Site Reliability Engineer Remote Jobs

47 Results

+30d

Senior Site Reliability Engineer (Brazil)

SezzleBrazil, Remote

Sales ● DevOPS ● Bachelor's degree ● terraform ● sql ● Design ● c++ ● docker ● kubernetes ● linux ● python ● AWS

Sezzle is hiring a Remote Senior Site Reliability Engineer (Brazil)

About the Role:

We are looking for a Site Reliability Engineer to work on our core Infrastructure and Security team, to assist us with designing, building, running, improving and scaling the infrastructure that engineering and data teams use to power their services. Your duties will include the development, testing, and maintenance of our serving and data platforms, using a combination of cloud products, open source tools and internal applications. Your duties will blend software development and operations in order to continuously automate our environments. You should be able to build high-quality, scalable solutions for a variety of problems.

Our Company:

Sezzle is a cutting-edge fintech company whose long-standing mission is to financially empower the next generation. Sezzle has built a payment platform that increases purchasing power for consumers by offering interest-free installment plans. This increase in purchasing power for consumers leads to increased sales and basket sizes for the numerous eCommerce merchants that currently work with Sezzle.

What Makes Working at Sezzle Awesome?

At Sezzle, we are more than just brilliant engineers, passionate data enthusiasts, out-of-the-box thinkers, and determined innovators; we are skilled musicians, yogis, cyclists, chefs, golfers, dog-lovers, and rock-climbers. We believe in surrounding ourselves with not only the best and the brightest individuals, but those that are unique and purpose-driven in all that they do. Our culture is not defined by a certain set of perks designed to give the illusion of the traditional startup culture, but rather, it is the visible example living in every employee that we hire.

Responsibilities:

Design, build and maintain scalable infrastructure for running our systems, based on Kubernetes, Redshift and additional AWS services and products.
Help the product teams quickly build out MVP products to test new solutions on the market.
Maintain and develop monitoring and alerting solutions to improve the on-call experience.
Assist product developers in debugging and triaging production issues.
Be the first line of defense for our operational environments, triaging and resolving problems as they occur. You will be on an on-call rotation.
Design and scale platform and data architectures to sustain rapid user growth.
Level up the teams through pairing, code review, and mentoring.
Bring and share with our team extensive experience with industry best practices in software development.

Minimum Requirements:

Bachelor's in computer science (preferred) or equivalent related experience
At least 5+ years of overall software, data, deployments and platform infrastructure experience.

Ideal Skills & Experience:

Experience with building and/or serving REST APIs using Go or a similar language.
Experience with Relational Databases, SQL and ORM technologies.
Strong overall Linux knowledge.
DevOps experience with CI/CD pipelines, Docker and Kubernetes, and cloud computing platforms like AWS.
Experience with deployment/provisioning tools like Terraform, Helm, Ansible.
Experience with implementing and maintaining observability and monitoring tools - Prometheus, Datadog, NewRelic, Grafana, Loki or similar.
Experience in ETL/ELT pipelines using Python and Open-source tools such as DBT.
Proficiency in building and maintaining large-scale data warehousing technologies such as Redshift.

About You:

A+ character. We are team-first here at Sezzle.
A hard-working mentality. It’s early and there is still a lot to build.
An excellent communicator.
A fun attitude. Life’s too short. We can have fun while we work hard on cool things.
Smarts. We need people that are smart enough to make decisions on their own and also smart enough to know when they need input from others.

Compensation

The compensation range for the role is as follows:

4,600 - 9,000 USD Monthly

Equal Employment Opportunity: Sezzle Inc. is an equal opportunity employer. We celebrate diversity and are committed to creating an inclusive environment for all employees. We do not discriminate based on race, color, religion, sex, national origin, age, disability, genetic information, pregnancy, or any other legally protected status. Sezzle recognizes and values the importance of diversity and inclusion in enriching the employment experience of its employees and in supporting our mission.

#Li-remote

See more jobs at Sezzle

Apply for this job

+30d

Senior Site Reliability Engineer (Chile)

SezzleChile, Remote

Sales ● DevOPS ● Bachelor's degree ● terraform ● sql ● Design ● c++ ● docker ● kubernetes ● linux ● python ● AWS

Sezzle is hiring a Remote Senior Site Reliability Engineer (Chile)

The salary range for this role is $5,000 - $9,200 per month (Gross in USD)

About Sezzle:

With a mission to financially empower the next generation, Sezzle is revolutionizing the shopping experience beyond payments, blending cutting-edge tech with seamless, interest-free installment plans that make shopping smarter and more accessible. We’re not just transforming payments; we’re redefining how people discover, interact with, and purchase the things they love while driving real impact on merchant sales through increased conversions and higher order values. As we continue to shape the future of fintech and retail, we’re building an innovative, dynamic team passionate about creating more than just a transaction but a truly unique shopping journey. If you’re excited about pushing boundaries in tech and delivering a game-changing experience for consumers and merchants alike, come join us at Sezzle and help create the future of shopping!

About the Role:

We are seeking a talented and motivated Senior Site Reliability Engineer who is best in class with a high IQ plus a high EQ. This role presents an exciting opportunity to thrive in a dynamic, fast-paced environment within a rapidly growing team, with abundant prospects for career advancement. In this role you will work on our core Infrastructure and Security team, to assist us with designing, building, running, improving and scaling the infrastructure that engineering and data teams use to power their services. Your duties will include the development, testing, and maintenance of our serving and data platforms, using a combination of cloud products, open source tools and internal applications. Your duties will blend software development and operations in order to continuously automate our environments. You should be able to build high-quality, scalable solutions for a variety of problems.

Compensation

Sezzle is a remote U.S.-based company listed on NASDAQ. Our salary ranges are as follows:

Senior: $7,000 - $9,200 USD per month

Responsibilities:

Design, build and maintain scalable infrastructure for running our systems, based on Kubernetes, Redshift and additional AWS services and products.
Help the product teams quickly build out MVP products to test new solutions on the market.
Maintain and develop monitoring and alerting solutions to improve the on-call experience.
Assist product developers in debugging and triaging production issues.
Be the first line of defense for our operational environments, triaging and resolving problems as they occur. You will be on an on-call rotation.
Design and scale platform and data architectures to sustain rapid user growth.
Level up the teams through pairing, code review, and mentoring.
Bring and share with our team extensive experience with industry best practices in software development.

Minimum Requirements:

Bachelor's in computer science (preferred) or equivalent related experience
At least 5+ years of overall software, data, deployments and platform infrastructure experience.

Ideal Skills & Experience:

Experience with building and/or serving REST APIs using Go or a similar language.
Experience with Relational Databases, SQL and ORM technologies.
Strong overall Linux knowledge.
DevOps experience with CI/CD pipelines, Docker and Kubernetes, and cloud computing platforms like AWS.
Experience with deployment/provisioning tools like Terraform, Helm, Ansible.
Experience with implementing and maintaining observability and monitoring tools - Prometheus, Datadog, NewRelic, Grafana, Loki or similar.
Experience in ETL/ELT pipelines using Python and Open-source tools such as DBT.
Proficiency in building and maintaining large-scale data warehousing technologies such as Redshift.

Sezzle’s Technology Stack:

Languages:Golang, Typescript, Python
Frontend:Typescript - React and React Native
Backend:Golang
Database:MySQL, Postgres, Elasticsearch
DevOps & Cloud:AWS, Kubernetes
Version Control:Git
CI/CD:Gitlab
Testing:Developer-driven, focus on automated unit, integration, and end-to-end tests
Sezzle is focused on using open source, and we build what we can before buying!

About You:

You have relentlessly high standards - many people may think your standards are unreasonably high. You are continually raising the bar and driving those around you to deliver great results. You make sure that defects do not get sent down the line and that problems are fixed so they stay fixed.
You’re not bound by convention - your success—and much of the fun—lies in developing new ways to do things
You need action - speed matters in business. Many decisions and actions are reversible and do not need extensive study. We value calculated risk-taking.
You earn trust - you listen attentively, speak candidly, and treat others respectfully.
You have backbone; disagree, then commit- you can respectfully challenge decisions when you disagree, even when doing so is uncomfortable or exhausting. You have conviction and are tenacious. You do not compromise for the sake of social cohesion. Once a decision is determined, you commit wholly.
You deliver results- you focus on the key inputs and deliver them with the right quality and in a timely fashion. Despite setbacks, you rise to the occasion and never settle.

What Makes Working at Sezzle Awesome:

At Sezzle, we are more than just brilliant engineers, passionate data enthusiasts, out-of-the-box thinkers, and determined innovators. We believe in surrounding ourselves with only the best and the brightest individuals. Our culture is not defined by a certain set of perks designed to give the illusion of the traditional startup culture, but rather, it is the visible example living in every employee that we hire.

#Li-remote

See more jobs at Sezzle

Apply for this job

+30d

Senior Site Reliability Engineer (Colombia)

SezzleColombia, Remote

Sales ● DevOPS ● Bachelor's degree ● terraform ● sql ● Design ● c++ ● docker ● kubernetes ● linux ● python ● AWS

Sezzle is hiring a Remote Senior Site Reliability Engineer (Colombia)

The salary range for this role is $5,000 - $9,200 per month (Gross in USD)

About Sezzle:

About the Role:

Compensation

Sezzle is a remote U.S.-based company listed on NASDAQ. Our salary ranges are as follows:

Senior: $7,000 - $9,200 USD per month

Responsibilities:

Design, build and maintain scalable infrastructure for running our systems, based on Kubernetes, Redshift and additional AWS services and products.
Help the product teams quickly build out MVP products to test new solutions on the market.
Maintain and develop monitoring and alerting solutions to improve the on-call experience.
Assist product developers in debugging and triaging production issues.
Be the first line of defense for our operational environments, triaging and resolving problems as they occur. You will be on an on-call rotation.
Design and scale platform and data architectures to sustain rapid user growth.
Level up the teams through pairing, code review, and mentoring.
Bring and share with our team extensive experience with industry best practices in software development.

Minimum Requirements:

Bachelor's in computer science (preferred) or equivalent related experience
At least 5+ years of overall software, data, deployments and platform infrastructure experience.

Ideal Skills & Experience:

Experience with building and/or serving REST APIs using Go or a similar language.
Experience with Relational Databases, SQL and ORM technologies.
Strong overall Linux knowledge.
DevOps experience with CI/CD pipelines, Docker and Kubernetes, and cloud computing platforms like AWS.
Experience with deployment/provisioning tools like Terraform, Helm, Ansible.
Experience with implementing and maintaining observability and monitoring tools - Prometheus, Datadog, NewRelic, Grafana, Loki or similar.
Experience in ETL/ELT pipelines using Python and Open-source tools such as DBT.
Proficiency in building and maintaining large-scale data warehousing technologies such as Redshift.

Sezzle’s Technology Stack:

Languages:Golang, Typescript, Python
Frontend:Typescript - React and React Native
Backend:Golang
Database:MySQL, Postgres, Elasticsearch
DevOps & Cloud:AWS, Kubernetes
Version Control:Git
CI/CD:Gitlab
Testing:Developer-driven, focus on automated unit, integration, and end-to-end tests
Sezzle is focused on using open source, and we build what we can before buying!

About You:

You have relentlessly high standards - many people may think your standards are unreasonably high. You are continually raising the bar and driving those around you to deliver great results. You make sure that defects do not get sent down the line and that problems are fixed so they stay fixed.
You’re not bound by convention - your success—and much of the fun—lies in developing new ways to do things
You need action - speed matters in business. Many decisions and actions are reversible and do not need extensive study. We value calculated risk-taking.
You earn trust - you listen attentively, speak candidly, and treat others respectfully.
You have backbone; disagree, then commit- you can respectfully challenge decisions when you disagree, even when doing so is uncomfortable or exhausting. You have conviction and are tenacious. You do not compromise for the sake of social cohesion. Once a decision is determined, you commit wholly.
You deliver results- you focus on the key inputs and deliver them with the right quality and in a timely fashion. Despite setbacks, you rise to the occasion and never settle.

What Makes Working at Sezzle Awesome:

#Li-remote

See more jobs at Sezzle

Apply for this job

+30d

Site Reliability Engineer - II (SRE II)

Live PersonHyderabad, Telangana, India (Remote)

DevOPS ● terraform ● nosql ● postgres ● sql ● ansible ● mongodb ● azure ● elasticsearch ● MySQL ● kubernetes ● linux ● jenkins ● AWS

Live Person is hiring a Remote Site Reliability Engineer - II (SRE II)

LivePerson (NASDAQ: LPSN) is the global leader in enterprise conversations. Hundreds of the world’s leading brands — including HSBC, Chipotle, and Virgin Media — use our award-winning Conversational Cloud platform to connect with millions of consumers. We power nearly a billion conversational interactions every month, providing a uniquely rich data set and safety tools to unlock the power of Conversational AI for better customer experiences.

At LivePerson, we foster an inclusive workplace culture that encourages meaningful connection, collaboration, and innovation. Everyone is invited to ask questions, actively seek new ways to achieve success, nd reach their full potential. We are continually looking for ways to improve our products and make things better. This means spotting opportunities, solving ambiguities, and seeking effective solutions to the problems our customers care about.

Overview:

LivePerson is looking for a Site Reliability Engineer for the GPT (Global Product & Technology) Division. You will be part of the LiverPerson SRE team building and managing highly available, distributed systems. You will have the opportunity to be part of a strong team and enjoy the work environment of a start-up, with a robust product and the benefits of a leading company in its field.

You will:

Ensure product high uptime and reliability 24x7.
Manage Linux servers in a multi-cloud environment
Manage high availability Kubernetes resources using Helm charts
Assist with deploying upgrades and patches using Chef/Ansible/Puppet/Helm
Monitoring and troubleshooting warnings and alerts related to the reporting platform’s performance
Develop monitoring resources and alerting systems such as Grafana, Prometheus, Kibana, DataDog and PagerDuty
Coordinate with DBA and developers to manage SQL and NOSQL database systems, including MongoDB, ElasticSearch, Postgres, MySQL and others
Managing message bus systems such as Kafka and Pulsar
Build and maintain CI/CD pipelines using Jenkins/Gitlab/Teamcity

You have:

Minimum 4+ years of experience of managing cloud based production environment (AWS, GCP, Azure, etc)
Highly experienced working in the Linux environment, good scripting in Bash / Python.
Highly experienced working configuration management systems like OpsCode Chef, Ansible, Puppet, etc.
Strong experience in Terraform, CloudFormation or other IAC
Experienced in SQL, including DDL and complex queries
Experienced working in the Kubernetes platform
Experience working in a microservices architecture using a message bus
Good knowledge of CI/CD pipelines orchestrators like TeamCity, Jenkins, Gitlab
Ability to integrate security best practices into the SRE workflow.
Highly motivated and independent.
Team player and excellent interpersonal Skills.
Excellent written and verbal communication skills.
BS in Computer Science or a related field, or equivalent work experience.
A strong background in cloud, network and application security and compliance
Experience with GPT or other LLMs a strong advantage

Benefits

Health: Medical, Dental, and Vision
Time away: Vacation and holidays
Development: Generous tuition reimbursement and access to internal professional development resources.
Equal opportunity employer

Why You’ll Love Working Here

As leaders in enterprise customer conversations, we celebrate diversity, empowering our team to forge impactful conversations globally. LivePerson is a place where uniqueness is embraced, growth is constant, and everyone is empowered to create their own success. And, we're very proud to have earned recognition from Fast Company, Newsweek, and BuiltIn for being a top innovative, beloved, and remote-friendly workplace.

Belonging At LivePerson

We are proud to be an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to age, ancestry, color, family or medical care leave, gender identity or expression, genetic information, marital status, medical condition, national origin, physical or mental disability, protected veteran status, race, religion, sex (including pregnancy), sexual orientation, or any other characteristic protected by applicable laws, regulations and ordinances. We also consider qualified applicants with criminal histories, consistent with applicable federal, state, and local law.

We are committed to the accessibility needs of applicants and employees. We provide reasonable accommodations to job applicants with physical or mental disabilities. Applicants with a disability who require reasonable accommodation for any part of the application or hiring process should inform their recruiting contact upon initial connection.

Apply for this job

+30d

Senior Site Reliability Engineer

WorkableAthens,Attica,Greece, Remote Hybrid

kubernetes

Workable is hiring a Remote Senior Site Reliability Engineer

Workable makes software to help companies find and hire great people. We get recruiting and its role in building healthy workplaces — which is why we’re proud more than 20,000 teams around the world use Workable to do exactly that.

At Workable, you’ll find smart people who have fun, learn and innovate, and help others do the same. We brainstorm, we laugh, and, occasionally, we party (there’s a lot to celebrate), but we also appreciate people’s need for quiet time and focused work. We respect everyone, we hire the best, and make sure every experience is special.

We’re growing fast and we want to make sure that we scale from thousands to hundreds of thousands so we’re looking for a Senior Site Reliability Engineer to join our SRE team.

Our product is built with a microservices architecture deployed on the Kubernetes platform. Our SRE team is responsible for deploying, monitoring, optimizing, and securing our cloud infrastructure and company software; both rapidly expanding. Automation is at the core of what we do. If you love working with new technologies, open-source software, and solving complex problems on highly distributed systems then this is the job for you! You will be part of a talented team of engineers that demonstrate superb technical competency, delivering mission-critical infrastructure and ensuring the highest levels of availability, performance, and security.

As a Senior Site Reliability Engineer in this team with an emphasis on Tools and Automations, you will be responsible for the following:

Develop tools and automations to make operations and deployments simpler and more robust.
Operate, deploy, and monitor cloud services from development to production.
Working in a highly cross-functional team with Developers on designing, releasing, and troubleshooting production systems.
Be responsible for the availability, scalability, and performance of our systems.
Troubleshoot issues, do capacity planning, and analyze system performance.
Lead projects within the team and be responsible for their timely delivery.

BS/MS degree in Computer Science, Engineering (or a proven strong background)
Excellent communication skills in English, particularly written communication.
Analytical and troubleshooting skills on large-scale distributed systems
Work autonomously and be able to deliver projects on time.
Passion for cutting-edge cloud technologies and automation
Strong curiosity for discovering new insights and eager to challenge the status quo
5+ years of relevant work experience, including programming experience
Experience with the Kubernetes platform and technology stack
Experience with a major cloud provider (GCP and AWS preferred)
Experience with configuration management and orchestration tools (e.g., Ansible, Terraform)
Experience with centralized logging, monitoring systems, and tooling frameworks
Deep knowledge of Linux systems
Familiarity with at least one programming language (preferably Go, Python, Java, C++)
Familiarity with Relational and NoSQL (MongoDB, Redis, Elastic, etc.) databases
Oh, and if you're into DevOps technologies and the CNCF ecosystem, but have experience with other frameworks, please do apply. We value quality engineers, not the tools they've used.

Preferred qualifications:

Bonus: Networking skills, especially TCP/IP, HTTP, DNS and load balancers

Our employees enjoy benefits that make them more productive and contribute directly to the development of their professional skills. We want to be able to attract the best of the best and make sure they keep getting better. On top of an exciting, vibrant, and intellectually challenging environment, we are offering:

An attractive salary and a bonus plan
Health insurance plan including dependents
Mobile data plan
Apple gear and access to the best productivity tools
Annual retreats in awesome locations

Workable is most decidedly an equal-opportunity employer. We want applicants of diverse backgrounds and hire without regard to color, gender, religion, national origin, citizenship, disability, age, sexual orientation, or any other characteristic protected by law.

See more jobs at Workable

Apply for this job

+30d

Lead Site Reliability Engineer

hims & hersRemote

Bachelor's degree ● kotlin ● terraform ● sql ● Design ● ansible ● git ● java ● c++ ● docker ● postgresql ● MySQL ● typescript ● kubernetes ● python

hims & hers is hiring a Remote Lead Site Reliability Engineer

Hims & Hers Health, Inc. (better known as Hims & Hers) is the leading health and wellness platform, on a mission to help the world feel great through the power of better health. We are revolutionizing telehealth for providers and their patients alike. Making personalized solutions accessible is of paramount importance to Hims & Hers and we are focused on continued innovation in this space. Hims & Hers offers nonprescription products and access to highly personalized prescription solutions for a variety of conditions related to mental health, sexual health, hair care, skincare, heart health, and more.

Hims & Hers is a public company, traded on the NYSE under the ticker symbol “HIMS”. To learn more about the brand and offerings, you can visit hims.com and forhers.com, or visit our investor site. For information on the company’s outstanding benefits, culture, and its talent-first flexible/remote work approach, see below and visit www.hims.com/careers-professionals.

About the Role:

We are seeking a Lead Site Reliability Engineer to help build a reliable web experience for our users. We believe that moving fast is our competitive advantage, and enables us to better serve our users. We also know that the faster we move, the more likely we are to break things.

You Will:

Design and implement SRE practices ensuring availability, scalability and observability of production systems with a strong focus on excellent customer experience
Actively seek and identify opportunities to improve the availability and performance of the system by applying the learnings from monitoring and observation
Use automation extensively to design, configure, manage, and monitor systems in support of our product development teams
Understanding of Infrastructure and infra automation (Infrastructure as Code)
Manage incidents and emergency response, track outages, ensure data integrity and engineer releases to promote safe, efficient and rapid deployments
Handle emergency response either by being on-call or by reacting to symptoms according to monitoring and escalation when needed
Improve the codebase by resolving logic issues, deprecating unused code, etc.
Implement monitoring, logging, alerting and SLO Reporting
Identify Service Level Indicators (SLIs) that will align the team to meet the availability and performance objectives
Perform and run blameless RCAs on incidents and outages aggressively looking for answers that will prevent incident reoccurrence
Provides reviews on design documents from internal and external teams
Performs more-complex tasks using highly-specialized knowledge and advanced business experience
Resolves complex tickets in creative manners
Develops and leads large and highly-complex cross-functional projects or programs
Determines solutions to blockers, identify tasks, and developing solutions as appropriate
Responsible for at least for 1 major delivery domain and accountable for all the aspects of SRE for that domain
Develops standards, tools, and knowledge requirements for skill and career development

You Have:

10+ years as a software engineer, shipping production code
5+ years of experience as a Site Reliability Engineer or Production support Engineer
Bachelor's degree in Computer Science, Engineering, or related field, or relevant years of work experience
Experience with service-oriented architectures and microservices at scale
Strong proficiency with RDBMS databases (PostgreSQL, MySQL, SQL Server, etc.)
Strong proficiency in SQL scripting
Proficiency developing in one or more languages such as Java, Kotlin, Python, and/or others
Ability to use containers and orchestration frameworks (Kubernetes, Docker, Container registries etc.)
Knowledge of CDN, typescript frameworks, and GQL.
Knowledge and good understanding of any pub/sub / Queue messaging systems
Proficiency in Git or other VCS
Experience with configuring, customizing, and extending monitoring tools (Datadog, Prometheus, New Relic etc.)
Excellent debugging and troubleshooting skills
Strong technical competency, with a data-driven analytical approach towards solving complex challenges
Have a systematic problem-solving approach, coupled with strong and effective communication skills and a sense of drive
- Nice-to-have: Experience with Terraform or other IAC tools such as Chef, Puppet or Ansible

Our Benefits (there are more but here are some highlights):

Competitive salary & equity compensation for full-time roles
Unlimited PTO, company holidays, and quarterly mental health days
Comprehensive health benefits including medical, dental & vision, and parental leave
Employee Stock Purchase Program (ESPP)
Employee discounts on hims & hers & Apostrophe online products
401k benefits with employer matching contribution
Offsite team retreats

#LI-Remote

Outlined below is a reasonable estimate of H&H’s compensation range for this role for US-based candidates. If you're based outside of the US, your recruiter will be able to provide you with an estimated salary range for your location.

The actual amount will take into account a range of factors that are considered in making compensation decisions, including but not limited to skill sets, experience and training, licensure and certifications, and location. H&H also offers a comprehensive Total Rewards package that may include an equity grant.

Consult with your Recruiter during any potential screening to determine a more targeted range based on location and job-related factors.

An estimate of the current salary range is

$150,000—$175,000 USD

We are focused on building a diverse and inclusive workforce. If you’re excited about this role, but do not meet 100% of the qualifications listed above, we encourage you to apply.

Hims considers all qualified applicants for employment, including applicants with arrest or conviction records, in accordance with the San Francisco Fair Chance Ordinance, the Los Angeles County Fair Chance Ordinance, the California Fair Chance Act, and any similar state or local fair chance laws.

Hims & Hers is committed to providing reasonable accommodations for qualified individuals with disabilities and disabled veterans in our job application procedures. If you need assistance or an accommodation due to a disability, please contact us at accommodations@forhims.com and describe the needed accommodation. Your privacy is important to us, and any information you share will only be used for the legitimate purpose of considering your request for accommodation. Hims & Hers gives consideration to all qualified applicants without regard to any protected status, including disability. Please do not send resumes to this email address.

For our California-based applicants – Please see our California Employment Candidate Privacy Policy to learn more about how we collect, use, retain, and disclose Personal Information.

See more jobs at hims & hers

Apply for this job

+30d

Site Reliability Engineer

hims & hersRemote

Bachelor's degree ● kotlin ● sql ● Design ● git ● java ● c++ ● postgresql ● MySQL ● python

hims & hers is hiring a Remote Site Reliability Engineer

About the Role:

We are seeking a Site Reliability Engineer to help build a reliable web experience for our users. We believe that moving fast is our competitive advantage, and enables us to better serve our users. We also know that the faster we move, the more likely we are to break things.

You Will:

Actively seek and identify opportunities to improve the availability and performance of the system by applying the learnings from monitoring and observation.
Use automation extensively to design, configure, manage, and monitor systems in support of our product development teams
Implement SRE practices ensuring availability, scalability and observability of production systems with a strong focus on excellent customer experience
Understanding of Infrastructure as Code
Incident management and emergency response, track outages, ensure data integrity and engineer releases to promote rapid deployments
Handle emergency response either by being on-call or by reacting to symptoms according to monitoring and escalation when needed
Implement monitoring, logging, alerting and SLO Reporting
Identify Service Level Indicators (SLIs) that will align the team to meet the availability and performance objectives.
Perform and run blameless RCAs on incidents and outages aggressively looking for answers that will prevent incident reoccurrence.
Demonstrates strong technical skills and expertise in any one of OOO programming languages
Independently handle complex technical tasks in projects.

You Have:

3+ years as a software engineer, shipping production code.
1+ years of experience as a Site Reliability Engineer or Production support Engineer
Bachelor's degree in Computer Science, Engineering, or related field, or relevant years of work experience
Proficiency with RDBMS databases (PostgreSQL, MySQL, SQL Server, etc.)
Proficiency in SQL scripting
Proficiency developing in one or more languages such as Java, Kotlin, Python, and/or others
Proficiency in Git or other VCS
Good debugging and troubleshooting skills
Strong technical competency, with a data-driven analytical approach towards solving complex challenge

Our Benefits (there are more but here are some highlights):

Competitive salary & equity compensation for full-time roles
Unlimited PTO, company holidays, and quarterly mental health days
Comprehensive health benefits including medical, dental & vision, and parental leave
Employee Stock Purchase Program (ESPP)
Employee discounts on hims & hers & Apostrophe online products
401k benefits with employer matching contribution
Offsite team retreats

#LI-Remote

Outlined below is a reasonable estimate of H&H’s compensation range for this role for US-based candidates. If you're based outside of the US, your recruiter will be able to provide you with an estimated salary range for your location.

The actual amount will take into account a range of factors that are considered in making compensation decisions including but not limited to skill sets, experience and training, licensure and certifications, and location. H&H also offers a comprehensive Total Rewards package that may include an equity grant.

Consult with your Recruiter during any potential screening to determine a more targeted range based on location and job-related factors.

An estimate of the current salary range for US-based employees is

$103,000—$117,000 USD

We are focused on building a diverse and inclusive workforce. If you’re excited about this role, but do not meet 100% of the qualifications listed above, we encourage you to apply.

For our California-based applicants – Please see our California Employment Candidate Privacy Policy to learn more about how we collect, use, retain, and disclose Personal Information.

See more jobs at hims & hers

Apply for this job

+30d

Site Reliability Engineer

Float.comNew York,United States, Remote

terraform ● slack ● qa ● kubernetes ● python ● PHP

Float.com is hiring a Remote Site Reliability Engineer

Who We Are

Float is the world’s leading software for teams to plan their time. Launched in 2012, we’ve grown every year since, and remain proudly independent, self-funded and profitable. As a certified B Corporation, we’re committed to making a positive contribution to our team, customers, the environment, and the remote community. We’re a team of 50 working 100% remotely who believe in living our Best Work Life. You’ll. partner with team members globally, including Australia, Mexico, Italy, Nigeria, Canada, and the USA. Hear what our team has to say by browsing our blog, or reading our Glassdoor reviews. Check out what our customers think of Float from our G2 reviews.

We’re on a scale up journey, and we’re seeking people who thrive in this stage, given the autonomy, and the opportunity, to do the best work of their career.

Why We’re Hiring For This Role

The role of Site Reliability Engineers at Float is to increase the autonomy of the product and engineering teams by growing their capabilities to focus on solving problems. SRE makes sure our engineers get scalable infrastructure to build software on top of, making sure pipelines from idea to customer run smoothly and are easily built upon, and we also deal with broad areas of security around our network and defining internal security policy and practices.

Our goals for the Engineering team are to increase the pace with which they deliver improvements for our customers, provide an increasingly sophisticated and reliable service from our teams, and mitigate external threats as we grow.

You will help us tackle those problems by increasing reliability of our services to support larger clients joining Float, and increasing the robust security systems we’ve implemented to continue protecting our growing customer base.

Chris Nash, our Team Lead (SRE & QA), explains the important role you will play within our SRE team. Watch this video.

You’ll be working asynchronously with a bright, dedicated team from across the globe, with a strong focus on taking complex problems and creating solutions that feel simple and intuitive for our customers.

What You’ll Be Responsible For

Early on, you’ll jump right into:

Continuing to support the regular maintenance of all the engineering systems supporting Float’s customers
Identifying areas requiring support to scale
Identifying areas for improving service resilience, ultimately delivering the ability to be resilient within the product and engineering teams themselves
Optimizing our monitoring and observability stack, building on the knowledge to create a standard set of tools and configurations for the product and engineering teams
Understanding Float’s SLOs in context, and building out SLO patterns and procedures for product and engineering teams

Once you are settled, we expect that you will jump into the following projects:

Building a repeatable and trustworthy disaster recovery program using chaos engineering techniques
Migrating all of our deployment configurations to a global single source of truth
Expanding Float’s infrastructure across multiple regions to create a global network

What You’ll Need To Be Successful

We want you to love your work and believe that these skills will allow you to succeed in the role.

Applying these skills requires:

An excellent understanding of how SRE operates as an enabling team
A very good understanding of Service Level Objectives
Working experience with Terraform, Bash, and a go-to language which ideally would be one of PHP, NodeJS, Python
Experience with Kubernetes and GCP would be highly valued

As a fully remote team, we’re looking for someone comfortable with asynchronous communication as the default, which means you have previous remote experience and are comfortable using tools like Slack, Loom, and Linear to communicate as needed. Don’t worry—you will have significant deep work time since we have very few meetings.

Why Join Us

Pay for this role is US $167,471 (Level 3). Here’s a blog post with more information on how we determine our salaries.

We’re a global async remote company with a diverse team of people from all over the world who share a common belief in living our best work life. We believe deeply in the idea of transparency and share our Float Handbook publicly so potential new team members can see first hand our perks & benefits as well as our ways of working. If you feel like you can thrive at Float to do your best work, we would love to hear from you.

Hiring Process For This Role

You’ll find a lot of useful information about our interview process and what it’s like to join our global team on the Float careers page. The hiring process for this role looks like this:

Initial First Meet (20 min): You'll meet with Julia Fulton, Talent Manager, to discuss your interest in the role and review your questions about working at Float.
Take-Home Assignment: Candidates that move forward will be invited to complete a take-home assignment for the engineering team to review. This is a 4-hour assignment. Candidates will receive high-level feedback from the hiring team and those that move forward will proceed to the technical interview stage to discuss results further in more detail.
Technical Interview (45 min): You’ll meet with Chris Nash (Team Lead, SRE & QA) and Bogdan Frunza (Senior SRE) to discuss more about your technical experience. This will be a great opportunity for you to ask any questions and talk about goals for the role.
Leadership Interview (45 min): You’ll meet with Lars Gelfin (CTO) and Colin Ross (Director of Engineering) to discuss more about your experience. This will be a great opportunity for you to ask any questions and talk about goals for the role.
Founder Interview (30 min): You’ll meet with Glenn, Float’s CEO, to get to know you and see if you have potential to be a great addition to the team.

Note: Industry research shows that women and those in traditionally underrepresented groups generally don’t apply to jobs unless they check all the boxes for the role. If you feel strongly that you have what it takes for this role but don’t check 100% of the boxes—that’s okay—we encourage you to apply anyway and highlight what you can bring to the table.

See more jobs at Float.com

Apply for this job

+30d

Site Reliability Engineer - III

Live PersonHyderabad, Telangana, India (Remote)

terraform ● nosql ● postgres ● sql ● ansible ● mongodb ● azure ● elasticsearch ● MySQL ● kubernetes ● linux ● jenkins ● AWS

Live Person is hiring a Remote Site Reliability Engineer - III

Overview:

LivePerson is looking for a Site Reliability/DevOps Engineer for the GPT (Global Product & Technology) Division. You will be part of the LivePerson SRE team building and managing highly available, distributed systems. You will have the opportunity to be part of a strong team and enjoy the work environment of a start-up, with a robust product and the benefits of a leading company in its field.

You will:

Lead a team to ensure product high uptime and reliability 24x7.
Give technical leadership and support to your team and stakeholders such as development teams, data team, and leadership.
Being hands-on with your team on day to day operations of cloud environments, Kubernetes platform, data and messaging platform, and security/compliance

You have:

Minimum 7+ years of experience of managing cloud based production environment (AWS, GCP, Azure, etc)
Strong verbal and written communication skills, proficient in collaborating with various stakeholders
Solid hands-on experience and understanding of system architecture
Highly experienced working in the Linux environment, good scripting in Bash / Python.
Highly experienced working configuration management systems like Puppet, OpsCode Chef, Ansible, etc.
Strong experience in Terraform, CloudFormation or other IAC
Strong experienced in SQL (including DDL and complex queries) and managing SQL and NOSQL database systems, including MongoDB, ElasticSearch, Postgres, MySQL and others
Strong experienced working in the Kubernetes platform and Helm
Strong experience working in a microservices architecture using a message bus such as Kafka and Pulsar
Experience in monitoring resources and alerting systems such as Grafana, Prometheus, Kibana, DataDog and PagerDuty
Good knowledge of CI/CD pipelines orchestrators like TeamCity, Jenkins, Gitlab.
Highly motivated and independent.
Team player and excellent interpersonal Skills.
BS in Computer Science or a related field, or equivalent work experience.
A strong background in cloud, network and application security and compliance
Experience with GPT or other LLMs a strong advantage

Benefits

Health: Medical, Dental, and Vision
Time away: Vacation and holidays
Development: Generous tuition reimbursement and access to internal professional development resources.
Equal opportunity employer

Apply for this job

+30d

Site Reliability Engineer III

SinchFrance, Remote

terraform ● mongodb ● linux ● AWS

Sinch is hiring a Remote Site Reliability Engineer III

We’re seeking a Site Reliability Engineer to join our Site Reliability Engineering team. This fully remote role is based in France.

Be a part of the team that builds and operates the infrastructure at the heart of every Sinch Mailjet service. You’ll be instrumental for the day-to-day management of our global infrastructure. This includes monitoring and tracking key performance indicators (KPIs), collaborating with engineers to ensure our products and services are appropriately resourced, automating processes, and planning for future growth and scalability.

As our Site Reliability Engineer, you will:

Partner with product engineering teams to identity systems requirements.
Build and support our cloud-based microservices infrastructure.
Automate routine processes and remediation tasks.
Develop, monitor and track Service Level Objectives (SLOs) for the systems under management.
Proactively troubleshoot, resolve, and plan for issues that typically come from support staff, other engineering teams, and our automated monitoring system.
Ensure our datastores are healthy and operate at optimal performance levels.
Contribute to the growth and culture of our engineering team

To contribute to this role, we believe you have:

Background in infrastructure, operations, or software engineering.

Expertise with containers and orchestration systems like Nomad.
Experience with cloud providers such as AWS and GCP.
Proficiency in configuration management tools such as Terraform and Ansible.
Hands-on proficiency with modern monitoring tools like Prometheus and Grafana.
Experience with distributed data stores such as Cassandra, MongoDB, and ElasticSearch.
Strong technical skills across various infrastructure technologies.
Proven ability to break down complex tasks into manageable ones.
Strong communication skills and a history of building solid relationships with peers and leadership.
Experience operating and maintaining production systems in a Linux and public cloud environment.
Demonstrated ability to mento and guide team members.

Are you ready? Join us on our journey!

We review applications continuously and may un-post the job posting earlier based on candidate inflow. Submit your application in English to join us in making seamless and impactful connections worldwide.

At Sinch, we value learning, embrace change, and offer opportunities for personal and professional growth. Unfortunately, we are not able to support relocation outside EU at the moment and therefore we will take into consideration only applicants that:

Hold French citizenship.
Hold EU/EEA citizenship.
Have a valid work permit for working in France.

Our Hiring Process

At Sinch, we are committed to following a recruitment process that is fair, objective, consistent, and non-discriminatory. We use pre-employment assessment to create an inclusive application experience to help foster diverse and high performing teams.

Even if you do not meet all job requirements, don't let that stop you from considering Sinch for the next step in your career. We are always looking for people that could help us pioneer the way the world communicates.

If this role isn't what you're looking for, please consider other open roles on our career page: https://www.sinch.com/careers/

See more jobs at Sinch

Apply for this job

+30d

Senior Site Reliability Engineer

People Can FlyWarszawa, Poland, Remote

kotlin ● Design ● mobile ● java ● c++ ● jenkins ● python

People Can Fly is hiring a Remote Senior Site Reliability Engineer

Job Description

Design, develop, deploy and operate reliable and scalable infrastructure for the online services platform
Collaborate with cross-functional teams to translate business requirements into technical solutions, balancing user needs with technical constraints.
Automate deployment of the online services platform to cloud providers, including provisioning for various stages like development, testing, and external publishers.
Develop and implement systems to maximise reliability, scalability, and uptime while also optimising for cost,
Design and develop systems and tooling that support efficient maintenance, updates, and recovery
Create tooling, data sources, monitoring dashboards, and alerting for all online services products, with a particular focus on real time service health
Lead Incident Management of live issues, as well as troubleshooting, break-fix and resolution of those issues
Create, review and maintain essential operational documentation such as run books, post-mortem reports, and root cause analysis
Assist leads with recruiting, onboarding, development and mentorship of engineers.
Stay updated on emerging SRE technologies and industry trends, evaluating their potential impact on our development processes and strategies.

Qualifications

4+ years of extensive experience in infrastructure engineering, with a specific focus on Cloud Infrastructure
Strong knowledge of, and experience with, writing and optimising Terraform.
Strong knowledge of, and experience with Infrastructure-as-Code (IaC) and related best practices
Strong in at least one programming language (Python, Go, Kotlin, Java or similar) as well as with scripting and automation in general
Good grasp of network architecture and security best practices.
Familiarity with CI/CD pipelines and tools like Github Actions, Jenkins
Proficient with Source Control and Code Review tools (Git/Github, Perforce/Swarm etc.).
Experience setting up monitoring and alerting systems
Experience with Incident Management and troubleshooting live issues
Ability to analyse and improve system performance, strong troubleshooting skills across various technology layers.
Knowledge in designing and implementing disaster recovery strategies.
Strong mentoring skills.
Strong verbal and written communication skills in English.

Nice to have:

Experience in the Video Games Industry
Unreal Engine knowledge (C++ in particular)
Experience in content distribution, ad-tech, news, mobile gaming, or finance domains
Additional language proficiency
Additional project management and bug tracking software knowledge

See more jobs at People Can Fly

Apply for this job

+30d

Site Reliability Engineer - SRE

Now1Atlanta, GA, Remote

DevOPS ● Design ● kubernetes ● linux ● jenkins ● python

Now1 is hiring a Remote Site Reliability Engineer - SRE

Job Description

Role: Site Reliability Engineer
Location: Atlanta, GA OR Dallas OR Austin, TX
Duration: Long Term or 6+ Months contract to Hire

Note: Remote Possible, however candidates will move to work onsite/Hybrid eventually. Please make sure you are comfortable with this.

EAD's allowed, who can work on W2. work's C2C who has their Own Corp.

Job description:

3-5 years of Site reliability engineer experience on google platform.

Strong experience in Google Cloud platform.

As a Staff Software Engineer, you will be a core player on the product team and are expected to build and grow the skillsets of the more junior Engineers. As a Staff Site Reliability Engineer you will be responsible for building and supporting the platform/application infrastructure of one of the largest retailers in the world. This will require you to maintain high site uptime/availability while embracing rapid change and growth using a strong devops mindset of continuous delivery and site automation.

Qualifications

Preferred Qualifications:

3-7 years of professional experience in engineering
Hands on experience in Site Reliability Engineering and solving problems through automation and instrumentation
Experience with Jenkins for CI/CD pipleine creation and CI/CD automation
Experience with Kubernetes implementation with Google
Proficient in a Linux or Unix based environment.
Proficiency in supporting a 24x7 operation.
Experience in a cloud computing platform and the associated automation patterns it provides, preferably GCP.
Deep understanding of an object orientated language, preferably the latest version of Java.
Proficient in a modern scripting language like GO or Python
Proficient in production systems design including High Availability, Disaster Recovery, Performance, Efficiency, and Security user, application performance, system, log, time-series, and dashboarding.
Proficient in a modern infrastructure automation toolkit such as Terraform/Helm

See more jobs at Now1

Apply for this job

+30d

Site Reliability Engineer

NewselaRemote

redis ● agile ● jira ● terraform ● postgres ● sql ● Design ● c++ ● docker ● MySQL ● python ● AWS

Newsela is hiring a Remote Site Reliability Engineer

The role:

As a member of our Technology team, the Site Reliability Engineer will be on an on-call rotation to respond to incidents that impact Newsela.com availability and provide support for developers during internal and external incidents.
Maintain and assist in extending our infrastructure with Terraform, Github Actions CI/CD, Prefect, and AWS services.
Build monitoring that alerts on symptoms rather than outages using Datadog, Sentry and CloudWatch.
Look for ways to turn repeatable manual actions into automations to reduce on-call toil.
Improve operational processes (such as deployments, releases, migrations, etc) to make them run seamlessly with fault tolerance in mind.
Design, build and maintain core cloud infrastructure on AWS and GCP that enables scaling to support thousands of concurrent users
Debug production issues across services and levels of the stack.
Provide infrastructure and architectural planning support as an embedded team member within a domain of Newsela’s application developers.

Why you’ll love this role:

As a member of our growing Technology team, you will have the opportunity to make a real and immediate impact by:
- being involved in the growth of Newsela’s infrastructure.
- influencing improved resiliency and reliability of the Newsela product.
You'll impact Newsela.com's availability, which will ultimately scale Newsela’s ability to bring engaging, culturally responsive learning content to K-12 classrooms nationwide.

Why you’re a great fit:

2+ years of experience as a Site Reliability Engineer.
Background in Infrastructure as code: use Terraform and Github CI/CD for automation, containerize our environments (Docker, ECS), and leverage cloud technologies to meet our goals.
Systems experience managing, configuring and troubleshooting operating system issues, storage (block and object), networking (VPCs, proxies and CDNs), and administer high-availability datastores (mySQL, Postgres, Neo4J) and Redis clusters.
Monitoring and instrumentation: implement metrics in Datadog, Sentry, log management and related systems, and Slack/JIRA integrations.
Understanding of engineering practices: availability, reliability and scalability, as well as disaster recovery.
Ability to work in a variety of languages: Shell, IaC, Python, and SQL.
Be able to plan using your familiarity with agile methodologies; use epics, issues to drive projects.
Personal and team workload organization and ability to self-organize and accomplish tasks asynchronously.
Contributing to Newsela architecture diagrams, process diagrams and runbook documentation.
Completing Root Cause Analysis (RCA) investigations and perform readiness reviews.
Improving team practices through code reviews, handoffs of work, and incidents.
Self-awareness, handling conflict in the team, providing and receiving feedback, and maintaining good relationships with other engineering teams.
Willingness to proactively step in and do the right thing while providing candid and constructive feedback.

Why you’ll love working at Newsela:

Health & Wellness: Access to the world’s leading medical experts for healthcare (pets included!). Discounts and resources to stay healthy: mind, body, and soul.
Work From Home: Almost all of our roles are fully remote - tech stipend included!
Supporting ALL Families: Supplemental programs and time off to take care of your family and yourself.
Time Off: Flexible PTO to recharge, including Sabbatical Leave
Professional Development: Annual stipends for continued learning and education
Make A Difference: No matter your role or department, the work you do each day helps share the future of education and improves the lives of students and teachers.

Base Compensation: $95,000 - $105,000. Total compensation for this role also includes incentive stock options and benefits. This compensation range may be adjusted based on actual experience.

See more jobs at Newsela

Apply for this job

+30d

Senior Site Reliability Engineer (SRE)

CLEAR - CorporateNew York, New York, United States (Hybrid)

Design ● java

CLEAR - Corporate is hiring a Remote Senior Site Reliability Engineer (SRE)

Today, CLEAR is well-known as a leader in digital and biometric identification, reducing friction for our members wherever an ID check is needed. We’re looking for a Senior Site Reliability Engineer (SRE) to establish our SRE function. You will join us to accelerate building and scaling our innovative systems that support our growing identity platform. You will drive on SLOs, using them to find and fix gaps in our observability and our overall systems. You will lead reliability-focused practices such as load testing, capacity planning, game days, chaos testing, and incident post-mortems. You will work hand-in-hand with the Software Engineering and Product team on the design, architecture, and implementation of new systems and services.

What You Will Do:

Embed within an Engineering and Product pillar to deeply understand the product and implement observability across all key flows
Facilitate and build load testing cases, ensuring we understand the limits and scaling factors of our services and systems
Contribute to architecture and design of new services and systems, ensuring highly reliable and scalable concepts are implemented
Work closely with Infrastructure, Developer Experience, Networking, and other teams to ensure Product Engineering requirements are met on future roadmaps and technical implementations
Build and lead practices such as game days, chaos engineering, and failure analysis
Build long-term capacity plans, with an eye toward reliability and cost-efficiency

Who You Are:

A software engineer who has worked as an embedded Site Reliability Engineer
Experience writing production-grade software in a modern language, such as Java
Strong knowledge of distributed systems concepts (think CAP theorem), microservices architecture, and distributed tracing
Experience with modern observability systems such as Datadog
Experience with performance debugging tools and patterns. You should be able to read a flame graph
A strong product and user-centric mindset
Desire to continuously improve systems and environments

How You'll be Rewarded:

At CLEAR we help YOU move forward - because when you’re at your best, we’re at our best. You’ll work with talented team members who are motivated by our mission of making experiences safer and easier.Our hybrid work environment provides flexibility. In our offices, you’ll enjoy benefits like meals and snacks. We invest in your well-being and learning & development with our stipend and reimbursement programs.

We offer holistic total rewards, including comprehensive healthcare plans, family building benefits (fertility and adoption/surrogacy support), flexible time off, free OneMedical memberships for you and your dependents, and a 401(k) retirement plan with employer match. The base salary range for this role is $175,000 - $215,000, depending on levels of skills and experience.

The base salary range represents the low and high end of CLEAR’s salary range for this position. Salaries will vary depending on various factors which include, but are not limited to location, education, skills, experience and performance. The range listed is just one component of CLEAR’s total compensation package for employees and other rewards may include annual bonuses, commission, Restricted Stock Units

About CLEAR

Have you ever had that green-light feeling? When you hit every green light and the day just feels like magic. CLEAR's mission is to create frictionless experiences where every day has that feeling. With more than 22+ million passionate members and hundreds of partners around the world, CLEAR’s identity platform is transforming the way people live, work, and travel. Whether it’s at the airport, stadium, or right on your phone, CLEAR connects you to the things that make you, you - unlocking easier, more secure, and more seamless experiences - making them all feel like magic.

CLEAR provides reasonable accommodation to qualified individuals with disabilities or protected needs. Please let us know if you require a reasonable accommodation to apply for a job or perform your job. Examples of reasonable accommodation include, but are not limited to, time off, extra breaks, making a change to the application process or work procedures, policy exceptions, providing documents in an alternative format, live captioning or using a sign language interpreter, or using specialized equipment.

See more jobs at CLEAR - Corporate

Apply for this job

+30d

Principal Site Reliability Engineer

BrightspeedCharlotte, NC, Remote

DevOPS ● Master’s Degree ● terraform ● ansible ● docker ● kubernetes ● AWS

Brightspeed is hiring a Remote Principal Site Reliability Engineer

Job Description

We are currently looking for a Principal Site Reliability Engineer to join our growing team. In this role, you will implement and maintain monitoring systems to track the performance and availability of business-critical systems and infrastructure using metrics to identify trends and potential issues. You will also work closely with development teams, operations, and other stakeholders to ensure that new services and features are reliable and scalable.

As a Principal Site Reliability Engineer, your duties and responsibilities will include:

Implement and maintain monitoring systems to track the performance and availability of Business-critical systems and infrastructure. Use metrics to identify trends and potential issues.
Respond to system outages and performance issues, performing root cause analysis to prevent recurrence
Develop scripts and tools to automate repetitive tasks, such as deployment, scaling, and monitoring
Work closely with development teams, operations, and other stakeholders to ensure that new services and features are reliable and scalable
Work on reducing latency and improving the speed of data transmission across the network
Define and measure Service Level Objectives (SLOs) and Service Level Indicators (SLIs) to ensure services meet required performance and availability targets+
Conduct postmortems after incidents to identify what went wrong and what can be improved
Work with Lead Application owners and internal Change Management to review code changes and support deployments
Lead the team of site reliability engineers onshore/offshore, mentor them for support activities required for system reliability
Must have ability to communicate and abstract the messaging to multiple target audiences including Sr business & IT leadership, technology, and business teams.

Qualifications

WHAT IT TAKES TO CATCH OUR EYE:

Master’s degree in computer science, telecommunications, or similar areas, with a minimum of 10 years software engineering experience, including a minimum of 5 years as a site reliability engineer
Proven track record of managing mission critical customer facing applications for reliability
5+ years of experience supporting operations and maintenance for cloud-native applications in production that are fault-tolerant, self-healing, scalable and high available
Excellent troubleshooting and problem-solving skills, with a keen attention to detail to identify and resolve complex production issues
Deep understanding of cloud computing platforms (GCP) and containerization technologies (e.g., Docker, Kubernetes)
Solid experience with core Kubernetes concepts such as Pods, Workloads, Services, Ingress/Egress, Deployments, ConfigMaps, HPA, Liveliness Probe, and Secrets
Strong knowledge of infrastructure as code tools (e.g., Terraform, Ansible, ArgoCD) and CI/CD pipelines
Strong experience working with integration of code quality tool (SonarQube or Checkmarx) with CI/CD pipeline
Strong experience with monitoring, logging, and observability tools like, Splunk, GCP log, Dynatrace etc.
Ability to work independently and as part of a collaborative team, effectively communicating technical concepts to both technical and non-technical stakeholders
Must have proven written and verbal communication skills, including presentations using tools like PowerPoint
Must have ability to communicate and abstract the messaging to multiple target audiences including Sr business & IT leadership, technology and business teams

BONUS POINTS FOR:

Certifications such as Google Professional Cloud DevOps Engineer or AWS Certified DevOps Engineer

#LI-SS1

See more jobs at Brightspeed

Apply for this job

+30d

Copy of Senior Site Reliability Engineer - Brazil

PodiumRemote, Brazil

Bachelor's degree ● terraform ● Design ● ansible ● azure ● ruby ● docker ● kubernetes ● linux ● python ● AWS

Podium is hiring a Remote Copy of Senior Site Reliability Engineer - Brazil

At Podium, our mission is to help local businesses win. Our lead conversion platform, powered by AI and integrations, helps local businesses convert leads faster, communicate easier, and make more sales. Every day, thousands of local businesses utilize our review management, communication, marketing, and payments products.

Our work and focus on helping local businesses thrive has been recognized across the industry, including Forbes’ Next Billion Dollar Startups, Forbes’ Cloud 100, the Inc. 5000, and Fast Company’s World’s Most Innovative Companies.

At Podium, we believe in fostering a culture that thrives on hiring and developing exceptional talent. Our operating principles serve as a compass, guiding daily behavior and decision-making, and ensure we hire people who will thrive at Podium. If you resonate with our operating principles and are energized by our mission, Podium will be a great place for you!

The Role:

A Site Reliability Engineer borders the worlds of software engineering and systems engineering. At Podium, the SRE team drives our products to success by building a stable, scalable, sustainable, and slick system. We permanently sit and sup with the product engineering teams to address all of their needs, and work as an SRE guild to build a world-class platform for our products to run on. We're currently targeting a senior SRE to come in and deliver impact from day one.

What you will be doing:

Work with the following technologies: Kubernetes, Helm, Docker, AWS, Terraform, Datadog, Prometheus, Ansible, StrongDM, Python, Go, Ruby, GitLab and GitLab CI.
Engaging with Podium's engineering community to identify potential areas of improvement or pain points and making Podium's systems safer and more pleasant to operate.
Participating in an on-call rotation for the services the team owns, triaging and addressing production as well as development issues.
Working cross-functionally with different teams to make sure that there is no down time for our products.
Mentoring junior engineers on the team.

What you should have:

Bachelor’s degree in a technical field or relevant work experience.
4+ years experience working alongside a production system in either a software engineer or systems engineer type role
3+ years deploying, operating and debugging server software on Linux
Curiosity and the desire to learn
Ability to take a rotating on-call shift

What we hope you have:

Experience with distributed systems and microservices
Practical knowledge of system design
Cloud computing, such as AWS, GCP, or Azure
SOC2, HIPAA, PCI, or other regulatory or compliance standards
Building and maintaining a CI/CD pipeline
Heavy Infrastructure experience

See more jobs at Podium

Apply for this job

+30d

Junior Site Reliability Engineer (Azure)

MedfarMontréal, Canada, Remote

DevOPS ● 2 years of experience ● terraform ● sql ● azure ● c++ ● .net

Medfar is hiring a Remote Junior Site Reliability Engineer (Azure)

Job Description

As a Junior Site Reliability Engineer (SRE) you will play a crucial role within the R&D and Innovation department. You will be called upon to collaborate with the Plexia product-aligned and core architecture team. The highly sensitive nature of health and medical systems expertise makes it so that the availability and reliability of our systems are of paramount importance to MEDFAR.

The goal of the Site Reliability Engineering (SRE) team is to enable the Plexia team to deliver work with substantial autonomy, therefore they will be collaborating with team members across the company to help them achieve better outcomes and to provide them with the necessary tools and technologies to deliver them. As part of the SRE team, you will be joining the team accountable for the operation, resilience and backup of the organization’s tools, products, data and services.

What you will be working on:

Refining and extending current monitoring capabilities to track essential service-level indicators and ensure visibility of these metrics.
Improving our infrastructure and software by collaborating extensively with the core architecture and product-aligned teams to identify and deliver improvements that enhance site availability through scalable, secure, and resilient architectures.
Defining and executing test plans that aim to ensure the robustness and resilience of our infrastructure and software systems.
Managing incidents and emergency response, tracking outages, ensuring data integrity and participating in release management to promote safe, efficient and rapid deployments.

Qualifications

Contribute to our team with your strengths:

1-2 years of experience working in site reliability engineering-related projects (required) plus additional experience in system administration, DevOps or software engineering roles (an asset)
Knowledge of Microsoft Azure specifically with high-reliability architecture and security hardening.
Experience with CI/CD processes and Azure DevOps pipelines.
Proficient in PowerShell.
Experience with Windows and Network setup and management
Experience in C#, .NET frameworks, and SQL programming
Experience in SQL Database Management
Strong ability and rigor in documenting tasks and procedures with detail
Experience working with Terraform or another IaC framework, an asset
Bilingual (FR/EN). The ability to communicate in English is required as many team members are located in BC.

Working conditions:

Full-time permanent role, 40 hours per week schedule.
'Emergency working hours' may occasionally be necessary to ensure system stability and address critical issues promptly.
Flexibility in working hours is important to collaborate with team members in the Pacific Standard Time zone.

See more jobs at Medfar

Apply for this job

+30d

Junior Site Reliability Engineer

NextivaPoland (Remote)

DevOPS ● sql ● oracle ● Design ● java ● linux

Nextiva is hiring a Remote Junior Site Reliability Engineer

Redefine the future of customer experiences. One conversation at a time.

We’re changing the game with a first-of-its-kind, conversation-centric platform that unifies team collaboration and customer experience in one place. Powered by AI, built by amazing humans.

Our culture is forward-thinking, customer-obsessed and built on an unwavering belief that connection fuels business and life; connections to our customers with our signature Amazing Service®, our products and services, and most importantly, each other. Since 2008, 100,000+ companies and 1M+ users rely on Nextiva for customer and team communication.

If you’re ready to collaborate and create with amazing people, let your personality shine and be on the frontlines of helping businesses deliver amazing experiences, you’re in the right place.

Build Amazing - Deliver Amazing - Live Amazing - Be Amazing

We are looking for an Operations Site Reliability Engineer to enhance, support, and troubleshoot our SaaS and VOIP platforms for our Business Technology program. We’re looking for someone with a wide breadth of knowledge, experience, and interest in a range of technology domains. This role will ensure the continued stability of our production applications while improving automation, alerting, and monitoring. We deal with many different technologies; a desire to learn and a hunger to work on challenging projects is a must.

Key Responsibilities:

Triage, troubleshoot, and fix production problems in every layer of the stack, with a focus on Oracle and billing systems
Design, develop, improve, and tune logging, monitoring, and alerting
Create actionable alerts to fix system outages before they occur
Write software to improve reliability and recoverability of production systems
Identify manual work, document the fix in the form of a runbook, then automate it away
Perform and automate system administration tasks
Participate in 24/7 on-call rotation supporting production systems

Qualifications:

Bachelor’s degree in Computer Science or related field, or equivalent work experience
0-2 years of Oracle systems experience
0-2 years of software development experience
0-2 years of Linux system administration experience
0-2 years of performance engineering experience
Understanding and experience working with RESTful APIs
Experience with triaging troubleshooting complex systems
Experience working with source control
Experience with containerization and container orchestration
Experience with application performance monitoring
Experience with web technology components including relational and SQL Databases, Apache, Tomcat, Java, packet monitoring
Experience with microservice environments and distributed systems
Familiarity with front-end technologies
Ability to clearly communicate technical concepts
Understanding of general SRE concepts and DevOps principles
Familiar with the SIP concepts and troubleshooting

Nextiva Core Competencies / DNA:

Drives Results: The successful candidate will be action oriented, with a passion for solving problems. They will bring clarity and simplicity to ambiguous situations. This individual will challenge the status quo; asking what we can do differently and finding ways to create and build more success. S/he is a change agent, prepared to lead and drive changes as we transform.
Critical Thinker: The successful candidate is fact based and data driven, able to understand and articulate the “why,” identifying key drivers and learning from the past. They are forward-thinking, anticipating problems before they arise. They’ll recommend and action well thought out solutions, understanding the risks and dependencies.
Right Attitude: The successful candidate will be team-oriented, collaborative and competitive with a winning mindset; they’re resilient and able to easily bounce back from setbacks. S/he will be able to zoom in / out, willing to be hands-on to help solve important problems while being a motivating figure for the team along the way. S/he will embrace a culture of service and learning with a focus on caring, supporting and respecting our customers and team members.

Rewards & Benefits:

Nextiva provides a comprehensive employee benefits package that includes highly competitive salary, medical and life insurance after probation, paid parental leave as per Company policy, employee recognition initiatives, various employee wellness programs and loads of learning and development opportunities which are coupled with career paths to last a lifetime. Great opportunity to work and build a career in international environment is supplemented by friendly atmosphere and professional team.

#LI-SC1 #LI-Remote

Apply for this job

+30d

Junior Site Reliability Engineer

NextivaUnited States (Remote)

DevOPS ● sql ● oracle ● Design ● java ● c++ ● linux

Nextiva is hiring a Remote Junior Site Reliability Engineer

Redefine the future of customer experiences. One conversation at a time.

We’re changing the game with a first-of-its-kind, conversation-centric platform that unifies team collaboration and customer experience in one place. Powered by AI, built by amazing humans.

If you’re ready to collaborate and create with amazing people, let your personality shine and be on the frontlines of helping businesses deliver amazing experiences, you’re in the right place.

Build Amazing - Deliver Amazing - Live Amazing - Be Amazing

Key Responsibilities:

Triage, troubleshoot, and fix production problems in every layer of the stack, with a focus on Oracle and billing systems
Design, develop, improve, and tune logging, monitoring, and alerting
Create actionable alerts to fix system outages before they occur
Write software to improve reliability and recoverability of production systems
Identify manual work, document the fix in the form of a runbook, then automate it away
Perform and automate system administration tasks
Participate in 24/7 on-call rotation supporting production systems

Qualifications:

Bachelor’s degree in Computer Science or related field, or equivalent work experience
0-2 years of Oracle systems experience
0-2 years of software development experience
0-2 years of Linux system administration experience
0-2 years of performance engineering experience
Understanding and experience working with RESTful APIs
Experience with triaging troubleshooting complex systems
Experience working with source control
Experience with containerization and container orchestration
Experience with application performance monitoring
Experience with web technology components including relational and SQL Databases, Apache, Tomcat, Java, packet monitoring
Experience with microservice environments and distributed systems
Familiarity with front-end technologies
Ability to clearly communicate technical concepts
Understanding of general SRE concepts and DevOps principles
Familiar with the SIP concepts and troubleshooting

Nextiva Core Competencies / DNA:

Drives Results: The successful candidate will be action oriented, with a passion for solving problems. They will bring clarity and simplicity to ambiguous situations. This individual will challenge the status quo; asking what we can do differently and finding ways to create and build more success. They are a change agent, prepared to lead and drive changes as we transform.
Critical Thinker: The successful candidate is fact based and data driven, able to understand and articulate the “why,” identifying key drivers and learning from the past. They are forward-thinking, anticipating problems before they arise. They’ll recommend and action well thought out solutions, understanding the risks and dependencies.
Right Attitude: The successful candidate will be team-oriented, collaborative and competitive with a winning mindset; they’re resilient and able to easily bounce back from setbacks. They will be able to zoom in / out, willing to be hands-on to help solve important problems while being a motivating figure for the team along the way. They will embrace a culture of service and learning with a focus on caring, supporting and respecting our customers and team members.

Compensation, Rewards & Benefits:

The salary or hourly wage offered by Nextiva to external candidates considers a wide range of factors, including but not limited to skills sets, experience, training, licensure and certifications, etc. Our compensation decisions are dependent on the facts and circumstances of each case. Our estimate of the expected hiring range for the position as posted is $57,000 - $84,650 A different level in the job hierarchy may apply to a specific candidate resulting in a different hiring range.

Nextiva provides a comprehensive employee benefits package that includes medical (including supplemental plans for accident, hospitalization and critical illness), telemedicine, dental, vision, disability, life insurance, legal assistance, an Employee Assistance Plan, paid parental bonding leave, PTO for hourly employees and Flexible Time Off (FTO) for salaried employees, an employee long-term savings plan (401k) through Fidelity with Nextiva matching, comprehensive employee wellness programs and loads of learning and development opportunities which are coupled with career paths to last a lifetime.

Interested in joining our amazing team at Nextiva HQ? Apply today as we launch the future of business conversations!????

Established in 2008 and headquartered in Scottsdale, Arizona, Nextiva secured $200M from Goldman Sachs in late 2021, valuing the company at $2.7B.To check out what’s going on at Nextiva, check us out on Instagram, Instagram (MX), YouTube, LinkedIn, and the Nextiva blog.

Nextiva is an equal opportunity employer. We celebrate diversity and are committed to creating an inclusive environment for all employees. We prohibit discrimination and harassment of any type without regard to race, color, religion, age, sex, national origin, disability status, genetics, protected veteran status, sexual orientation, gender identity or expression, or any other characteristic protected by federal, state or local laws.Nextiva participates in the E-Verify Program where and as required by law. For additional information about E-Verify visit USCIS.

#LI-RQ1 #LI-Remote

See more jobs at Nextiva

Apply for this job

+30d

Site Reliability Engineer - Brazil

PodiumRemote, Brazil

Bachelor's degree ● terraform ● Design ● ansible ● azure ● ruby ● docker ● kubernetes ● linux ● python ● AWS

Podium is hiring a Remote Site Reliability Engineer - Brazil

The Role:

What you will be doing:

Work with the following technologies: Kubernetes, Helm, Docker, AWS, Terraform, Datadog, Prometheus, Ansible, StrongDM, Python, Go, Ruby, GitLab and GitLab CI.
Engaging with Podium's engineering community to identify potential areas of improvement or pain points and making Podium's systems safer and more pleasant to operate.
Participating in an on-call rotation for the services the team owns, triaging and addressing production as well as development issues.
Working cross-functionally with different teams to make sure that there is no down time for our products.
Mentoring junior engineers on the team.

What you should have:

Bachelor’s degree in a technical field or relevant work experience.
4+ years experience working alongside a production system in either a software engineer or systems engineer type role
3+ years deploying, operating and debugging server software on Linux
Curiosity and the desire to learn
Ability to take a rotating on-call shift

What we hope you have:

Experience with distributed systems and microservices
Practical knowledge of system design
Cloud computing, such as AWS, GCP, or Azure
SOC2, HIPAA, PCI, or other regulatory or compliance standards
Building and maintaining a CI/CD pipeline
Heavy Infrastructure experience

See more jobs at Podium

Apply for this job