Site Reliability Engineer Remote Jobs

208 Results

4d

Site Reliability Engineer (Fully Remote)

agile3 years of experiencenosqlazureapirubylinuxpythonAWSjavascript

Latitude, Inc. is hiring a Remote Site Reliability Engineer (Fully Remote)

 

Salary: $71,000 - 91,000/year

We are looking for an experienced Site Reliability Engineer with operational and/or site reliability engineering background with a passion for providing superior system availability and customer experience. We are looking for candidates who can lead a 24/7 support organization, drive reliability and performance across a massive scale by mastering the full depth of the stack. As an SRE, you will have the opportunity to tackle complex problems of scale which are unique to tech companies while using your expertise in delivery and support of critical services.

Job Responsibilities:

  • Effectively manage troubleshooting and recovery of complex production incidents, ranging from low to critical impacts

  • Drive incident resolution through a systematic problem solving approach, coupled with a strong sense of ownership and drive

  • Actively participate in teams’ Agile stories (project work) to streamline and enhance day to day operations of the team

  • Create, manage and utilize appropriate technical procedural documentation (run books)

  • Proactively monitor all of the applications and infrastructure behind Capital One’s external and internal customer facing services including their availability, latency, performance, and capacity

  • Influence resiliency and scalability in production environments in Amazon Web Services (AWS)

  • Identify opportunities and develop proactive automated monitoring and alerting solutions by utilizing available tools (Splunk, DataDog, etc.)

  • Assist with conducting Root Cause Analysis (RCA) on critical production outages, develop and implement mitigation strategies

  • Utilize production support expertise to influence and support new designs, architectures, standards and methods maintaining stability and availability for large-scale distributed systems

  • Proactively identify and implement opportunities for automation of routine maintenance tasks, data gathering and resolution of common issues

  • Continuously seek to develop new skills and technical expertise, as well as proactively share knowledge with others

Basic Qualifications:

  • At least 3 years of experience in technology production support

  • Azure DevOps experience

  • 2+ years of experience with Linux, UNIX, python, Ruby, Go, JavaScript, or NoSQL

  • 2+ years of experience with AWS, Azure or GCP

  • 2+ years experience with web API services

  • 2+ years of experience with Splunk, New Relic, or DataDog monitoring and alerts

 

See more jobs at Latitude, Inc.

Apply for this job

4d

Senior Site Reliability Engineer - Germany

IncoproRemote
DesignopenstackkubernetesAWS

Incopro is hiring a Remote Senior Site Reliability Engineer - Germany

Germany / Remote 

Corsearch’s solutions are revolutionizing how companies commercialize and protect their growth. Trusted by thousands of customers worldwide, Corsearch delivers data, analytics, and services that support brands to market their assets and reduce commercial risks.

 

From IP clearance to brand protection, Corsearch provides a comprehensive program that enables businesses to secure brand value and thrive commercially. Behind the world’s best-known brands, there’s Corsearch.

 

Corsearch has more than 1500 team members serving over 5,000 clients on five continents, and we’re growing and changing rapidly. We are a fantastic company to work for — with great benefits, growth opportunities, and a terrific internal culture — and we truly believe that it’s people who make us thrive.

 

Corsearch is growing fast and is always looking for new talented people to be part of the journey.

 

About the Position

  • Keep the customer-facing services available at top performance by maintaining the constant health of the supporting systems.
  • Own the incident response system to alert service owners when their services need their attention, thereby further enabling teams to own their code from their desktop to production
  • Problem Management - populate in participate in (Root Cause Analyses (RCAs) and hand them off to the appropriate team
  • Ensuring that work carried out by the Site Reliability team is executed in such a way as to comply with the company’s internal compliance policy and directives
  • Improve the observability of the platforms to measure system health as well as see historic metrics to allow for faster diagnosis of pending issues or retroactive analysis of production issues.
  • Being available to discuss and resolve technical issues and escalations with other technical with clear communication
  • Document, develop, and improve operational practices and procedures
  • Maintain configuration management and orchestration tooling.
  • Work with and lead other members of the team in staying on top of key industry innovation and technology, and assist in team development growth
  • Identifying work opportunities and preparing or assisting with the preparation of technical proposals as required
  • Ability to operate in the high-pressure environment and troubleshoot complex issues quickly successfully handle multiple priorities
  • Work to automate detection and resolution of recurring issues in the production environment

 

Requirements:

  • Experience with monitoring, logging, and alerting technologies: Datadog, CloudWatch, Grafana, Prometheus, ELK stack and related
  • Experience with software engineering and data structure principles and practices.
  • Experience with object-oriented and structured programming principles and practices.
  • Experience with distributed computing, storage, and networking design, monitoring and administration.
  • Experience with public cloud services including AWS, GCP, and Azure.
  • Experience with virtualization and containerization solutions such as OpenStack, VMWare, Kubernetes, and Docker.
  • Experience with CI/CD tools, configuration management, and IaC.
  • Experience with application metrics, performance monitoring, and optimization.
  • Experience automating, maintaining, and improving systems and applications.
  • Strong ability to understand and translate technical needs into actionable solutions.
  • Proactive mindset with strong attention to details, patterns, and potential bottlenecks.
  • Provable success collaborating across teams and tiers within an enterprise organization.

See more jobs at Incopro

Apply for this job

4d

Sr. Site Reliability Engineer

GOGOXRemote
gitkubernetesAWSbackend

GOGOX is hiring a Remote Sr. Site Reliability Engineer

We’re looking for aSeniorInfrastructureEngineer with a passion to develop and provide stable infrastructure for backend applications. In this role, you will touch modern infrastructure architecture, CICD flow build up, SRE culture, IaC concept,.., etc..

What you will do:

  • Maintain infrastructure stability and scalability.
  • Develop and maintain codes for kubernete clusters.
  • Build up and maintain CICD flow
  • Use services that public clouds provide to ensure our infrastructure stability. 

Who you are:

  • Experience in git, kubernetes, and CICD flow buildup
  • Familiar with public clouds like AWS, GCP, and Azure. 
  • Understanding the IaC concept and SRE culture will be a plus. 

What we offer

  • Clear growth path
  • Casual working environment
  • Hybrid work
  • A fast growing technology startup providing on-demand mobility solutions and more
  • A multi-cultural team
  • A software engineering team striving for technical excellence
  • A company promotes learning, continuous improvement and personal growth

GOGOX is the first on-demand logistics and transportation platform in Asia. As a pioneer among tech and logistics startups, we transform the logistics industry, by making use of the trending sharing economy concept and embracing the beauty of simplicity and efficiency.

Over the years, GOGOX has expanded its business from Hong Kong to Singapore, South Korea, Mainland China, Taiwan and India and will continue to expand globally. If you share our vision and enjoy working in a creative, innovative and fun environment, apply to join our team and start your GOGOVanture today.

See more jobs at GOGOX

Apply for this job

7d

Site Reliability Engineer

VerimatriRemote
4 years of experienceagilemobileAWS

Verimatri is hiring a Remote Site Reliability Engineer

Verimatrix is seeking a talented and experienced engineer to join ourSite Reliability          Engineering(SRE) team. We are looking for someone who is passionate about SRE and can help evangelize proper practices and mindsets. This position will also help build out and refine our observability stack to provide actionable data to other teams, provide fine-grained metrics for our alerting and on-call management system, and take a more proactive approach to issues. Lastly as a member of the SRE Team, this position will provide Tier 2 support and help improve the quality of our Runbooks. Bring your experience to help us prepare for scale by adopting industry best practices in availability, security, observability, reliability, and automation.

If this sounds like a challenge for you and you are a problem solver who loves collaboration, this position may be for you! We are operational, but now we need you to help us reach operational excellence. Be prepared to partner with other teams and collaborate across all functions in our organization. Learn what it is like to work in a company where transparency and visibility are valued. We encourage shared goals and objectives across teams. We care about our culture and our people.

Bring your SRE skills to Verimatrix and help us become proactive and able to anticipate problems rather than just be reactive. Solve hard problems with software and automation. Be part of a team and company who support each other and strive to have a positive impact on our customers.

What we looking for:

  • Services with high availability and reliability
  • A strong culture of inclusion and forward thinking.
  • Better reporting and the ability to use data for decision making.
  • Improved monitoring and alerting
  • Improved and effective observability.
  • Common deployment practices
  • A blameless post-mortem culture where we are always reflecting and improving.
  • Movement toward industry best practices.

 

QUALIFICATION REQUIREMENTS: To perform this job successfully, an individual must be able to perform each essential duty satisfactorily.  The requirements listed below are representative of the knowledge, skill, and/or ability required.  Reasonable accommodations may be made to enable individuals with disabilities to perform the essential functions.

 

  • 2-4 years of experience working in a DevOps role
  • Preferably 1 year or more of experience working on an SRE team
  • Experience with AWS (EC2, EKS, Fargate)
  • Expertise in maintaining infrastructure as code (AWS CloudFormation or similar)
  • Knowledge of observability stacks (Grafana, Prometheus, AWS X-Ray, Datadog)
  • Experience with CI/CD and working in Agile environments
  • Familiarity with SLOs, SLIs, and Error Budgets
  • Experience working with teams in different countries
  • Demonstrated ability to collaborate
  • Strong communication skills

 

Verimatrix (Euronext Paris: VMX) helps power the modern connected world with security made for people. We protect digital content, applications, and devices with intuitive, people-centered, and frictionless security. Leading brands turn to Verimatrix to secure everything from premium movies and live streaming sports to sensitive financial and healthcare data, to mission-critical mobile applications. We enable the trusted connections our customers depend on to deliver compelling content and experiences to millions of consumers around the world. Verimatrix helps partners get to market faster, scale easily, protect valuable revenue streams, and win new business. To learn more, visit www.verimatrix.com.

By submitting this form, I agree to the processing of my personal data for the purpose of processing my job application and replying to my request,

in compliance with Verimatrix’s PrivacyNotice

See more jobs at Verimatri

Apply for this job

12d

Site Reliability Engineer (SRE) (h/f)

api.videoRemote job, Remote
terraformDesignansibleapigitlinuxpython

api.video is hiring a Remote Site Reliability Engineer (SRE) (h/f)

Today, video accounts for over 80% of all internet traffic! ????

We are increasingly living in a video-first world where our online experiences are dominated by real-time, streaming, and on-demand video.

At api.video our mission is to connect people through their cameras and videos. We are a global API-first platform managing and delivering online video at scale and our goal is to become the standard for how modern teams bring video experiences into their products and services.

Just like Stripe for payments, Twilio for text/VOIP and Sendgrid for email; we're making video accessible to every client and developer via our api, the world over.

What's the opportunity? ????

As our company is scaling we’re looking to double the size of the team within the year.

We’re looking to add talented minds able to work during CET/CEST time-zones business hours.

In this role you’ll participate in the design, development and run of api.video's infrastructure, enabling developers and clients in more than 100 countries to quickly integrate all the features needed to deliver live or on-demand video into their applications and services.

A unique opportunity to be an early member of a success story. A welcoming and collaborative environment with people who love working on complex issues. With ambitious objectives enabling this role offers the opportunity push your learning curve.

What will you be doing? ????️

As a Site Reliability Engineer, you will be in charge of service continuity and focusing on creating and maintaining solutions to achieve that goal. You will be the owner of the reliability of the infrastructure stack.

Among the subjects you will be working on :

  • Constantly refining observability.
  • Automating remediation on incidences.
  • Design and implement automation of operating tasks.
  • Configuration, maintenance and administration of systems infrastructure including core services, internal tools, monitoring solutions and working with baremetal servers.
  • Work closely with other team members and cross team to ensure ease of operation and quality of service guarantee.
  • Participating to the architecture, forecasting, and ensuring the success of the projects.

What can you expect at api.video?????

  • Global presence with an international working environment.
  • 100% Remote possible (offices in Bordeaux)
  • We offer competitive salaries.
  • Flexible timetable - we value results over presence.
  • Work in your preferred System and OS (Mac, Linux, Microsoft).
  • Your voice is valued and will count in our decision making.
  • Personal Growth. We invest in your career development; do you need books or to attend conferences? We got you covered!

See more jobs at api.video

Apply for this job

15d

Site Reliability Engineer (SRE) (PeopleFluent) UK, Remote

LTGBrighton, London, Sheffield, GB Remote
agileBachelor's degreeterraformansiblescrumrubyjavac++elasticsearchkuberneteslinuxjenkinspythonAWSjavascript

LTG is hiring a Remote Site Reliability Engineer (SRE) (PeopleFluent) UK, Remote

PeopleFluent is hiring! We have an exciting opportunity for a Site Reliability Engineer to join our Hosting team.

The ideal candidate will genuinely enjoy solving operational and development problems using the latest and greatest technologies / methodologies. We also need someone who knows how to play well with others (especially the super fun and interesting people we have on our team).

A little bit more about what we expect from a candidate …

  • Experience with automation such as Terraform and Ansible.
  • Experience with CI/CD tooling. i.e. Jenkins
  • Experience coding in one or more programming languages.
  • Experience architecting and developing large scale systems both in Data Centers and in the cloud.
  • Experience with Kubernetes implementation and administration.
  • Experience with Linux systems and administration.
  • Experience debugging and automating routine tasks.
  • Experience using a systematic problem-solving approach and being able to effectively communicate with team members.
  • Ability to focus on highly portable common approaches that fit ‘the big picture’ and can work for many product lines and production environments

About You

We expect you to have at least 3 years of professional experience in Systems Administration, Applications Development, Software Engineering, and/or Configuration Management. At least 1 year of professional experience (or more!) as a SRE is highly desired!

We would like (but don't require) you to have:

  • Completed coursework in Computer Science; a Bachelor's Degree is a plus.
  • Advanced expertise with cloud computing platforms like Amazon Web Services; relevant AWS Certification (e.g. Developer, Solutions Architect, and/or SysOps Administrator) is a plus!
  • Advanced knowledge across all areas of network infrastructure in AWS (e.g. load balancers, subnets, gateways, NAT, bastion servers, SSL certs, DNS, etc.).
  • Advanced expertise with data centers and hybrid cloud approaches.
  • Advanced experience with web automation tools (e.g. Jenkins, Ansible, Selenium, Terraform, CloudFormation, etc.).
  • Advanced experience with CI/CD methodologies and tools (e.g. ArgoCD, etc.).
  • Advanced experience working with container orchestration (e.g. Kubernetes, ECS, etc.).
  • Advanced skills with scripting and development languages (preferably C#, Java, Python, Ruby, JavaScript, PowerShell, and/or Bash).
  • Experience with Applications, Systems, and Database Monitoring tools & resources (e.g. Elasticsearch, Prometheus, Grafana, etc.).
  • Experience working with Agile software development methodologies; expertise with Scrum and/or Kanban is a plus.
  • Excellent communication & interpersonal skills.

About the Company

PeopleFluent provides flexible cloud solutions that put learning at the heart of talent strategy. As a market leader in integrated talent management and learning solutions, PeopleFluent helps companies hire, develop, and advance a skilled and motivated workforce. Whether they're deployed separately or as a suite, our Recruiting, Onboarding, Performance, Succession, Compensation, and Learning solutions deliver a superior user experience that guides managers and employees with contextual learning – right in the flow of work!

PeopleFluent Learning is part of Learning Technologies Group plc (LTG).

For more information, please visit www.peoplefluent.com and/or www.ltgplc.com.

We are an Equal Opportunity Employer and do not discriminate against any employee or applicant for employment because of race, color, sex, age, national origin, religion, sexual orientation, gender identity, status as a veteran, and basis of disability or any other federal, state or local protected class.

See more jobs at LTG

Apply for this job

Tech9 is hiring a Remote Site Reliability Engineer (SRE) in Mexico

If you are a talented Site Reliability Engineer(SRE), this is the position for you! This is a great opportunity to work with a company that has a primary focus of making our customers happy by delivering value without all the burdensome policies and rules that have become typical for outsourced software development companies. If you work at Tech9, we will ensure that you are happy because at Tech9, we #techhappily! 

**Note: This role is 100% remote. You will not be required to come to the office.  

If that sounds attractive, please apply! We'd love to talk to you.

Minimum Qualifications:

  • 3+ years of experience as an SRE 
  • Comfortable looking at .NET code
  • Able to code scripts such as Powershell, Batch, AWS, 
  • Strong AWS knowledge and experience 
  • High intermediate level of English

#LI-Remote

See more jobs at Tech9

Apply for this job

19d

Site Reliability Engineer (Laravel/Vue/AWS/ECS) at

terraformlaravelapimysqljavascriptfrontendPHP

SportsRecruits is hiring a Remote Site Reliability Engineer (Laravel/Vue/AWS/ECS) at

About SportsRecruits

SportsRecruits is the leading sports recruiting network, connecting athletes, clubs, events, and college coaches in the recruiting process. The company’s network and tools are trusted by sports organizations such as the IWLCA, IMLCA, and Junior Volleyball Association. Every year, millions of connections are made on the network, resulting in commitments to the best academic and athletic institutions.

SportsRecruits is an equal opportunity employer and embraces diversity and equal opportunity on our team. Just like the student-athletes we support, we are trying to get better and stronger as a team everyday.  We are committed to building a team that represents a variety of backgrounds, perspectives, and skills.  We strongly believe that the more inclusive our team is, the better we can serve all student-athletes, as well as their families and coaches, who are pursuing their dreams. 

About the Position

We are a product development team full of fun, intelligent, happy, and hardworking engineers, designers and product managers distributed across the United States. We are profitable, funded and giving more high school athletes the ability to play college athletics than any other recruiting tool out there. Your input and coding/problem solving skills will make a direct impact in how we scale and grow the company.  

We are looking for an SRE to join our team working remotely. We are looking for someone who is a programmer first and is capable of debugging code, refactoring code and writing tests. Our two main technologies that we are investing most of our resources in are Laravel (a modern PHP framework), which we use for our API and Vue.js (javascript frontend framework), which currently powers the frontend application. 

As a Site Reliability Engineering on the Infra/DevOps team, you will take over setting up incident response protocols, Continuous Integration pipeline, Performance indicators and reduction of technical debt, diagnose and identify issues coming from Newrelic and Sentry.  

You will spearhead SRE practices like: 

  • Track performance indicators ranging from Core Web Vitals to DB queries per transaction, failures per request, currently all monitored using NewRelic and Sentry. 
  • Set up performance targets and identify most impactful projects to improve performance metrics across the existing platform features. SLIs & SLOs 
  • Tackle technical debt projects which inhibit performance
  • Write scalable code and optimizations across Javascript, JS/CSS assets, CDN, PHP cycles, Memory usage and DB query performance 

You will aid with devops / infrastructure responsibilities: 

  • Help manage dev and production infrastructure currently running on ECS-Fargate and leveraging Terraform

Requirements: 

  • 5+ years of experience developing web based applications
  • Strong knowledge of OOP, refactoring, and unit testing
  • Advanced working knowledge of ORMs, MySQL and MySQL optimization
  • Comfortable with command line tools 
  • Experience debugging applications based on Javascript and PHP 
  • Laravel PHP experience is a big plus 
  • VueJS experience is a big plus  

Nice to have: 

  • Nice to have: Interest and some experience in DevOps, SRE, or related roles
  • Experience working with CI tools
  • Experience managing Infrastructure as code using Terraform or CloudFormation

What we offer:

It’s important to us that our team is happy, and we're always looking for ways to improve our overall work culture and support our employees’ well-being. Here are a few of the benefits we offer at this time:

  • Comprehensive medical, vision, and dental coverage
  • 401(k)
  • Unlimited time-off policy
  • Option to work remote or in our future office

This is a full time position available as remote or in NYC, no freelancers please.  Principals only, no recruiters please. 

 

See more jobs at SportsRecruits

Apply for this job

22d

Site Reliability Engineer, evertz.io (Poland)

2 years of experienceagile3 years of experienceuiscrumjavatypescriptlinuxangularjenkinspythonAWS

Evertz Microsystems Limited is hiring a Remote Site Reliability Engineer, evertz.io (Poland)

Skills and experience you will bring:

•    3 years of experience managing critical production infrastructure and maintaining reliability and uptime of serverless applications running on the cloud. 
•    2 years of experience with monitoring, log-aggregation, and observability services like Datadog, CloudWatch, Honeycomb, Splunk, and New Relic.
•    2 years of experience implementing and managing production CI/CD pipelines using modern deployment mechanisms such as blue/green deployment
•    2 years of experience translating SLO’s and SLI’s into actionable improvements. Reliability, monitoring, and observability are not just words to you.
•    Solid foundation in Linux systems administration, networking, and security. 

Additional skills and experience that will be useful:

•    Experience with security frameworks such as OWASP, ISO, CSA and PCI. 
•    Experience conducting threat assessments and creating remediation plans based on the results of threat assessments. 
•    Experience with penetration testing, threat modelling, open-source, and commercial security tools. 
•    Experience developing new deployment mechanisms for webapp infrastructure, such as: canary, A/B, blue/green, red-line and other deployment patterns 
•    Deep knowledge of performance tuning of core AWS services like Lambda, DynamoDB, APIGateway, SQS, EventBus, EC2 
•    Experience with chaos engineering that pushes systems and products to their limits to see how they will respond to unexpected events. 


About the Role

The evertz.io Engineering Team builds next-generation systems for content management and distribution in the Media and Entertainment industry. Disney, NBCUniversal, Discovery, BBC, and many other content producers and publishers use our products and services to make the most of their file-based and live content for the least effort.

We work with high quality video in real-time and non-real-time scenarios across a wide range of cutting-edge tech. Specializations within the group span from low-level video manipulation and analysis, through back-end management and orchestration services, to web delivered UIs. Working in parallel with these teams is the Scientific Computing Group who work in computer vision, data science and machine learning, taking experiments in Jupyter notebooks through to deployment in production. This makes for a challenging and rewarding engineering experience of continual learning and plenty of opportunity to explore different parts of the stack.

Our technology stack includes a Serverless microservice architecture that capitalizes on the full breadth of AWS services with code written in Python, Rust and Java, our UI uses the latest versions of Angular, Typescript and NgRx, our CI/CD pipelines leverage AWS, Jenkins, Nexus, and Bazel in addition to our in-house release-management application to build and release 100's of software components.

As a Site Reliability Engineer, you will join our talented and passionate team building evertz.io: a collection of services that will be used by the biggest names in the exciting broadcast and media industry. Our services are hosted in AWS, with a Serverless First mindset.

“Work is a thing you do, not a place you go”

We work in agile, low-bureaucracy, high-creativity, cross-functional teams spread across the world. It’s a highly creative work environment where we support your growth with opportunities for career progression, mentoring others and third-party education. The team is built on trust and is relaxed, open and welcoming to all, and there’s fun to be had with regular social events and sports teams.

Responsibilities
As part of this role, you will be expected to:
•    Establish and measure reliability goals like Uptime, Downtime, Mean time between failures, Mean time to resolution, etc.
•    Define operational maturity by defining and implementing SLIs, SLOs, enable faster detection, and isolation of failures and proactively work to mitigate them
•    Participate in an on-call rotation.
•    Participate in daily scrum standups, sprint planning, and other team rituals including retrospectives.
•    Implement and maintain CI/CD pipelines on AWS using CodeCommit, CodePipeline and CodeDeploy 
•    Evaluate, Implement, and use various monitoring, log-aggregation, and observability services like AWS CloudWatch, Honeycomb to troubleshoot and resolve issues rapidly
•    Conducting and documenting root cause analysis (RCA) and post-incident reviews that document events.
•    Gather and analyze metrics from both operating systems and applications to assist in performance tuning and fault finding

Location
This role allows you to work with “Full Flexibility” - for any work where being physically close to fixed equipment is not a requirement, you have the option to work remotely.
Remote working is not the same as working from home, WFH is just one very common option. You can work from wherever gets the creative juices flowing: coffee shops, co-working places, the park, a different country even! Anywhere with Internet access.
Of course, working from an office is an option too especially if you’re craving some ad hoc in-person interaction! Evertz has offices in Canada, England, Scotland, India, Singapore, Hong Kong, Virginia, California, Arizona, Ohio, Hungary, Belgium, Poland and Australia. Many have great spaces for meet-ups as well as permanent or floating desk space.

Working Hours
This role allows you to work asynchronously meaning you can contribute at the times when you do your best work. Some people are early-birds, some are night-owls, maybe Saturday is better than Wednesday? Whilst some overlap for core meetings is needed, you don’t have to do your deep work between 9 and 5.

Salary & Benefits
We offer a competitive salary with annual performance-based bonus and stock option schemes. A pension plan; an employer funded health and medical plan; life insurance plan; long term disability coverage; paid time off; an employee assistance program; and a discount platform. The availability and specifics of these benefits vary by location, details of which will be provided during the hiring process.

 

See more jobs at Evertz Microsystems Limited

Apply for this job

+30d

Site Reliability Engineer

FivetranRemote, Any, United States, AMER
terraformsalesforceDesignansibleazureapijavapostgresqlkuberneteslinuxAWSbackend

Fivetran is hiring a Remote Site Reliability Engineer

Site Reliability Engineer at Fivetran (W13)
The global leader in modern data integration
Remote, Any, United States, AMER
Full-time
About Fivetran

Our mission is to save engineers from building in-house data pipelines, by building one automated data pipeline that everyone can use. Every single company that uses SaaS tools to run their business will eventually need to analyze the data that sits in those tools. Fivetran unlocks this data with automated connectors that converts messy, chaotic APIs into normalized, standard schemas.

About the role

From Fivetran’s founding until now, our mission has remained the same: to make access to data as simple and reliable as electricity. With Fivetran, customer data arrives in their warehouses, canonical and ready to query, with no engineering or maintenance required. We’re proud that more organizations continue to leverage our technology every day to become truly data-driven.

Fivetran is looking for a high-performance, experienced engineer to be a part of a team of Site Reliability Engineers. You will be working closely with engineering teams, product managers, as well as support and sales engineers to build the future of the Fivetran Data Platform Reliability. 

As a member of the Site Reliability Engineering team, you will take ownership over the overall performance and reliability of Fivetran’s infrastructure, the robustness of the deployment pipeline, as well as timely and effective incident response and resolution. You will take responsibility for the growth and stability of Fivetran’s infrastructure, and be a key player driving effective incident response and overall issue avoidance.

  • Responsible for ongoing reliability and robustness of Fivetran’s production infrastructure by monitoring availability, capacity, and throughput
  • Evolve systems by adding reliability into our product roadmap
  • Coordinate the re-prioritize or fix critical bugs for support or sales requirements as needed
  • Make recommendations to production infrastructure by interfacing with engineering to ensure 100% availability
  • Ensure scalable artifacts deployment to all environments by automation scripts
  • Constantly monitor infrastructure vulnerabilities and remedy them by working with the security team

Minimal Requirement:

  • 1+ years of experience working on Site Reliability Engineering or DevOps
  • Working knowledge of Kubernetes and Terraform
  • Knowledge of one of major cloud platforms such as AWS, GCP and Azure 
  • Experience in Python/Shell scripting
  • Experience with Linux operating systems internals and administration 

Preferred experience:

  • Working experience in Golang
  • Experience with databases such as PostgreSQL
  • Knowledge of all three major cloud platforms (AWS, GCP, Azure)
  • Working with SaaS products at scale 
  • Bonus if you also have Java
  • Configuration management such as Ansible
  • CircleCI experience
  • Networking experience (VPC, VPN, Reversed ssh…)

Perks and Benefits:

  • 100% paid Medical, Dental, Vision and Basic Life Insurance. Benefits begin on your first day!
  • Option of Health Savings Account (HSA) or Flexible Savings Account (FSA)
  • Generous paid time off (PTO) plus paid sick time, holidays, parental leave, and volunteer days off
  • 401k match program
  • Eligible donation match program
  • Monthly cell phone stipend
  • Work-from-home equipment reimbursement for your home office setup!
  • Professional development and training opportunities
  • Company virtual happy hours, free food, and fun team building activities
  • Pet Insurance -- and yes, you can bring your well-behaved fur babies to work
  • Commuter benefits to help with transit and parking costs
  • Employee Assistance Program (EAP)
  • Referral Bonuses
  • Stock equity -- every employee is granted stock options when they walk in the door   
  • Annual Camp Fivetran trip that brings together every employee from around the world

We’re honored to be valued at over $5.6 billion, but more importantly, we’re proud of our core values of Get Stuck In, Do the Right Thing, and One Team, One Dream. To learn more about Fivetran’s culture and what it’s like to be part of the team, click here and enjoy our video.

To learn more about our candidate privacy policy, you can read our statement here.

Technology

We've built a huge product with a small team by dividing our platform into simple, independent pieces and building our software in a disciplined, pragmatic way. We use Java, Google Cloud Platform, PostgreSQL, and React.

See more jobs at Fivetran

Apply for this job

+30d

Site Reliability Engineer (SRE) (PeopleFluent) US, Remote

LTGRaleigh, NC Remote
agileBachelor's degreeterraformansiblescrumrubyjavac++elasticsearchkuberneteslinuxjenkinspythonAWSjavascript

LTG is hiring a Remote Site Reliability Engineer (SRE) (PeopleFluent) US, Remote

PeopleFluent is hiring! We have an exciting opportunity for a Site Reliability Engineer to join our Hosting team.

The ideal candidate will genuinely enjoy solving operational and development problems using the latest and greatest technologies / methodologies. We also need someone who knows how to play well with others (especially the super fun and interesting people we have on our team).

A little bit more about what we expect from a candidate …

  • Experience with automation such as Terraform and Ansible.
  • Experience with CI/CD tooling. i.e. Jenkins
  • Experience coding in one or more programming languages.
  • Experience architecting and developing large scale systems both in Data Centers and in the cloud.
  • Experience with Kubernetes implementation and administration.
  • Experience with Linux systems and administration.
  • Experience debugging and automating routine tasks.
  • Experience using a systematic problem-solving approach and being able to effectively communicate with team members.
  • Ability to focus on highly portable common approaches that fit ‘the big picture’ and can work for many product lines and production environments

About You

We expect you to have at least 3 years of professional experience in Systems Administration, Applications Development, Software Engineering, and/or Configuration Management. At least 1 year of professional experience (or more!) as a SRE is highly desired!

We would like (but don't require) you to have:

  • Completed coursework in Computer Science; a Bachelor's Degree is a plus.
  • Advanced expertise with cloud computing platforms like Amazon Web Services; relevant AWS Certification (e.g. Developer, Solutions Architect, and/or SysOps Administrator) is a plus!
  • Advanced knowledge across all areas of network infrastructure in AWS (e.g. load balancers, subnets, gateways, NAT, bastion servers, SSL certs, DNS, etc.).
  • Advanced expertise with data centers and hybrid cloud approaches.
  • Advanced experience with web automation tools (e.g. Jenkins, Ansible, Selenium, Terraform, CloudFormation, etc.).
  • Advanced experience with CI/CD methodologies and tools (e.g. ArgoCD, etc.).
  • Advanced experience working with container orchestration (e.g. Kubernetes, ECS, etc.).
  • Advanced skills with scripting and development languages (preferably C#, Java, Python, Ruby, JavaScript, PowerShell, and/or Bash).
  • Experience with Applications, Systems, and Database Monitoring tools & resources (e.g. Elasticsearch, Prometheus, Grafana, etc.).
  • Experience working with Agile software development methodologies; expertise with Scrum and/or Kanban is a plus.
  • Excellent communication & interpersonal skills.

What we offer

In addition to vacation benefits, you will be eligible upon your date of hire to participate in our comprehensive benefits program which includes medical, dental, and vision insurance; we also offer HSA and FSA plans as well as life insurance offerings. Additionally, you will be eligible to participate in our 401(k) plan.

About the Company

PeopleFluent provides flexible cloud solutions that put learning at the heart of talent strategy. As a market leader in integrated talent management and learning solutions, PeopleFluent helps companies hire, develop, and advance a skilled and motivated workforce. Whether they're deployed separately or as a suite, our Recruiting, Onboarding, Performance, Succession, Compensation, and Learning solutions deliver a superior user experience that guides managers and employees with contextual learning – right in the flow of work!

PeopleFluent Learning is part of Learning Technologies Group plc (LTG).

For more information, please visit www.peoplefluent.com and/or www.ltgplc.com.

We are an Equal Opportunity Employer and do not discriminate against any employee or applicant for employment because of race, color, sex, age, national origin, religion, sexual orientation, gender identity, status as a veteran, and basis of disability or any other federal, state or local protected class.

See more jobs at LTG

Apply for this job

+30d

Senior Site Reliability Engineer (APAC)

MariaDB Corporation AbSingapore, SG Remote
5 years of experienceterraformmariadbDesignazuredockerkuberneteslinuxjenkinspythonAWS

MariaDB Corporation Ab is hiring a Remote Senior Site Reliability Engineer (APAC)

MariaDB is making a big impact on the world. Whether you’re checking your bank account, buying a coffee, shopping online, making a phone call, listening to music, taking out a loan or ordering takeout – MariaDB is the backbone of applications used everyday. Companies small and large, including 75% of the Fortune 500, run MariaDB, touching the lives of billions of people. With massive reach through Linux distributions, enterprise deployments and public clouds, MariaDB is uniquely positioned as the leading database for modern application development.

The Opportunity

MariaDB is building a web-based management tool to help our customers easily configure and manage enterprise MariaDB configurations. This role will join an existing team to help build and accelerate the delivery of this product. This is a high impact role where you will have the opportunity to work on hundreds of clusters in a multi-cloud environment.

Responsibilities:

  • Design, develop, and test major features of our monitoring infrastructure
  • Interact with other product development groups to build new features and bring out business value for our customers
  • Work as part of a broader team to ensure the right and best product gets built
  • Support other divisions of MariaDB with technical skill and experience
  • Be part of the team

Requirements:

  • Minimum 5 years of experience as a DevOps engineer
  • Minimum 7 years of overall experience in software development
  • Expеrience running Kubernetes clusters on production
  • Experience supporting a PaaS, IaaS, CP, and/or Azure
  • Hands-on experience with technologies like Terraform
  • Excellent knowledge on Linux/Unix environments
  • Expеrience with Cloud Networking
  • Experience coding in one or more of the following languages: Go, Python, Bash
  • Experience with occasional on-call rotation

Nice To Have Experience:

  • Experience working with Kubernetes in production multi-cloud environments
  • Experience building complex CI/CD pipelines using Jenkins
  • Advanced experience with Google Cloud and AWS
  • Familiarity with the CNCF stack
  • Kubernetes certification
  • Docker certification
  • Google Cloud, AWS, and/or Azure certification
  • Experience with ServiceNow Platform
  • Networking knowledge/certification(s)

Location:APAC (Remote)

What’s in It for You?

Impact the world of technology by pushing the boundaries of technology and business models, working at MariaDB. Be part of a game-changing organization that encourages outside-the-box thinking, values empowerment, and is truly shaping the future of the software industry. You’ll be collaborating with high-caliber colleagues around the world, offering unparalleled learning and growth opportunities. We provide a very competitive compensation package, 25 days paid annual leave (plus holidays), stock options, a massive degree of flexibility and freedom, and more.

How to Apply

If you are interested in this position, please submit your application along with your resume/CV.

MariaDB does not sponsor work visas or relocation.

MariaDB is committed to providing any necessary accommodations for individuals with disabilities within our application and interview process. To request an accommodation due to a disability, please inform your recruiter.

MariaDB is an equal opportunities employer.

See more jobs at MariaDB Corporation Ab

Apply for this job

+30d

Senior Site Reliability Engineer

ConvexRemote or San Francisco, CA
B2BoracleDesignpythonAWSfrontend

Convex is hiring a Remote Senior Site Reliability Engineer

Senior Site Reliability Engineer at Convex (W19)
Software for the commercial services industry.
Remote or San Francisco, CA / Remote
Full-time
About Convex

At Convex (YC W19), we’re building the leading B2B full-stack software platform for the $400bn+ commercial services market. It's a 100-year-old industry impacting millions of people every day. We already work with some of the largest enterprise companies in the sector and were one of the fastest growing companies in the Winter 2019 YC batch. Our team is a unique mix of industry veterans from Carrier, Siemens, and Honeywell as well as founders from MIT, Harvard, and Georgia Tech. Based in San Francisco, our investors include Emergence Capital, 1984 Ventures, UP2398, Liquid2 (Joe Montana), Y Combinator, the founders of PlanGrid, and others.

About the role

At Convex, we build the leading B2B platform for the fast growing commercial and building services industry. Our software provides rich data on every commercial property in the US (~63M) and workflow software built on top of that. For our users who serve these properties, that data and workflow becomes their secret weapon; there's nothing else like it available in the market today. Our customers rely on Convex to identify, win, and manage new growth opportunities.

We are based in, and love, the seven square miles of San Francisco, but our customers (and employees) live and work in almost every state in America. They include some of the largest enterprises in the country, like Siemens and Carrier, and smaller businesses we care just as deeply about.

The Product

Our flagship product, Atlas, is a “consumer-grade enterprise product.” Think Apple experience with Oracle utility. Atlas supercharges our users’ work by providing them with information on virtually every commercial property in the country. There is literally no other data source like this available anywhere. All that data is interesting, but it isn’t powerful unless you have the ability to work with it, which is why we are building a full suite of specialized software tools on top of it.

Your Role

  • Manage and monitor the infrastructure powering Atlas
  • Build tools and processes needed to scale our product, data pipeline, and our engineering and data teams
  • Be a core part of our development process, and work closely with other software engineers to continue growing Atlas both quickly and reliably

Requirements

  • 5+ years of professional software engineering, SRE and/or Devops experience working on external-facing applications
  • Solid programming skills in Python
  • Advanced knowledge and experience with AWS
  • Have built robust, complete CI/CD pipelines from source to validation
  • Deep experience with monitoring tools and best practices
  • Strong understanding of Infrastructure as Code (IaC)
  • Confident running and debugging production systems
  • An advocate for security best practices

Nice to Have

  • You are interested in building tech, real estate tech, construction tech, and/or mechanical engineering, plumbing, electrical systems, security systems, fire and life safety systems, or building control systems
  • You stay up to date with technology and how to apply it to solve real-world business problems
  • You want to form the core of a growing engineering team and develop its culture

Benefits

It’s important to us that we provide our employees with meaningful benefits & resources that support them through every stage of their life & career with us, so we’ve built a robust wellness plan to do just that.

  • Generous employer contributions towards medical, dental, and vision insurance

  • Paid parental leave of up to 6 months with 100% pay

  • Flexible & generous time-off plans (including mental health days!)

  • Income protection through short-term and long-term disability plans

  • Tax-favored benefits such as retirement savings plans and flexible spending accounts

  • Commuter programs

  • Healthy lunch, drink, and snack options at our corporate office

  • Flexible hybrid & remote work options

About Convex

At Convex (YC W19), we’re building

the leading B2B full-stack software platform for the $400bn+ commercial services market. It's a 100-year-old industry impacting millions of people every day. We already work with some of the largest enterprise companies in the sector and were one of the fastest growing companies in the Winter 2019 YC batch. Based in San Francisco, our investors include Fifth Wall, Emergence Capital, GGV, 1984 Ventures, UP2398, Liquid2 (Joe Montana), YCombinator, the founders of PlanGrid, and others.

At Convex, we welcome diverse perspectives and people who think rigorously and aren't afraid to challenge assumptions. Join us!

Convex is an equal opportunity employer and values diversity at our company. We do not discriminate on the basis of race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status.

If you need assistance or an accommodation due to a disability, please let your recruiter know.

Technology

Convex is looking for engineers to help our customers serve an under the radar, but massive and ubiquitous industry, commercial building services. Our customers service the systems that provide air we breathe, the water we drink, as well as the lighting, safety, and security systems that power daily life for billions of people.

We are based in, and love, the seven square miles of San Francisco, but our customers are in every corner of America. We found our foothold in small to medium sized businesses, and have quickly been pulled up-market to some of the largest enterprises in the country, like Siemens and Carrier. We have shipped an impressive amount of product with a lean team, but now we have to scale to meet the demands of our customer base which is growing in both size and sophistication.

That is why we need you.

The Product

Our flagship product, Atlas, is a “consumer grade enterprise product.” Think Apple experience with Oracle utility. Atlas supercharges our users’ work by providing them with information on virtually every commercial property in the country. There is literally no other data source like this available anywhere. All that data is interesting, but it isn’t powerful unless you have the ability to work with it, which is why we are building a full suite of specialized software tools on top of it.

See more jobs at Convex

Apply for this job

+30d

Senior Site Reliability Engineer (Backend Platform team)

doxy.meRemote
terraformuiscrumUXgitdockertypescriptkubernetesAWSbackendNode.js

doxy.me is hiring a Remote Senior Site Reliability Engineer (Backend Platform team)

Help us build meaningful software in healthcare used by doctors, patients, and researchers worldwide. 


Our Company

Doxy.me is the simple, free, and secure telemedicine solution used by over 1 million healthcare providers worldwide. Our mission is to eliminate barriers to telemedicine like cost and accessibility, so we are constantly striving to make doxy.me more accessible to everyone, everywhere. With over 350,000 telemedicine calls made through our platform every day, there are millions of people relying on us to simplify their healthcare services.


Our Culture

  • Authentic.We are sincere and care personally. We don't let egos get in the way, getting to the right answer is more important than being right. We aren’t afraid to challenge someone directly, but not like a jerk. We focus on doing the right thing, we are the type of person who always takes the shopping cart back. 
  • Bright.We use our innate intelligence, talent, and curiosity to create simple, innovative, world-class solutions to problems. We just "get it". We are constantly seeking to increase our own brightness through self-improvement and combining our brightness with others.
  • Effective.We are hungry self-starters who will get the job done regardless of circumstances. We don't need to be managed or told what to do. We are reliable and pride ourselves in producing high-quality, world-class results on time. 


Overview

We are seeking a Senior SIte Reliability Engineer motivated by unique, interesting, meaningful challenges in the healthcare sector. We help doctors provide remote medical care, researchers collect structured data, the general public understand personal risks of disease, and much much more. 

 

What Will You Do

  • Build highly scalable infrastructure, automate continuous integration and deployments, and collaborate closely with engineering teams to improve a globally distributed cloud-based infrastructure, powering the entire doxy.me platform
  • Partake in the whole cycle of feature development, from ideation to the actual delivery
  • Conduct high-level root-cause analysis for service interruptions and establish preventive measures
  • Participate in the LeSS process and SCRUM ceremonies

 

Our Expectations

  • Extensive experience as a software engineer building globally available, scalable cloud-native infrastructure
  • Strong cloud experience with AWS, Docker, and infrastructure-as-code (Terraform)
  • Hands-on experience with CI/CD processes, such as building/bundling and deploying code with technologies like Docker, GitLab CI/CircleCI, etc
  • Comfortable with error reporting and monitoring tools like DataDog
  • An eye for detail when it comes to UX and UI
  • Knowledge of Git or other version control systems, preferably experience with Gerrit
  • Familiarity with Kubernetes, Helm, and microservice architecture

 

Nice to have

  • Background writing backend systems in Node.js and TypeScript
  • Experience working with databases (MySQL/PostgreSQL)

 

Quick Info

  • Benefits: 20 days paid time off, sick leave, flexible public holidays, extensive educational program, Macbook, remote working environment
  • Doxy.me tech stack: 
    • React, Node.js, Typescript, WebRTC, Loopback 4, AWS, Kubernetes, Docker, AngularJS
    • 3rd party: Vonage, Pubnub, Segment, Twilio, Stripe
  • Our products: 
    • Doxy.me: The simple, free, and secure telemedicine solution currently usedby over 700000 doctors worldwide and helping over 500000 patients/day. 
    • dokbot.io: Patient-focused data collection for healthcare. 
    • Adhere.ly: digital adherence tool for providers and their patients.
    • ItRunsInMyFamily.com: Using health history to identify the risks of cancer and other diseases that run in families.
  • Our team: technologists, academics, researchers, and innovators from all over the world. English is the language used in all internal communication.
  • To ensure HIPAA compliance we perform background checks after extending a job offer.

See more jobs at doxy.me

Apply for this job

+30d

Site Reliability Engineer, UK

UJETUnited Kingdom Remote
terraformDesignazureapirubydockerkuberneteslinuxpythonAWSNode.js

UJET is hiring a Remote Site Reliability Engineer, UK

About Us

UJET is the world’s first and only cloud contact center platform for smartphone-era CX. By modernizing digital and in-app experiences, UJET unifies the enterprise brand experience across sales, marketing, and support, eliminating the frustration of channel switching between voice, digital, and self-service for consumers. Offering unsurpassed resiliency and the flexibility to deploy across leading public cloud infrastructures, UJET powers the world’s largest elastic CCaaS tenant at up to 22,000 agents globally and is trusted by innovative, customer-centric enterprises like Instacart, Turo, Wag!, and Atom Tickets to intelligently orchestrate predictive, contextual, conversational customer experiences.

Opportunity

We are looking to add a Site Reliability Engineer to our growing engineering team! The SRE teams own UJET’s cloud based infrastructure and scaling. We work closely with the Security team and other Engineering teams. Our ideal candidate is an experienced SRE who has built and maintained cloud infrastructure at scale and has meticulous code style and quality.

Responsibilities

  • Design, build and maintain critical cloud-based systems (such as GCP, AWS, and Azure)
  • Monitor site stability, performance, and security using common Site Reliability Engineering practices
  • Plan upgrades for scaling, capacity, API performance in a complex multi-tenant environment
  • Improve deployment, management, and scalability of our services
  • Champion the implementation of processes to improve visibility across the entire technology stack
  • Document system design and procedures
  • Provide clear status updates on projects in a timely manner
  • Participate in monthly on-call duties
  • Participate in weekly meetings as required

Requirements

  • BS in Computer Science, or equivalent experience
  • Strong programming and/or scripting skills in any of Python, Go, Node.js, Ruby
  • Strong experience with Terraform or other Infrastructure as Code tools
  • Solid understanding of Linux containerization with Docker
  • 4+ years production experience with one or more public Cloud providers (AWS/GCP/Azure)
  • 2+ years production experience with Kubernetes (both operational and application design)
  • Experience with Prometheus / New Relic for monitoring and dashboards
  • Proficiency with Linux system administration
  • Strong Networking skills as they pertain to Cloud/Kubernetes infrastructure
  • Experience with test automation and CI/CD, such as GitOps
  • Understanding of Kafka from an Operational perspective
  • Desire to automate everything
  • Knowledge of best practices related to security, performance, and disaster recovery
  • Intellectual curiosity that motivates you to keep on top of technical trends
  • Highly organized and have the ability to juggle many tasks without losing sight of the highest priority items
  • Stay focused under pressure, prioritizing and managing multiple projects simultaneously in a very fast-paced environment
  • Extremely detail oriented, organized, a self-starter
  • Demonstrate high ownership and ability to drive issues to resolution
  • Excellent communication skills, both written and verbal
  • You are self-motivated with the ability to work independently and in globally distributed teams
  • You are service-oriented and enjoy working with engineers to make the software development process as painless as possible, providing continuous improvement

UJET is an Equal Opportunity Employer

Research shows that while men apply to jobs when they meet an average of 60% of the criteria, women and other marginalized folks tend to only apply when they check every box. So if you think you have what it takes, but don't necessarily meet every single point on the job description, please still get in touch. We'd love to have a chat and see if you could be a great fit. (Thanks CultureAmp who came up with this statement - it’s too good and too important to not repeat)

Compliance Responsibilities

Security, data protection and compliance (SDPC) are paramount to the success of our partnerships. All roles at UJET require compliance with legal and regulatory requirements and acceptance and adherence to all policies and standards within UJET. Personnel acknowledges they are personally responsible for reporting any suspected violations or abuse and are required to complete SDPC training and fulfill role-specific SDPC responsibilities.

Why UJET?

In addition to our great team and disruptive technology, we offer our teammates a competitive compensation and benefits package, work/life balance, unlimited vacation, stock options, monthly game nights, and more!

See more jobs at UJET

Apply for this job

+30d

Staff Site Reliability Engineer

RevenueCatRemote, Americas or EMEA
remote-firstterraformDesignmobileapipostgresqlpythonAWSbackendfrontend

RevenueCat is hiring a Remote Staff Site Reliability Engineer

Staff Site Reliability Engineer at RevenueCat (S18)
$218k - $245k  •  
Developer tools to easily build in-app purchases and subscriptions.
Remote, Americas or EMEA / Remote
Full-time
6+ years
About RevenueCat

RevenueCat is a simple API for developers to manage subscriptions. We provide all the infrastructure needed for app developers to build, analyze and grow their subscription business.

About the role

About us:

RevenueCat makes building, analyzing and growing mobile subscriptions easy. We launched as part of Y Combinator's summer 2018 batch and today are handling more than $1.2B of in-app purchases annually across thousands of apps.

We are a mission driven, remote-first company that is building the standard for mobile subscription infrastructure. Top apps like VSCO, Notion, and ClassDojo count on RevenueCat to power their subscriptions at scale.

Our 40 team members (and growing!) are located all over the world, from San Francisco to Madrid to Taipei. We're a close-knit, product-driven team, and we strive to live our core values: Customer Obsession, Always Be Shipping, Own It, and Balance.

We are looking for a Senior Site Reliability engineer to help design, build and support reliable core systems and infrastructure. We drive cross-team collaboration to improve scalability and end-to-end reliability. Our SDK is shipped on over 10k apps, and our APIs receive more than 20 billion requests per month. Our stability affects the experience of millions of users.

We want to bring somebody onboard that is passionate about reliability, scalability and understanding the limits of computers and people. This person should be excited about all the technical challenges we will face growing our API throughput to millions of requests per minute.

About you:

  • You have 8+ years of experience designing and maintaining complex/large/growing systems.
  • You collaborate well with others, and can communicate effectively in a fully-remote culture.
  • When reviewing new system designs or code, you naturally think about what can go wrong: edge cases, failure modes, bottlenecks, migrations, releases, interesting metrics, etc.
  • You love debugging and finding the root cause of production issues.
  • You can't sleep if something doesn't have enough metrics to ensure everything is working properly.
  • You are proactive, when you see something broken you jump on it to fix it or suggest improvements.
  • You move fast, test and iterate quickly.
  • You love the Linux/Unix shell, but hate manual processes and love to automate all the things.

Preferred Experience:

  • Experience with AWS cloud, Terraform, Prometheus and PostgreSQL
  • Experience with highly available, high-throughput, REST APIs
  • Solid knowledge of Python

In the first month, you'll:

  • Meet frequently with your team and mentor to get up to speed
  • Setup: familiarize with repositories, task management, dev environment
  • Implement and ship your first project
  • Familiarize yourself with the RevenueCat dashboards, logging, debugging tools, cloud providers, infrastructure management and general architecture
  • Familiarize yourself with workflows and subscription business concepts.

In the first three months, you'll:

  • Be able to scope and work on tasks self-sufficiently.
  • Start oncall training
  • Participate in code reviews and contribute in other ways to improve reliability and quality of services

In the first six months, you'll:

  • Contribute to risk assessment, disaster planning and response strategies
  • Be obsessed about our uptime
  • Detect our blindspots and add observability to mitigate them
  • Work closely with product engineers to design reliable rollouts of new features
  • Review code, proposals and participate in architectural discussions.

In the first twelve months, you'll:

  • Know all the major components of our system and be able to debug complex issues
  • Have your own initiatives for improving the services and our infrastructure
  • Be able to spec and architect medium-large projects, gather feedback and design validation and rollout plans.
  • Mentor other engineers
  • Influence the org to improve general reliability, scalability and performance

What we offer:

  • $218,000 to $245,000 USD salary regardless of your location
  • Competitive equity in a fast-growing, Series B startup backed by top tier investors including Y Combinator
  • 10 year window to exercise vested equity options
  • Fully remote work environment that promotes autonomy and flexibility
  • Suggested 4 to 5 weeks time off to recharge and focus on mental, physical, and emotional health
  • $2,000 USD to build your personal workspace
  • $1,000 USD annual stipend for your continuous learning and growth
Technology

We have an API, a web dashboard, and a proliferation of mobile SDKs.

The API is Flask + PSQL, the web dashboard is a React app, and the mobile SDKs are written in whatever language the target platform is.

Our API has to deal with a massive amount of requests and there going to be many interesting scaling problems in the future for us.

On the mobile SDK side, it is a great challenge in providing sane and native-feeling SDKs to many platforms. A great opportunity for a polyglot who cares about developer experience.

Apply Now

See more jobs at RevenueCat

Apply for this job

+30d

Site Reliability Engineer - Remote

VetCentricWashington, DC Remote
agileMaster’s DegreeoracleDesignazureAWS

VetCentric is hiring a Remote Site Reliability Engineer - Remote

About Us:

VetCentric is focused on delivering outstanding services to the federal government.  We have extensive experience in the fields of cyber security, supply chain & logistics management, strategy, business analytics, and IT services such as system design, continuous improvement, virtualization, and data center management.  VetCentric is an SBA certified HUBZone company and VA CVE certified Service-Disabled Veteran Owned Small Business (SDVOSB). We operate in 15 states with offices in Washington DC and Northern Virginia. ​

Perks Working with Us:

  • Competitive compensation
  • Comprehensive health, vision, dental benefits
  • 15 days leave and 11 days of paid Federal Holidays  
  • 401(k) with matching plan
  • Annual training budget
  • Fantastic company culture

Location(s): Anywhere, US. Candidates from HUBZones preferred.

Employment Eligibility: Eligible to work for any employer in the United States without requiring sponsorship. Sponsorship is not available currently.

As a Site Reliability Engineer (SRE) on our team, you will use your subject matter monitoring expertise and skills to improve the reliability of the VA’s applications via enterprise monitoring capability tools. You will be responsible for figuring out why an application with enterprise monitoring efforts allowed a high priority incident (HPI) or a critical priority incident (CPI). You'll work with the Enterprise Command Center’s (ECC) Business Line Management (BLM) Teams, the ECC Event Management (EM) Team and the Enterprise Command Operations’ (ECO) Incident Management Team detect, investigate, and diagnose monitoring problems and defects across Enterprise level applications and technology stacks. This position will be on a team dedicated to providing recommendations and instrumenting those approved recommendations in ECC’s monitoring tools to improve VA enterprise reliability and improve the quality of services provided to veterans. The ECC monitoring tools will be focused in Splunk Enterprise/ITSL, AppDynamics, DynaTrace, SolarWinds, ScienceLogic and Aternity. You will be working with system and application owners to obtain existing design and functionality, leverage comprehension of workflow systems and applications processes within multiple system environments and work across technology and development teams to diagnose outages due to inadequate monitoring instrumentation designs and recommend changes to increase reliability.

You Have:

  • 6+ years monitoring and troubleshooting experience with two or more of the following monitoring tools, AppDynamics, DynaTrace, Splunk/ITSI, SolarWinds, ScienceLogic or Aternity
  • 8+ years of experience working with key indicators for IT system operability, reliability, application performance and code quality
  • 8+ years of experience deploying, maintaining and troubleshooting complex applications at an enterprise scale while working with cross-functional teams
  • Experience in one or more Technology Areas (Network, Windows, Desktop, Unix/Linux, AWS or Azure Cloud, WebSphere Middleware, Java/JS Development, Microsoft or Oracle Database)
  • 1+ years of experience in service virtualization, AWS or Azure Cloud technologies, and SaaS and PaaS implementation.
  • 2+ years experience leading teams
  • Experience with using Microsoft Office, including Word, Excel, and PowerPoint
  • Ability to work independently with little supervision
  • Master’s Degree in Computer Science, Engineering, or Equivalent and 10 total years of experience; or 20 total years of experience in lieu of a degree

Nice If You Have:

  • Experience with test-driven development, distributed systems, microservices and cloud-native application implementation
  • Experience with the following tools: Oracle Enterprise Manager, Power Bi and ServiceNow
  • Possession of excellent written and verbal communication skills
  • Possession of strong critical thinking and error assessment capabilities
  • Experience working in an Agile framework such as KanBan and Scrum.
  • Public Trust Clearance

See more jobs at VetCentric

Apply for this job

+30d

Associate Site Reliability Engineer

IFSItasca, IL, USA, Remote
Commercial experiencejirasqloracleazurejavac++linuxAWS

IFS is hiring a Remote Associate Site Reliability Engineer

Company Description

At IFS you will work in a growing, global enterprise software company built upon committed and empowered colleagues who come to work knowing they are making a difference. We work everyday with customers who continue to challenge their markets and competitors. As a challenger ourselves, we partner with our customers to guide them through their digital transformations and extract the most value out of our software solutions. We take pride in ensuring that our employees are able to achieve the company goals as well as develop their career. We believe empowered autonomy, committed colleagues and being part of a winning team are the keys to our success and what makes us great! We are #ForTheChallengers and if that resonates with you, we would love to hear from you!

Job Description

Associate Site Reliability Engineer (SRE – US ITAR) 

United States: Remote job role

The IFS Associate Site Reliability Engineer exists within the global Cloud Operations organization. The role forms part of a team, which reports into a Cloud Services manager who is responsible for the operational and people management aspects of the team. The team provides 24x7x365 operations support to the IFS customer base who have subscribed to the IFS Cloud Services for ITAR (International Traffic in Arms Regulations). The role handles multiple aspects of incident, service request, problem, and change management, as well as working with multiple internal and external stakeholders related to Cloud Services for ITAR.  At times, the need to aid other areas of the global Cloud Services team will also be necessary.

Although not a role with people management duties, the selected individual will typically have an area(s) of technical expertise that not all members on the team share. Mentoring, handling escalations, writing documentation, promoting best practices, and taking a primary role in shared team initiatives will be required. At times, working with other members of the larger Cloud Operations organization, Application Support, R&D, Consulting, other groups within IFS, as well as external vendors will also be required.

Work performed is subject to ITAR compliance.  Strict adherence to established processes is critically important to executing job responsibilities for maintaining compliance.

 

Key Duties

  • Manage an incoming queue of cases, incidents and service requests within SLA, OLA and KPI targets
  • Support the event management team and their work to enhance the related event processes and tools.
  • Support the triage team in their work to assess and correctly route incoming incidents and service requests
  • Work with other Service Center functions and appropriate stakeholders to resolve long running, complex or major incidents
  • Deliver a top tier customer experience through clear communication, precise management of expectations and good customer focused service delivery
  • Lifecycle management (creating, updating, deleting, etc.) of Knowledge Articles, FAQs, SOPs and Job Aids for the documentation library
  • Work with the Problem Management team to perform and provide Root Cause Analysis activities for customer incidents (which includes postmortems, incident timelines, including identifying and implementing corrective actions)
  • Support the implementation of corrective actions from the problem management process
  • Manage scheduling of future dated activities while understanding the time specific resource limitations of the team
  • Provide ongoing feedback to improve the service request process
  • Support the automation team in creating and enhancing the tooling and documentation for standard service requests
  • Support the change management process across the service
  • Support the supplier management process across the service
  • Perform operational items within the service transition process for new and updated products
  • Work with other Service Center functions to define and produce various internal and customer reports on a recurring and ad-hoc basis

 

Personal Abilities

  • Ability to manage own time efficiently and effectively
  • Ability to work to deadlines and targets
  • Flexibility to work to deadlines and needs of the role
  • Ability to work in international, multi-discipline, cross-functional teams
  • Proactivity in all aspects of the technical and team role
  • Ability to mentor and act as a positive role model for other team members
  • Excellent verbal and written communication skills in English
  • Ability to read and understand technical documentation
  • Ability to convey ideas and needs to technical and non-technical audiences
  • Problem-solving skills and the ability to change approach based on information gathered during the process
  • Effective use of multiple types of resources to identify and resolve support cases.  all provided resources to identify and work a support issue.  (Knowledge Base, internal Subject Matter Experts, Vendor specific resources Internet based resources to documentation, teams,
  • Strong organizational skills and ability to multi-task
  • A positive team player with a can-do attitude
  • Proactivity and ownership of work items in all aspects of the technical and team role
  • Ability to self-learn and quickly understand new and changing technologies in a fast-moving service driven technology landscape

 

Experience

  • Mandatory
    • Experience in cloud computing services, enterprise IT service delivery or an SRE role
    • Demonstrated knowledge of cloud computing services or IT service management methodologies and best practices
    • Experience in a modern ticket/service desk tooling such as ServiceNow, Jira Service Desk, or a similar tool
    • Experience of 24x7 service delivery in an SLA/KPI driven environment
  • Optional Value Add
    • Experience in ITIL, ISO 20000, or a similar service delivery framework
    • Experience in the provision of cloud computing services or IT service delivery

 

Technical Skills

The successful candidate must have the following skills and for each relevant skill, the candidate should either have commercial experience or a suitable professional grade qualification in one or more of the following areas:

  • Oracle Middleware/Java
  • WebLogic Server administration including Java debug/fault finding at the server/JVM level
  • Linux or Windows Server administration
  • Microsoft SQL Server administration
  • Oracle Database Administration
  • Docker/Kubernetes Administration
  • Microsoft Azure Administration
  • Terraform/Ansible/Powershell

In addition to having experience in one of the above areas, experience in the following areas of expertise are also desired:

  • Oracle Middleware/Java
  • WebLogic Server administration including Java debug/fault finding at the server/JVM level
  • Linux or Windows Server administration
  • Microsoft SQL Server administration
  • Oracle Database Administration
  • Docker/Kubernetes Administration
  • Microsoft Azure Administration
  • Terraform/Ansible/Powershell

The following are value add skills if available

  • GCP administration and operations
  • Working knowledge of ERP systems
  • Usage of ITSM tools in a service desk environment

 

Qualifications

Mandatory

A formal qualification (Degree, HND, etc.) in Computer Science, Information Technology or similar.

Optional Value Add

  • ITIL qualifications, at foundation or higher levels
  • Specialist Technical Qualifications, suitable examples:
  • Windows Server MCP or Red Hat RHCE groups of certifications
  • Microsoft Azure, AWS or GCP certifications
  • Cisco CC or Juniper JNCP groups of certifications
  • CompTIA group of certifications

 

Working Environment

Team provides support 24x7x365.  Flexibility to working some holidays, nights, weekends and assist with escalations at short notice. 

Note: This role profile serves to provide objective criteria for selecting a candidate who best fits the requirements.  This document summarizes the main duties and responsibilities of the role and is not intended as an exhaustive list.

Additional Information

All qualified applicants will receive consideration for employment without regard to race, color, religion, sex, sexual orientation, gender identity, national origin, disability, or status as a protected veteran. VEVRAA Federal Contractor, Equal Opportunity Employer

See more jobs at IFS

Apply for this job

+30d

Site Reliability Engineer (full remote working)

lastminute.comWrocław, Poland, Remote
terraformDesignmobileansibleazuredockerkubernetesubuntulinuxpythonAWS

lastminute.com is hiring a Remote Site Reliability Engineer (full remote working)

Company Description

Launched in 1998, this pioneering British-born brand has specialised in creating amazing experiences and unforgettable memories - from hotels, city breaks and holidays to theatre, entertainment and spa days. Experts in brightening up online travel, lastminute.com is among the worldwide leaders in the field, helping hundreds of thousands of customers every year find, and do, "whatever makes them pink".

lastminute.com is part of lm group, publicly-traded multinational Group, among the worldwide leaders in the online travel industry. Every month, the Group reaches across all its websites and mobile apps (in 17 languages and 40 countries) 60 million unique users that search for and book their travel and leisure experiences. More than 1,200 people enjoy working with us and contribute to provide our audience with a comprehensive and inspiring offering of travel-related products and services.

At the heart of our culture is a commitment of inclusion across race, gender, age sexual orientation, religion, gender identity or expression and accessibility. We strongly believe in an equal opportunity space, which is welcoming and celebrates the uniqueness of everyone who works here. We value different lived experiences and respect viewpoints, as we know unicity drives innovation. We want to make sure our people reflect the communities across the world we help travel.

Job Description

 *Please note that is a full remote working position/on-site*

*This vacancy is also eligible for External Referral Programme: Do you have a friend that you think can be interested in this position? Don’t keep it for yourself, click here and suggest us his/her profile! Check out how our External referral policy works here

To support and participate in company-wide Continuous Deployment introductions and SRE projects we are looking for a Site Reliability Engineer with certified experience as SRE  for our Technology department.

“Hope is not a strategy. Engineering solutions to design, build, and maintain efficient large-scale systems is a true strategy, and a good one.”

Key Responsibilities 

  • As Site Reliability Engineers we are responsible for the availability, performances, monitoring, and incident response of the platform and services running on multiple environments.
  • Improve infrastructure automation and automate repetitive tasks and build a scalable infrastructure
  • Improve and evolve the Self-Service Capabilities to developers and other stakeholders
  • Collaborate closely with architects, developers, database administrators in order to handle the reliability and scalability of the infrastructure.
  • Working closely with the Infrastructure team to define and implement solutions necessary for the success of the development teams.
  • Participate in periodic on-call duties

Qualifications

Essential

  • + 3 years experience as DevOps
  • Strong Experience with Linux operating systems (Ubuntu, RHEL) internals and administration
  • Strong knowledge in web application and high traffic web architecture
  • Strong knowledge  of Docker and Orchestration frameworks (Kubernetes preferred, Openshift, Nomad)
  • Experience working in microservices-based architectures
  • Good understanding of  configuration management tools, Ansible, IAC tools (Terraform) and their best practices
  • Good knowledge and hands-on experience using  Continuous delivery and deployment tools like GitlabCI, Spinnaker or similar (CircleCI / GoCD / Github Actions …)
  • Experience in Virtualization technologies (Vmware)
  • Good Knowledge of languages like Go, Python and system scripting languages
  • Good Knowledge of major public cloud providers technologies  (AWS, Google Cloud, Azure)
  • Good Knowledge of data centre management
  • Experience with traditional and modern website architecture
  • Familiarity with Centralized logs solutions (Fluentd, Logstash, Splunk)
  • Familiarity  understanding of change management and incident management processes
  • Familiarity with observability

Desirable

  • Travel domain experience
  • Certifications in one of the above-described fields
  • Good understanding of hybrid cloud architecture
  • Vmware NSX
  • Sysadmin background

Abilities/qualities

  • Good communication skills, written and verbal     
  • Enthusiasm to learn new technologies
  • Attitude to teamwork and ability to work in multi-location teams

Additional Information

By joining our company, you will have the chance to:

  • Join a dynamic team in an inclusive-international environment
  • Grow thanks to the career journey and our internal mobility perspective
  • Manage your own schedule thanks to the flexible start and end of the working day
  • Work a shorter working week (36h), of which 4 hours on Friday morning
  • Get focus time for learning, development and deep work on Friday mornings
  • Work partially or fully remote according to local laws
  • Enjoy continuous training thanks to our company platform
  • Benefit from employee discounts on travel
  • Receive 2 days off per year for the purpose of volunteering
  • Receive a bonus after 5 and one after 10 years in the company
  • Get free snacks / fruit / hot drinks / water / beverages at our offices
  • Participate in amazing winter and summer corporate events
  • Benefit from extended parental or marriage leave

See more jobs at lastminute.com

Apply for this job

+30d

Azure DevOps Site Reliability Engineer (Remote)

VendavoChicago, IL, USA, Remote
sqlmobileazuregitc++.netdockerkuberneteslinuxjenkinspythonAWS

Vendavo is hiring a Remote Azure DevOps Site Reliability Engineer (Remote)

Company Description

Vendavo is the leading provider of price management and optimization solutions for business-to-business companies worldwide.  Vendavo solutions (On-premise, Mobile and SaaS) include comprehensive pricing analysis, optimization, price setting, and deal execution capabilities that help companies improve profits through the art of science and big data.  Leading companies across chemicals, high-tech, industrial manufacturing, and distribution industries leverage Vendavo solutions to drive higher profits.  We’re making a difference in business, and we’re looking for energetic, experienced, and talented professionals to grow our team. If you are someone who is driven to make a global impact and believes in a culture of mutual respect, then you need to join us here at Vendavo!

We collaborate with our customers like few others in our industry.  That’s how we help global businesses achieve extraordinary outcomes in driving predictable, profitable outcomes and growth, by combining the best technology, processes, and – most importantly – people.

It doesn’t stop with unlocking opportunities for customers: We’re committed to creating growth, opportunity, diversity, and inclusion for our employees, too.

Our team is growing. You will too.

Job Description

The Opportunity:  We are seeking a DevOps Site Reliability Engineer to embed with our Cloud Services team. In this role, you will help maintain, develop, and scale the Vendavo Cloud platform to support our rapid growth and ambitious goals. Members of this team take a collaborative and customer-oriented approach. You will have the opportunity to offer new ideas and make valuable contributions to the team every day. If you love automating infrastructure as code, and enjoy the variety of systems administration, cloud services, and database administration, this role is for you!

  • Drive system reliability and performance improvements to delight our customers
  • Champion infrastructure as code, deployment automation, and observability to enable reliable, rapid, and effortless releases to production
  • Play a key role in evaluating and integrating Azure and AWS platform technologies into our platform architecture
  • Enthusiastically participate in a culture of continuous improvement
  • Use a variety of monitoring and APM tools to ensure the health of the system and identify opportunities for improvement in the application and the database
  • Create repeatable patterns to automate production and non-production infrastructure to meet scalability, reliability, security, and availability requirements
  • Help evaluate new tools and technologies and collaborate closely with the development team
  • Maintain platform security controls

Qualifications

  • Experience with development in .NET, SQL and C#
  • Expertise with CI/CD tools - Jenkins, TeamCity, Azure DevOps or similar tools
  • Experience with cloud services – Azure, or similar 
  • Experience with Docker and container orchestration tools such as Kubernetes is a plus
  • Scripting experience with PowerShell, Python, or batch
  • Great interpersonal skills and an ability to work in a team environment
  • Self-starter willing to work in a dynamic environment with minimal supervision
  • Experience with Windows Server and Linux based environments
  • Experience with Git
  • Experience with infrastructure monitoring Icinga, Prometheus, Nagios
  • Strong desire to acquire and master new skills

Additional Information

  • Competitive base salary + bonus
  • Comprehensive health benefits including medical and dental
  • Unlimited paid time off
  • Flexible working hours

Accommodations

Vendavo is an inclusive community, and we know that everyone has their own needs. If you have a disability or special need that requires accommodation during the interview process, please contact your recruiter with your request. Your message will be confidential, and we will be happy to assist you.

All your information will be kept confidential according to EEO guidelines.

See more jobs at Vendavo

Apply for this job