Site Reliability Engineer Remote Jobs

141 Results

29d

Site Reliability Engineer III - (Build & Release Focus)

RenaissanceRemote, Any City, REMOTE, United States, Remote
agileterraformDesigngitc++.netjenkinspythonAWSjavascript

Renaissance is hiring a Remote Site Reliability Engineer III - (Build & Release Focus)

Company Description

When you join Renaissance®, you join a global leader in pre-K–12 education technology. 

Renaissance’s solutions help educators analyze, customize, and plan personalized learning paths for students, allowing time for what matters—creating energizing learning experiences in the classroom. Our fiercely passionate employees and educational partners have helped drive phenomenal student growth, with Renaissance solutions being used in over one-third of US schools and in more than 100 countries worldwide.

Every day, we are connected to our mission by exemplifying our values: trust each other, win together, strive for the best, own our actions, and grow and evolve.

Job Description

We are looking for an experienced Site Reliability Engineer III with a focus on Build & Release Engineering.  

  • This role will work on engineering efficiency initiatives and will design & implement continuous integration, continuous delivery pipelines.  

  • We are on a quest to enable developers to perform self-service at-will deployments all the way through production.  

  • You are a developer. 

  • You are a problem solver, self-motivated, great at communication, values teamwork, and have a passion for DevOps & SRE principles.  

  • You will be reporting directly to the Director of Engineering. 

In this role, you will work with engineering, security & operations teams to: 

  •  Improve engineering efficiency by designing & building automated CI/CD pipelines. 
  • Help shape build & release systems & processes.  

  • Implement zero-downtime deployments, immutable and reproducible builds, and blue/green or canary deployments. 

  • Evangelize DevOps philosophy. 

  • Reduce toil by improving automation  

  • Be part of an on call rotation & support off hour deployments.  

Qualifications

For this role, you must have: 

  • 5+ years of software development experience, with recent specialization in Build & Release Systems. 

  • Expert-level experience in continuous integration, continuous delivery/deployment, zero-downtime deployments, and Blue/Green or Canary Deployments. 

  • Experience with Python. 

  • Experience with Configuration Management and Orchestration Tools such as Terraform or Ansible. 

  • Experience with Docker. 

  • Experience with Amazon Web Services including ECS, Fargate, Lambdas. 

  • Experience setting up & managing Jenkins at scale. 

  • Experience with Git and related best practices (branching/merging/rebasing). 

  • Experience working in an agile environment. 

  • Great written and verbal communication skills. 

  Bonus points for experience with: 

  •  AWS CodeDeploy & other AWS developer tools. 
  • Writing larger applications in Python (i.e. knowledge of module namespaces, unit testing, database APIs) 

  • GitHub Enterprise administration and deployment 

  • GitHub Actions 

  • JFrog Artifactory or SonaType Nexus 

  • EKS 

  • LaunchDarkly 

  • New Relic 

  • .NET Core 

  • C# 

  • Javascript 

  • Packer 

  

Additional Information

All your information will be kept confidential according to EEO guidelines.

Salary Range: 93,000 – 145,000. This range is based on national market data and may vary by location.

Benefits:

  • Medical, Dental & Vision
  • 401K with generous matching
  • Generous Tuition & Professional Development Reimbursement
  • 10+ paid holidays and Vacation Time Off
  • 2 Volunteer Days off yearly
  • 14 weeks fully paid family leave

Renaissance is committed to maintaining a safe and healthful environment for our employees and customers.  To uphold this commitment, Renaissance requires all employees to receive a full COVID-19 vaccination as a condition of employment, unless an individual has been granted an exemption as an accommodation due to a disability or for religious reasons.

Frequently cited statistics show that some women, minorities, individuals with disabilities, and protected veterans, may only apply to roles if they meet 100% of the qualifications. At Renaissance, we encourage all applications! Roles evolve over time, especially with innovation, and youmay be just the person we need into the future. We hope you're open to learning new skills to grow with us. Make our team, your team!

Renaissance is an equal opportunity employer and does not discriminate with respect to any term, condition or privilege of employment based on race, color, religion, sex, sexual orientation, gender identity or expression, age, disability, military or veteran status, marital status, or status of an individual in any group or class protected by applicable federal, state, or local law.

At Renaissance our mission is: “To accelerate learning for all children and adults of all ability levels and ethnic and social backgrounds, worldwide.” Inherent in that guiding principle is dedication to serving all identities by recognizing the importance of Diversity, Equity, and Inclusion (DEI) in our organization, our work and our products.

Renaissance also provides reasonable accommodations for qualified individuals with disabilities in accordance with the Americans with Disabilities Act and applicable state and local laws. If an accommodation is needed to participate in the job application or interview process, please contact Talent Acquisition.  

#LI-HO1

#LI-Remote

See more jobs at Renaissance

Apply for this job

29d

Senior Site Reliability Engineer

Procore TechnologiesSan Francisco, CA, USA, Remote
terraformansiblerubyjavakubernetesjenkinsAWSNode.js

Procore Technologies is hiring a Remote Senior Site Reliability Engineer

Job Description

What if you could use your technology skills to develop a product that impacts the way communities' hospitals, homes, sports stadiums, and schools across the world are built? Construction impacts the lives of nearly everyone in the world, and yet it's also one of the world's least digitized industries, not to mention one of the most dangerous. That's why we're looking for an experienced Senior Software Engineer to join Procore's journey to revolutionize a historically underserved industry.

As aSenior Software Engineer on our Reliability Engineering team, you'll help champion solutions to systemic issues affecting every team at Procore. Leveraging your software and systems architecture expertise, you'll conduct consultative engagements with our service authors that improve our software's reliability. If you have a passion for solving complex problems unique to running large, highly scalable, resilient systems, we would love for you to join us!

This position will report to the Manager of the Reliability Engineering team with the opportunity to be located in our Carpinteria, CA headquarters, New York City, or Austin, TX office. Remote candidates will be considered based on experience with the expectation of occasional travel to these offices. We're looking for someone to join our team immediately.

What you'll do:

  • Work on projects within a small team of Reliability Engineers to continually improve the reliability of Procore's services through engineering and process improvement

  • Collaborate with your peers to develop solutions in your respective area with a bias toward reusability, toil reduction, and resiliency

  • Identify and surface opportunities for solving larger systemic issues

  • Use a collaborative approach to make technical decisions that align with Procore's architectural vision

  • Partner with internal customers, peers, and leadership in planning, prioritization, and roadmap development

  • Collaborate with teammates through code reviews and pairing

  • Serve as a subject matter expert on tools, processes, and procedures and help guide others to create and maintain a healthy codebase

  • Facilitate an "open source" mindset and culture both across teams internally and outside of Procore through active participation in and contributions to the greater community

What we're looking for:

  • BS or MS degree in Computer Science or related discipline; or comparable work experience. Technical Certifications are a plus

  • 5+ years of combined experience as a Software, Resiliency, or Reliability Engineer, with proficiency in one or more languages (Ruby, Node.js, Java preferred)

  • Experience designing and developing services within distributed systems

  • Curriosity to seek out and solve complex problems

  • Experience working with software, platforms, and infrastructure at scale (we run thousands of containers and have millions of users) 

  • Experience with the following is preferred:

    • Public cloud (AWS, GCP)

    • Container orchestration (Kubernetes)

    • Cloud automation tooling (e.g., CloudFormation, Terraform, Ansible)

    • Continuous Integration Tooling (e.g., CircleCI, Jenkins, Travis, etc.)

    • Continuous Deployment Tooling (e.g., ArgoCD, Spinnaker)

    • Service Mesh / Discovery Tooling (e.g., Consul, Envoy, Istio, Linkerd)

    • Contributions to open-source projects

Additional Information

If you'd like to stay in touch and be the first to hear about new roles at Procore, join our Talent Community.

About Us

Procore Technologies is building the software that builds the world. We provide cloud-based construction management software that helps clients more efficiently build skyscrapers, hospitals, retail centers, airports, housing complexes, and more. At Procore, we have worked hard to create and maintain a culture where you can own your work and are encouraged and given resources to try new ideas. Check us out on Glassdoor to see what others are saying about working at Procore. 

We are an equal opportunity employer and welcome builders of all backgrounds. We thrive in a diverse, dynamic, and inclusive environment. We do not tolerate discrimination against employees on the basis of age, color, disability, gender, gender identity or expression, marital status, national origin, political affiliation, race, religion, sexual orientation, veteran status, or any other classification protected by law.

Perks & Benefits

You are a person with dreams, goals, and ambitions—both personally and professionally. That's why we believe in providing benefits that not only match our Procore values (Openness, Optimism, and Ownership) but enhance the lives of our team members. Here are just a few of our benefit offerings: generous paid vacation, employee stock purchase plan, enrichment and development programs, and friends and family events.

See more jobs at Procore Technologies

Apply for this job

30d

Site Reliability Engineer - US/CA (Remote position)

agileterraformazurerubyc++kubernetesjenkinspythonAWS

Nozomi Networks is hiring a Remote Site Reliability Engineer - US/CA (Remote position)

Please note, this is a full remote job opportunity. Do not apply for this role if you are not physically located in US or Canada.

While this is a remote position, we cannot consider candidates that are not based in these regions.


Nozomi Networksis the leader of industrial cybersecurity.  Whether our clients need fast product enhancements, onsite engineering support, or rapid deployment across continents, we deliver. We accelerate digital transformation by providing exceptional network visibility, threat detection and operational insight for OT and IoT environments.

Here at Nozomi Networks, you will have the opportunity to develop ground-breaking technology for Industrial Cybersecurity, where Deep Packet Inspection and Artificial Intelligence are used together with Agile Methodologies to build our product. 

 

We are looking for an experienced Site Reliability Engineer to support and expand our globally distributed team. Together with the rest of the team, we are responsible for the availability, performances, monitoring, and incident response of our cloud based services. 

You will play a critical role at improving our infrastructure while automating any repetitive task. You will be also working closely with our development teams helping them defining and implementing any needed solution for their productivity and success.

 

Responsibilities

  • Build and maintain our cloud based infrastructure and related services
  • Promote and implement automated solutions across application deployment, maintenance, and operational activities
  • Troubleshoot issues across the entire stack, from network and OS to applications
  • Performance tuning and optimization of cloud services, including cost saving strategies
  • Drive post-incident analysis according to our no-blame culture
  • Participate in periodic on-call duties
  • Ability to operate in settings with strong confidentiality and data privacy protocols

 

Requirements

  • 3+ years of professional experience as DevOps, SRE, or Cloud engineer
  • Strong hands-on experience with Kubernetes
  • Hands-on experience with at least one cloud provider (preferably AWS)
  • Good knowledge of at least one programming language (Ruby, Python, Go)
  • Experience defining infrastructure as code using tools such as Terraform or CloudFormation
  • Experience in the definition of CI/CD pipelines using technologies like Jenkins, GHA, or similar

 

Nice to have

  • Experience with distributed systems
  • Experience with observability concepts/tools
  • Experience with any other cloud provider (preferably GCP and Azure)
  • Experience with Agile methodologies

 

What we offer

  • Competitive compensation
  • Budget for professional development (Courses, certifications, conferences etc…)
  • Flexible working hours
  • Work from anywhere you prefer
  • Professional growth plan based on your performance and interests

 

Top Product reviews:  Gartner Peer Reviews

 

See more jobs at Nozomi Networks

Apply for this job

30d

Senior Site Reliability Engineer (RAVN)

iManageRemote
terraformazuregitdockerkubernetespythonAWS

iManage is hiring a Remote Senior Site Reliability Engineer (RAVN)

This is a remote position. We are a global team that leverages the latest technology to communicate with our colleagues across the globe. When it’s safe to do so, there may be times in which this role would be required to travel to a local office for in person collaborations with your team.  

Being a Sr. Site Reliability Engineer at iManage means…

You are a senior level engineer with a good level of experience in SRE tools and can also demonstrate the ability to pick up new concepts and technologies quickly. As a SRE on our team, you'll get your hands in every part of the technology stack, understanding the inner workings and helping driving our solutions forward. Our solution makes it simple to uncover relevant content and seek best practices from experts based on their own activities, experience and relationships. It goes beyond the basic productivity you’d expect from a document management system. By incorporating other sources of institutional, analytical and practical data, we help our clients discover contextually relevant content, expertise, best practices, and insights. You will play the key role of supporting the teams to deliver excellent software to the clients by solving challenging and difficult problems, all while having fun!

One of our leaders, Site Reliability Engineer Team Lead (Andrea Coccodi) describes this opportunity best: “You will work closely with the engineering team, providing guidance and your expertise around Docker, Helm, ArgoCD and Kubernetes. More importantly, you will work on complex and exciting challenges every day. The engineering team is a vibrant and ambitious group that is responsible for driving our product forward and executing our product roadmap - and so you will be helping to manage and improve our development pipeline as well as working closely with developers to support their technology stack”.

iM Responsible For…

  • Maintaining the technology stack up to date and increasing reliability
  • Driving innovation and platform evolution   
  • Adhering to security best practices 
  • Driving the productization and observability of our applications
  • Coordinating and participating in production support and on-call rotations   
  • Working cross functionally with cloud operations, security team, development and product team 

iM Qualified Because I Have…

  • 5+ years in SRE roles demonstrating increasing responsibilities 
  • Solid understanding of working with GIT source control  
  • The ability to troubleshoot and debug VM/container issues at any level, including networking   
  • Experience deploying resources via Terraform
  • Strong experience running Kubernetes clusters in Production and at scale  
  • Hands on experience with Microsoft Azure, AWS or Google Cloud  
  • Strong knowledge and understanding of monitoring tools (Prometheus, Grafana, EFK) 
  • Knowledge and experience of scripting (Bash or Python); we want someone who can contribute to development of tools and services, and grow our automation 
  • Passion for technology and solving challenging problems  

iM Getting To…

  • Join a supportive, experienced team benefiting from continuous growth within an inclusive, encouraging and vibrant culture 
  • Onboard remotely and be included in all aspects of iManage life 
  • Collaborate cross functionally 
  • Help mentor, lead, and coach junior team members  
  • Focus on meaningful work, solving complex, real world issues utilizing the latest technologies and protocols 
  • Own your learning and growth within our career development support framework plus, access a huge range online learning library 
  • Receive competitive benefits that include; attractive salary based on market data, health/vision/dental/life insurance, 401k matching, performance bonuses, flexible working environment, generous PTO, unlimited sick days and so much more! 

About iManage…

iManage is dedicated to Making Knowledge Work™. Over one million professionals across 65+ countries rely on our intelligent, cloud-enabled, secure knowledge work platform to uncover and activate the knowledge that exists inside their business content and communications.   

We are continuously innovating to solve the most complex professional challenges and enable better business outcomes; Our work is not always easy but it is ambitious and rewarding.  

So we’re looking for people who love a challenge. People who are happiest when they’re solving problems and collaborating with the industry’s best and brightest. That’s the iManage way. It’s how we do things that might appear impossible. How we develop our employees’ strengths and unlock their potential. How we find meaning in everything we do.  

Whoever you are, whatever you do, however you work. Make it mean something at iManage. 

Learn more at: www.imanage.com  

Please see our privacy statement for more information on how we handle your personal data: https://imanage.com/privacy-policy/  

#LI-LM1

#LI-Remote

See more jobs at iManage

Apply for this job

+30d

Site Reliability Engineer

XplorWashington, DC, USA, Remote
agileterraformpostgresqlpythonAWS

Xplor is hiring a Remote Site Reliability Engineer

Company Description

At Xplor, we help businesses thrive by making life simple for daily activities with a recurring nature. We do that by offering smart software, payments, and commerce-enabling solutions across five “everyday life” verticals: Education, Health and Fitness, Boutique Wellness, Field Services and Personal Services.

You’ll join our Boutique Wellness team where we work with all types of fitness professionals. Personal training. Yoga. Bootcamp. Barre. Fitness - you name it. We offer the services they need, so that they can focus on what really matters - helping their members and clients succeed.

Job Description

We are looking for an experienced Site Reliability Engineer who is excited about the challenge of reliably operating and scaling Boutique product platforms, which are already making waves in the fitness industry.

As a member of the Cloud Infrastructure team, you will help us develop the delivery pipelines, monitoring services, and infrastructure that serves our teams' best-in-class software and customer-facing products. You will work closely with a smart and dedicated team to solve challenging problems across a variety of domains, including monitoring and alerting, containers, networking, scalability, and many more.

Our technical stack includes: Terraform, PostgreSQL, DataDog, Python, numerous AWS services including RDS, ECS, Lambda, CloudFront, EventBridge, CodePipeline and more.

Responsibilities and Expectations

  • Contributing clean infrastructure code to optimize system reliability, performance, and efficiency
  • Collaborating closely with product stakeholders and engineering teams to deliver and scale secure, stable, observable applications
  • Proactively writing monitors and alerts to ensure actionable data insights are provided to application and infrastructure teams
  • Documenting to ensure the utility and maintainability of the platform and shared services
  • Participating in operations and on-call duties
  • Providing constructive feedback on your colleagues’ pull requests, and accepting constructive feedback on your own PRs in return
  • Participating actively and thoughtfully in the full Agile development lifecycle, from planning to testing to release

Reports to: Principal Engineer

Location: Washington, DC, USA (Or remotely in one of the following US States: AZ, CA, CO, DC, FL, GA, IL, MA, MD, MI, NC, NE, NJ, NV, NY, OH, OR, PA, SC, TN, TX, UT, VA.)

Qualifications

Technical Qualifications

  • 3+ years experience in professional software development
  • Thorough knowledge of web application performance monitoring and troubleshooting
  • History of automating complex workloads
  • Proficiency for analyzing time-series data
  • Experience working on a high-traffic tiered web application
  • Experience with container orchestration frameworks
  • Experience with serverless technologies
  • Experience with event-driven systems
  • Experience with continuous integration
  • Experience administering UNIX-like operating systems
  • Experience with agile software development methodologies

Work Skills/Personal Characteristics

  • Ability to communicate clearly and kindly with technical and non-technical colleagues
  • Able to work as a part of a team, but also pursue individual objectives
  • Willing to be a mentor to your colleagues and learn from them as well
  • An interest in learning about various technologies, our products, and the boutique fitness industry

Additional Information

All your information will be kept confidential according to EEO guidelines.

Sheryl Sandberg once said, “If you're offered a seat on a rocket ship, don't ask what seat! Just get on.” We couldn't agree more. So, are you ready to get on board?

To learn more about us and our products, please visit www.xplortechnologies.com. 

Xplor is proud to be an Equal Employment Opportunity employer. We're dedicated to attracting, retaining and developing our people regardless of gender identity, ethnicity, sexual orientation, disability, veteran status and age. Applications are encouraged from all sectors of the community. 

All Information will be kept confidential according to EEO guidelines.

We’re committed to replying to each application and look forward to getting in touch with you soon.

See more jobs at Xplor

Apply for this job

+30d

Senior Site Reliability Engineer (Core Team)

iManageRemote
agileterraformDesignazuregitkuberneteslinuxpythonAWS

iManage is hiring a Remote Senior Site Reliability Engineer (Core Team)

This is a remote position. We are a global team that leverages the latest technology to communicate with our colleagues across the globe. When it’s safe to do so, there may be times in which this role would be required to travel to a local office for in person collaborations with your team.

Being a Sr. Site Reliability Engineer at iManage means…

You are a Sr. SRE who is interested in building something from the ground up with our new and exciting cloud platform. In this role, you will contribute to our SaaS platform with the global SRE team. You will participate in architectural and design discussions, along with efforts to avoid and reduce toil, and of course, provide a scalable, reliable platform for the success of our customers and organization. Because we value collaboration, you will work with other SRE team leads as well as cross functional teams to make critical and unified decisions.

Here is what one of our leaders, Team Lead - Site Reliability Engineer (Kevin Richner), has to say about the culture of the team: "Our group has the unique challenge of tackling a brand new initiative. Being a part of this project means you will get the opportunity to provide structure as a senior level contributor, while expanding your knowledge of new technologies and helping drive decision making. Our team works closely together to share ideas, have open conversations, and solve complex problems.”

iM Responsible For…

  • Participating in, and facilitating, agile sprints and associated ceremonies   
  • Driving innovation and platform evolution   
  • Scaling cloud infrastructure to support our growing ecosystem based on Kubernetes   
  • Providing reliable, predictable deployment and maintenance of distributed systems   
  • Adhering to security best practices
  • Writing and designing automation, monitoring, diagnosing, and debugging tooling   
  • Coordinating and participating in production support and on-call rotations   
  • Conducting incident management and contributing to associated retrospective/postmortem as needed   
  • Working cross functionally with cloud operations, development, and product team 

iM Qualified Because I Have…

  • 3+ years as an SRE, or 5+ years in a DevOps related role with increasing responsibilities demonstrated
  • Working knowledge of HashiCorp Terraform, Vault, and Consul 
  • Experience running Kubernetes clusters in Production and at scale  
  • Hands on experience with Microsoft Azure, AWS, or Google Cloud  
  • Solid understanding of working with git source control  
  • The ability to troubleshoot and debug container issues at any level, including container networking   
  • Familiarity with container networking, including different network plugins and frameworks such as Calico  
  • Strong knowledge and understanding of microservices based architectures   
  • Good understanding of networking including L2 and L3 concepts   
  • A background in administrating and maintaining Linux based systems   
  • Knowledge and experience of one or more programming languages with ability to contribute to development of tools and services
  • Working knowledge of Kustomize, Helm, Tekton, and other related tools
  • Profiency in at least one language with Python or Golang as preferred  
  • Able to identify and mitigate reliability risks  
  • Passion for technology and solving challenging problems  

iM Getting To…

  • Join a supportive, experienced team benefiting from continuous growth within an inclusive, encouraging and vibrant culture 
  • Onboard remotely and be included in all aspects of iManage life 
  • Collaborate cross functionally 
  • Help mentor, lead, and coach junior team members  
  • Focus on meaningful work, solving complex, real world issues utilizing the latest technologies and protocols 
  • Own your learning and growth within our career development support framework plus, access a huge range online learning library 
  • Receive competitive benefits that include; attractive salary based on market data, health/vision/dental/life insurance, 401k matching, performance bonuses, flexible working environment, generous PTO, unlimited sick days and so much more! 

About iManage…

iManage is dedicated to Making Knowledge Work™.  Over one million professionals across 65+ countries rely on our intelligent, cloud-enabled, secure knowledge work platform to uncover and activate the knowledge that exists inside their business content and communications.   

We are continuously innovating to solve the most complex professional challenges and enable better business outcomes; Our work is not always easy but it is ambitious and rewarding.  

So we’re looking for people who love a challenge. People who are happiest when they’re solving problems and collaborating with the industry’s best and brightest. That’s the iManage way. It’s how we do things that might appear impossible. How we develop our employees’ strengths and unlock their potential. How we find meaning in everything we do.  

Whoever you are, whatever you do, however you work. Make it mean something at iManage. 

Learn more at: www.imange.com 

Please see our privacy statement for more information on how we handle your personal data: https://imanage.com/privacy-policy/  

#LI-LM1

#LI-Remote

See more jobs at iManage

Apply for this job

+30d

Lead Site Reliability Engineer

YouGovNew York, NY, USA, Remote
agilesqlDesignkuberneteslinuxpythonAWS

YouGov is hiring a Remote Lead Site Reliability Engineer

Company Description

YouGov is an international market research and data analytics group.

Our mission is to supply a continuous stream of accurate data and insight into what the world thinks, so that companies, governments and institutions can better serve the people and communities that sustain them. 

We have the best data and the best tools. We continuously challenge conventional approaches to research, and we disrupt our industry to ensure that our clients always get the best solutions.

We are driven by a set of shared values. We are fast, fearless and innovative. We work diligently to get it right. We are guided by accuracy, ethics and proven methodologies. We respect and trust each other, and bring these values into everything that we do.

Each day, our highly engaged proprietary global panel of over 15 million people provides us with thousands of data points on consumer opinions, attitudes and behaviours. We combine this continuous stream of data with our research expertise to provide insights that enable intelligent decision-making and informed conversations.

With operations in the UK, North America, Mainland Europe, the Nordics, the Middle East, India and Asia Pacific, YouGov has one of the world’s largest research networks.

The Culture

Diversity and inclusion are fundamental to YouGov. We are committed to giving the world a voice by capturing the opinions of all groups, including the ones that are often under-represented in research. We are also committed to making sure that our products and tools are free from any bias, as accuracy is key to what we do. None of the above can be done without having a truly diverse workforce, in an inclusive workplace. We are very keen on attracting and retaining the best talent. And best talent also means a diverse pool of talent, with various backgrounds and perspectives.

Supporting the wellbeing of our staff, including maintaining a good work and life balance, is important to us. We support flexible working arrangements where appropriate for a role, with many locations offering a hybrid office-and-remote working approach. 

As an Equal Opportunity Employer, qualified applicants will receive consideration for employment without regard to race, colour, religion, sex, sexual orientation, gender perception or identity, national origin, age, marital status, veteran status, disability status, or any other characteristic protected by law. All employment decisions are made on the basis of occupational qualifications, merit, and business need.

Job Description

YouGov is searching for a Lead Site Reliability Engineer to help with technical planning and execution for our Site Reliability Engineering (SRE) team. 

In this collaborative role, you'll work with senior Directors, and a group of Engineers to be collectively responsible for the delivery, optimization, resilience, and availability of high-value and high-transaction-rate services trusted and used by both the general public and some of the largest brands in the world. You'll collaborate on planning technical aspects, participate in selecting vendors, and help drive the adoption of best practices across all YouGov technology groups. You'll work at a fast pace with autonomy, and you will have the opportunity to train fellow SREs. 

 What you will do: 

  • Collaborate on planning for SRE projects by helping translate high level business goals to project goals 

  • Provide input on the technical design and do the technical implementation of SRE plans and projects. 

  • Work on the selection of vendors to solve SRE requirements 

  • Provide strong and positive mentorship to fellow SREs and to other engineers 

  • Participate in support requests for YouGov’s production environment (not on-call) 

  • Establish Error Budgets for the products by monitoring SLIs, measuring SLOs and publishing them to dashboards that are useful for the business. 

  • Drive blameless post-mortems with all the technology teams and use the Error Budget to establish priorities for any necessary changes 

  • Identify and solve critical problems and build automation to prevent their recurrence. 

  • Design, develop, and implement supporting cloud services on the Kubernetes platform. 

Qualifications

  • 5+ years' work experience in a similar job role. 
  • Strong analytical and problem-solving skills. 
  • Strong experience with log aggregation, status monitoring applications, and APMs including NewRelic, Sentry, ELK, and Prometheus 
  • Kubernetes knowledge and experience (50+ nodes) 
  • Experience with cloud (AWS) and on premise setups 
  • Strong Linux background and understanding of networking. 
  • Significant knowledge of and familiarity with SRE best practices 
  • Experience working with fully remote teams 
  • Experience administering and/or designing databases - SQL and NoSQL. (preferred but not required) 
  • Exposure to Python web applications (preferred but not required)  
  • Experience working with Agile project management methodologies 

Additional Information

This is a full time, permanent remote role, which can be based in a YouGov Office or remote location in the UK or Europe. We are a global team with developers in the US, South America, Europe, and India.

See more jobs at YouGov

Apply for this job

+30d

Senior Site Reliability Engineer

AvaloqAyala Ave, Makati, Metro Manila, Philippines, Remote
terraformDesignansiblegitrubyjavadockerkubernetesjenkinspythonAWS

Avaloq is hiring a Remote Senior Site Reliability Engineer

Company Description

Writing the future. Together. 

Avaloq is a value driven, fast-paced financial technology and services company and we are committed to developing the banking solutions of tomorrow. 

By joining Avaloq, you’ll become a key part of our effort to power the digital transformation of the financial services industry. Our ambition is big and bold – to provide full end-to-end digital solutions by combining our leading efficiency with a flexible, responsible digital user experience. Headquartered in Zurich, Avaloq has over 2,000 employees globally. More information is available at www.avaloq.com  

 

Job Description

Your team:

We are an international and multidisciplinary team providing infrastructure as code to run effectively on the public cloud our client SaaS product implementations lifecycle.

We believe our colleagues comes first, thinking different is an asset and innovation comes by putting customer experience at the centre of our design thinking.

As we expand our customer SaaS deployments, we are currently seeking experienced site reliability engineers (SREs) bringing fresh ideas, demonstrating a unique and informed viewpoint, and enjoying collaborating with a cross-functional team to develop reliable real-world solutions and positive user experiences at every interaction.

Your mission:

  • Build & run our cloud SaaS environments by monitoring availability, optimizing system performance and taking a holistic view of system health
  • Provide primary operational support and engineering for multiple large distributed software applications
  • Build software and systems to manage and automate platform infrastructure and applications
  • Balance feature development speed and reliability with well-defined service level objectives
  • Improve reliability, quality, and time-to-market of our suite of software as a service solutions
  • Partner with development teams to improve services through rigorous testing and release procedures
  • Participate in system design consulting, platform management, and capacity planning to create sustainable systems and services through automation and uplifts

Qualifications

What you need:

  • University degree in Computer Science or related discipline
  • Solid experience in implementing public cloud concepts and models
  • Programming experience with at least one modern language such as Python, Ruby, Go or Java including object-oriented design
  • Experience with distributed storage technologies like S3 as well as dynamic resource management frameworks as Kubernetes, Mesos, Docker
  • A proactive approach to spotting problems, areas for improvement, and performance bottlenecks
  • Drive to standardize, streamline, and automate processes
  • Experience in operating highly available services design of large-scale distributed systems, preferably using AWS technologies
  • Experience with Infrastructure as Code (e.g. Terraform, CloudFormation)
  • Experience with Observability (e.g. Cloudwatch, Prometheus / Grafana, Kibana, Elastic Search)


You will get extra points for the following:

  • Experiences in configuring and operating AWS EKS desirable.
  • Configuration management tooling (e.g. Puppet, Ansible)
  • Expert of cloud SaaS security best practices
  • Git, Gradle and Jenkins CI tool

Additional Information

What you can expect:

It’s all about getting to know our teams and to e-meet with us. We will use video interviews to give you the opportunity to meet your future colleagues and get a first insight into Avaloq’s unique culture.

What we will offer you:

We offer competitive base salaries and a benefits package with private health and dental care as well as a generous pension. If you go the extra mile, you might be entitled to an extraordinary achievement reward.
Avaloq aims to share its success with all its employees by paying out “Success Share Units” depending on its performance in a given year.

Don’t be shy – apply!
Please only apply online.

Note to Agencies: All unsolicited résumés will be considered direct applicants and no referral fee will be acknowledged.

See more jobs at Avaloq

Apply for this job

+30d

Sr. Site Reliability Engineer (SM-CO) - *** Python, Ansible, Linux ***

ZscalerSan Jose, CA, USA, Remote
Bachelor degreeterraformDesignansiblelinuxpython

Zscaler is hiring a Remote Sr. Site Reliability Engineer (SM-CO) - *** Python, Ansible, Linux ***

Company Description

 

*** US citizenship is Required *** due to the nature of the customers assigned.

For over 10 years, Zscaler has been disrupting and transforming the security industry. Our 100% purpose built cloud platform delivers the entire gateway security stack as a service through 150 global data centers to securely connect users to their applications, regardless of device, location, or network in over 185 countries protecting over 3,500 companies and 100 Million threats detected a day.

We work in a fast paced, dynamic and make it happen culture. Our people are some of the brightest and passionate in the industry that thrive on being the first to solve problems.  We are always looking to hire highly passionate, collaborative and humble people that want to make a difference.  

Job Description

  • Managing patching and fixing vulnerabilities
  • Writing new tools and automation in Python, Ansible
  • Evaluating new technologies and services for the operations environment
  • Create and deploy scalable systems for massively growing global infrastructure
  • Design and deployment of our customer-facing Linux and BSD based systems infrastructure
  • Develop, augment and maintain Ops documentation

Qualifications

  • 8+ years experience in a SaaS/ Cloud/Distributed environment growing at rapid scale
  • Minimum 4+ years of scripting experience (BashPython) is required
  • Experience with automation frameworks (Ansible preferred, Chef, Puppet, Terraform)
  • Developing and debugging tools, automation and scripts
  • Strong Centos/UNIX skills, FreeBSD specific experience is a plus
  • Ability to analyze and troubleshoot systems performance and work with time series data
  • Basic Networking skills (TCP/IP, DNS, LACP, CARP) for testing and troubleshooting is required

Education:

  • Bachelor degree in Computer Science, Computer Engineering or similar discipline with MS degree in Computer Science or Computer Engineering preferred

Additional Information

All your information will be kept confidential according to EEO guidelines.

#LI-LG1

What You Can Expect From Us:

  • An environment where you will be working on cutting edge technologies and architectures
  • A fun, passionate and collaborative workplace
  • Competitive salary and benefits, including equity

Why Zscaler?

People who excel at Zscaler are smart, motivated and share our values. Ask yourself: Do you want to team with the best talent in the industry? Do you want to work on disruptive technology? Do you thrive in a fluid work environment? Do you appreciate a company culture that enables individual and group success and celebrates achievement? If you said yes, we’d love to talk to you about joining our award-winning team. 

Additional information about Zscaler (NASDAQ: ZS ) is available at https://www.zscaler.com

Zscaler is an equal opportunity employer. We celebrate diversity and are committed to creating an inclusive environment for all employees.

See more jobs at Zscaler

Apply for this job

SD Solutions is hiring a Remote PokerStars | Site Reliability Engineer

On behalf of PokerStars, SD Solutions is looking for a Site Reliability Engineer to help monitor, diagnose, triage, debug and support production issues. You will be playing a key part in the operations and support of Flutter International Marketing Technology solutions diagnosing and resolving application issues to ensure optimal performance and usability for our players and system users.

Profile and Experience:

  • 3+ years industry experience providing support to a variety of software applications, particularly (but ideally not limited to) gaming applications.
  • Diligent attention to detail coupled with thorough analysis, judgment and problem-solving skills.
  • Detail-oriented, highly motivated individual who can work on multiple tasks simultaneously and context switch seamlessly.
  • The ability to work under aggressive timelines and manage competing priorities.
  • Highly proficient in logging and monitoring tools (Grafana, Prometheus, Kibana, Splunk, ELK).
  • Strong oral and written communication skills
  • Fluency in the English language

Job Responsibilities:

  • collaborate with other teams to assist, troubleshoot, debug and support production issues.
  • develop and maintain rich instrumentation to proactively monitor and alert for thresholds or defined health checks.
  • work professionally with 3rd party vendors to have application issues fixed and long-term visions implemented.
  • lead and influence cross-functional teams to promote operational excellence, suggest best practices for site reliability and be a driver in our dedication to continuous improvement.
  • be responsible for the creation of software operations and support related documentation.
  • acquire strong domain knowledge of complex systems, both front end and back end, in order to research, recommend and coordinate software development, configuration changes.
  • make disciplined technical decisions and suggestions that are communicated to stakeholders clearly and in a timely manner.

The Group

PokerStars is part of Flutter Entertainment Plc, a global sports betting, gaming and entertainment provider headquartered in Dublin and part of the FTSE 100 index of the London Stock Exchange. Flutter brings together exceptional brands, products and businesses and a diverse global presence in a safe, responsible and ultimately sustainable way.


See more jobs at SD Solutions

Apply for this job

+30d

Senior Site Reliability Engineer

AboundNew York, NY Remote
terraformswiftdockerjenkinsAWS

Abound is hiring a Remote Senior Site Reliability Engineer

About Abound

At Abound, our mission is to help local businesses connect, grow, and delight customers.

We believe the best stores will always have exciting new products with unique stories. From maker stories to purposeful ingredients, we only select brands and products that spark a deeper connection with our customers.

With Abound, there’s so much more in store for you. Discover an exciting career opportunity at a rapidly growing rocketship startup (we grew 30x in less than 2 years!) that will give you the environment and the freedom to excel.

The Impact You Will Make

You will be a founding member of the SRE team at Abound and will have the opportunity to shape how we build and execute this function going forward. This position will work alongside fellow engineers to provide highly reliable environments, build and deploy pipelines, and a top notch developer experience. We are in the midst of a hyper-growth phase at our company, so it is a fantastic time to join us!

What You’ll Do:

  • Collaborate with others to create, from scratch, our infrastructure as code as we migrate from Heroku to AWS (Terraform, Docker, K8s, EKS, etc.)
  • Enhance and maintain our CI and deployment pipeline to allow for easy and swift production deployments (and occasional rollbacks)
  • Build the monitoring and alerting infrastructure required to proactively identify production incidents and assist with root cause analysis
  • Lead and evangelize the evolution of our infrastructure and tooling

What You Have:

  • A friendly, collaborative and passionate demeanor
  • 5+ years overall professional experience as an engineer
  • 3+ years experience as an SRE, creating and operating modern cloud based infrastructure
  • Experience migrating from a PaaS provider to self-hosted AWS a huge plus
  • Experience with CI / CD tooling such as CircleCI, Jenkins, or Github Actions
  • Strong working knowledge of monitoring & alerting tools such as DataDog

What You’ll Get From Us:

You’ll get top-notch benefits including a competitive base salary, equity in an early-stage, fast growing company, healthcare, PTO, home office equipment and wellness benefits. Importantly, you’ll also get an environment where there is an emphasis on individual growth, experimentation and ownership. You’ll find a team that is singularly focused on creating a great experience for our customers and partners alike. You will not find egos here, and need not apply if you have one. We seek kind humans first and foremost because we believe that the single greatest impactor of day to day happiness is those you surround yourself with.

See more jobs at Abound

Apply for this job

+30d

Site Reliability Engineer, CI & Infrastructure

SquarePortland, OR, USA, Remote
DesignmobileansibleAWS

Square is hiring a Remote Site Reliability Engineer, CI & Infrastructure

Company Description

Square builds common business tools in unconventional ways so more people can start, run, and grow their businesses. When Square started, it was difficult and expensive (or just plain impossible) for some businesses to take credit cards. Square made credit card payments possible for all by turning a mobile phone into a credit card reader. Since then Square has been building an entire business toolkit of both hardware and software products including Square Capital, Square Terminal, Square Payroll, and more. We’re working to find new and better ways to help businesses succeed on their own terms—and we’re looking for people like you to help shape tomorrow at Square.

Job Description

Square is committed to high standards and innovation, and our hardware is no exception. Square's hardware devices form the tangible connection between Square and the millions of small businesses who rely on our services. We rigorously test our hardware devices, which requires a reliable and robust CI system. As an SRE for the continuous integration infrastructure team, you will drive the application reliability and scalability of our CI systems and pipelines, in support of developer velocity and productivity. This is a unique role that applies strong CS and software engineering principles to the field of infrastructure, offering challenges in areas including scaling, distributed systems, and concurrency. This role will report into our Engineering Manager for Hardware CI & Infrastructure.
 
Role Location: Your direct team will be based in the Pacific time zone. This role is open to all people who can be effective partners with this team, including sharing typical working hours.

You Will:

  • Build scalable infrastructure to manage CI systems (both on-prem and AWS) and applications, with a focus on software engineering and CS principles.
  • Build tools and automation that enhance developer velocity and productivity.
  • Minimize the risk of reliability related failure outcomes related to durability, availability, and performance.
  • Collaborate across multiple teams including Devices Software Engineering, Client Platform Engineering, Corp Systems Engineering, IT Support, Production Platform Engineering, and Information Security.
  • Design and manage our SLIs and SLOs.
  • Build dashboards and programmatic alerting to maximize visibility into system health and status.
  • Help with capacity planning for our hybrid cloud infrastructure (on-prem and AWS).
  • Perform periodic on-call duty to handle system outages.

Qualifications

You have:

  • BS or higher in Computer Science or equivalent technical experience.
  • 6+ years of industry experience developing and troubleshooting large-scale infrastructure.
  • The ability to independently write programs in a programming language of your choice.
  • Similarly strong Bash and shell scripting skills.
  • Knowledge of network routing, TCP/IP Protocols, DHCP configurations, and Git.
  • Strong understanding of Infrastructure as Code.
  • Experience with Configuration management and Infrastructure-as-a-Service tools such as Ansible, Chef, Puppet, etc.
  • Experience building and scaling continuous integration systems and pipelines.
  • Experience with deploy and management systems like VMware, Foreman, MaaS, or other open source tools.
  • Nice to have: Experience with Jenkins/CI or other test infrastructure tools, a huge plus.

Additional Information

We’re working to build a more inclusive economy where our customers have equal access to opportunity, and we strive to live by these same values in building our workplace. Square is a proud equal opportunity employer. We work hard to evaluate all employees and job applicants consistently, without regard to race, color, religion, gender, national origin, age, disability, pregnancy, gender expression or identity, sexual orientation, citizenship, or any other legally protected class. 

We believe in being fair, and are committed to an inclusive interview experience, including providing reasonable accommodations to disabled applicants throughout the recruitment process. We encourage applicants to share any needed accommodations with their recruiter, who will treat these requests as confidentially as possible. Want to learn more about what we’re doing to build a workplace that is fair and square? Check out our I+D page

Additionally, we consider qualified applicants with criminal histories for employment on our team, and always assess candidates on an individualized basis.

Perks

We want you to be well and thrive. Our global benefits package includes:

  • Healthcare coverage
  • Retirement Plans
  • Employee Stock Purchase Program
  • Wellness perks
  • Paid parental leave
  • Paid time off
  • Learning and Development resources

Square, Inc. (NYSE: SQ) builds tools to empower businesses and individuals to participate in the economy. Sellers use Square to reach buyers online and in person, manage their business, and access financing. Individuals use Cash App to spend, send, store, and invest money. And TIDAL is a global music and entertainment platform that expands Square's purpose of economic empowerment to artists. Square, Inc. has offices in the United States, Canada, Japan, Australia, Ireland, Spain, Norway, and the UK.

See more jobs at Square

Apply for this job

+30d

Site Reliability Engineer (AWS Cloud)

IntelliPro Group Inc.San francisco, CA
ansibleazurepythonAWS

IntelliPro Group Inc. is hiring a Remote Site Reliability Engineer (AWS Cloud)

Title: Site reliability engineer 
Location: Remote currently
Job Qualification:
What we're looking for:
- Experience in AWS is preferred but another cloud provider such as
Google Cloud or Azure could work
 
- Proficient in Python, Go is a plus
- Experience with at leastone of the modern configuration management tools such as Puppet, Chef, Ansible, or Salt -Strong knowledge of Linux/Unix/BSD internals.
- Experience developing and debuggingnetworking protocols (HTTP, SSL, and TCP)
 
 This is acontract position for our client.  As such, the contractor who fills this role will beemployed via our agency partner Intellipro Group.
- All interviews will be scheduled and/or conducted by theclient manager.
When a finalist has been selected,Intellipro Group will extend the offer and provide assignment details including duration,benefits options, and onboarding details.
 
Millions of people across the world come to Pinterest to find new ideas every day. It’s where
they get inspiration, dream about new possibilities and plan for what matters most. Our
mission is to help those people find their inspiration and create a life they love. As a Pinterest
contractor, you’ll be challenged to take on work that upholds this mission and pushes Pinterest
forward. You’ll grow as a person and leader in your field, all the while helping users make their
lives better in the positive corner of the internet.
 
What you'll do: - Handle operational requests
around configuration and networking changes - Write automation for operational workloads -
Provide developer and customer support for complex troubleshooting issues related to
systems, applications, or networking

See more jobs at IntelliPro Group Inc.

Apply for this job

+30d

Site Reliability Engineer (Remote)

CertainChapel Hill, NC Remote
marketoterraformmariadbsqlsalesforceoracleDesignmongodbazuredockerlinuxpythonAWSNode.js

Certain is hiring a Remote Site Reliability Engineer (Remote)

About Certain, Inc:

Certain is the leading end-to-end enterprise event experience platform provider to the Fortune 1000. Our SaaS, cloud solution empowers marketers to deliver truly engaging digital and in-person attendee experiences, capturing rich insights and buying signals that lead to greater sales and marketing results. Headquartered in San Francisco, with offices in North America, Europe and the Pacific Rim, Certain partners with hundreds of enterprise and event management companies across tens of thousands of events with millions of attendees to deliver the best customer experience through live events. Certain is in the midst of rapid growth as event automation is one of the few dark spots in the marketing tech stack. While 30 – 40% of the marketing budget goes to hosting events, marketers lack a single platform to deliver seamless digital and in-person event experiences at scale while also extracting overall event ROI. Certain is changing the game as the first event experience platform that aggregates over 300 event data points that are seamlessly integrated with Eloqua, Marketo and Salesforce to deliver a highly personalized experience before, during and after live events. Certain has secured some of the largest global brands as clients such as Oracle, Microsoft and Red Hat – who won an Ops Stars Marketing Ops Team of the Year award based on their implementation of Certain in 2019. Now is the time to be part of a team that will establish Certain as the event experience leader.

In this role, you will strive to improve the performance, scalability, reliability, and security of our solutions. You should enjoy the fast pace of a small and rapily growing company where the continuous evolutin of our products and services is the norm. You'll be expected to contribute to and iterate on our configuratin and infastructure management and our service deployment framework. You should also have knowledge of monitoring and logging solutions, as well as tech to perform automation across any number of systems and containers.

Responsibilities

  • Work as a member of a global technology operations team administering 24/7 compute environments.
  • Perform management, monitoring, tuning, and troubleshooting of Linux servers in cooperation with members of the IT operations team.
  • Ensure that applications and services are highly available, reliable, and performant through world-class automation, monitoring, altering, and self-healing capabilities.
  • Analyze system metrics and logs to ensure maximum uptime and delivery.
  • Design and maintain core infrastructure systems running in Amazon Web Services and Microsoft Azure.
  • Build and maintain configuration and infrastructure management, service deployment frameworks, infrastructure as code, and utility software.
  • Participate in a 24/7 on-call rotation.
  • Partner with development team to improve reliability and availability.

Skills and Experience

  • Minimum of 7 years working in a production operations role in support of large scale, distributed Linux and containerized infrastructure.
  • Experience running Apache, Tomcat, Kafka, Datadog, Salt, Docker, Terraform, Hadoop, AWS, Cloudformation, AWS IAM, AWS Lambda and ELK.
  • Experience with Nginx, Node.js, Glue, Athena, ECS and EKS, MS SQL, MariaDB, MongoDB is a plus.
  • An understanding of computing fundamentals including, but not limited to, virtualization, containers, storage, security, database, and networking.
  • Experience running infrastructure operations in Amazon Web Services or Microsoft Azure focusing on high availability, resilience, and performance are paramount.
  • Programming experience for automation of systems; we use Python, BASH, and Java.
  • Experience with configuration management, service deployment, and infrastructure as code provisioning using salt, terraform or similar tools.
  • Experience with software development lifecycle, continuous deployments, integration (CI/CD), and building and maintaining software pipelines.
  • Security conscious in all aspects of deliverying operations.
  • Exceptional problem solving, criticial thinking, and analytical skills.
  • Excellent written and verbal communication skills as well as teamwork and interpersonal skills.
  • AWS or relevant certifications a plus




See more jobs at Certain

Apply for this job

+30d

Senior Site Reliability Engineer (Prisma Cloud) - Can be remote

Palo Alto NetworksSanta Clara, CA, USA, Remote
agileterraformDesignansiblescrumjavakubernetespythonAWS

Palo Alto Networks is hiring a Remote Senior Site Reliability Engineer (Prisma Cloud) - Can be remote

Company Description

Our Mission

At Palo Alto Networks® everything starts and ends with our mission:

Being the cybersecurity partner of choice, protecting our digital way of life.

We have the vision of a world where each day is safer and more secure than the one before. These aren’t easy goals to accomplish – but we’re not here for easy. We’re here for better. We are a company built on the foundation of challenging and disrupting the way things are done, and we’re looking for innovators who are as committed to shaping the future of cybersecurity as we are.

Disruption is at the core of our technology and on our way of work to meet the needs of our employees now and in the future through FLEXWORK, our approach to how we work. We’re changing the nature of work from benefits to learning, location to leadership, we’ve rethought and recreated every aspect of the employee experience at Palo Alto Networks. And because it FLEXes around each individual employee based on their individual choices, employees are empowered to push boundaries and help us all evolve, together.

Job Description

Your Career

The Prisma Cloud Security team is responsible for building products that protect data, workloads, and infrastructure for some of the largest enterprise customers in the world. We help our customers in their journey to the public cloud by ensuring they have the best in class protection. The public cloud market has been growing at a very rapid rate for the last few years. As more and more enterprises leverage public cloud, there is an insatiable demand for securing workloads in public cloud. With the recent acquisition of two leading companies in this space - RedLock and Evident.io, Palo Alto Networks is the market leader in this space.

Your Impact

As you build your career at Palo Alto Networks you will be involved in a number of different projects and initiatives, being an integral part of the team. You will:

  • Work with development teams to ensure that applications have scalability and reliability built-in from day one -  Agile is second nature to you and you’re excited to work in scrum teams and represent the SRE perspective

  • Design and enhance software architecture to improve scalability, service reliability, cost, and performance -  You’ve helped create services that are critical to their customers’ success

  • Deploy automation for provisioning and operating infrastructure at large scale -  You are experienced in Infrastructure as Code concepts and have put them into production

  • Partner with teams to improve CI/CD processes and technology -  Helping teams in delivering value early is what you strive for

  • Mentor members of the staff on large scale cloud deployments -  You’re an expert in deploying in the cloud and can bring a teaching mindset to help others benefit from your experience

  • Drive the adoption of observability practices and a data-driven mindset -  You love metrics, graphs, and gaining a deep understanding of why things happen in a system, helping others gain visibility into the things they build

  • Participate in the occasional on-call rotation supporting the infrastructure owned by the SRE team -  Finding ways to reduce the time to resolution and improve the reliability of services is key to running a trusted platform

Qualifications

Your Experience

  • Strong sense of architecture and design for fault tolerance, scale-out approaches, and stability. Practiced in the AWS Well Architected Framework or similar 

  • Demonstrated experience in building tools and automation in Python or Java for large production environments

  • Experience working with microservice architectures running on Kubernetes and containers

  • Expert knowledge of Unix/Linux (shell/tools/kernel/networking/storage)

  • Tools-first mindset. You build tools for yourself and others to increase efficiency and reduce churn

  • Experience with Configuration Management and Infrastructure as Code: Terraform, Ansible, Chef, Puppet, etc.

  • Experience with public cloud (AWS) at medium to large scale

  • Demonstrated experience in designing and building large-scale metrics and monitoring systems is a plus

  • Organized, focused on building, improving, resolving and delivering

  • Exceptional communicator in and across teams, taking the lead

Additional Information

The Team

The DevOps team within Prisma Cloud is responsible for the scaling and support of our multi-region, multi-cloud application. We’re a group of software engineers, site reliability engineers, and security experts that own the deployment architecture and strive to improve our infrastructure through deep partnership with other teams across the organization.

Our Commitment

We’re trailblazers that dream big, take risks, and challenge cybersecurity’s status quo. It’s simple: we can’t accomplish our mission without diverse teams innovating, together.

We are committed to providing reasonable accommodations for all qualified individuals with a disability. If you require assistance or accommodation due to a disability or special need, please contact us at [email protected]

Palo Alto Networks is an equal opportunity employer. We celebrate diversity in our workplace, and all qualified applicants will receive consideration for employment without regard to age, ancestry, color, family or medical care leave, gender identity or expression, genetic information, marital status, medical condition, national origin, physical or mental disability, political affiliation, protected veteran status, race, religion, sex (including pregnancy), sexual orientation, or other legally protected characteristics.

Disclosure required by sb19-085 (8-5-20) of the minimum compensation (includes on-target earnings = base + on target incentives for sales roles) for this role to be located in the state of Colorado. If hired in Colorado, this position starts at $129,200/yr. Depending on the position offered, restricted stock units and incentive or bonus pay may be provided as part of this compensation package. Additional benefits may be found here.

All your information will be kept confidential according to EEO guidelines.

 

#LI-TD1

See more jobs at Palo Alto Networks

Apply for this job

+30d

Site Reliability Engineer (SRE)

NXTThing RPO, LLCToronto, ON, Canada, Remote
agilejiraDesignazurescrumkubernetesAWS

NXTThing RPO, LLC is hiring a Remote Site Reliability Engineer (SRE)

Company Description

 

What makes us Qlik

Qlik helps enterprises around the world move faster, work smarter, and lead the way forward with an end-to-end solution for getting value out of data. A Gartner Magic Quadrant Leader for 11 years in a row! Our platform is the only one on the market that allows for open-ended, curiosity-driven exploration, giving everyone – at any skill level – the ability to make real discoveries that lead to real outcomes and transformative changes. We are a Values-Driven organization, operating over 100 countries with 45,000 customers around the world. If you think we are interesting, please read on – we may be looking for you!

 

About Qlik

  • Competitive Benefits package
  • Flexible working environment
  • Giving back is a part of our culture – we give you a day to change the world. In addition, we encourage our employees to participate in our numerous Corporate Responsibility Employee Programs
  • Learn about our Corporate Responsibility Program by visiting Qlik.org
  • Check out our careers in R&D here.
  • Check out our company page on Linkedin!
  • Follow us on Instagram @lifeatqlik and @Qlik

Job Description

What makes us Qlik

 

Qlik helps enterprises around the world move faster, work smarter, and lead the way forward with an end-to-end solution for getting value out of data. Our platform is the only one on the market that allows for open-ended, curiosity-driven exploration, giving everyone – at any skill level – the ability to make real discoveries that lead to real outcomes and transformative changes. We are a Values-Driven organization, operating over 100 countries with 45,000 customers around the world. If you think we are interesting, please read on – we may be looking for you!

The Qlik Site Reliability Engineering team (SRE) is committed to ensuring that our large-scale distributed systems are scalable, monitored, automated and performing optimally, 24x7x365. We own our production environment - from the initial design phases to ensuring continuous high availability. Bring your passion to automate and join our growing team, and dig deep into performance, scalability, capacity, and reliability problems.

Responsibilities include:

  • Develop and support cloud infrastructure implementations and being directly involved in the software deployment process.
  • Use monitoring tools to find problems and resolve and/or escalate to development.
  • Implement and deploy new applications and enhancements to existing applications, software, and operating systems.
  • Work with development teams to establish SLOs and strive to ensure proper incident response when error budgets are depleted.
  • Plan and perform Operating System and software upgrades.
  • Provide general assistance for Technical Support.
  • Assist in the development and implementation of disaster recovery plans; and conduct research on emerging technologies in support of systems development efforts and recommend technologies that will increase cost effectiveness and systems flexibility.
  • Create and maintain documentation as it relates to system configuration, mapping, processes, and service records.

Qualifications

  Qualifications include:

  • 2+ years of experience with deploying and supporting a SaaS offering.
  • 1+ years of experience deploying and supporting Kubernetes clusters in a public, scalable SaaS offering.
  • 1+ years of experience as a Site Reliability Engineer or 3+ years in a DevOps environment.
  • 3+ years development experience (Golang / NodeJs / bash / etc.).
  • 1+ years of experience with each of the following:
    • Cloud Infrastructure (Amazon Web Services (AWS) / Google Cloud Platform (GCP) / Azure / etc.)
    • Bug tracking (JIRA / Bugzilla / YouTrack / etc.)
    • Continuous Deployment (concourse / GitHub Actions / Spinnaker / etc.)
    • Source control (github / gitlab / etc.)
    • Metrics & Tracing (Grafana / Prometheus / Jaeger / OpenTelemetry / OpenTracing / etc.)
  • Familiar and comfortable with agile development techniques – Scrum certified preferred.
  • Self-starter with the ability to work independently on projects.
  • Proactive and strong ability to learn new things with limited guidance.
  • Demonstrated ability to work effectively within a team and with cross-functional technical and business teams.
  • A curious attitude that is interested in knowing why things work the way they do and using that information to improve and enhance.

 

Location

This role is located in Ottawa but will consider candidates willing to work remotely in the EST time zone.

Additional Information

Emburse provides equal employment opportunities (EEO) to all employees and applicants for employment without regard to race, color, religion, sex, national origin, age, disability or genetics. In addition to federal law requirements, Emburse complies with applicable state and local laws governing nondiscrimination in employment in every location where the company has facilities. This policy applies to all terms and conditions of employment.

See more jobs at NXTThing RPO, LLC

Apply for this job

+30d

Site Reliability Engineering team (SRE)

NXTThing RPO, LLCOttawa, ON, Canada, Remote
agilejiraDesignazurescrumkubernetesAWS

NXTThing RPO, LLC is hiring a Remote Site Reliability Engineering team (SRE)

Company Description

 

What makes us Qlik

Qlik helps enterprises around the world move faster, work smarter, and lead the way forward with an end-to-end solution for getting value out of data. A Gartner Magic Quadrant Leader for 11 years in a row! Our platform is the only one on the market that allows for open-ended, curiosity-driven exploration, giving everyone – at any skill level – the ability to make real discoveries that lead to real outcomes and transformative changes. We are a Values-Driven organization, operating over 100 countries with 45,000 customers around the world. If you think we are interesting, please read on – we may be looking for you!

 

About Qlik

  • Competitive Benefits package
  • Flexible working environment
  • Giving back is a part of our culture – we give you a day to change the world. In addition, we encourage our employees to participate in our numerous Corporate Responsibility Employee Programs
  • Learn about our Corporate Responsibility Program by visiting Qlik.org
  • Check out our careers in R&D here.
  • Check out our company page on Linkedin!
  • Follow us on Instagram @lifeatqlik and @Qlik

Job Description

What makes us Qlik

 

Qlik helps enterprises around the world move faster, work smarter, and lead the way forward with an end-to-end solution for getting value out of data. Our platform is the only one on the market that allows for open-ended, curiosity-driven exploration, giving everyone – at any skill level – the ability to make real discoveries that lead to real outcomes and transformative changes. We are a Values-Driven organization, operating over 100 countries with 45,000 customers around the world. If you think we are interesting, please read on – we may be looking for you!

The Qlik Site Reliability Engineering team (SRE) is committed to ensuring that our large-scale distributed systems are scalable, monitored, automated and performing optimally, 24x7x365. We own our production environment - from the initial design phases to ensuring continuous high availability. Bring your passion to automate and join our growing team, and dig deep into performance, scalability, capacity, and reliability problems.

Responsibilities include:

  • Develop and support cloud infrastructure implementations and being directly involved in the software deployment process.
  • Use monitoring tools to find problems and resolve and/or escalate to development.
  • Implement and deploy new applications and enhancements to existing applications, software, and operating systems.
  • Work with development teams to establish SLOs and strive to ensure proper incident response when error budgets are depleted.
  • Plan and perform Operating System and software upgrades.
  • Provide general assistance for Technical Support.
  • Assist in the development and implementation of disaster recovery plans; and conduct research on emerging technologies in support of systems development efforts and recommend technologies that will increase cost effectiveness and systems flexibility.
  • Create and maintain documentation as it relates to system configuration, mapping, processes, and service records.

Qualifications

  Qualifications include:

  • 2+ years of experience with deploying and supporting a SaaS offering.
  • 1+ years of experience deploying and supporting Kubernetes clusters in a public, scalable SaaS offering.
  • 1+ years of experience as a Site Reliability Engineer or 3+ years in a DevOps environment.
  • 3+ years development experience (Golang / NodeJs / bash / etc.).
  • 1+ years of experience with each of the following:
    • Cloud Infrastructure (Amazon Web Services (AWS) / Google Cloud Platform (GCP) / Azure / etc.)
    • Bug tracking (JIRA / Bugzilla / YouTrack / etc.)
    • Continuous Deployment (concourse / GitHub Actions / Spinnaker / etc.)
    • Source control (github / gitlab / etc.)
    • Metrics & Tracing (Grafana / Prometheus / Jaeger / OpenTelemetry / OpenTracing / etc.)
  • Familiar and comfortable with agile development techniques – Scrum certified preferred.
  • Self-starter with the ability to work independently on projects.
  • Proactive and strong ability to learn new things with limited guidance.
  • Demonstrated ability to work effectively within a team and with cross-functional technical and business teams.
  • A curious attitude that is interested in knowing why things work the way they do and using that information to improve and enhance.

 

Location

This role is located in Ottawa but will consider candidates willing to work remotely in the EST time zone.

Additional Information

Emburse provides equal employment opportunities (EEO) to all employees and applicants for employment without regard to race, color, religion, sex, national origin, age, disability or genetics. In addition to federal law requirements, Emburse complies with applicable state and local laws governing nondiscrimination in employment in every location where the company has facilities. This policy applies to all terms and conditions of employment.

See more jobs at NXTThing RPO, LLC

Apply for this job

+30d

Site Reliability Engineer

CertainChapel Hill, NC Remote
marketoterraformmariadbsqlsalesforceoracleDesignmongodbazuredockerlinuxpythonAWSNode.js

Certain is hiring a Remote Site Reliability Engineer

About Certain, Inc:

Certain is the leading end-to-end enterprise event experience platform provider to the Fortune 1000. Our SaaS, cloud solution empowers marketers to deliver truly engaging digital and in-person attendee experiences, capturing rich insights and buying signals that lead to greater sales and marketing results. Headquartered in San Francisco, with offices in North America, Europe and the Pacific Rim, Certain partners with hundreds of enterprise and event management companies across tens of thousands of events with millions of attendees to deliver the best customer experience through live events. Certain is in the midst of rapid growth as event automation is one of the few dark spots in the marketing tech stack. While 30 – 40% of the marketing budget goes to hosting events, marketers lack a single platform to deliver seamless digital and in-person event experiences at scale while also extracting overall event ROI. Certain is changing the game as the first event experience platform that aggregates over 300 event data points that are seamlessly integrated with Eloqua, Marketo and Salesforce to deliver a highly personalized experience before, during and after live events. Certain has secured some of the largest global brands as clients such as Oracle, Microsoft and Red Hat – who won an Ops Stars Marketing Ops Team of the Year award based on their implementation of Certain in 2019. Now is the time to be part of a team that will establish Certain as the event experience leader.

In this role, you will strive to improve the performance, scalability, reliability, and security of our solutions. You should enjoy the fast pace of a small and rapily growing company where the continuous evolutin of our products and services is the norm. You'll be expected to contribute to and iterate on our configuratin and infastructure management and our service deployment framework. You should also have knowledge of monitoring and logging solutions, as well as tech to perform automation across any number of systems and containers.

Responsibilities

  • Work as a member of a global technology operations team administering 24/7 compute environments.
  • Perform management, monitoring, tuning, and troubleshooting of Linux servers in cooperation with members of the IT operations team.
  • Ensure that applications and services are highly available, reliable, and performant through world-class automation, monitoring, altering, and self-healing capabilities.
  • Analyze system metrics and logs to ensure maximum uptime and delivery.
  • Design and maintain core infrastructure systems running in Amazon Web Services and Microsoft Azure.
  • Build and maintain configuration and infrastructure management, service deployment frameworks, infrastructure as code, and utility software.
  • Participate in a 24/7 on-call rotation.
  • Partner with development team to improve reliability and availability.

Skills and Experience

  • Minimum of 7 years working in a production operations role in support of large scale, distributed Linux and containerized infrastructure.
  • Experience running Apache, Tomcat, Kafka, Datadog, Salt, Docker, Terraform, Hadoop, AWS, Cloudformation, AWS IAM, AWS Lambda and ELK.
  • Experience with Nginx, Node.js, Glue, Athena, ECS and EKS, MS SQL, MariaDB, MongoDB is a plus.
  • An understanding of computing fundamentals including, but not limited to, virtualization, containers, storage, security, database, and networking.
  • Experience running infrastructure operations in Amazon Web Services or Microsoft Azure focusing on high availability, resilience, and performance are paramount.
  • Programming experience for automation of systems; we use Python, BASH, and Java.
  • Experience with configuration management, service deployment, and infrastructure as code provisioning using salt, terraform or similar tools.
  • Experience with software development lifecycle, continuous deployments, integration (CI/CD), and building and maintaining software pipelines.
  • Security conscious in all aspects of deliverying operations.
  • Exceptional problem solving, criticial thinking, and analytical skills.
  • Excellent written and verbal communication skills as well as teamwork and interpersonal skills.
  • AWS or relevant certifications a plus




See more jobs at Certain

Apply for this job

+30d

(Senior) Cloud Site Reliability Engineer (m/f/x) onsite or remote (in Germany)

Scalable Capital GmbHBerliner Str., Berlin, Germany, Remote
terraformB2BjenkinsAWS

Scalable Capital GmbH is hiring a Remote (Senior) Cloud Site Reliability Engineer (m/f/x) onsite or remote (in Germany)

Company Description

Scalable Capital was founded in 2014, entering the FinTech industry with the aim of democratizing investment management and brokerage. Our mission is to use modern technology to make investments both easier and more affordable. Today, Scalable Capital is Europe's largest digital asset manager with assets valued over 6 billion Euros under management, a neo-broker for independent decision makers, and a European leader in providing B2B digital asset management platform solutions. Visit our finance blog or tune in for our podcast to find out what our Expert Team has to say.

If you are looking for scalability in your professional career, join us in re-defining how investors think about wealth creation and in making first class investment services available to everyone.

Work with our team onsite or remote from anywhere in Germany.

Job Description

Scalable Capital was built in the cloud from day one. Our services currently run on various AWS services like ECS, Fargate and Lambda and are distributed across multiple accounts. We embrace a DevOps culture where the development teams manage their CI/CD pipelines and cloud infrastructure for their services themself. Our Site Reliability Engineering Team focuses on shared infrastructure to enable the development teams to deploy and operate services in the cloud productively and securely.

 

  • Continuously improve our cloud setup. This includes recurring analysis of our AWS infrastructure as well as migrations between services
  • Research and integration of tooling to improve our processes
  • Mentoring and enabling our teams to further foster our DevOps culture
  • Be responsible for improving infrastructure and deployment automation using tools like Terraform, Jenkins, GitHub Actions
  • Support our business and software development teams to monitor and operate our AWS cloud platform including logs, metrics, tracing, and security

Qualifications

  • Experience with at least one public cloud provider and infrastructure as code
  • Working knowledge in at least one general purpose programming language
  • A degree in a relevant field of study (e.g. computer science, engineering, sciences) or work experience in a role that typically requires a university degree
  • A passion for automating and improving processes
  • Full professional proficiency in either English or German and the ability to communicate concisely in an international English-speaking environment
  • Previous experience in networking, audit-logging, and access tracing is an advantage

Additional Information

  • Be part of one of the fastest-growing and most visible Fintech startups in Europe, creating innovative services that have a substantial impact on the lives of our customers
  • The ability to work with an international, diverse, inclusive, and ever-growing team that loves creating the best products for our clients
  • Enjoy an office in a great location in the middle of Munich or Prenzlauer Berg, one of the hippest neighborhoods of Berlin or choose to work remote
  • Learn and grow by joining our in-house knowledge sharing sessions and spending your individual Education Budget 
  • Work productively with the latest hardware and tools
  • Say goodbye to order commissions and say hello to your complimentary subscription of Scalable Capital's PRIME Broker
  • Benefit from an attractive compensation package
  • Learn and Experience German culture first hand by joining our free German language classes

See more jobs at Scalable Capital GmbH

Apply for this job

+30d

Site Reliability Engineer

LastMinute GroupMadrid, Spain, Remote
terraformDesignmobileansibleazuredockerkubernetesubuntulinuxpythonAWS

LastMinute Group is hiring a Remote Site Reliability Engineer

Company Description

Launched in 1998, this pioneering British-born brand has specialised in creating amazing experiences and unforgettable memories - from hotels, city breaks and holidays to theatre, entertainment and spa days. Experts in brightening up online travel, lastminute.com is among the worldwide leaders in the field, helping hundreds of thousands of customers every year find, and do, "whatever makes them pink".

lastminute.com is part of lm group, publicly-traded multinational Group, among the worldwide leaders in the online travel industry. Every month, the Group reaches across all its websites and mobile apps (in 17 languages and 40 countries) 60 million unique users that search for and book their travel and leisure experiences. More than 1,200 people enjoy working with us and contribute to provide our audience with a comprehensive and inspiring offering of travel-related products and services.

Job Description

*Please note that is a full remote working position/on-site*

To support and participate in company-wide Continuous Deployment introductions and SRE projects  we are looking for a Site Reliability Engineer with certified experience as SRE  for our Technology department.

“Hope is not a strategy. Engineering solutions to design, build, and maintain efficient large-scale systems is a true strategy, and a good one.”

Key Responsibilities 

  • As Site Reliability Engineers we are responsible for the availability, performances, monitoring, and incident response of the platform and services running on multiple environments.
  • Improve infrastructure automation and automate repetitive tasks and build a scalable infrastructure
  • Improve and evolve the Self-Service Capabilities to developers and other stakeholders
  • Collaborate closely with architects, developers, database administrators in order to handle the reliability and scalability of the infrastructure.
  • Working closely with the Infrastructure team to define and implement solutions necessary for the success of the development teams.
  • Participate in periodic on-call duties

Qualifications

Essential

  • + 3 years experience as DevOps
  • Strong Experience with Linux operating systems (Ubuntu, RHEL) internals and administration
  • Strong knowledge  of Docker and Orchestration frameworks (Kubernetes preferred, Openshift, Nomad)
  • Experience working in microservices based architectures
  • Good understanding of  configuration management tools , Ansible , IAC tools (Terraform) and their best practices
  • Good knowledge and hands-on experience using  Continuous delivery and deployment tools like GitlabCI, Spinnaker or similar (CircleCI / GoCD / Github Actions …)
  • Experience in Virtualization technologies (Vmware)
  • Good Knowledge of languages like Go, Python and system scripting languages
  • Good Knowledge of major public cloud providers technologies  (AWS, Google Cloud, Azure)
  • Good Knowledge of data center management
  • Experience with traditional and modern website architecture
  • Familiarity with Centralized logs solutions (Fluentd, Logstash, Splunk)
  • Familiarity  understanding of change management and incident management processes
  • Familiarity with observability

Desirable

  • Travel domain experience
  • Certifications in one of above described fields
  • Good understanding of hybrid cloud architecture
  • Vmware NSX
  • Sysadmin background

Abilities/qualities

  • Good communication skills, written and verbal     
  • Enthusiasm to learn new technologies
  • Attitude to teamwork and ability to work in multi-location teams