Since 2004, Mandiant has been a trusted partner to security-conscious organizations. Effective security is based on the right combination of expertise, intelligence, and adaptive technology, and the Mandiant Advantage SaaS platform scales decades of frontline experience and industry-leading threat intelligence to deliver a range of dynamic cyber defense solutions. Mandiant’s approach helps organizations develop more effective and efficient cyber security programs and instills confidence in their readiness to defend against and respond to cyber threats.
Reporting to the IT Operations Director in our Enterprise Technology Services team the Production Operations and Site Reliability Engineering leadership role will lead a globally distributed high-performing team focused on elevating application and service performance and availability in support of our organization’s fast-evolving enterprise technology needs.
Reduce risk to service availability for employees and customers by partnering with Engineering and Operations teams to proactively pivot to AI-driven telemetry tooling, leading a team of professionals highly focused on improving transaction resilience.
The ideal candidate will have a broad background spanning both applications and infrastructure. They will have direct experience in multiple coding languages and have performed code reviews in their previous work. They will have been a service reliability engineer and shifted into management. They will have a strong sense of urgency with respect to outages and delegate appropriately to ensure 24/7 coverage for their function. This breadth of experience will be leveraged to mature Mandiant’s SRE teams and processes.
What You Will Do:
· Identify opportunities for improving telemetry, observability, service availability and transaction resilience
· Identify and build self-healing capabilities leveraging APIs, scripting, and coding
· Streamline and optimize tooling and process from risk detection to remediation
· Drive accountability for risk reduction and corrective actions following Post-mortem and RCA reviews
· Ownership and accountability for Major Incident and Problem Management processes and execution
· Partner with sustaining engineering teams (both internal and external) to improve service performance and stability
· Drive active partnership with mergers and acquisitions integration teams to ensure that goals are achieved for service availability objectives
· Drive architectural reviews for critical services, raising the bar for service availability and transaction resilience
· Track Product Engineering roadmaps to identify forward-looking telemetry and automation opportunities
· Build and sustain enterprise-level service offering such as Logging as a Service, Telemetry as a Service, Service Availability reporting and dashboard publication
· Establish and govern standards for New Service Introduction
· Vendor Relationship Management for Telemetry Tooling solution providers
· CMDB population, governance, and maintenance
· Staffing, motivation, and evolution of globally diverse high-performing team
· Demonstrated management experience leading professionals from technology disciplines including Monitoring, Cloud, Storage, Network, Virtualization and Server technologies
· DevOps leadership experience in a fast-paced mid-sized or large-scale enterprise
· Experience working with a Code Repository such as GitHub and related best practices
· ITSM Expert, or demonstrated success leading IT Service Management programs and initiatives
· Experience with CI/CD pipeline delivery
· Experience supporting Infrastructure-As-Code, including conducting code reviews
· Passionate about automation, with demonstrated success identifying and executing on automation opportunities
· Astute understanding of Data Center technologies and hybrid cloud architecture
· Demonstrated knowledge of Azure and AWS architecture and support of hosted services
· Use of modern DevOps tools such as Terraform, Ansible, Docker, Jenkins, Chef or Puppet
· Scripting in languages such as Powershell, PHP, Python, Shell or Java
· Knowledge of operating system principles and architecture, particularly for Windows and Linux/Unix
· Experience with Enterprise deployments of SQL, Certificate Authorities, AD OU Structure, Azure AD, and group policies
· Strong analytical skills for solving complex cross-functional problems
· Solid understanding of business value streams and service dependency mapping
· Positive attitude with a strong focus on creativity and active solutioning
· Excellent communication skills (both written and verbal)
· Strong business acumen, with demonstrated success driving organizational change management
· Customer-oriented mindset, challenging the status quo and identifying business value opportunities that can be accelerated by telemetry-driven insight
As a U.S. federal contractor, Mandiant has adopted a COVID-19 Vaccination Policy to comply with our obligations under applicable laws and requirements. This position is covered under Mandiant’s COVID-19 Vaccination Policy and therefore proof of vaccination against COVID-19 will be required as a condition of hire. At Mandiant we are committed to our #OneTeam approach combining diversity, collaboration, and excellence. All qualified applicants will receive consideration for employment without regard to race, sex, color, religion, sexual orientation, gender identity, national origin, protected veteran status, or on the basis of disability.
This is a regionally based role that must be located on the East coast.
See more jobs at Mandiant