« Back to search results

Site Reliability Engineer


Job Number: 22278

Location: Acton Support Centre (123)

Contract: Full Time / Permanent


Site Reliability Engineer


The Site Reliability Engineering team is a group of highly technical engineers who are tasked with maintaining and developing the reliability, scalability and performance of the platform and infrastructure. The SRE is empowered to drive technical resolutions across the technology stack from application through to infrastructure and all stops in between.


Key responsibilities:


  • Site Reliability Engineers will define operational requirements and standards for software delivered by the Hub and may often be involved in the development of working software alongside product squads.
  • SREs will constantly monitor a range of parameters such as availability, latency and capacity, to protect live systems and reduce the risk of incidents by evolving Hub development standards, continually improving efficiency, reliability and performance
  • Scale and evolve systems and services to ensure reliability, resiliency, performance and security
  • Analytically solve problems, with the ability to come up with practical solutions within a production environment in a time critical environment.
  • Support system design, platform development and deployment reviews
  • Monitor and debug issues across the stack (infrastructure, network, platform, services)
  • Work closely with squads to design & deploy software
  • Conduct blameless post-mortems and implement technical resolutions for high severity incidents
  • Create run books and manage game days to ensure stability and resilience
  • Provide 24x7 on-call support as required
  • Help drive a DevOps culture throughout the organisation


Key Skills & Experience:


  • Wide breadth of knowledge across cloud systems including Azure and AWS
  • Excellent site and system monitoring skills, with deep knowledge of SLIs/SLOs, and the ability to spot risks, efficiently prioritise them and pre-empt their resolution before they turn into incidents
  • Strong analytical and incident investigation/troubleshooting skills, with the ability to quickly identify the root cause of and solve infrastructure incidents
  • Strategic focus with the ability to create and develop software requirements, refining these after incidents occur to negate the possibility of the same type of incident occurring again
  • Excellent interpersonal and relationship building skills. Ability to coach developers on best practices for developing software that is secure and reliable
  • Strong understanding of both development and operations, with a view of how to balance speedy, agile deliveries with stable and secure software and infrastructure development
  • Strong understanding of automation and the desire to create automated solutions
  • In-depth knowledge of scaling software
  • Positive attitude, with a focus on making proactive steps to ensure that similar incidents don’t occur in the future, without any emphasis on assigning blame.
  • Good knowledge of Azure and AWS cloud systems
  • Knowledge of windows, docker and cloud technologies is essential
  • Networking experience
  • Experience of SQL, NoSQL & database administration
  • Strong experience of monitoring performance/availability through applications such as Newrelic, AppDynamics, CloudWatch or similar
  • Experience in cloud engineering principles
  • Experience of release management
  • Understanding of CI/CD principles
  • Past experience in any development/scripting languages such as PowerShell, Bash, Go, Python, C#, NodeJS