Jobs / The***

Senior Site Reliability Engineer (SRE)

The*** · United Kingdom · Remote
Visa sponsorship details are locked. Unlock company name and apply link with .
United Kingdom43,603-112,893 GBP/yearlyRemote
Remuneration
43,603-112,893 GBP/yearly
Location
United Kingdom · Remote
British Summer Time (UTC+1)
Visa sponsorship
Sponsors visa

Job summary

About us: We are an international IT and Fulfillment Services company with offices in the US, UK, and Portugal. The*** works with organisations to create digital and technology platforms that drive transformation, develop capabilities, and build businesses around the world.

Responsibilities

  • In this role, you will be responsible for bringing order, stability, and robust resource hygiene to a complex, high-stakes core banking environment deployed on Azure Kubernetes Service (AKS).
  • You will lead the transition from a reactive firefighting posture to a proactive, durable operational baseline—ensuring that complex data flows and multi-tiered applications run reliably.
  • Kubernetes Resource Governance & Architecture
  • Enforce Pod-Level Hygiene: Mandate and configure explicit CPU, memory, and ephemeral storage requests and limits across all microservices to completely eliminate resource over-commitment and node-level evictions.
  • Autoscaling Safeguards: Manage Horizontal Pod Autoscaler (HPA) configurations carefully, ensuring that high-risk, single-replica overrides (e.g., min=max=1) are banned on critical web UI and front-office tiers.
  • Cluster Readiness Management: Implement and maintain robust application readiness probes to properly gate traffic post-restart, ensuring clients never hit un-initialized or un-ready application pods.
  • Database & Application Performance Tuning
  • Deadlock Mitigation: Analyze, trace, and remediate persistent database performance bottlenecks, specifically focusing on Azure SQL Hyperscale deadlocks, aborted transactions, and high data I/O contention.
  • Advanced Observability & Alert Engineering
  • De-noise the Paging Layer: Take ownership of an inherently noisy alerting system generating hundreds of daily threshold alerts, implementing aggressive deduplication and alert-noise reduction.
  • Incident Response & Root Cause Analysis (RCA)
  • Telemetry-Driven Diagnostics: Utilize deep-dive APM

Degrees

AssociateDegree

Industry

AutomotiveBankingMedia

Company size

Enterprise