Jobs / The***
Senior Site Reliability Engineer (SRE)
The*** · United Kingdom · Remote
Visa sponsorship details are locked. Unlock company name and apply link with .
United Kingdom43,603-112,893 GBP/yearlyRemote
Remuneration
43,603-112,893 GBP/yearly
Location
United Kingdom · Remote
British Summer Time (UTC+1)
Visa sponsorship
Sponsors visa
Job summary
About us: We are an international IT and Fulfillment Services company with offices in the US, UK, and Portugal. The*** works with organisations to create digital and technology platforms that drive transformation, develop capabilities, and build businesses around the world.
Responsibilities
- In this role, you will be responsible for bringing order, stability, and robust resource hygiene to a complex, high-stakes core banking environment deployed on Azure Kubernetes Service (AKS).
- You will lead the transition from a reactive firefighting posture to a proactive, durable operational baseline—ensuring that complex data flows and multi-tiered applications run reliably.
- Kubernetes Resource Governance & Architecture
- Enforce Pod-Level Hygiene: Mandate and configure explicit CPU, memory, and ephemeral storage requests and limits across all microservices to completely eliminate resource over-commitment and node-level evictions.
- Autoscaling Safeguards: Manage Horizontal Pod Autoscaler (HPA) configurations carefully, ensuring that high-risk, single-replica overrides (e.g., min=max=1) are banned on critical web UI and front-office tiers.
- Cluster Readiness Management: Implement and maintain robust application readiness probes to properly gate traffic post-restart, ensuring clients never hit un-initialized or un-ready application pods.
- Database & Application Performance Tuning
- Deadlock Mitigation: Analyze, trace, and remediate persistent database performance bottlenecks, specifically focusing on Azure SQL Hyperscale deadlocks, aborted transactions, and high data I/O contention.
- Advanced Observability & Alert Engineering
- De-noise the Paging Layer: Take ownership of an inherently noisy alerting system generating hundreds of daily threshold alerts, implementing aggressive deduplication and alert-noise reduction.
- Incident Response & Root Cause Analysis (RCA)
- Telemetry-Driven Diagnostics: Utilize deep-dive APM
Degrees
AssociateDegree
Industry
AutomotiveBankingMedia
Company size
Enterprise