Jobs / The***

Senior Site Reliability Engineer (SRE)

The*** · United Kingdom · Remote

Visa sponsorship details are locked. Unlock company name and apply link with .

United Kingdom43,603-112,893 GBP/yearlyRemote

Remuneration

43,603-112,893 GBP/yearly

Location

United Kingdom · Remote

British Summer Time (UTC+1)

Visa sponsorship

Sponsors visa

Job summary

About us: We are an international IT and Fulfillment Services company with offices in the US, UK, and Portugal. The*** works with organisations to create digital and technology platforms that drive transformation, develop capabilities, and build businesses around the world.

Responsibilities

In this role, you will be responsible for bringing order, stability, and robust resource hygiene to a complex, high-stakes core banking environment deployed on Azure Kubernetes Service (AKS).
You will lead the transition from a reactive firefighting posture to a proactive, durable operational baseline—ensuring that complex data flows and multi-tiered applications run reliably.
Kubernetes Resource Governance & Architecture
Enforce Pod-Level Hygiene: Mandate and configure explicit CPU, memory, and ephemeral storage requests and limits across all microservices to completely eliminate resource over-commitment and node-level evictions.
Autoscaling Safeguards: Manage Horizontal Pod Autoscaler (HPA) configurations carefully, ensuring that high-risk, single-replica overrides (e.g., min=max=1) are banned on critical web UI and front-office tiers.
Cluster Readiness Management: Implement and maintain robust application readiness probes to properly gate traffic post-restart, ensuring clients never hit un-initialized or un-ready application pods.
Database & Application Performance Tuning
Deadlock Mitigation: Analyze, trace, and remediate persistent database performance bottlenecks, specifically focusing on Azure SQL Hyperscale deadlocks, aborted transactions, and high data I/O contention.
Advanced Observability & Alert Engineering
De-noise the Paging Layer: Take ownership of an inherently noisy alerting system generating hundreds of daily threshold alerts, implementing aggressive deduplication and alert-noise reduction.
Incident Response & Root Cause Analysis (RCA)
Telemetry-Driven Diagnostics: Utilize deep-dive APM

Degrees

AssociateDegree

Industry

AutomotiveBankingMedia

Company size

Enterprise