Jobs / Bit***
Sr. SRE Platform Software Engineer
Bit*** · San Jose, CA, United States
Visa sponsorship details are locked. Unlock company name and apply link with .
San Jose, CA, United StatesRemote
Remuneration
Not specified
Location
San Jose, CA, United States
Visa sponsorship
Sponsors visa
Job summary
About Bit*** Bitdeer is a world-leading technology company for AI and Bitcoin mining infrastructure. Bitdeer is committed to providing comprehensive Bitcoin mining solutions for its customers and building AI computational infrastructure to support the AI revolution.
Qualifications
- Software Engineering Experience: 7+ years of production software engineering experience, including 2 or more years operating what you built (real on-call experience, not just shipping code).
- Programming Languages: Production-depth mastery of at least one systems-grade language—Go (preferred), Rust, or Java.
- Proficiency in Python for tooling and SDK work.
- Distributed Systems Fundamentals: Strong grasp of at-least-once vs.
- exactly-once trade-offs, idempotency, back-pressure, leader election, consistent hashing, gossip, and fan-out.
- Ability to evaluate CRDT vs.
- Raft vs.
- Paxos and select the right tool for the job.
- Multi-Region Observability Stack: Experience at production scale with Prometheus, VictoriaMetrics, Mimir, Thanos, Loki, Elasticsearch, Tempo, Jaeger, or OpenTelemetry.
- Must have built or substantively contributed to the ingest, query, or storage paths of these systems.
- GitOps & CI/CD: Hands-on experience with Argo, Flux, Helm, Kustomize, Cosign signing, signed-bundle promotion, and blast-radius-aware rollouts.
- Kubernetes Operator Pattern: Proven experience writing a controller or CRD handling real production traffic, with a deep understanding of watch-cache mechanics, leader election, and reconcile loops.
Responsibilities
- You will own 1-2 of these:
- Collection & Storage: collection-agent, customer-sdk-gateway, metrics-store, logs-store, traces-store, profiles-store, analytics-lake, enrichment-service, collection-monitor.
- Alert, Correlation & SLO: alert-engine-framework, alert-correlation, slo-framework, default M-series alert rules.
- Topology, Cluster-Health & Cluster Platform Services: topology-service, cluster-health-rollup, OSS-SRE-tool collection plugins for K8s, Slurm, Ray, Volcano, Kueue, and KubeRay.
- Fault-Prediction: prediction-engine-framework and built-in predictors (GPU, Link, Disk, XPA, Straggler, SDC, Stranded GPU).
- Remediation, Workflow, Inspection & Jobs: remediation-actuator, orchestration-substrate (workflow engine), inspection-orchestrator, job-scheduler, NCCL-baseline inspection probe.
- Hardware Lifecycle & DC Ops: hardware-lifecycle, dc-operations, boot-provisioning, rolling-upgrade, bare-metal-bmc-service, auto-discovery, ZTP D0–D5 pipeline, IPMI bare-metal management.
- Identity, Secrets, Tenant-Config & CMDB: iam-service, secrets-service, tenant-sre-config, cmdb-cache, schema registry.
- Customer-Bridge, Ticketing & SRE Platform Portal: customer-bridge, customer-ticketing, sre-operation-system, Customer Console BFF, SRE Console BFF.
- Backup, DR & Meta-Monitor: backup-orchestrator, meta-monitor, external-watcher integration (Datadog or equivalent).
- CI/CD, GitOps, Plugin Framework & SRE Image Registry: cicd-pipeline, gitops-sync, plugin-registry, sre-image-registry.
- Self-Improving Agent: agent-control-plane, agent-discovery, agent-codegen, agent-sandbox, per-Region LLM gateway.
Degrees
AssociateBachelor
Work schedule
On-callRotationShift
Industry
AutomotiveLogisticsMedia
Company size
Smb
Security clearance
Secret