Jobs / Spe***

Site Reliability Engineer

Spe*** · San Francisco, CA, United States
Visa sponsorship details are locked. Unlock company name and apply link with .
San Francisco, CA, United StatesRemote
Remuneration
Not specified
Location
San Francisco, CA, United States
Visa sponsorship
Sponsors visa

Job summary

Company Background Spe***'s mission is to help automate the physical world. Today, we build video sensors with state-of-the-art AI agents that answer any question, anywhere in their environments. Our systems can automatically detect and reason about any physical activity captured on camera, from security incidents (e.g.

Benefits

We offer both long range wireless (1km range) and wired sensor variants to suit

Qualifications

  • Observability Owner — Fleet Visibility
  • Design and implement observability (logging, metrics, alerting) across edge devices and cloud infrastructure (AWS).
  • Surface and close telemetry gaps; build fleet-wide visibility that enables data-driven reliability decisions.
  • Develop runbooks, incident response procedures, and participate in on-call rotations.
  • Strong Linux systems administration — comfortable working over SSH in production, not just dev environments.
  • Experience with edge or on-prem hardware alongside cloud infrastructure.
  • Solid networking fundamentals: DNS, firewalls, VPNs, subnets, secure remote access.
  • Scripting or programming in Python, Go, or Bash for operational tooling.
  • Familiarity with containerization (Docker, Kubernetes a plus).
  • Embedded systems experience — reading firmware logs, understanding hardware-software boundaries, and reasoning about what's happening below the OS is a meaningful edge in this role.
  • Deeper cloud experience (AWS infrastructure, IAM, networking, observability tooling) is a strong plus for owning the cloud side of the fleet.
  • Rust or C experience — we have firmware in both; being able to read and reason about low-level code accelerates triage significantly.

Responsibilities

  • You'll drive reliability across our sensor fleet — triaging issues in the field, building the systems that prevent them from recurring, and owning the observability that keeps us ahead of problems as we scale.
  • Reactive — Triage & Recovery
  • Debug and triage issues across a live fleet of diverse Linux-based sensor nodes and edge appliances deployed at customer sites.
  • SSH into field hardware to diagnose, patch, and recover systems — often with limited remote access and incomplete information.
  • Own site bring-ups end to end; be the person who gets things back online.
  • Systems Builder — Close the Loop
  • Build and maintain fleet management systems: OTA update pipelines, device health tracking, remote diagnostics, and lifecycle tooling.
  • Identify repeat fires and eliminate them — build tooling, pre-deployment checks, and root cause processes that prevent recurrence.
  • Automate toil relentlessly: if you're doing something twice, you should be scripting it.
  • Collaborate with embedded systems, and platform teams to define reliability and deployment

Degrees

Associate

Work schedule

On-callRotationShift

Industry

AutomotiveEnergyOil-gas