Jobs / NVI***
Principal Software Engineer, GPU Firmware and GPU System Software — CSP Engagements
NVI*** · Austin, TX, United States
Visa sponsorship details are locked. Unlock company name and apply link with .
Austin, TX, United States272,000-431,250 USD/yearlyOnsite
Remuneration
272,000-431,250 USD/yearly
Location
Austin, TX, United States
Visa sponsorship
Sponsors visa
Job summary
We're looking for a Principal Software Engineer to join our CSP Engagements team as the technical focal point for GPU firmware and GPU system software, working directly with engineering teams of key CSP / hyperscale customers to ensure they can reliably manage, update, and operate NVI*** GPU firmware at fleet scale.
Benefits
Applications for this job will be accepted at least until June 30, 2026.This posting is for an existing vacancy.NVIDIA uses AI
Qualifications
- e.g., multi-tenancy isolation, secure boot, attestation), and performance — and champion those priorities into NVI***'s GPU firmware/software feature roadmap and delivery plan
- Drive GPU firmware update orchestration for large-scale deployments — multi-GPU update sequencing, rollback strategy, failure handling, and validation across hundreds of GPUs per rack
- Identify cross-CSP GPU SW/FW issue patterns — common update failures, recovery gaps, and configuration problems — and drive documentation, tooling, and test strategy improvements
- What we need to see:
- 15+ years of experience in GPU system software, GPU firmware, or accelerator platform engineering.
- BS or MS in Computer Science, Electrical Engineering, or related field (or equivalent experience)
- Deep understanding of GPU architecture internals: streaming multiprocessors, GEMM execution, compute kernels, memory hierarchy, and how firmware/driver decisions impact GPU compute performance
- Understanding of multi-GPU fabric architectures (NVLink, or similar) and how firmware coordinates across multiple GPUs in a rack-scale system
- Understanding of GPU firmware architecture: VBIOS, GPU microcontroller firmware, InfoROM, and their interaction with the GPU driver stack
- Experience with firmware update lifecycle management at scale: multi-device update sequencing, A/B updates, rollback, staged rollout, emergency recovery
- Understanding of GPU error handling and recovery flows — how firmware-level errors propagate through the driver stack to application-visible failures
- Experience with GPU health monitoring and telemetry: Xid errors, thermal events, power events, ECC counters, and their significance for firmware/software teams
Responsibilities
- What you'll be doing:
Degrees
AssociateBachelor
Industry
AutomotiveEnergyMedia
Company size
Smb