Jobs / Bri***
AI Infrastructure Engineer
Bri*** · Monroe, NJ
Visa sponsorship details are locked. Unlock company name and apply link with .
Monroe, NJHybrid
Remuneration
Not specified
Location
Monroe, NJ
Visa sponsorship
Sponsors visa
Job summary
Bri*** is a forward-thinking software development company dedicated to building innovative solutions that help businesses automate and optimize their operations. We leverage cutting-edge technologies to create scalable, secure, and user-friendly applications.
Benefits
Employment Terms & Visa PolicyThis is a 100% remote, full-time, direct W2 position with Bright Vision
Qualifications
- Bachelor’s or Master’s degree in Computer Science or a related field
- Six or more years of experience in infrastructure, platform, or HPC engineering
- Hands-on experience operating GPU clusters or large-scale ML training infrastructure
- Strong proficiency in Python and at least one systems language such as Go or C++
- Deep understanding of distributed training, accelerator architectures, and collective communication
- Experience with Kubernetes, Slurm, Ray, or similar scheduling systems for ML workloads
- Strong understanding of Linux internals, networking, and high-performance storage
- Experience with at least one major cloud provider’s ML infrastructure offerings
- Strong software engineering practices including testing, CI/CD, and code review
- Excellent communication and cross-functional collaboration
- Experience operating InfiniBand or RDMA networking at scale
- Contributions to open-source ML infrastructure projects
Responsibilities
- Design and operate GPU and accelerator infrastructure for training and inference, spanning on-prem clusters, cloud-managed services, and hybrid configurations
- Build scheduling, queueing, and resource-sharing systems that maximize accelerator utilization across many teams
- Integrate frameworks such as PyTorch, JAX, DeepSpeed, FSDP, Megatron-LM, and Ray Train into a unified platform offering
- Operate high-performance storage systems and data pipelines that keep accelerators fed with training data at near-line-rate
- Design networking architectures supporting RDMA, InfiniBand, NCCL, and high-bandwidth collective communication
- Build observability for AI workloads including utilization, throughput, training stability, and failure-mode analytics
- Implement checkpointing, restart, and fault-tolerance patterns for long-running training jobs at scale
- Drive cost optimization across compute, storage, and networking through scheduling, spot capacity, and right-sizing
- Develop developer tooling and paved-road workflows that let researchers launch experiments safely and efficiently
- Partner with research and applied ML teams to plan capacity for upcoming training runs
- Implement security controls, isolation, and access management for multi-tenant AI infrastructure
- Drive automation across cluster provisioning, lifecycle management, and configuration enforcement
Skills
Communication
Degrees
AssociateDegree
Industry
AutomotiveEnergyMediaPublic-sector
Company size
Smb