Jobs / Bri***

AI Infrastructure Engineer

Bri*** · Monroe, NJ
Visa sponsorship details are locked. Unlock company name and apply link with .
Monroe, NJHybrid
Remuneration
Not specified
Location
Monroe, NJ
Visa sponsorship
Sponsors visa

Job summary

Bri*** is a forward-thinking software development company dedicated to building innovative solutions that help businesses automate and optimize their operations. We leverage cutting-edge technologies to create scalable, secure, and user-friendly applications.

Benefits

Employment Terms & Visa PolicyThis is a 100% remote, full-time, direct W2 position with Bright Vision

Qualifications

  • Bachelor’s or Master’s degree in Computer Science or a related field
  • Six or more years of experience in infrastructure, platform, or HPC engineering
  • Hands-on experience operating GPU clusters or large-scale ML training infrastructure
  • Strong proficiency in Python and at least one systems language such as Go or C++
  • Deep understanding of distributed training, accelerator architectures, and collective communication
  • Experience with Kubernetes, Slurm, Ray, or similar scheduling systems for ML workloads
  • Strong understanding of Linux internals, networking, and high-performance storage
  • Experience with at least one major cloud provider’s ML infrastructure offerings
  • Strong software engineering practices including testing, CI/CD, and code review
  • Excellent communication and cross-functional collaboration
  • Experience operating InfiniBand or RDMA networking at scale
  • Contributions to open-source ML infrastructure projects

Responsibilities

  • Design and operate GPU and accelerator infrastructure for training and inference, spanning on-prem clusters, cloud-managed services, and hybrid configurations
  • Build scheduling, queueing, and resource-sharing systems that maximize accelerator utilization across many teams
  • Integrate frameworks such as PyTorch, JAX, DeepSpeed, FSDP, Megatron-LM, and Ray Train into a unified platform offering
  • Operate high-performance storage systems and data pipelines that keep accelerators fed with training data at near-line-rate
  • Design networking architectures supporting RDMA, InfiniBand, NCCL, and high-bandwidth collective communication
  • Build observability for AI workloads including utilization, throughput, training stability, and failure-mode analytics
  • Implement checkpointing, restart, and fault-tolerance patterns for long-running training jobs at scale
  • Drive cost optimization across compute, storage, and networking through scheduling, spot capacity, and right-sizing
  • Develop developer tooling and paved-road workflows that let researchers launch experiments safely and efficiently
  • Partner with research and applied ML teams to plan capacity for upcoming training runs
  • Implement security controls, isolation, and access management for multi-tenant AI infrastructure
  • Drive automation across cluster provisioning, lifecycle management, and configuration enforcement

Skills

Communication

Degrees

AssociateDegree

Industry

AutomotiveEnergyMediaPublic-sector

Company size

Smb