Jobs / Bri***
AI Performance Engineer
Bri*** · New York, NY
Visa sponsorship details are locked. Unlock company name and apply link with .
New York, NYRemote
Remuneration
Not specified
Location
New York, NY
Visa sponsorship
Sponsors visa
Job summary
Bri*** is a forward-thinking software development company dedicated to building innovative solutions that help businesses automate and optimize their operations. We leverage cutting-edge technologies to create scalable, secure, and user-friendly applications.
Benefits
Employment Terms & Visa PolicyThis is a 100% remote, full-time, direct W2 position with Bright Vision
Qualifications
- into well-engineered solutions, and will be expected to raise the bar through code review, design review, and mentorship of more junior engineers.
- The successful candidate brings strong engineering discipline, a clear communication style, and a track record of shipping meaningful work that holds up well in production.
- Bachelor's or master's degree in computer science, Computer Engineering, or related field.
- Six or more years of experience in performance engineering, ML systems, or HPC.
- Strong proficiency in Python and C++.
- Hands-on experience optimizing deep learning workloads on modern GPUs.
- Deep understanding of distributed training and inference techniques.
- Experience with profiling
- Experience optimizing LLM inference at production scale.
- Contributions to vLLM, TensorRT-LLM, DeepSpeed, or similar projects.
- Familiarity with custom kernel authoring in Triton or CUTLASS.
- Experience with FinOps for AI workloads.
Responsibilities
- In this role you will work closely with cross-functional partners — product, design, engineering, operations, and business stakeholders — to translate ambiguous
- Profile and optimize end-to-end AI training and inference pipelines for throughput, latency, and cost.
- Identify and eliminate bottlenecks across data loading, model compute, communication, and memory.
- Implement and tune quantization, sparsity, and pruning strategies to reduce model footprint and accelerate inference.
- Optimize distributed training using tensor parallelism, pipeline parallelism, FSDP, and ZeRO-style sharding.
- Tune attention implementations using Flash Attention, paged attention, and related techniques.
- Implement KV cache optimization, continuous batching, and speculative decoding for LLM serving.
- Drive compiler-level optimizations using Triton, XLA, Torch Inductor, or TVM, working with the broader ML framework community to land improvements that translate into measurable end-to-end performance gains.
- Optimize data pipelines, sharding strategies, and storage access patterns for high-throughput training.
- Build and maintain rigorous benchmark suites and regression frameworks across workloads.
- Collaborate with ML and platform engineering teams to embed best practices in standard pipelines.
- Drive cost-efficiency improvements through model architecture, hardware selection, and scheduling strategies.
Skills
Communication
Degrees
AssociateBachelorDegreeMaster
Industry
AutomotiveEnergyMediaPublic-sector
Company size
Smb