Jobs / Bri***

AI Performance Engineer

Bri*** · New York, NY
Visa sponsorship details are locked. Unlock company name and apply link with .
New York, NYRemote
Remuneration
Not specified
Location
New York, NY
Visa sponsorship
Sponsors visa

Job summary

Bri*** is a forward-thinking software development company dedicated to building innovative solutions that help businesses automate and optimize their operations. We leverage cutting-edge technologies to create scalable, secure, and user-friendly applications.

Benefits

Employment Terms & Visa PolicyThis is a 100% remote, full-time, direct W2 position with Bright Vision

Qualifications

  • into well-engineered solutions, and will be expected to raise the bar through code review, design review, and mentorship of more junior engineers.
  • The successful candidate brings strong engineering discipline, a clear communication style, and a track record of shipping meaningful work that holds up well in production.
  • Bachelor's or master's degree in computer science, Computer Engineering, or related field.
  • Six or more years of experience in performance engineering, ML systems, or HPC.
  • Strong proficiency in Python and C++.
  • Hands-on experience optimizing deep learning workloads on modern GPUs.
  • Deep understanding of distributed training and inference techniques.
  • Experience with profiling
  • Experience optimizing LLM inference at production scale.
  • Contributions to vLLM, TensorRT-LLM, DeepSpeed, or similar projects.
  • Familiarity with custom kernel authoring in Triton or CUTLASS.
  • Experience with FinOps for AI workloads.

Responsibilities

  • In this role you will work closely with cross-functional partners — product, design, engineering, operations, and business stakeholders — to translate ambiguous
  • Profile and optimize end-to-end AI training and inference pipelines for throughput, latency, and cost.
  • Identify and eliminate bottlenecks across data loading, model compute, communication, and memory.
  • Implement and tune quantization, sparsity, and pruning strategies to reduce model footprint and accelerate inference.
  • Optimize distributed training using tensor parallelism, pipeline parallelism, FSDP, and ZeRO-style sharding.
  • Tune attention implementations using Flash Attention, paged attention, and related techniques.
  • Implement KV cache optimization, continuous batching, and speculative decoding for LLM serving.
  • Drive compiler-level optimizations using Triton, XLA, Torch Inductor, or TVM, working with the broader ML framework community to land improvements that translate into measurable end-to-end performance gains.
  • Optimize data pipelines, sharding strategies, and storage access patterns for high-throughput training.
  • Build and maintain rigorous benchmark suites and regression frameworks across workloads.
  • Collaborate with ML and platform engineering teams to embed best practices in standard pipelines.
  • Drive cost-efficiency improvements through model architecture, hardware selection, and scheduling strategies.

Skills

Communication

Degrees

AssociateBachelorDegreeMaster

Industry

AutomotiveEnergyMediaPublic-sector

Company size

Smb