Jobs / Bri***

AI Performance Engineer

Bri*** · New York, NY

Visa sponsorship details are locked. Unlock company name and apply link with .

New York, NYRemote

Remuneration

Not specified

Location

New York, NY

Visa sponsorship

Sponsors visa

Job summary

Bri*** is a forward-thinking software development company dedicated to building innovative solutions that help businesses automate and optimize their operations. We leverage cutting-edge technologies to create scalable, secure, and user-friendly applications.

Benefits

Employment Terms & Visa PolicyThis is a 100% remote, full-time, direct W2 position with Bright Vision

Qualifications

into well-engineered solutions, and will be expected to raise the bar through code review, design review, and mentorship of more junior engineers.
The successful candidate brings strong engineering discipline, a clear communication style, and a track record of shipping meaningful work that holds up well in production.
Bachelor's or master's degree in computer science, Computer Engineering, or related field.
Six or more years of experience in performance engineering, ML systems, or HPC.
Strong proficiency in Python and C++.
Hands-on experience optimizing deep learning workloads on modern GPUs.
Deep understanding of distributed training and inference techniques.
Experience with profiling
Experience optimizing LLM inference at production scale.
Contributions to vLLM, TensorRT-LLM, DeepSpeed, or similar projects.
Familiarity with custom kernel authoring in Triton or CUTLASS.
Experience with FinOps for AI workloads.

Responsibilities

In this role you will work closely with cross-functional partners — product, design, engineering, operations, and business stakeholders — to translate ambiguous
Profile and optimize end-to-end AI training and inference pipelines for throughput, latency, and cost.
Identify and eliminate bottlenecks across data loading, model compute, communication, and memory.
Implement and tune quantization, sparsity, and pruning strategies to reduce model footprint and accelerate inference.
Optimize distributed training using tensor parallelism, pipeline parallelism, FSDP, and ZeRO-style sharding.
Tune attention implementations using Flash Attention, paged attention, and related techniques.
Implement KV cache optimization, continuous batching, and speculative decoding for LLM serving.
Drive compiler-level optimizations using Triton, XLA, Torch Inductor, or TVM, working with the broader ML framework community to land improvements that translate into measurable end-to-end performance gains.
Optimize data pipelines, sharding strategies, and storage access patterns for high-throughput training.
Build and maintain rigorous benchmark suites and regression frameworks across workloads.
Collaborate with ML and platform engineering teams to embed best practices in standard pipelines.
Drive cost-efficiency improvements through model architecture, hardware selection, and scheduling strategies.

Skills

Communication

Degrees

AssociateBachelorDegreeMaster

Industry

AutomotiveEnergyMediaPublic-sector

Company size

Smb