GPU kernel engineers from Triton/CUDA backgrounds, US-based, junior to mid-level

completed17 qualified1 runApr 27, 5:44 PMgpu-kernel-engineers-from-tritoncuda-backgrounds-us-based-ju-1777311875

ParsedNVIDIA · Junior · IC · US

Generating seed nodes

0 proposed

Explored 0 queries

0/0 done

Expanding nodes

queued

Qualifying candidates

queued

Qualified Candidates (16)

Anjiang Wei

high hireability

PhD student@Stanford University

San Francisco, US

Stanford PhD student (Alex Aiken lab) with direct GPU kernel research: co-authored 'Astra' (multi-agent GPU kernel performance optimization), 'AccelOpt' (AI accelerator kernel optimization), 'Equivalence Checking of ML GPU Kernels', and AMOS (tensor computation mapping on spatial accelerators, 83 citations)
Focus is on LLM-driven GPU kernel optimization — adjacent to hands-on Triton/CUDA but deeply informed by accelerator internals
US-based (San Francisco)
Hireability: HIGH — likely 6th+ year PhD at Stanford (first papers ~2020), prime finishing window; active publication output through 2025/2026 and no competing signals

William Brandon

high hireability

PhD student@MIT CSAIL

Previously: Research Assistant @ MIT Media Lab

Cambridge, US

Custom CUDA GPU kernel engineer — authored FLUTE (CUTLASS-based kernels fusing dequantization+matmul, 2-4x faster GEMM for quantized LLMs, EMNLP 2024)
Also works on multi-GPU LLM inference systems: tensor parallelism, KV cache optimization, speculative decoding
PhD student at MIT CSAIL, Cambridge MA, with Jonathan Ragan-Kelley (Halide co-inventor) as collaborator
Hireability: HIGH — ~4-5th year PhD (Berkeley CS/Math Fall 2020 → MIT CSAIL), likely in final stretch and prime for industry transition

BOYUAN FENG

medium hireability

SWE@PyTorch

Previously: Researcher @ Meta

SWE at PyTorch/Meta on GPU kernels: co-authored FlexAttention (fused attention kernel compiler, 2024/2025), active CUDA/Inductor/CUDA-graphs contributor (99 PRs to pytorch/pytorch, latest March 2026), UCSB CS PhD with GNNAdvisor/TC-GNN/APNN-TC (CUDA) papers (h-index 15)
Has tritonbench and CUTLASS forks
US-based (UCSB/California)
Hireability: MEDIUM — stable Meta/PyTorch SWE role with very recent activity; PhD bio still listed suggesting possible near-graduation transition window but no explicit open-to-work signals

Connor Holmes

medium hireability

Researcher@OpenAI

Previously: Researcher @ Microsoft

San Francisco, US

GPU systems researcher with direct CUDA/kernel background — personal grnn CUDA repo (low-latency RNN GPU inference), TurboMoE (kernel-fusion for MoE training), DeepSpeed-FastGen (GPU inference throughput)
Was on DeepSpeed team at Microsoft before joining OpenAI ~2024 (on Sora paper). 'Researcher' title (not Senior), h_index 14, SF-based — solid junior-to-mid fit
Hireability: MEDIUM — ~2 years at OpenAI, within typical transition window, but no explicit open-to-work signals found

David Pruitt

medium hireability

Developer@NVIDIA

Previously: Instructor @ The University of Texas at El Paso

Austin, US

Developer at NVIDIA in Austin, US
DB expertise includes 'Computer Architecture, CPUs, GPUs, Memory, Cache' with a 2022 paper on 'Decoupling Cache and Core Speed on Power, Throughput, and Energy' — suggests hardware-level systems knowledge relevant to GPU kernel work
Recent 2024 papers on weather forecasting ML (Huge Ensembles at NVIDIA)
No direct evidence of Triton/CUDA kernel contributions
Moderate match
Hireability: MEDIUM — no pipeline signals of recent job change; no GitHub; 'Developer' title at NVIDIA, tenure unknown but likely within transition window

Gabriele Oliaro

medium hireability

CS PhD Student@Snowflake AI Research

Previously: Research Scientist Intern @ Snowflake

Strong ML systems + GPU kernel background at CMU: wrote a fused Softmax/ArgMax CUDA kernel (own repo goliaro/softmax-argmax-fused-kernel), forked FlashInfer (GPU kernel lib for LLM serving), published Korch (optimal GPU kernel orchestration for tensor programs, ASPLOS 2024) and SpecInfer (LLM serving acceleration, 415 citations). h_index 10
US-based (Pittsburgh, PA)
Hireability: MEDIUM — 4th year PhD expected 2027, ~1-2 years remaining; actively interning at Snowflake showing industry orientation, but likely finishing PhD before full-time

Haocheng Xi

medium hireability

MLsys Researcher@University of California, Berkeley

Previously: Research Intern @ Nvidia

Berkeley, US

2nd-year PhD student at UC Berkeley (Yao Class BS, Tsinghua) focusing on GPU kernel optimization for efficient ML: INT4/INT8/FP8 training quantization, sparse attention acceleration (SpargeAttn, 50 citations), and hands-on CUDA work (pinned repos: `how-to-optim-algorithm-in-cuda`, `cuda-tensorcore-hgemm`)
NVIDIA internship May–Aug 2025 with MIT Han Lab. h-index 7, 13+ papers at NeurIPS/ICML/ICLR
Based in Berkeley, CA
Hireability: MEDIUM — 2nd year PhD (started 2024), too early for graduation-driven transition, but strong industry interest shown via NVIDIA internship and 15 website updates in 2025 including cv_updates

Zhihao Zhang

medium hireability

Ph.D. student@Carnegie Mellon University

Previously: MS student @ Carnegie Mellon University

Pittsburgh, US

PhD student at CMU Catalyst building LLM inference systems; pinned FlashInfer (CUDA kernel library for LLM serving) and TVM repos show direct CUDA/kernel background
First-author papers on speculative inference (SpecInfer, ASPLOS/OSDI 2024) and sparse attention kernels (TidalDecode, ICLR 2025)
US-based in Pittsburgh
Hireability: MEDIUM-HIGH — ~5-6 years into PhD at CMU (graduation likely imminent in 2026); LinkedIn profile completely wiped in Jan 2026 (headline, position, all experience/education emptied), suggesting active career transition

Zhuoming Chen

medium hireability

Ph.D. student@Carnegie Mellon University

Previously: Research Intern @ Meta

New York, US

Strong LLM inference systems researcher — first-author on SpecInfer (381 cites, ASPLOS 2024), Sequoia, MagicDec, MagicPIG; h_index 9
Advised by Beidi Chen (efficient transformers) and Zhihao Jia (ML systems) at CMU
Work focuses on speculative decoding and efficient attention algorithms rather than direct CUDA/Triton kernel authorship — adjacent to GPU kernel engineering but not squarely in it
US-based
Hireability: MEDIUM — year 3 PhD (started 2023), recent Meta FAIR internship under Leon Bottou, likely 1-2 years from graduation; CV updated Feb 2026 but no explicit open-to-work signals

Zihao Ye

medium hireability

Engineer@NVIDIA

Previously: Intern @ NVIDIA

San Francisco, US

Creator and lead engineer of FlashInfer — a CUDA GPU attention kernel library for LLM inference (used in vLLM, SGLang, MLC-Engine; 77 citations)
Also core contributor to Apache TVM and MLC-LLM (ML compilation with GPU kernels)
Papers: SparseTIR (sparse GPU compilation), TensorIR (tensorized program optimization), FlashInfer
Engineer at NVIDIA, US-based (Seattle)
Hireability: MEDIUM — GitHub bio 'Sad to be employed.' hints at mild discontent, but no explicit job-seeking signals; no recent website/LinkedIn activity changes; tenure at NVIDIA unclear

Ali Hassani

low hireability

Research Scientist@NVIDIA

Previously: Graduate Research Assistant @ Georgia Institute of Technology

Atlanta, US

NVIDIA Research Scientist (Deep Imagination Research Group) building CUDA/C++ kernels for NATTEN (sparse attention library)
PhD Georgia Tech expected 2026, h-index 11
Key work: 'Faster Neighborhood Attention: Reducing the O(n^2) Cost of Self Attention at the Threadblock Level' (threadblock-level GPU kernel optimization) and 'Generalized Neighborhood Attention at the Speed of Light' (2025)
Active C++ CUDA development, recent Flash Attn 3 fork
Atlanta, US
Hireability: LOW — joined NVIDIA full-time Oct 2025 (~6 months ago, still in settling-in window)

Da Yan

low hireability

Member Of Technical Staff@Anthropic

Previously: Independent Contractor @ OpenAI

New York, US

Exactly on-query: GPU assembler author (turingas for Volta/Turing/Ampere), CUDA kernel work (CUDA-Winograd), PhD from HKUST on optimizing DNN kernels + GPU compiler backend in LLVM
Research expertise listed as 'GPU performance optimizing, GPU compiler'
Bio: 'AI compute & compilers.' Now MTS at Anthropic in New York
Hireability: LOW — currently at Anthropic (stable, prestigious), no open-to-work signals, no LinkedIn or website activity indicating job search; well within tenure window but no mobility indicators

Lucas Liebenwein

low hireability

Tech Lead, Deep Learning Inference@Nvidia

Previously: Chief Architect @ OmniML

New York, US

Tech Lead for Deep Learning Inference at NVIDIA, leading TensorRT-LLM AutoDeploy (compiler-driven PyTorch→CUDA inference-optimized graphs)
Strong CUDA/GPU inference background; PhD MIT 2021, h-index 13, previously Founding Engineer at OmniML (acquired by NVIDIA)
Based in New York, US
Seniority is above query target — 5 years post-PhD experience, was Engineering Manager at NVIDIA Feb 2023–May 2025 before transitioning to Tech Lead IC track
Hireability: LOW — only ~11 months into new Tech Lead role (started May 2025), no active signals of seeking opportunities, no LinkedIn/website activity detected

Mark Saroufim

low hireability

Software Engineer@Meta

Previously: ML Engineer @ Graphcore

San Francisco, US

Co-founder of GPU MODE (gpu-mode/lectures, 6k stars; gpu-mode/kernelbot kernel competition platform) and PyTorch maintainer at Meta; co-authored KernelBot and TorchAO papers (2025); Bay Area, US
Seniority is mid-level+ (SWE at Meta) but may exceed junior/mid target
Hireability: LOW — recently cofounded Core Automation lab, unlikely to be open to roles

Songlin Yang

low hireability

Member of Technical Staff@Thinking Machines Lab

Previously: Member of Technical Staff @ Thinking Machines Lab

San Francisco, US

Author of FLA (fla-org/flash-linear-attention), a Triton-based library for hardware-efficient linear attention kernels, with 277-citation paper on Gated Linear Attention and direct Triton/CUDA kernel contributions
PhD MIT, h-index 15
MTS at Thinking Machines Lab (Tri Dao's lab) in SF
Hireability: LOW — just moved from MIT PhD to Thinking Machines Lab ~Jan 2026, only ~3-4 months into new role

Ziheng Jiang

low hireability

AI Researcher@Meta

Previously: Principal Research Scientist @ ByteDance

Seattle, US

Apache TVM PMC member and core contributor (ML compiler for GPU kernel optimization, 2.5k citations)
Co-authored Flux (GPU kernel fusion on GPUs, 2024) and COMET (fine-grained MoE computation-communication overlap, 2025)
PhD from UW, Seattle-based
Strong ML systems + GPU kernel background, though TVM-compiler-centric rather than raw Triton/CUDA
Was Principal Research Scientist at ByteDance Seed before Meta
Hireability: LOW — pipeline signals show very recent company switch to Meta (captured Feb 2026), likely < 6 months in new role

Runs

#1completed0 qualified / 0 foundApr 27, 5:44 PM