Back to dashboard

GPU kernel engineers from Triton/CUDA backgrounds, US-based, junior to mid-level

completed17 qualified1 runApr 27, 5:44 PMgpu-kernel-engineers-from-tritoncuda-backgrounds-us-based-ju-1777311875
ParsedNVIDIA · Junior · IC · US
Generating seed nodes
0 proposed
Explored 0 queries
0/0 done
    3
    Expanding nodes
    queued
    4
    Qualifying candidates
    queued

    Qualified Candidates (16)

    AW

    Anjiang Wei

    high hireability

    PhD student@Stanford University

    San Francisco, US

    • Stanford PhD student (Alex Aiken lab) with direct GPU kernel research: co-authored 'Astra' (multi-agent GPU kernel performance optimization), 'AccelOpt' (AI accelerator kernel optimization), 'Equivalence Checking of ML GPU Kernels', and AMOS (tensor computation mapping on spatial accelerators, 83 citations)
    • Focus is on LLM-driven GPU kernel optimization — adjacent to hands-on Triton/CUDA but deeply informed by accelerator internals
    • US-based (San Francisco)
    • Hireability: HIGH — likely 6th+ year PhD at Stanford (first papers ~2020), prime finishing window; active publication output through 2025/2026 and no competing signals
    WB

    William Brandon

    high hireability

    PhD student@MIT CSAIL

    Previously: Research Assistant @ MIT Media Lab

    Cambridge, US

    • Custom CUDA GPU kernel engineer — authored FLUTE (CUTLASS-based kernels fusing dequantization+matmul, 2-4x faster GEMM for quantized LLMs, EMNLP 2024)
    • Also works on multi-GPU LLM inference systems: tensor parallelism, KV cache optimization, speculative decoding
    • PhD student at MIT CSAIL, Cambridge MA, with Jonathan Ragan-Kelley (Halide co-inventor) as collaborator
    • Hireability: HIGH — ~4-5th year PhD (Berkeley CS/Math Fall 2020 → MIT CSAIL), likely in final stretch and prime for industry transition
    BF

    BOYUAN FENG

    medium hireability

    SWE@PyTorch

    Previously: Researcher @ Meta

    • SWE at PyTorch/Meta on GPU kernels: co-authored FlexAttention (fused attention kernel compiler, 2024/2025), active CUDA/Inductor/CUDA-graphs contributor (99 PRs to pytorch/pytorch, latest March 2026), UCSB CS PhD with GNNAdvisor/TC-GNN/APNN-TC (CUDA) papers (h-index 15)
    • Has tritonbench and CUTLASS forks
    • US-based (UCSB/California)
    • Hireability: MEDIUM — stable Meta/PyTorch SWE role with very recent activity; PhD bio still listed suggesting possible near-graduation transition window but no explicit open-to-work signals
    CH

    Connor Holmes

    medium hireability

    Researcher@OpenAI

    Previously: Researcher @ Microsoft

    San Francisco, US

    • GPU systems researcher with direct CUDA/kernel background — personal grnn CUDA repo (low-latency RNN GPU inference), TurboMoE (kernel-fusion for MoE training), DeepSpeed-FastGen (GPU inference throughput)
    • Was on DeepSpeed team at Microsoft before joining OpenAI ~2024 (on Sora paper). 'Researcher' title (not Senior), h_index 14, SF-based — solid junior-to-mid fit
    • Hireability: MEDIUM — ~2 years at OpenAI, within typical transition window, but no explicit open-to-work signals found
    DP

    David Pruitt

    medium hireability

    Developer@NVIDIA

    Previously: Instructor @ The University of Texas at El Paso

    Austin, US

    • Developer at NVIDIA in Austin, US
    • DB expertise includes 'Computer Architecture, CPUs, GPUs, Memory, Cache' with a 2022 paper on 'Decoupling Cache and Core Speed on Power, Throughput, and Energy' — suggests hardware-level systems knowledge relevant to GPU kernel work
    • Recent 2024 papers on weather forecasting ML (Huge Ensembles at NVIDIA)
    • No direct evidence of Triton/CUDA kernel contributions
    • Moderate match
    • Hireability: MEDIUM — no pipeline signals of recent job change; no GitHub; 'Developer' title at NVIDIA, tenure unknown but likely within transition window
    GO

    Gabriele Oliaro

    medium hireability

    CS PhD Student@Snowflake AI Research

    Previously: Research Scientist Intern @ Snowflake

    • Strong ML systems + GPU kernel background at CMU: wrote a fused Softmax/ArgMax CUDA kernel (own repo goliaro/softmax-argmax-fused-kernel), forked FlashInfer (GPU kernel lib for LLM serving), published Korch (optimal GPU kernel orchestration for tensor programs, ASPLOS 2024) and SpecInfer (LLM serving acceleration, 415 citations). h_index 10
    • US-based (Pittsburgh, PA)
    • Hireability: MEDIUM — 4th year PhD expected 2027, ~1-2 years remaining; actively interning at Snowflake showing industry orientation, but likely finishing PhD before full-time
    HX

    Haocheng Xi

    medium hireability

    MLsys Researcher@University of California, Berkeley

    Previously: Research Intern @ Nvidia

    Berkeley, US

    • 2nd-year PhD student at UC Berkeley (Yao Class BS, Tsinghua) focusing on GPU kernel optimization for efficient ML: INT4/INT8/FP8 training quantization, sparse attention acceleration (SpargeAttn, 50 citations), and hands-on CUDA work (pinned repos: `how-to-optim-algorithm-in-cuda`, `cuda-tensorcore-hgemm`)
    • NVIDIA internship May–Aug 2025 with MIT Han Lab. h-index 7, 13+ papers at NeurIPS/ICML/ICLR
    • Based in Berkeley, CA
    • Hireability: MEDIUM — 2nd year PhD (started 2024), too early for graduation-driven transition, but strong industry interest shown via NVIDIA internship and 15 website updates in 2025 including cv_updates
    ZZ

    Zhihao Zhang

    medium hireability

    Ph.D. student@Carnegie Mellon University

    Previously: MS student @ Carnegie Mellon University

    Pittsburgh, US

    • PhD student at CMU Catalyst building LLM inference systems; pinned FlashInfer (CUDA kernel library for LLM serving) and TVM repos show direct CUDA/kernel background
    • First-author papers on speculative inference (SpecInfer, ASPLOS/OSDI 2024) and sparse attention kernels (TidalDecode, ICLR 2025)
    • US-based in Pittsburgh
    • Hireability: MEDIUM-HIGH — ~5-6 years into PhD at CMU (graduation likely imminent in 2026); LinkedIn profile completely wiped in Jan 2026 (headline, position, all experience/education emptied), suggesting active career transition
    ZC

    Zhuoming Chen

    medium hireability

    Ph.D. student@Carnegie Mellon University

    Previously: Research Intern @ Meta

    New York, US

    • Strong LLM inference systems researcher — first-author on SpecInfer (381 cites, ASPLOS 2024), Sequoia, MagicDec, MagicPIG; h_index 9
    • Advised by Beidi Chen (efficient transformers) and Zhihao Jia (ML systems) at CMU
    • Work focuses on speculative decoding and efficient attention algorithms rather than direct CUDA/Triton kernel authorship — adjacent to GPU kernel engineering but not squarely in it
    • US-based
    • Hireability: MEDIUM — year 3 PhD (started 2023), recent Meta FAIR internship under Leon Bottou, likely 1-2 years from graduation; CV updated Feb 2026 but no explicit open-to-work signals
    ZY

    Zihao Ye

    medium hireability

    Engineer@NVIDIA

    Previously: Intern @ NVIDIA

    San Francisco, US

    • Creator and lead engineer of FlashInfer — a CUDA GPU attention kernel library for LLM inference (used in vLLM, SGLang, MLC-Engine; 77 citations)
    • Also core contributor to Apache TVM and MLC-LLM (ML compilation with GPU kernels)
    • Papers: SparseTIR (sparse GPU compilation), TensorIR (tensorized program optimization), FlashInfer
    • Engineer at NVIDIA, US-based (Seattle)
    • Hireability: MEDIUM — GitHub bio 'Sad to be employed.' hints at mild discontent, but no explicit job-seeking signals; no recent website/LinkedIn activity changes; tenure at NVIDIA unclear
    AH

    Ali Hassani

    low hireability

    Research Scientist@NVIDIA

    Previously: Graduate Research Assistant @ Georgia Institute of Technology

    Atlanta, US

    • NVIDIA Research Scientist (Deep Imagination Research Group) building CUDA/C++ kernels for NATTEN (sparse attention library)
    • PhD Georgia Tech expected 2026, h-index 11
    • Key work: 'Faster Neighborhood Attention: Reducing the O(n^2) Cost of Self Attention at the Threadblock Level' (threadblock-level GPU kernel optimization) and 'Generalized Neighborhood Attention at the Speed of Light' (2025)
    • Active C++ CUDA development, recent Flash Attn 3 fork
    • Atlanta, US
    • Hireability: LOW — joined NVIDIA full-time Oct 2025 (~6 months ago, still in settling-in window)
    DY

    Da Yan

    low hireability

    Member Of Technical Staff@Anthropic

    Previously: Independent Contractor @ OpenAI

    New York, US

    • Exactly on-query: GPU assembler author (turingas for Volta/Turing/Ampere), CUDA kernel work (CUDA-Winograd), PhD from HKUST on optimizing DNN kernels + GPU compiler backend in LLVM
    • Research expertise listed as 'GPU performance optimizing, GPU compiler'
    • Bio: 'AI compute & compilers.' Now MTS at Anthropic in New York
    • Hireability: LOW — currently at Anthropic (stable, prestigious), no open-to-work signals, no LinkedIn or website activity indicating job search; well within tenure window but no mobility indicators
    LL

    Lucas Liebenwein

    low hireability

    Tech Lead, Deep Learning Inference@Nvidia

    Previously: Chief Architect @ OmniML

    New York, US

    • Tech Lead for Deep Learning Inference at NVIDIA, leading TensorRT-LLM AutoDeploy (compiler-driven PyTorch→CUDA inference-optimized graphs)
    • Strong CUDA/GPU inference background; PhD MIT 2021, h-index 13, previously Founding Engineer at OmniML (acquired by NVIDIA)
    • Based in New York, US
    • Seniority is above query target — 5 years post-PhD experience, was Engineering Manager at NVIDIA Feb 2023–May 2025 before transitioning to Tech Lead IC track
    • Hireability: LOW — only ~11 months into new Tech Lead role (started May 2025), no active signals of seeking opportunities, no LinkedIn/website activity detected
    MS

    Mark Saroufim

    low hireability

    Software Engineer@Meta

    Previously: ML Engineer @ Graphcore

    San Francisco, US

    • Co-founder of GPU MODE (gpu-mode/lectures, 6k stars; gpu-mode/kernelbot kernel competition platform) and PyTorch maintainer at Meta; co-authored KernelBot and TorchAO papers (2025); Bay Area, US
    • Seniority is mid-level+ (SWE at Meta) but may exceed junior/mid target
    • Hireability: LOW — recently cofounded Core Automation lab, unlikely to be open to roles
    SY

    Songlin Yang

    low hireability

    Member of Technical Staff@Thinking Machines Lab

    Previously: Member of Technical Staff @ Thinking Machines Lab

    San Francisco, US

    • Author of FLA (fla-org/flash-linear-attention), a Triton-based library for hardware-efficient linear attention kernels, with 277-citation paper on Gated Linear Attention and direct Triton/CUDA kernel contributions
    • PhD MIT, h-index 15
    • MTS at Thinking Machines Lab (Tri Dao's lab) in SF
    • Hireability: LOW — just moved from MIT PhD to Thinking Machines Lab ~Jan 2026, only ~3-4 months into new role
    ZJ

    Ziheng Jiang

    low hireability

    AI Researcher@Meta

    Previously: Principal Research Scientist @ ByteDance

    Seattle, US

    • Apache TVM PMC member and core contributor (ML compiler for GPU kernel optimization, 2.5k citations)
    • Co-authored Flux (GPU kernel fusion on GPUs, 2024) and COMET (fine-grained MoE computation-communication overlap, 2025)
    • PhD from UW, Seattle-based
    • Strong ML systems + GPU kernel background, though TVM-compiler-centric rather than raw Triton/CUDA
    • Was Principal Research Scientist at ByteDance Seed before Meta
    • Hireability: LOW — pipeline signals show very recent company switch to Meta (captured Feb 2026), likely < 6 months in new role

    Runs

    #1completed0 qualified / 0 foundApr 27, 5:44 PM