Back to dashboard

Junior GPU kernel engineers in the US with CUDA and Triton experience

completed51 qualified1 runApr 19, 3:24 PMjunior-gpu-kernel-engineers-in-the-us-with-cuda-and-triton-e
Parsed2 topics · Junior · Engineer · US
Generating seed nodes
0 proposed
Explored 0 queries
0/0 done
    3
    Expanding nodes
    queued
    4
    Qualifying candidates
    queued

    Qualified Candidates (51)

    AS

    Aaryan Singhal

    high hireability
    • Stanford CS student in Palo Alto; owns CUDA repo targeting H100 kernel optimization and CUDABenchmarks, plus 321 commits to ThunderKittens (TK) and TKConvs
    • Hands-on GPU kernel work evidenced across multiple CUDA repos
    • Hireability: HIGH — active Stanford student with H100 CUDA kernel experience and substantial ThunderKittens contribution depth
    AH

    Alex Hu

    high hireability
    • MIT student (alexhu@mit.edu), Cambridge MA (US)
    • Owns cudalol CUDA kernels repo and jax_quant_llm, plus ThunderKittens contributor; student status with real CUDA work confirmed
    • Hireability: HIGH — MIT student with dedicated CUDA kernel repo and ThunderKittens + quantization work
    AL

    Austin Liu

    high hireability
    • Austin Liu — UCI MS student (2024 grad), Liger-Kernel contributor (18 commits, FP8 kernels in Triton)
    • Active GPU kernel practitioner
    • Junior-level, Orange County CA
    BR

    Brian K. Ryu

    high hireability
    • Brian K
    • Ryu — NVIDIA SWE (~2 yrs), FlashInfer MoE Blackwell CUDA kernels (GEMM + attention on H100/B200)
    • Direct CUDA kernel engineering at NVIDIA on cutting-edge hardware
    • Strong junior, US-based
    EN

    enbao

    high hireability
    • Stanford student (@Stanford bio, enbao.me) — US-based via Stanford affiliation
    • Owns kernels repo (custom SGEMM/SGEMV in CUDA) and tk-old-blackwell (Blackwell TK fork, CUDA), plus ThunderKittens contributor
    • Direct evidence of low-level CUDA kernel authorship at cutting-edge (Blackwell)
    • Hireability: HIGH — Stanford student writing SGEMM/SGEMV from scratch and porting TK to Blackwell is a strong junior CUDA kernel signal
    KW

    Kyle Wang

    high hireability
    • Kyle Wang — AMD GPU SW engineer, 73 Triton PRs targeting GFX1250 (Strix Halo chip), low-level kernel contributor. ~2-3 yrs exp
    • Strong CUDA/Triton depth via AMD ecosystem, US-based
    MB

    Matthew Bonanni

    high hireability
    • Stanford PhD completed 2025, now MLE at Red Hat in Boston MA — bio explicitly lists HPC, C++, CUDA, LLM inference. 165 PRs to vllm-project including FLASH_ATTN_MLA_SPARSE backend (FA4 DSA on Blackwell), FlashMLA CUDA fork, MLA sparse CUTLASS kernel work
    • Personal LLM.cu repo confirms standalone kernel authoring
    • New-grad (PhD 2025) with no senior title, US-based
    • Hireability: HIGH — PhD new grad with hands-on CUDA kernel work at vLLM maintainer level
    SS

    Shivam Sahni

    high hireability
    • Shivam Sahni — MS UCSD 2024, Together AI SWE
    • Liger-Kernel top contributor (40+ commits, RoPE/activation kernels in Triton)
    • Strong junior with industry traction, US-based
    SS

    Steven Shimizu

    high hireability
    • Steven Shimizu — US-based (Pacific timezone), Liger-Kernel contributor (23 commits, FP8/RoPE kernels in Triton)
    • Junior practitioner with strong hands-on Triton evidence
    SS

    Stuart Sul

    high hireability
    • ThunderKittens rank-2 contributor (524 commits) shipping megakernels, ring attention, and MXFP8 training kernels; owns gpu-experiments and deltaspill (CUDA register spill debugger)
    • Stanford PhD student + ML Researcher at Cursor, Palo Alto CA, no senior title
    • Hireability: HIGH — rare combination of production-grade CUDA megakernel work and active PhD-level research at a top AI lab
    TT

    Thien Tran

    high hireability
    • Owns gn-kernels (CUDA, benchmarked on H200/B200/RTX5090) with CUTLASS INT4/FP8 matmul and Triton attention kernels, plus gpu-mode-kernels repo and deep GPU architecture blog on TMA swizzling and tcgen05
    • No company, no senior title; independent practitioner
    • NOT US-based (Singapore) — strong enough to qualify despite location
    • Hireability: HIGH — rare combination of production-grade CUDA+Triton kernel authorship with benchmarks on latest Blackwell/Hopper hardware
    WT

    Wenxuan Tan

    high hireability
    • Wenxuan Tan — UW Madison, flash-attention FA3 contributor (CUDA internals, pingpong scheduling)
    • Strong CUDA depth on transformer kernels
    • Junior, US-based
    AK

    Aditya K Kamath

    medium hireability
    • UW PhD student (computer systems/architecture), Seattle WA, 5 PRs to FlashInfer on GPU kernel memory hierarchy
    • Owns Legion-gcs and InvariantBitPacking in CUDA
    • No senior title
    • Hireability: MEDIUM — solid systems-focused CUDA work and PhD research pedigree, smaller contribution footprint (5 PRs) but architecture background is directly applicable
    AZ

    Alex Zhang

    medium hireability
    • Alex Zhang — MIT PhD student (1st yr), GPU MODE community contributor (KernelBench, LeetGPU)
    • CUDA/Triton practitioner writing benchmark kernels
    • Junior, Cambridge MA
    AS

    Andre Slavescu

    medium hireability
    • University of Waterloo CS student — Canada-based, not US
    • Strong CUDA kernel portfolio: meTile GPU eDSL, intra-kernel-profiler (CUDA), Liger-Kernel contributor, and mHC.cu (DeepSeek manifold CUDA with 5-15x H100 speedup)
    • Impressive originality for a student but location is Canada
    • Hireability: MEDIUM — exceptional CUDA kernel breadth but non-US location may limit fit
    AN

    Aniruddha Nrusimha

    medium hireability

    PhD candidate@MIT

    Previously: Undergrad student @ University of California Berkeley

    Boston, US

    • Aniruddha Nrusimha — MIT PhD student (2nd yr), quantization-aware pretraining (qat-pretrain repo with CUDA kernel work for quantized ops)
    • Junior-level, US-based Cambridge MA
    CL

    Chun-Mao Lai

    medium hireability
    • Software Engineer - Systems Infrastructure at LinkedIn in Sunnyvale CA (US); Liger-Kernel contributor with TransformerEngine and SGLang forks indicating Triton kernel integration in production infra
    • Account created 2020, CSE 234 (Winter 2025 grad ML systems course) fork signals recent new grad/student
    • US-based, junior seniority
    • Hireability: MEDIUM — US-based junior with Triton production exposure, modest contributions (16 commits) at LinkedIn infra
    DL

    Dylan Lim

    medium hireability
    • Stanford CS student, Palo Alto (US confirmed); 102 ThunderKittens commits and forked pplx-kernels (Perplexity GPU kernel library)
    • Junior/student status clear but fewer original CUDA repos compared to peers
    • Hireability: MEDIUM — TK contribution and pplx-kernels fork are positive signals but thinner original kernel authorship
    LC

    Lequn Chen

    medium hireability
    • Research Engineer at Perplexity AI (@ppl-ai) in Seattle WA (US); FlashInfer contributor 41 commits and CUTLASS CUDA fork — real kernel-level systems work
    • Title is Research Engineer (not Senior/Staff), account from 2012 suggests mid-level rather than strict junior
    • Hireability: MEDIUM — strong GPU systems background and US location, but tenure signals push toward mid-level
    MP

    Max Podkorytov

    medium hireability
    • Max Podkorytov — AMD GPU open-source contributor, ROCm/HIP kernel work (hipBLASLt, Composable Kernel). ~3 yrs exp, Seattle WA
    • CUDA transferable from AMD/ROCm background
    MM

    Mayank Mishra

    medium hireability

    Graduate Student Researcher@University of California, Berkeley

    Previously: Research Engineer-II @ MIT-IBM Watson AI Lab

    Berkeley, US

    • Mayank Mishra — UCB visiting PhD, accelerated-model-architectures (Triton Flash Attention kernels)
    • ML engineer with Triton kernel writing
    • Junior, US Bay Area
    RD

    Raayan Dhar

    medium hireability
    • FlashInfer CUDA contributor writing FP8 MoE per-channel quantization, BF16 GEMM backends with CUTLASS+cuDNN, and RoPE kernel extensions
    • CUTLASS fork active on GitHub
    • No employer, no senior title
    • US location unconfirmed (GitHub shows org-mode as location)
    • Hireability: MEDIUM — strong low-level CUDA kernel authorship in production inference repos, but US location unverified
    RL

    Ruihang Lai

    medium hireability
    • Ruihang Lai — CMU 4th-yr PhD, Apache TVM PMC member, Triton SwiGLU/flash-attn kernel contributor
    • Strong compiler+kernel background
    • Junior (still in PhD), Pittsburgh PA
    SX

    Shanli Xing

    medium hireability
    • Shanli Xing — CMU incoming PhD, FlashInfer lead contributor (CUDA kernel library for LLM serving)
    • Strong CUDA/Triton practitioner
    • Junior, Pittsburgh PA
    TC

    Tcc0403

    medium hireability
    • Liger-Kernel maintainer with 62 commits (rank 2), CuTe DSL forge fork, hands-on Triton kernel work
    • Account created December 2020, no senior title — early-career/maintainer-level
    • NOT US-based (Taipei, Taiwan)
    • Hireability: MEDIUM — strong Triton/CuTe kernel work but Taiwan-based
    VJ

    Vaibhav Jindal

    medium hireability
    • Liger-Kernel contributor rank 3 (44 commits) with Triton kernel optimization for LLM training at LinkedIn, SF Bay Area
    • Title is Software Engineer (no senior signal), Liger-Kernel fork with Triton kernel PRs confirmed
    • Limited owned standalone CUDA/Triton repos beyond Liger-Kernel
    • Hireability: MEDIUM — solid Triton contribution record but no standalone kernel repos beyond Liger-Kernel PRs
    WY

    Wentao Ye

    medium hireability
    • Owns cuda_basic_tutorial (CUDA language confirmed), authored custom fast all2all CUDA kernel for vLLM, contributed nvfp4 quantization CUDA fixes; Boston MA, no employer, no senior title
    • Hireability: MEDIUM — hands-on CUDA kernel authorship across vLLM and personal repos, scope is more educational than production megakernel work
    YZ

    Yilong Zhao

    medium hireability
    • Yilong Zhao — UCB PhD student, FlashInfer contributor + Atom (low-bit attention CUDA kernels)
    • GPU kernel practitioner
    • Junior, Berkeley CA
    YY

    yyihuang

    medium hireability
    • Pittsburgh PA (US), bio GPU architect, no employer listed. 272 PRs to flashinfer-ai/flashinfer-bench covering fused MoE FP8 kernel definitions, TRT-LLM speculative decoding, GQA paged decode/prefill for B200
    • Forked DeepGEMM (FP8 CUDA) and Cute-Learning (CuTe CUDA examples)
    • Account 2019
    • GPU architect title is ambiguous — could indicate chip-design background
    • Hireability: MEDIUM — strong FlashInfer kernel contributions and US-based, but architect title and no employer warrant a closer look
    ZM

    Zain Merchant

    medium hireability
    • Zain Merchant — USC student, Liger-Kernel contributor (9 commits, Triton kernel work)
    • Still in school but actively contributing GPU kernels
    • US-based
    CH

    Connor Holmes

    low hireability

    Researcher@OpenAI

    Previously: Researcher @ Microsoft

    San Francisco, US

    • Connor Holmes test
    HS

    Hanshi Sun

    low hireability

    Research Scientist@ByteDance

    Previously: Teaching Assistant @ Carnegie Mellon University

    Bellevue, US

    • Hanshi Sun — ByteDance SWE, Triton-distributed contributor (parallel attention kernels)
    • China-based currently
    • US location unclear
    • Borderline junior
    JZ

    Jiangyun Zhu

    low hireability
    • Current intern at Inferact (Beijing) fusing RoPE+KV cache kernels for MLA in vLLM, owns fa-fwd implementing Flash-Attention-3 forward kernel from scratch
    • Account created June 2021, clearly junior/student
    • NOT US-based (Beijing, China)
    • Hireability: LOW — technically impressive intern but China-based with no US signal
    AA

    Adnan Akhundov

    No note
    AD

    Aidan Do

    No note
    CS

    Cameron Shinn

    No note
    DZ

    Dan Zimmerman

    No note
    DB

    David Berard

    No note
    HF

    Haozheng Fan

    No note
    JN

    Jez Ng

    No note
    KS

    Kyle Sayers

    No note
    LG

    Luka Govedic

    No note
    MH

    Markus Hoehnerbach

    No note
    MW

    Micah Williamson

    No note
    MM

    Michael Melesse

    No note
    PZ

    Pengzhan Zhao

    No note
    TZ

    Ted Zadouri

    No note
    VR

    Varun Sundar Rabindranath

    No note
    YW

    Yidi Wu

    No note
    YQ

    Yi Qian

    No note
    ZY

    Zheng Yan

    No note

    Runs

    #1completed51 qualified / 86 foundApr 19, 3:25 PM