Stanford CS student in Palo Alto; owns CUDA repo targeting H100 kernel optimization and CUDABenchmarks, plus 321 commits to ThunderKittens (TK) and TKConvs
Hands-on GPU kernel work evidenced across multiple CUDA repos
Hireability: HIGH — active Stanford student with H100 CUDA kernel experience and substantial ThunderKittens contribution depth
AH
Alex Hu
high hireability
MIT student (alexhu@mit.edu), Cambridge MA (US)
Owns cudalol CUDA kernels repo and jax_quant_llm, plus ThunderKittens contributor; student status with real CUDA work confirmed
Hireability: HIGH — MIT student with dedicated CUDA kernel repo and ThunderKittens + quantization work
AL
Austin Liu
high hireability
Austin Liu — UCI MS student (2024 grad), Liger-Kernel contributor (18 commits, FP8 kernels in Triton)
Active GPU kernel practitioner
Junior-level, Orange County CA
BR
Brian K. Ryu
high hireability
Brian K
Ryu — NVIDIA SWE (~2 yrs), FlashInfer MoE Blackwell CUDA kernels (GEMM + attention on H100/B200)
Direct CUDA kernel engineering at NVIDIA on cutting-edge hardware
Strong junior, US-based
EN
enbao
high hireability
Stanford student (@Stanford bio, enbao.me) — US-based via Stanford affiliation
Owns kernels repo (custom SGEMM/SGEMV in CUDA) and tk-old-blackwell (Blackwell TK fork, CUDA), plus ThunderKittens contributor
Direct evidence of low-level CUDA kernel authorship at cutting-edge (Blackwell)
Hireability: HIGH — Stanford student writing SGEMM/SGEMV from scratch and porting TK to Blackwell is a strong junior CUDA kernel signal
Strong CUDA/Triton depth via AMD ecosystem, US-based
MB
Matthew Bonanni
high hireability
Stanford PhD completed 2025, now MLE at Red Hat in Boston MA — bio explicitly lists HPC, C++, CUDA, LLM inference. 165 PRs to vllm-project including FLASH_ATTN_MLA_SPARSE backend (FA4 DSA on Blackwell), FlashMLA CUDA fork, MLA sparse CUTLASS kernel work
Personal LLM.cu repo confirms standalone kernel authoring
New-grad (PhD 2025) with no senior title, US-based
Hireability: HIGH — PhD new grad with hands-on CUDA kernel work at vLLM maintainer level
SS
Shivam Sahni
high hireability
Shivam Sahni — MS UCSD 2024, Together AI SWE
Liger-Kernel top contributor (40+ commits, RoPE/activation kernels in Triton)
Strong junior with industry traction, US-based
SS
Steven Shimizu
high hireability
Steven Shimizu — US-based (Pacific timezone), Liger-Kernel contributor (23 commits, FP8/RoPE kernels in Triton)
Junior practitioner with strong hands-on Triton evidence
SS
Stuart Sul
high hireability
ThunderKittens rank-2 contributor (524 commits) shipping megakernels, ring attention, and MXFP8 training kernels; owns gpu-experiments and deltaspill (CUDA register spill debugger)
Stanford PhD student + ML Researcher at Cursor, Palo Alto CA, no senior title
Hireability: HIGH — rare combination of production-grade CUDA megakernel work and active PhD-level research at a top AI lab
TT
Thien Tran
high hireability
Owns gn-kernels (CUDA, benchmarked on H200/B200/RTX5090) with CUTLASS INT4/FP8 matmul and Triton attention kernels, plus gpu-mode-kernels repo and deep GPU architecture blog on TMA swizzling and tcgen05
No company, no senior title; independent practitioner
NOT US-based (Singapore) — strong enough to qualify despite location
Hireability: HIGH — rare combination of production-grade CUDA+Triton kernel authorship with benchmarks on latest Blackwell/Hopper hardware
UW PhD student (computer systems/architecture), Seattle WA, 5 PRs to FlashInfer on GPU kernel memory hierarchy
Owns Legion-gcs and InvariantBitPacking in CUDA
No senior title
Hireability: MEDIUM — solid systems-focused CUDA work and PhD research pedigree, smaller contribution footprint (5 PRs) but architecture background is directly applicable
AZ
Alex Zhang
medium hireability
Alex Zhang — MIT PhD student (1st yr), GPU MODE community contributor (KernelBench, LeetGPU)
University of Waterloo CS student — Canada-based, not US
Strong CUDA kernel portfolio: meTile GPU eDSL, intra-kernel-profiler (CUDA), Liger-Kernel contributor, and mHC.cu (DeepSeek manifold CUDA with 5-15x H100 speedup)
Impressive originality for a student but location is Canada
Hireability: MEDIUM — exceptional CUDA kernel breadth but non-US location may limit fit
AN
Aniruddha Nrusimha
medium hireability
PhD candidate@MIT
Previously: Undergrad student @ University of California Berkeley
Boston, US
Aniruddha Nrusimha — MIT PhD student (2nd yr), quantization-aware pretraining (qat-pretrain repo with CUDA kernel work for quantized ops)
Junior-level, US-based Cambridge MA
CL
Chun-Mao Lai
medium hireability
Software Engineer - Systems Infrastructure at LinkedIn in Sunnyvale CA (US); Liger-Kernel contributor with TransformerEngine and SGLang forks indicating Triton kernel integration in production infra
Account created 2020, CSE 234 (Winter 2025 grad ML systems course) fork signals recent new grad/student
US-based, junior seniority
Hireability: MEDIUM — US-based junior with Triton production exposure, modest contributions (16 commits) at LinkedIn infra
Junior/student status clear but fewer original CUDA repos compared to peers
Hireability: MEDIUM — TK contribution and pplx-kernels fork are positive signals but thinner original kernel authorship
LC
Lequn Chen
medium hireability
Research Engineer at Perplexity AI (@ppl-ai) in Seattle WA (US); FlashInfer contributor 41 commits and CUTLASS CUDA fork — real kernel-level systems work
Title is Research Engineer (not Senior/Staff), account from 2012 suggests mid-level rather than strict junior
Hireability: MEDIUM — strong GPU systems background and US location, but tenure signals push toward mid-level
MP
Max Podkorytov
medium hireability
Max Podkorytov — AMD GPU open-source contributor, ROCm/HIP kernel work (hipBLASLt, Composable Kernel). ~3 yrs exp, Seattle WA
CUDA transferable from AMD/ROCm background
MM
Mayank Mishra
medium hireability
Graduate Student Researcher@University of California, Berkeley
Previously: Research Engineer-II @ MIT-IBM Watson AI Lab
Hireability: MEDIUM — solid Triton contribution record but no standalone kernel repos beyond Liger-Kernel PRs
WY
Wentao Ye
medium hireability
Owns cuda_basic_tutorial (CUDA language confirmed), authored custom fast all2all CUDA kernel for vLLM, contributed nvfp4 quantization CUDA fixes; Boston MA, no employer, no senior title
Hireability: MEDIUM — hands-on CUDA kernel authorship across vLLM and personal repos, scope is more educational than production megakernel work
YZ
Yilong Zhao
medium hireability
Yilong Zhao — UCB PhD student, FlashInfer contributor + Atom (low-bit attention CUDA kernels)
GPU kernel practitioner
Junior, Berkeley CA
YY
yyihuang
medium hireability
Pittsburgh PA (US), bio GPU architect, no employer listed. 272 PRs to flashinfer-ai/flashinfer-bench covering fused MoE FP8 kernel definitions, TRT-LLM speculative decoding, GQA paged decode/prefill for B200
Forked DeepGEMM (FP8 CUDA) and Cute-Learning (CuTe CUDA examples)
Account 2019
GPU architect title is ambiguous — could indicate chip-design background
Hireability: MEDIUM — strong FlashInfer kernel contributions and US-based, but architect title and no employer warrant a closer look
Still in school but actively contributing GPU kernels
US-based
CH
Connor Holmes
low hireability
Researcher@OpenAI
Previously: Researcher @ Microsoft
San Francisco, US
Connor Holmes test
HS
Hanshi Sun
low hireability
Research Scientist@ByteDance
Previously: Teaching Assistant @ Carnegie Mellon University
Bellevue, US
Hanshi Sun — ByteDance SWE, Triton-distributed contributor (parallel attention kernels)
China-based currently
US location unclear
Borderline junior
JZ
Jiangyun Zhu
low hireability
Current intern at Inferact (Beijing) fusing RoPE+KV cache kernels for MLA in vLLM, owns fa-fwd implementing Flash-Attention-3 forward kernel from scratch
Account created June 2021, clearly junior/student
NOT US-based (Beijing, China)
Hireability: LOW — technically impressive intern but China-based with no US signal