Junior GPU kernel engineers in the US with CUDA and Triton experience

completed51 qualified1 runApr 19, 3:24 PMjunior-gpu-kernel-engineers-in-the-us-with-cuda-and-triton-e

Parsed2 topics · Junior · Engineer · US

Generating seed nodes

0 proposed

Explored 0 queries

0/0 done

Expanding nodes

queued

Qualifying candidates

queued

Qualified Candidates (51)

Aaryan Singhal

high hireability

Stanford CS student in Palo Alto; owns CUDA repo targeting H100 kernel optimization and CUDABenchmarks, plus 321 commits to ThunderKittens (TK) and TKConvs
Hands-on GPU kernel work evidenced across multiple CUDA repos
Hireability: HIGH — active Stanford student with H100 CUDA kernel experience and substantial ThunderKittens contribution depth

Alex Hu

high hireability

MIT student (alexhu@mit.edu), Cambridge MA (US)
Owns cudalol CUDA kernels repo and jax_quant_llm, plus ThunderKittens contributor; student status with real CUDA work confirmed
Hireability: HIGH — MIT student with dedicated CUDA kernel repo and ThunderKittens + quantization work

Austin Liu

high hireability

Austin Liu — UCI MS student (2024 grad), Liger-Kernel contributor (18 commits, FP8 kernels in Triton)
Active GPU kernel practitioner
Junior-level, Orange County CA

Brian K. Ryu

high hireability

Brian K
Ryu — NVIDIA SWE (~2 yrs), FlashInfer MoE Blackwell CUDA kernels (GEMM + attention on H100/B200)
Direct CUDA kernel engineering at NVIDIA on cutting-edge hardware
Strong junior, US-based

enbao

high hireability

Stanford student (@Stanford bio, enbao.me) — US-based via Stanford affiliation
Owns kernels repo (custom SGEMM/SGEMV in CUDA) and tk-old-blackwell (Blackwell TK fork, CUDA), plus ThunderKittens contributor
Direct evidence of low-level CUDA kernel authorship at cutting-edge (Blackwell)
Hireability: HIGH — Stanford student writing SGEMM/SGEMV from scratch and porting TK to Blackwell is a strong junior CUDA kernel signal

Kyle Wang

high hireability

Kyle Wang — AMD GPU SW engineer, 73 Triton PRs targeting GFX1250 (Strix Halo chip), low-level kernel contributor. ~2-3 yrs exp
Strong CUDA/Triton depth via AMD ecosystem, US-based

Matthew Bonanni

high hireability

Stanford PhD completed 2025, now MLE at Red Hat in Boston MA — bio explicitly lists HPC, C++, CUDA, LLM inference. 165 PRs to vllm-project including FLASH_ATTN_MLA_SPARSE backend (FA4 DSA on Blackwell), FlashMLA CUDA fork, MLA sparse CUTLASS kernel work
Personal LLM.cu repo confirms standalone kernel authoring
New-grad (PhD 2025) with no senior title, US-based
Hireability: HIGH — PhD new grad with hands-on CUDA kernel work at vLLM maintainer level

Shivam Sahni

high hireability

Shivam Sahni — MS UCSD 2024, Together AI SWE
Liger-Kernel top contributor (40+ commits, RoPE/activation kernels in Triton)
Strong junior with industry traction, US-based

Steven Shimizu

high hireability

Steven Shimizu — US-based (Pacific timezone), Liger-Kernel contributor (23 commits, FP8/RoPE kernels in Triton)
Junior practitioner with strong hands-on Triton evidence

Stuart Sul

high hireability

ThunderKittens rank-2 contributor (524 commits) shipping megakernels, ring attention, and MXFP8 training kernels; owns gpu-experiments and deltaspill (CUDA register spill debugger)
Stanford PhD student + ML Researcher at Cursor, Palo Alto CA, no senior title
Hireability: HIGH — rare combination of production-grade CUDA megakernel work and active PhD-level research at a top AI lab

Thien Tran

high hireability

Owns gn-kernels (CUDA, benchmarked on H200/B200/RTX5090) with CUTLASS INT4/FP8 matmul and Triton attention kernels, plus gpu-mode-kernels repo and deep GPU architecture blog on TMA swizzling and tcgen05
No company, no senior title; independent practitioner
NOT US-based (Singapore) — strong enough to qualify despite location
Hireability: HIGH — rare combination of production-grade CUDA+Triton kernel authorship with benchmarks on latest Blackwell/Hopper hardware

Wenxuan Tan

high hireability

Wenxuan Tan — UW Madison, flash-attention FA3 contributor (CUDA internals, pingpong scheduling)
Strong CUDA depth on transformer kernels
Junior, US-based

Aditya K Kamath

medium hireability

UW PhD student (computer systems/architecture), Seattle WA, 5 PRs to FlashInfer on GPU kernel memory hierarchy
Owns Legion-gcs and InvariantBitPacking in CUDA
No senior title
Hireability: MEDIUM — solid systems-focused CUDA work and PhD research pedigree, smaller contribution footprint (5 PRs) but architecture background is directly applicable

Alex Zhang

medium hireability

Alex Zhang — MIT PhD student (1st yr), GPU MODE community contributor (KernelBench, LeetGPU)
CUDA/Triton practitioner writing benchmark kernels
Junior, Cambridge MA

Andre Slavescu

medium hireability

University of Waterloo CS student — Canada-based, not US
Strong CUDA kernel portfolio: meTile GPU eDSL, intra-kernel-profiler (CUDA), Liger-Kernel contributor, and mHC.cu (DeepSeek manifold CUDA with 5-15x H100 speedup)
Impressive originality for a student but location is Canada
Hireability: MEDIUM — exceptional CUDA kernel breadth but non-US location may limit fit

Aniruddha Nrusimha

medium hireability

PhD candidate@MIT

Previously: Undergrad student @ University of California Berkeley

Boston, US

Aniruddha Nrusimha — MIT PhD student (2nd yr), quantization-aware pretraining (qat-pretrain repo with CUDA kernel work for quantized ops)
Junior-level, US-based Cambridge MA

Chun-Mao Lai

medium hireability

Software Engineer - Systems Infrastructure at LinkedIn in Sunnyvale CA (US); Liger-Kernel contributor with TransformerEngine and SGLang forks indicating Triton kernel integration in production infra
Account created 2020, CSE 234 (Winter 2025 grad ML systems course) fork signals recent new grad/student
US-based, junior seniority
Hireability: MEDIUM — US-based junior with Triton production exposure, modest contributions (16 commits) at LinkedIn infra

Dylan Lim

medium hireability

Stanford CS student, Palo Alto (US confirmed); 102 ThunderKittens commits and forked pplx-kernels (Perplexity GPU kernel library)
Junior/student status clear but fewer original CUDA repos compared to peers
Hireability: MEDIUM — TK contribution and pplx-kernels fork are positive signals but thinner original kernel authorship

Lequn Chen

medium hireability

Research Engineer at Perplexity AI (@ppl-ai) in Seattle WA (US); FlashInfer contributor 41 commits and CUTLASS CUDA fork — real kernel-level systems work
Title is Research Engineer (not Senior/Staff), account from 2012 suggests mid-level rather than strict junior
Hireability: MEDIUM — strong GPU systems background and US location, but tenure signals push toward mid-level

Max Podkorytov

medium hireability

Max Podkorytov — AMD GPU open-source contributor, ROCm/HIP kernel work (hipBLASLt, Composable Kernel). ~3 yrs exp, Seattle WA
CUDA transferable from AMD/ROCm background

Mayank Mishra

medium hireability

Graduate Student Researcher@University of California, Berkeley

Previously: Research Engineer-II @ MIT-IBM Watson AI Lab

Berkeley, US

Mayank Mishra — UCB visiting PhD, accelerated-model-architectures (Triton Flash Attention kernels)
ML engineer with Triton kernel writing
Junior, US Bay Area

Raayan Dhar

medium hireability

FlashInfer CUDA contributor writing FP8 MoE per-channel quantization, BF16 GEMM backends with CUTLASS+cuDNN, and RoPE kernel extensions
CUTLASS fork active on GitHub
No employer, no senior title
US location unconfirmed (GitHub shows org-mode as location)
Hireability: MEDIUM — strong low-level CUDA kernel authorship in production inference repos, but US location unverified

Ruihang Lai

medium hireability

Ruihang Lai — CMU 4th-yr PhD, Apache TVM PMC member, Triton SwiGLU/flash-attn kernel contributor
Strong compiler+kernel background
Junior (still in PhD), Pittsburgh PA

Shanli Xing

medium hireability

Shanli Xing — CMU incoming PhD, FlashInfer lead contributor (CUDA kernel library for LLM serving)
Strong CUDA/Triton practitioner
Junior, Pittsburgh PA

Tcc0403

medium hireability

Liger-Kernel maintainer with 62 commits (rank 2), CuTe DSL forge fork, hands-on Triton kernel work
Account created December 2020, no senior title — early-career/maintainer-level
NOT US-based (Taipei, Taiwan)
Hireability: MEDIUM — strong Triton/CuTe kernel work but Taiwan-based

Vaibhav Jindal

medium hireability

Liger-Kernel contributor rank 3 (44 commits) with Triton kernel optimization for LLM training at LinkedIn, SF Bay Area
Title is Software Engineer (no senior signal), Liger-Kernel fork with Triton kernel PRs confirmed
Limited owned standalone CUDA/Triton repos beyond Liger-Kernel
Hireability: MEDIUM — solid Triton contribution record but no standalone kernel repos beyond Liger-Kernel PRs

Wentao Ye

medium hireability

Owns cuda_basic_tutorial (CUDA language confirmed), authored custom fast all2all CUDA kernel for vLLM, contributed nvfp4 quantization CUDA fixes; Boston MA, no employer, no senior title
Hireability: MEDIUM — hands-on CUDA kernel authorship across vLLM and personal repos, scope is more educational than production megakernel work

Yilong Zhao

medium hireability

Yilong Zhao — UCB PhD student, FlashInfer contributor + Atom (low-bit attention CUDA kernels)
GPU kernel practitioner
Junior, Berkeley CA

yyihuang

medium hireability

Pittsburgh PA (US), bio GPU architect, no employer listed. 272 PRs to flashinfer-ai/flashinfer-bench covering fused MoE FP8 kernel definitions, TRT-LLM speculative decoding, GQA paged decode/prefill for B200
Forked DeepGEMM (FP8 CUDA) and Cute-Learning (CuTe CUDA examples)
Account 2019
GPU architect title is ambiguous — could indicate chip-design background
Hireability: MEDIUM — strong FlashInfer kernel contributions and US-based, but architect title and no employer warrant a closer look

Zain Merchant

medium hireability

Zain Merchant — USC student, Liger-Kernel contributor (9 commits, Triton kernel work)
Still in school but actively contributing GPU kernels
US-based

Connor Holmes

low hireability

Researcher@OpenAI

Previously: Researcher @ Microsoft

San Francisco, US

Connor Holmes test

Hanshi Sun

low hireability

Research Scientist@ByteDance

Previously: Teaching Assistant @ Carnegie Mellon University

Bellevue, US

Hanshi Sun — ByteDance SWE, Triton-distributed contributor (parallel attention kernels)
China-based currently
US location unclear
Borderline junior

Jiangyun Zhu

low hireability

Current intern at Inferact (Beijing) fusing RoPE+KV cache kernels for MLA in vLLM, owns fa-fwd implementing Flash-Attention-3 forward kernel from scratch
Account created June 2021, clearly junior/student
NOT US-based (Beijing, China)
Hireability: LOW — technically impressive intern but China-based with no US signal

Adnan Akhundov

No note

Aidan Do

No note

Cameron Shinn

No note

Dan Zimmerman

No note

David Berard

No note

Haozheng Fan

No note

Jez Ng

No note

Kyle Sayers

No note

Luka Govedic

No note

Markus Hoehnerbach

No note

Micah Williamson

No note

Michael Melesse

No note

Pengzhan Zhao

No note

Ted Zadouri

No note

Varun Sundar Rabindranath

No note

Yidi Wu

No note

Yi Qian

No note

Zheng Yan

No note

Runs

#1completed51 qualified / 86 foundApr 19, 3:25 PM