Junior GPU kernel engineers in the US with CUDA/Triton experience

completed12 qualified1 runApr 21, 10:05 PMjunior-gpu-kernel-engineers-in-the-us-with-cudatriton-experi-1776809147

Parsed3 topics · Junior · Engineer · US

Generating seed nodes

0 proposed

Explored 0 queries

0/0 done

Expanding nodes

queued

Qualifying candidates

queued

Qualified Candidates (12)

Zhihao Zhang

high hireability

Ph.D. student@Carnegie Mellon University

Previously: MS student @ Carnegie Mellon University

Pittsburgh, US

PhD student at CMU Catalyst (Zhihao Jia's group) working directly on GPU kernels — pinned FlashInfer fork (CUDA kernel library for LLM serving), OSDI 2026 paper on Mirage Persistent Kernel (tensor program compiler/runtime), ASPLOS 2024 SpecInfer
Hands-on CUDA codebase contributor, US-based (Pittsburgh)
Hireability: HIGH — LinkedIn profile completely emptied Jan 2026 (all positions/education/experience cleared — strong graduation signal), OSDI 2026 paper indicates final-year PhD output, website states 'open to collaboration'

Aditya Tomar

medium hireability

Undergraduate Student@UC Berkeley

Previously: Researcher @ PSSG

Strong GPU/HPC background: Gordon Bell Prize finalist for scalable LLM training on GPU supercomputers (SC'24), upcoming NVIDIA internship on Megatron-LM (May–Aug 2026), BAIR researcher under Keutzer/Gholami on LLM inference kernels
No explicit CUDA/Triton keyword but HPC systems work strongly implies kernel-level GPU expertise
US-based (Berkeley)
Hireability: MEDIUM — starting NVIDIA internship next month (May 2026), 3rd year undergrad, FTE not available until ~2027

Anjiang Wei

medium hireability

PhD student@Stanford University

San Francisco, US

LLM-for-GPU-kernels researcher at Stanford
Published on GPU kernel optimization (Astra multi-agent system, AccelOpt) and tensor mapping to spatial accelerators (AMOS, ISCA 2022, 83 citations)
Primary angle is using LLMs as kernel optimizers/code generators rather than direct CUDA/Triton authorship, but demonstrates deep GPU kernel knowledge
PhD student in SF/California, US
Hireability: MEDIUM — likely 5-6 years into PhD (papers from 2020), approaching graduation window, but still actively teaching (CS224N Winter 2025) with no explicit job search signals

Gabriele Oliaro

medium hireability

CS PhD Student@Snowflake AI Research

Previously: Research Scientist Intern @ Snowflake

Implemented CUDA kernels in FlexFlow (merged MoE Experts kernel PR, CUDA graph fusion, single-stream kernel PRs); personal softmax-argmax fused CUDA kernel repo; forked FlashInfer (CUDA kernel library for LLM serving) and KernelBench; paper on GPU kernel orchestration (Korch)
CMU PhD in ML Systems with parallel computing focus, h-index 10
US-based (Pittsburgh, PA)
Hireability: MEDIUM — 4th-year PhD at CMU, expected 2027 graduation, currently interning at Snowflake AI Research; not explicitly on the job market but approaching final phase

Haocheng Xi

medium hireability

MLsys Researcher@University of California, Berkeley

Previously: Research Intern @ Nvidia

Berkeley, US

2nd-year PhD student at UC Berkeley (MLsys/Efficient ML group)
Strong CUDA kernel work: pinned repos include how-to-optim-algorithm-in-cuda and cuda-tensorcore-hgemm; published FP8/INT8 training kernels (COAT ICLR 2025, Jetfire ICML 2024) and sparse attention CUDA kernels (SpargeAttn)
Two NVIDIA internships (Feb-Aug 2024, May-Aug 2025)
Based in Berkeley, CA (US)
Hireability: MEDIUM — early 2nd-year PhD (started 2024), not nearing graduation; likely open to summer internships. No explicit open-to-work signals; most recent website activity was Jan 2026

Luca Manolache

medium hireability

Head Teaching Assistant@UC Berkeley

Previously: Teaching Assistant @ UC Berkeley

San Francisco, US

Berkeley EECS junior researching ML systems and model efficiency
NeurIPS 2025 paper on Multipole Attention with custom kernel implementations achieving 4.5x speedup for sparse attention
Forked NCCL (multi-GPU comm library), has distributed-sparse-attention repo, contributes to SkyPilot
Head TA for OS (CS162)
Based in Berkeley/SF, US
Hireability: MEDIUM — junior undergrad (~2 years from graduation), likely open to internships; no explicit job-seeking signals

Simon Guo

medium hireability

PhD Student@Stanford University

Previously: Machine Learning Research Intern @ Cohere

San Francisco, US

Stanford CS PhD student (Scaling Intelligence Lab, Prof
Azalia Mirhoseini) doing direct GPU kernel research — co-authored KernelBench ('Can LLMs Write Efficient GPU Kernels?', 23 citations, ICML 2025) and Kevin ('Multi-Turn RL for Generating CUDA Kernels', 2025)
Prior NVIDIA DRIVE internship and Apple GPU design internship
Palo Alto, CA
Hireability: MEDIUM — likely year 3-4 of PhD (started ~2022-23), no explicit job-seeking signals, but within prime transition window; research focus is highly industry-aligned

Zhuoming Chen

medium hireability

Ph.D. student@Carnegie Mellon University

Previously: Research Intern @ Meta

New York, US

LLM inference systems researcher at CMU (Beidi Chen + Zhihao Jia labs) — speculative decoding (Sequoia, TriForce, MagicPIG ICLR 2025 Spotlight), efficient attention (MagicDec ICLR 2025)
GPU work primarily at the systems level using FlashInfer/PyTorch; no direct custom CUDA/Triton kernel code found in repos, but forks flash-linear-attention and flex-block-attn show familiarity with kernel-level libraries
H-index 9, Meta FAIR intern 2025
Based in New York, US
Hireability: MEDIUM — ~3 years into PhD (2023 start), CV update 67 days ago suggests career motion, likely 1-2 more years before graduation

Alex L Zhang

low hireability

Researcher@Sakana AI

Previously: Member of Technical Staff @ VantAI

Tokyo, JP

Co-author of KernelBench (ICML 2025, evaluating LLMs for writing efficient GPU kernels) and KernelBot GPU code competition platform
Core team member of GPU MODE leaderboard (hosted NVIDIA/AMD/$100k competitions)
Triton FlashAttention2 custom mask implementation pinned on GitHub
Princeton CS '24, now first-year PhD at MIT CSAIL, based in the US
Hireability: LOW — just started PhD (~1.5 years in), far from completion; may be open to summer research internships but unlikely for full-time roles

Hiva Mohammadzadeh

low hireability

Machine Learning Engineer | Data Scientist@IntuigenceAI

Previously: Machine Learning Engineer @ Algoverse

San Francisco, US

Co-authored KVQuant (NeurIPS 2024, 322 citations) which implemented custom CUDA kernels for KV cache quantization achieving ~1.7x speedups
Also co-authored Squeezed Attention and SPEED (speculative decoding), all focused on GPU-level LLM inference optimization
MS CS (AI, Systems) Stanford; BAIR researcher at UC Berkeley
Currently ML Engineer at IntuigenceAI in SF
Hireability: LOW — ~15 months in current role at IntuigenceAI, only minor LinkedIn headline rebrand detected (no open-to-work signals)

William Hu

low hireability

Member of Technical Staff@Modal

Previously: GPU Compiler Engineer @ Qualcomm

San Francisco, US

Co-authored KernelBench (ICML 2025, 23 citations) — open-source benchmark for evaluating LLMs writing efficient GPU kernels across 250 CUDA/PyTorch ML workloads
Research expertise in HPC, Compilers, AI; MTS at Modal (GPU cloud infra), SF-based, no PhD
Hireability: LOW — pre-computed ~5 months tenure at Modal; personal site now lists Anthropic as current role (possible recent company move), no open-to-work signals detected

Xiuyu Li

low hireability

PhD candidate@Berkeley AI Research (BAIR) at UC Berkeley

Previously: Research Consultant @ Together AI

San Francisco, US

Co-first author of TorchSparse (MLSys'22, MICRO'23), a high-performance CUDA library for sparse convolution on GPUs — directly on-target for CUDA kernel engineering
Fresh PhD from Berkeley BAIR (h-index 19), also contributed to SqueezeLLM (3-bit LLM quantization with 2.3x GPU speedup), SVDQuant, SparseLoRA, and SGLang serving framework
Hireability: LOW — recently joined xAI as Member of Technical Staff (LinkedIn transition detected ~Feb 2026, website confirms); likely still in first few months of new role

Runs

#1completed0 qualified / 0 foundApr 21, 10:05 PM