Junior CUDA / GPU kernel engineers in the US

completed9 qualified1 runApr 27, 7:20 PMjunior-cuda-gpu-kernel-engineers-in-the-us

ParsedJunior

Generating seed nodes

0 proposed

Explored 0 queries

0/0 done

Expanding nodes

queued

Qualifying candidates

queued

Qualified Candidates (8)

William Hu

high hireability

Member of Technical Staff@Modal

Previously: GPU Compiler Engineer @ Qualcomm

San Francisco, US

Stanford MSCS student finishing degree, working as MTS intern at Modal on the Flash team (Flash Attention CUDA kernels)
First-authored KernelBench (ICML 2025, 23 citations) — a benchmark for LLM-written GPU kernels
Pinned repos include HipKittens (AMD GPU kernels, C++) and ThunderKittens fork (CUDA tile primitives)
Prior GPU compiler experience at Qualcomm
BS Math-CS from UCSD
SF-based
Hireability: HIGH — website still says 'MTS intern', MSCS likely finishing spring 2026, prime transition window

Daiyaan Arfeen

medium hireability

PhD student@Graduate Student, Carnegie Mellon University

Previously: Deep Learning Architecture Intern @ NVIDIA

San Francisco, US

ML systems researcher at CMU PDL with strong GPU-adjacent work — PipeFill (GPU bubble utilization in LLM training, 2025), GraphPipe (DNN graph pipeline parallelism, 2024), SpecInfer (speculative inference on FlexFlow, 2024)
Focus is systems-level distributed training and inference rather than CUDA kernel engineering specifically
PhD student in US (h-index 5)
Hireability: MEDIUM — first paper from 2018 suggests ~5-6 years into PhD at CMU, likely approaching graduation window; no explicit job market signals from pipeline

Gabriele Oliaro

medium hireability

CS PhD Student@Snowflake AI Research

Previously: Research Scientist Intern @ Snowflake

PhD-level GPU kernel work: published 'Optimal Kernel Orchestration for Tensor Programs with Korch' (ASPLOS 2024) and 'Event Tensor: A Unified Abstraction for Compiling Dynamic Megakernel' (MLSys 2026); also core contributor to FlexFlow (C++ distributed DNN training)
Harvard BS, Tsinghua MS, CMU PhD (4th year)
Based in Pittsburgh, PA (US)
Hireability: MEDIUM — expected graduation 2027 (~1 yr away), currently interning at Snowflake AI Research; prime recruiting window for a 2027 PhD, but not yet in final semester

Helya Hosseini

medium hireability

Research Assistant and Teaching Assistant@University of Maryland

Previously: Logic Design Teaching Assistant @ University of Tehran

PhD student at UMD directly working on GPU kernel co-design — published 'Coruscant: Co-Designing GPU Kernel and Sparse Tensor Core to Advocate Unstructured Sparsity in Efficient LLM Inference' at MICRO 2025, plus 'Acamar' (MICRO 2024) on hardware accelerators
Research expertise explicitly in Computer Architecture and GPU optimization for sparse LLM inference
US-based (College Park, MD)
Hireability: MEDIUM — active PhD student with 2 consecutive MICRO papers, likely 2-4 years into program; no explicit job-seeking signals but within internship/transition window

Rajat Vadiraj Dwaraknath

medium hireability

PhD Student@PhD Student, ICME, Stanford University

Previously: Quantitative Research Intern @ Jump Trading

San Francisco, US

Wrote custom CUDA kernels for FlashSketch (arXiv 2602.06071, 2026), a GPU-accelerated sparse sketching system with 1.7x speedup over SOTA — direct GPU kernel co-design work addressing irregular memory access patterns on GPU
ICME PhD student at Stanford in SF; also forked GPU MODE reference-kernels leaderboard
Hireability: MEDIUM — publication timeline (2021–2026) suggests ~5th year PhD, prime transition window, but no explicit open-to-work signals found

Zhihao Zhang

medium hireability

Ph.D. student@Carnegie Mellon University

Previously: MS student @ Carnegie Mellon University

Pittsburgh, US

PhD student at CMU Catalyst (advised by Zhihao Jia) focused on GPU kernel systems for LLM serving
OSDI 2026 paper on 'Mirage Persistent Kernel' (compiler and runtime for mega-kernelizing tensor programs) is direct CUDA/GPU kernel work
Pinned GitHub repo: FlashInfer (CUDA kernel library for LLM serving)
US-based in Pittsburgh
Hireability: MEDIUM — still an active PhD student (2026 OSDI publication), LinkedIn profile went completely empty in Jan 2026 (possibly privated), website says 'open to collaboration'; unclear how far along in PhD but CMU ML systems PhD typically 5-6 years

Zhuoming Chen

medium hireability

Ph.D. student@Carnegie Mellon University

Previously: Research Intern @ Meta

New York, US

PhD student at CMU (Robotics Institute, advised by GPU-systems researchers Beidi Chen and Zhihao Jia) focused on GPU-efficient LLM inference — speculative decoding systems (SpecInfer 381 citations, Sequoia, TriForce, MagicDec) requiring deep CUDA/GPU optimization
H-index 9, strong OS/systems background from Tsinghua
Meta FAIR internship 2025
Located in Pittsburgh, PA
Hireability: MEDIUM — 3rd-year PhD (started 2023), cv_update 73 days ago shows career activity, but likely 2+ years from completion; research group is GPU-systems focused which is strong signal for kernel engineering fit

Simon Guo

low hireability

PhD Student@Stanford University

Previously: Machine Learning Research Intern @ Cohere

San Francisco, US

Lead author on KernelBench (GPU kernels benchmark, 23 citations) and Kevin (multi-turn RL for CUDA kernel generation)
GPU design internship at Apple and NVIDIA DRIVE; active researcher at Stanford Scaling Intelligence Lab in Palo Alto
Directly on-target for the CUDA/GPU kernel query
Hireability: LOW — ~18 months into PhD program, early stage with no job search signals detected

Runs

#1completed0 qualified / 0 foundApr 27, 7:20 PM