Back to dashboard

Junior CUDA / GPU kernel engineers in the US

completed9 qualified1 runApr 27, 7:20 PMjunior-cuda-gpu-kernel-engineers-in-the-us
ParsedJunior
Generating seed nodes
0 proposed
Explored 0 queries
0/0 done
    3
    Expanding nodes
    queued
    4
    Qualifying candidates
    queued

    Qualified Candidates (8)

    WH

    William Hu

    high hireability

    Member of Technical Staff@Modal

    Previously: GPU Compiler Engineer @ Qualcomm

    San Francisco, US

    • Stanford MSCS student finishing degree, working as MTS intern at Modal on the Flash team (Flash Attention CUDA kernels)
    • First-authored KernelBench (ICML 2025, 23 citations) — a benchmark for LLM-written GPU kernels
    • Pinned repos include HipKittens (AMD GPU kernels, C++) and ThunderKittens fork (CUDA tile primitives)
    • Prior GPU compiler experience at Qualcomm
    • BS Math-CS from UCSD
    • SF-based
    • Hireability: HIGH — website still says 'MTS intern', MSCS likely finishing spring 2026, prime transition window
    DA

    Daiyaan Arfeen

    medium hireability

    PhD student@Graduate Student, Carnegie Mellon University

    Previously: Deep Learning Architecture Intern @ NVIDIA

    San Francisco, US

    • ML systems researcher at CMU PDL with strong GPU-adjacent work — PipeFill (GPU bubble utilization in LLM training, 2025), GraphPipe (DNN graph pipeline parallelism, 2024), SpecInfer (speculative inference on FlexFlow, 2024)
    • Focus is systems-level distributed training and inference rather than CUDA kernel engineering specifically
    • PhD student in US (h-index 5)
    • Hireability: MEDIUM — first paper from 2018 suggests ~5-6 years into PhD at CMU, likely approaching graduation window; no explicit job market signals from pipeline
    GO

    Gabriele Oliaro

    medium hireability

    CS PhD Student@Snowflake AI Research

    Previously: Research Scientist Intern @ Snowflake

    • PhD-level GPU kernel work: published 'Optimal Kernel Orchestration for Tensor Programs with Korch' (ASPLOS 2024) and 'Event Tensor: A Unified Abstraction for Compiling Dynamic Megakernel' (MLSys 2026); also core contributor to FlexFlow (C++ distributed DNN training)
    • Harvard BS, Tsinghua MS, CMU PhD (4th year)
    • Based in Pittsburgh, PA (US)
    • Hireability: MEDIUM — expected graduation 2027 (~1 yr away), currently interning at Snowflake AI Research; prime recruiting window for a 2027 PhD, but not yet in final semester
    HH

    Helya Hosseini

    medium hireability

    Research Assistant and Teaching Assistant@University of Maryland

    Previously: Logic Design Teaching Assistant @ University of Tehran

    US

    • PhD student at UMD directly working on GPU kernel co-design — published 'Coruscant: Co-Designing GPU Kernel and Sparse Tensor Core to Advocate Unstructured Sparsity in Efficient LLM Inference' at MICRO 2025, plus 'Acamar' (MICRO 2024) on hardware accelerators
    • Research expertise explicitly in Computer Architecture and GPU optimization for sparse LLM inference
    • US-based (College Park, MD)
    • Hireability: MEDIUM — active PhD student with 2 consecutive MICRO papers, likely 2-4 years into program; no explicit job-seeking signals but within internship/transition window
    RD

    Rajat Vadiraj Dwaraknath

    medium hireability

    PhD Student@PhD Student, ICME, Stanford University

    Previously: Quantitative Research Intern @ Jump Trading

    San Francisco, US

    • Wrote custom CUDA kernels for FlashSketch (arXiv 2602.06071, 2026), a GPU-accelerated sparse sketching system with 1.7x speedup over SOTA — direct GPU kernel co-design work addressing irregular memory access patterns on GPU
    • ICME PhD student at Stanford in SF; also forked GPU MODE reference-kernels leaderboard
    • Hireability: MEDIUM — publication timeline (2021–2026) suggests ~5th year PhD, prime transition window, but no explicit open-to-work signals found
    ZZ

    Zhihao Zhang

    medium hireability

    Ph.D. student@Carnegie Mellon University

    Previously: MS student @ Carnegie Mellon University

    Pittsburgh, US

    • PhD student at CMU Catalyst (advised by Zhihao Jia) focused on GPU kernel systems for LLM serving
    • OSDI 2026 paper on 'Mirage Persistent Kernel' (compiler and runtime for mega-kernelizing tensor programs) is direct CUDA/GPU kernel work
    • Pinned GitHub repo: FlashInfer (CUDA kernel library for LLM serving)
    • US-based in Pittsburgh
    • Hireability: MEDIUM — still an active PhD student (2026 OSDI publication), LinkedIn profile went completely empty in Jan 2026 (possibly privated), website says 'open to collaboration'; unclear how far along in PhD but CMU ML systems PhD typically 5-6 years
    ZC

    Zhuoming Chen

    medium hireability

    Ph.D. student@Carnegie Mellon University

    Previously: Research Intern @ Meta

    New York, US

    • PhD student at CMU (Robotics Institute, advised by GPU-systems researchers Beidi Chen and Zhihao Jia) focused on GPU-efficient LLM inference — speculative decoding systems (SpecInfer 381 citations, Sequoia, TriForce, MagicDec) requiring deep CUDA/GPU optimization
    • H-index 9, strong OS/systems background from Tsinghua
    • Meta FAIR internship 2025
    • Located in Pittsburgh, PA
    • Hireability: MEDIUM — 3rd-year PhD (started 2023), cv_update 73 days ago shows career activity, but likely 2+ years from completion; research group is GPU-systems focused which is strong signal for kernel engineering fit
    SG

    Simon Guo

    low hireability

    PhD Student@Stanford University

    Previously: Machine Learning Research Intern @ Cohere

    San Francisco, US

    • Lead author on KernelBench (GPU kernels benchmark, 23 citations) and Kevin (multi-turn RL for CUDA kernel generation)
    • GPU design internship at Apple and NVIDIA DRIVE; active researcher at Stanford Scaling Intelligence Lab in Palo Alto
    • Directly on-target for the CUDA/GPU kernel query
    • Hireability: LOW — ~18 months into PhD program, early stage with no job search signals detected

    Runs

    #1completed0 qualified / 0 foundApr 27, 7:20 PM