Back to dashboard

Junior GPU kernel engineers in the US with CUDA/Triton experience

completed12 qualified1 runApr 21, 10:05 PMjunior-gpu-kernel-engineers-in-the-us-with-cudatriton-experi-1776809147
Parsed3 topics · Junior · Engineer · US
Generating seed nodes
0 proposed
Explored 0 queries
0/0 done
    3
    Expanding nodes
    queued
    4
    Qualifying candidates
    queued

    Qualified Candidates (12)

    ZZ

    Zhihao Zhang

    high hireability

    Ph.D. student@Carnegie Mellon University

    Previously: MS student @ Carnegie Mellon University

    Pittsburgh, US

    • PhD student at CMU Catalyst (Zhihao Jia's group) working directly on GPU kernels — pinned FlashInfer fork (CUDA kernel library for LLM serving), OSDI 2026 paper on Mirage Persistent Kernel (tensor program compiler/runtime), ASPLOS 2024 SpecInfer
    • Hands-on CUDA codebase contributor, US-based (Pittsburgh)
    • Hireability: HIGH — LinkedIn profile completely emptied Jan 2026 (all positions/education/experience cleared — strong graduation signal), OSDI 2026 paper indicates final-year PhD output, website states 'open to collaboration'
    AT

    Aditya Tomar

    medium hireability

    Undergraduate Student@UC Berkeley

    Previously: Researcher @ PSSG

    US

    • Strong GPU/HPC background: Gordon Bell Prize finalist for scalable LLM training on GPU supercomputers (SC'24), upcoming NVIDIA internship on Megatron-LM (May–Aug 2026), BAIR researcher under Keutzer/Gholami on LLM inference kernels
    • No explicit CUDA/Triton keyword but HPC systems work strongly implies kernel-level GPU expertise
    • US-based (Berkeley)
    • Hireability: MEDIUM — starting NVIDIA internship next month (May 2026), 3rd year undergrad, FTE not available until ~2027
    AW

    Anjiang Wei

    medium hireability

    PhD student@Stanford University

    San Francisco, US

    • LLM-for-GPU-kernels researcher at Stanford
    • Published on GPU kernel optimization (Astra multi-agent system, AccelOpt) and tensor mapping to spatial accelerators (AMOS, ISCA 2022, 83 citations)
    • Primary angle is using LLMs as kernel optimizers/code generators rather than direct CUDA/Triton authorship, but demonstrates deep GPU kernel knowledge
    • PhD student in SF/California, US
    • Hireability: MEDIUM — likely 5-6 years into PhD (papers from 2020), approaching graduation window, but still actively teaching (CS224N Winter 2025) with no explicit job search signals
    GO

    Gabriele Oliaro

    medium hireability

    CS PhD Student@Snowflake AI Research

    Previously: Research Scientist Intern @ Snowflake

    • Implemented CUDA kernels in FlexFlow (merged MoE Experts kernel PR, CUDA graph fusion, single-stream kernel PRs); personal softmax-argmax fused CUDA kernel repo; forked FlashInfer (CUDA kernel library for LLM serving) and KernelBench; paper on GPU kernel orchestration (Korch)
    • CMU PhD in ML Systems with parallel computing focus, h-index 10
    • US-based (Pittsburgh, PA)
    • Hireability: MEDIUM — 4th-year PhD at CMU, expected 2027 graduation, currently interning at Snowflake AI Research; not explicitly on the job market but approaching final phase
    HX

    Haocheng Xi

    medium hireability

    MLsys Researcher@University of California, Berkeley

    Previously: Research Intern @ Nvidia

    Berkeley, US

    • 2nd-year PhD student at UC Berkeley (MLsys/Efficient ML group)
    • Strong CUDA kernel work: pinned repos include how-to-optim-algorithm-in-cuda and cuda-tensorcore-hgemm; published FP8/INT8 training kernels (COAT ICLR 2025, Jetfire ICML 2024) and sparse attention CUDA kernels (SpargeAttn)
    • Two NVIDIA internships (Feb-Aug 2024, May-Aug 2025)
    • Based in Berkeley, CA (US)
    • Hireability: MEDIUM — early 2nd-year PhD (started 2024), not nearing graduation; likely open to summer internships. No explicit open-to-work signals; most recent website activity was Jan 2026
    LM

    Luca Manolache

    medium hireability

    Head Teaching Assistant@UC Berkeley

    Previously: Teaching Assistant @ UC Berkeley

    San Francisco, US

    • Berkeley EECS junior researching ML systems and model efficiency
    • NeurIPS 2025 paper on Multipole Attention with custom kernel implementations achieving 4.5x speedup for sparse attention
    • Forked NCCL (multi-GPU comm library), has distributed-sparse-attention repo, contributes to SkyPilot
    • Head TA for OS (CS162)
    • Based in Berkeley/SF, US
    • Hireability: MEDIUM — junior undergrad (~2 years from graduation), likely open to internships; no explicit job-seeking signals
    SG

    Simon Guo

    medium hireability

    PhD Student@Stanford University

    Previously: Machine Learning Research Intern @ Cohere

    San Francisco, US

    • Stanford CS PhD student (Scaling Intelligence Lab, Prof
    • Azalia Mirhoseini) doing direct GPU kernel research — co-authored KernelBench ('Can LLMs Write Efficient GPU Kernels?', 23 citations, ICML 2025) and Kevin ('Multi-Turn RL for Generating CUDA Kernels', 2025)
    • Prior NVIDIA DRIVE internship and Apple GPU design internship
    • Palo Alto, CA
    • Hireability: MEDIUM — likely year 3-4 of PhD (started ~2022-23), no explicit job-seeking signals, but within prime transition window; research focus is highly industry-aligned
    ZC

    Zhuoming Chen

    medium hireability

    Ph.D. student@Carnegie Mellon University

    Previously: Research Intern @ Meta

    New York, US

    • LLM inference systems researcher at CMU (Beidi Chen + Zhihao Jia labs) — speculative decoding (Sequoia, TriForce, MagicPIG ICLR 2025 Spotlight), efficient attention (MagicDec ICLR 2025)
    • GPU work primarily at the systems level using FlashInfer/PyTorch; no direct custom CUDA/Triton kernel code found in repos, but forks flash-linear-attention and flex-block-attn show familiarity with kernel-level libraries
    • H-index 9, Meta FAIR intern 2025
    • Based in New York, US
    • Hireability: MEDIUM — ~3 years into PhD (2023 start), CV update 67 days ago suggests career motion, likely 1-2 more years before graduation
    AZ

    Alex L Zhang

    low hireability

    Researcher@Sakana AI

    Previously: Member of Technical Staff @ VantAI

    Tokyo, JP

    • Co-author of KernelBench (ICML 2025, evaluating LLMs for writing efficient GPU kernels) and KernelBot GPU code competition platform
    • Core team member of GPU MODE leaderboard (hosted NVIDIA/AMD/$100k competitions)
    • Triton FlashAttention2 custom mask implementation pinned on GitHub
    • Princeton CS '24, now first-year PhD at MIT CSAIL, based in the US
    • Hireability: LOW — just started PhD (~1.5 years in), far from completion; may be open to summer research internships but unlikely for full-time roles
    HM

    Hiva Mohammadzadeh

    low hireability

    Machine Learning Engineer | Data Scientist@IntuigenceAI

    Previously: Machine Learning Engineer @ Algoverse

    San Francisco, US

    • Co-authored KVQuant (NeurIPS 2024, 322 citations) which implemented custom CUDA kernels for KV cache quantization achieving ~1.7x speedups
    • Also co-authored Squeezed Attention and SPEED (speculative decoding), all focused on GPU-level LLM inference optimization
    • MS CS (AI, Systems) Stanford; BAIR researcher at UC Berkeley
    • Currently ML Engineer at IntuigenceAI in SF
    • Hireability: LOW — ~15 months in current role at IntuigenceAI, only minor LinkedIn headline rebrand detected (no open-to-work signals)
    WH

    William Hu

    low hireability

    Member of Technical Staff@Modal

    Previously: GPU Compiler Engineer @ Qualcomm

    San Francisco, US

    • Co-authored KernelBench (ICML 2025, 23 citations) — open-source benchmark for evaluating LLMs writing efficient GPU kernels across 250 CUDA/PyTorch ML workloads
    • Research expertise in HPC, Compilers, AI; MTS at Modal (GPU cloud infra), SF-based, no PhD
    • Hireability: LOW — pre-computed ~5 months tenure at Modal; personal site now lists Anthropic as current role (possible recent company move), no open-to-work signals detected
    XL

    Xiuyu Li

    low hireability

    PhD candidate@Berkeley AI Research (BAIR) at UC Berkeley

    Previously: Research Consultant @ Together AI

    San Francisco, US

    • Co-first author of TorchSparse (MLSys'22, MICRO'23), a high-performance CUDA library for sparse convolution on GPUs — directly on-target for CUDA kernel engineering
    • Fresh PhD from Berkeley BAIR (h-index 19), also contributed to SqueezeLLM (3-bit LLM quantization with 2.3x GPU speedup), SVDQuant, SparseLoRA, and SGLang serving framework
    • Hireability: LOW — recently joined xAI as Member of Technical Staff (LinkedIn transition detected ~Feb 2026, website confirms); likely still in first few months of new role

    Runs

    #1completed0 qualified / 0 foundApr 21, 10:05 PM