Back to dashboard

Junior GPU kernel engineers in the US with CUDA/Triton experience

completed7 qualified1 runApr 21, 6:29 PMjunior-gpu-kernel-engineers-in-the-us-with-cudatriton-experi-1776796187
Parsed3 topics · Junior · Engineer · United States
Generating seed nodes
0 proposed
Explored 0 queries
0/0 done
    3
    Expanding nodes
    queued
    4
    Qualifying candidates
    queued

    Qualified Candidates (7)

    LM

    Luca Manolache

    high hireability

    Head Teaching Assistant@UC Berkeley

    Previously: Teaching Assistant @ UC Berkeley

    San Francisco, US

    • Junior EECS student at UC Berkeley with direct CUDA kernel writing experience (custom GEMM repo in CUDA/C) and a fork of NVIDIA NCCL (multi-GPU collective comms)
    • ML systems and model efficiency researcher with SkyPilot contributions and a 2025 paper on efficient attention mechanisms
    • Based in SF/Berkeley, US
    • Hireability: HIGH — bio says 'Junior studying EECS' but active GitHub commits through Dec 2025 suggests likely finishing senior year or graduating spring 2026, prime hiring window
    BF

    BOYUAN FENG

    medium hireability

    SWE@PyTorch

    Previously: Researcher @ Meta

    • PhD student at UCSB + SWE at PyTorch; co-authored FlexAttention (2025, 51 citations — fused GPU attention kernel), TC-GNN (GPU tensor cores, 81 citations), APNN-TC (CUDA code, Ampere tensor cores), GNNAdvisor (GPU GNN acceleration, 210 citations)
    • GitHub forks include tritonbench and cutlass (CUDA Templates)
    • US-based (Santa Barbara, CA)
    • Hireability: MEDIUM — advanced PhD student (likely yr 6-7 at UCSB, nearing completion); currently SWE at PyTorch which may be internship/co-op; likely entering full-time job market soon
    HX

    Haocheng Xi

    medium hireability

    MLsys Researcher@University of California, Berkeley

    Previously: Research Intern @ Nvidia

    Berkeley, US

    • Strong CUDA/ML-systems focus at UC Berkeley: pinned repos include how-to-optim-algorithm-in-cuda and cuda-tensorcore-hgemm, papers on INT4/INT8/FP8 training kernels (Training Transformers with 4-bit Integers, Jetfire) and SpargeAttn sparse attention
    • NVIDIA internship May-Aug 2025
    • Fits junior profile
    • Hireability: MEDIUM — 2nd year PhD student (started 2024), not yet in final-year transition window, but NVIDIA internship and 'Welcome to contact me!' signal openness to industry engagement
    HM

    Hiva Mohammadzadeh

    medium hireability

    Machine Learning Engineer | Data Scientist@IntuigenceAI

    Previously: Machine Learning Engineer @ Algoverse

    San Francisco, US

    • Co-authored KVQuant (NeurIPS 2024, 322 citations), which explicitly developed custom CUDA kernels for sub-4-bit KV cache quantization achieving 1.7x speedup on A100s
    • Also co-authored Squeezed Attention (ACL 2025) and SPEED (NeurIPS Workshop) — all efficient LLM inference at BAIR
    • MS CS (AI/Systems) at Stanford, no PhD
    • Based in SF Bay Area
    • Hireability: MEDIUM — just started Stanford MSCS (2025-2027), likely available for summer 2026 internship; still ML Engineer at IntuigenceAI startup with no explicit open-to-work signal
    MS

    Mark Saroufim

    medium hireability

    Software Engineer@Meta

    Previously: ML Engineer @ Graphcore

    San Francisco, US

    • Co-founder of GPU MODE (the top GPU kernel education community with 100+ lectures), PyTorch engineer at Meta working on pytorch/ao (quantization kernels) and pytorch/helion (ML kernel DSL)
    • CUDA-obsessed GitHub bio, recent PRs in flash-linear-attention and BackendBench
    • Based in Bay Area, US
    • Papers: KernelBot (2025), TorchAO (2025)
    • Not strictly junior but labeled SW Engineer (not senior/staff)
    • Hireability: MEDIUM — active at Meta (commits Apr 2026), LinkedIn headline recently added MSL but no job-seeking signals
    AO

    Anne Ouyang

    low hireability

    Founder@Standard Kernel

    Previously: Deep Learning Engineer @ NVIDIA

    San Francisco, US

    • Founded Standard Kernel (GPU kernel startup) and Stanford PhD student (on leave)
    • Former cuDNN engineer at NVIDIA with direct CUDA kernel experience; first author of KernelBench ('Can LLMs Write Efficient GPU Kernels?', 2025)
    • Based in SF
    • MIT BS+MEng
    • Hireability: LOW — explicitly building her own startup and personal website states 'not actively looking'; exceptional technical match but unlikely to be recruitable near-term
    XL

    Xiuyu Li

    low hireability

    PhD candidate@Berkeley AI Research (BAIR) at UC Berkeley

    Previously: Research Consultant @ Together AI

    San Francisco, US

    • TorchSparse++ co-author — GPU sparse convolution framework with CUDA kernels (MICRO'23, MLSys'22, 67 citations)
    • Also contributes to SGLang (high-performance LLM serving) and SqueezeLLM quantization work
    • PhD Berkeley (efficient deep learning), h_index 19, Bay Area
    • Hireability: LOW — LinkedIn pipeline signals show recently joined xAI as Member of Technical Staff (~2-3 months ago), brand new hire

    Runs

    #1completed0 qualified / 0 foundApr 21, 6:29 PM