Junior GPU kernel engineers in the US with CUDA/Triton experience

completed7 qualified1 runApr 21, 6:29 PMjunior-gpu-kernel-engineers-in-the-us-with-cudatriton-experi-1776796187

Parsed3 topics · Junior · Engineer · United States

Generating seed nodes

0 proposed

Explored 0 queries

0/0 done

Expanding nodes

queued

Qualifying candidates

queued

Qualified Candidates (7)

Luca Manolache

high hireability

Head Teaching Assistant@UC Berkeley

Previously: Teaching Assistant @ UC Berkeley

San Francisco, US

Junior EECS student at UC Berkeley with direct CUDA kernel writing experience (custom GEMM repo in CUDA/C) and a fork of NVIDIA NCCL (multi-GPU collective comms)
ML systems and model efficiency researcher with SkyPilot contributions and a 2025 paper on efficient attention mechanisms
Based in SF/Berkeley, US
Hireability: HIGH — bio says 'Junior studying EECS' but active GitHub commits through Dec 2025 suggests likely finishing senior year or graduating spring 2026, prime hiring window

BOYUAN FENG

medium hireability

SWE@PyTorch

Previously: Researcher @ Meta

PhD student at UCSB + SWE at PyTorch; co-authored FlexAttention (2025, 51 citations — fused GPU attention kernel), TC-GNN (GPU tensor cores, 81 citations), APNN-TC (CUDA code, Ampere tensor cores), GNNAdvisor (GPU GNN acceleration, 210 citations)
GitHub forks include tritonbench and cutlass (CUDA Templates)
US-based (Santa Barbara, CA)
Hireability: MEDIUM — advanced PhD student (likely yr 6-7 at UCSB, nearing completion); currently SWE at PyTorch which may be internship/co-op; likely entering full-time job market soon

Haocheng Xi

medium hireability

MLsys Researcher@University of California, Berkeley

Previously: Research Intern @ Nvidia

Berkeley, US

Strong CUDA/ML-systems focus at UC Berkeley: pinned repos include how-to-optim-algorithm-in-cuda and cuda-tensorcore-hgemm, papers on INT4/INT8/FP8 training kernels (Training Transformers with 4-bit Integers, Jetfire) and SpargeAttn sparse attention
NVIDIA internship May-Aug 2025
Fits junior profile
Hireability: MEDIUM — 2nd year PhD student (started 2024), not yet in final-year transition window, but NVIDIA internship and 'Welcome to contact me!' signal openness to industry engagement

Hiva Mohammadzadeh

medium hireability

Machine Learning Engineer | Data Scientist@IntuigenceAI

Previously: Machine Learning Engineer @ Algoverse

San Francisco, US

Co-authored KVQuant (NeurIPS 2024, 322 citations), which explicitly developed custom CUDA kernels for sub-4-bit KV cache quantization achieving 1.7x speedup on A100s
Also co-authored Squeezed Attention (ACL 2025) and SPEED (NeurIPS Workshop) — all efficient LLM inference at BAIR
MS CS (AI/Systems) at Stanford, no PhD
Based in SF Bay Area
Hireability: MEDIUM — just started Stanford MSCS (2025-2027), likely available for summer 2026 internship; still ML Engineer at IntuigenceAI startup with no explicit open-to-work signal

Mark Saroufim

medium hireability

Software Engineer@Meta

Previously: ML Engineer @ Graphcore

San Francisco, US

Co-founder of GPU MODE (the top GPU kernel education community with 100+ lectures), PyTorch engineer at Meta working on pytorch/ao (quantization kernels) and pytorch/helion (ML kernel DSL)
CUDA-obsessed GitHub bio, recent PRs in flash-linear-attention and BackendBench
Based in Bay Area, US
Papers: KernelBot (2025), TorchAO (2025)
Not strictly junior but labeled SW Engineer (not senior/staff)
Hireability: MEDIUM — active at Meta (commits Apr 2026), LinkedIn headline recently added MSL but no job-seeking signals

Anne Ouyang

low hireability

Founder@Standard Kernel

Previously: Deep Learning Engineer @ NVIDIA

San Francisco, US

Founded Standard Kernel (GPU kernel startup) and Stanford PhD student (on leave)
Former cuDNN engineer at NVIDIA with direct CUDA kernel experience; first author of KernelBench ('Can LLMs Write Efficient GPU Kernels?', 2025)
Based in SF
MIT BS+MEng
Hireability: LOW — explicitly building her own startup and personal website states 'not actively looking'; exceptional technical match but unlikely to be recruitable near-term

Xiuyu Li

low hireability

PhD candidate@Berkeley AI Research (BAIR) at UC Berkeley

Previously: Research Consultant @ Together AI

San Francisco, US

TorchSparse++ co-author — GPU sparse convolution framework with CUDA kernels (MICRO'23, MLSys'22, 67 citations)
Also contributes to SGLang (high-performance LLM serving) and SqueezeLLM quantization work
PhD Berkeley (efficient deep learning), h_index 19, Bay Area
Hireability: LOW — LinkedIn pipeline signals show recently joined xAI as Member of Technical Staff (~2-3 months ago), brand new hire

Runs

#1completed0 qualified / 0 foundApr 21, 6:29 PM