Junior GPU kernel engineers in the US with CUDA/Triton experience

completed4 qualified1 runApr 21, 9:38 PMjunior-gpu-kernel-engineers-in-the-us-with-cudatriton-experi-1776807525

Parsed3 topics · Junior · Engineer · United States

Generating seed nodes

0 proposed

Explored 0 queries

0/0 done

Expanding nodes

queued

Qualifying candidates

queued

Qualified Candidates (4)

medium hireability

MLsys Researcher@University of California, Berkeley

Previously: Research Intern @ Nvidia

Berkeley, US

Strong CUDA/ML-systems PhD student at UC Berkeley (2nd year, BS Tsinghua Yao Class)
Pinned repos include cuda-tensorcore-hgemm and how-to-optim-algorithm-in-cuda (both Cuda); papers on SpargeAttn (sparse attention GPU kernels), FP8 training (COAT, Jetfire), INT4 quantization
NVIDIA Research Intern May–Aug 2025 on efficient training/inference and FP8 workflows
Berkeley, CA (US)
Hireability: MEDIUM — active 2nd-year PhD student, unlikely to leave for full-time before ~2028; strong internship candidate per NVIDIA precedent. Website very actively updated (2026-04-20) with new publications, no explicit open-to-work signal

medium hireability

Machine Learning Engineer | Data Scientist@IntuigenceAI

Previously: Machine Learning Engineer @ Algoverse

San Francisco, US

Co-author on KVQuant (NeurIPS 2024, 322 citations) which develops custom CUDA kernels for KV cache quantization (~1.7x speedup over fp16 matmul), and Squeezed Attention (ACL 2025) implementing sparse FlashAttention kernels (4x+ speedups)
Berkeley EECS grad (2023), now in Stanford MS CS (AI, Systems) 2025-2027; ML Engineer at IntuigenceAI startup in SF
Hireability: MEDIUM — currently mid-program at Stanford MS (2025-2027) and employed at startup, no open-to-work signals

medium hireability

Head Teaching Assistant@UC Berkeley

Previously: Teaching Assistant @ UC Berkeley

San Francisco, US

UC Berkeley EECS junior researching ML systems and model efficiency; co-authored NeurIPS 2025 paper 'Multipole Attention for Efficient Long Context Reasoning' which implements attention kernels achieving 4.5× speedup — direct GPU kernel experience
Forked NCCL and has a distributed-sparse-attention repo
Based in Berkeley/SF, US
Hireability: MEDIUM — current undergrad junior (likely graduating 2027), no explicit job-seeking signals, but prime candidate for summer 2026 internship

low hireability

PhD candidate@Berkeley AI Research (BAIR) at UC Berkeley

Previously: Research Consultant @ Together AI

San Francisco, US

Berkeley BAIR PhD (Prof
Keutzer), strong CUDA/GPU kernel background — lead contributor to TorchSparse and TorchSparse++ (CUDA sparse convolution frameworks, MICRO'23/MLSys'22), also published on Q-Diffusion and SqueezeLLM quantization. h-index 19
Bay Area, US
Now MTS at xAI working on coding RL and infra
Hireability: LOW — pipeline shows recent move from Together AI (Research Consultant) to xAI (MTS), scraped Feb 2026, likely only 3-5 months into new role

#1completed0 qualified / 0 foundApr 21, 9:38 PM