Junior GPU kernel engineers in the US with CUDA/Triton experience

completed7 qualified1 runApr 21, 5:05 PMjunior-gpu-kernel-engineers-in-the-us-with-cudatriton-experi-1776791114

Parsed3 topics · Junior · Engineer · United States

Generating seed nodes

0 proposed

Explored 0 queries

0/0 done

Expanding nodes

queued

Qualifying candidates

queued

Qualified Candidates (7)

Aditya Tomar

medium hireability

Undergraduate Student@UC Berkeley

Previously: Researcher @ PSSG

3rd-year EECS undergrad at UC Berkeley doing GPU systems research at BAIR under Kurt Keutzer — QuantSpec (ICML 2025) and XQuant for LLM inference optimization, SC 2024 Gordon Bell Prize Finalist for scalable GPU supercomputer training
Core work is GPU systems/HPC rather than explicit CUDA/Triton kernel writing, but strong adjacent GPU exposure and upcoming NVIDIA Applied DL Research internship (May–Aug 2026)
Hireability: MEDIUM — still undergrad (graduating ~2026/2027), committed to NVIDIA internship through Aug 2026; best window is summer 2027 internship or post-graduation full-time

Daiyaan Arfeen

medium hireability

PhD student@Graduate Student, Carnegie Mellon University

Previously: Deep Learning Architecture Intern @ NVIDIA

San Francisco, US

GPU systems PhD at CMU PDL — papers on GPU utilization (PipeFill, MLSys 2025), LLM serving acceleration (SpecInfer, ASPLOS 2024, 386 citations), and ML cluster scheduling (Sia, SOSP 2023)
Work is GPU systems-level rather than explicit CUDA/Triton kernel writing, but demonstrates deep GPU architecture knowledge
US-based
Hireability: MEDIUM — long PhD publication record spanning 2018-2025 (likely 5-7+ years), possibly nearing graduation, but no active job-search signals from pipeline or website

Gabriele Oliaro

medium hireability

CS PhD Student@Snowflake AI Research

Previously: Research Scientist Intern @ Snowflake

ML systems PhD at CMU with GPU kernel relevance — Korch paper (ASPLOS '24) on optimal kernel orchestration for tensor programs, and FlexFlow C++/CUDA distributed training framework (contributor)
SpecInfer (415 citations) for LLM inference acceleration on GPUs. h_index 10
Based in Pittsburgh, PA / Snowflake internship in San Mateo, CA (US)
Hireability: MEDIUM — 4th year PhD, expected graduation 2027, currently on Snowflake AI Research internship (~2 years); approaching final stretch but not yet in the prime transition window

Haocheng Xi

medium hireability

MLsys Researcher@University of California, Berkeley

Previously: Research Intern @ Nvidia

Berkeley, US

Strong GPU kernel engineer — CUDA repos pinned (how-to-optim-algorithm-in-cuda, cuda-tensorcore-hgemm), multiple NVIDIA internships on FP8/INT8 training (COAT at ICLR 2025), PhD at UC Berkeley on ML sys/Efficient ML
Based in Berkeley, CA
Hireability: MEDIUM — 2nd-year PhD student (started 2024), very active on GitHub (commit 2026-04-20), two NVIDIA internships show industry engagement, but likely 2+ years from full-time market; strong intern candidate now

Hiva Mohammadzadeh

low hireability

Machine Learning Engineer | Data Scientist@IntuigenceAI

Previously: Machine Learning Engineer @ Algoverse

San Francisco, US

Co-authored KVQuant (NeurIPS 2024, 322 citations) which developed custom CUDA kernels for KV cache quantization achieving ~1.7x speedups on LLaMA-7B; also co-authored Squeezed Attention and SPEED papers on LLM inference acceleration
BS EECS UC Berkeley, currently enrolled in Stanford MSCS AI/Systems (2025-2027) while working as ML Engineer at IntuigenceAI in SF
Hireability: LOW — only 6-8 months into a 2-year Stanford MS program (graduates 2027); no open-to-work signals; simultaneously working part-time at IntuigenceAI

Xiuyu Li

low hireability

PhD candidate@Berkeley AI Research (BAIR) at UC Berkeley

Previously: Research Consultant @ Together AI

San Francisco, US

Strong CUDA/GPU kernel background via TorchSparse and TorchSparse++ (sparse convolution on GPUs, 144+67 citations, MICRO'23/MLSys'22)
PhD from UC Berkeley BAIR, now MTS at xAI focused on coding RL/infra — more senior than 'junior' label but has direct GPU kernel CUDA experience
US (Bay Area) ✓
Hireability: LOW — pipeline signals show she just transitioned from Research Consultant at Together AI → MTS at xAI (scraped 2026-02-05), ~2.5 months into new role

Zhihao Zhang

low hireability

Ph.D. student@Carnegie Mellon University

Previously: MS student @ Carnegie Mellon University

Pittsburgh, US

Active CUDA kernel contributor to mirage-project/mirage: implemented Blackwell (SM100) linear kernels with TMA+Epilogue pipelines, MoE kernels with expert-balanced GeMM, and PTX sync optimizations (Oct-Dec 2025)
Also authored CUDA kernel impl for TidalDecode sparse attention paper (2024)
PhD at CMU Catalyst under Zhihao Jia, Pittsburgh US
Hireability: LOW — appears to have just joined Lithos AI startup (lithos-ai/motus PRs merged April 11-21 2026, <2 weeks ago), very likely a new hire

Runs

#1completed0 qualified / 0 foundApr 21, 5:05 PM