Junior GPU kernel engineers in the US with CUDA/Triton experience

completed43 qualified1 runApr 19, 4:41 PMjunior-gpu-kernel-engineers-in-the-us-with-cudatriton-experi-1776616906

Parsed3 topics · Junior · Engineer · US

Generating seed nodes

0 proposed

Explored 0 queries

0/0 done

Expanding nodes

queued

Qualifying candidates

queued

Qualified Candidates (43)

Abhishek Tyagi

high hireability

PhD student@University of Rochester

Previously: Senior Engineer @ Samsung

Rochester, US

PhD student at University of Rochester with explicit CUDA and parallel computing expertise
Former NVIDIA intern (Resiliency and Safety Architecture)
LinkedIn profile recently wiped (scraped Jan 2026) suggesting possible job search
Strong fit for junior GPU kernel role
Hireability: HIGH — CUDA background, NVIDIA experience, and apparent openness to new opportunities

Aleksa Gordic

high hireability

Ex-DeepMind/Microsoft, SF-based, self-described 'Tensor Core maximalist'
GitHub has fast.cu (fastest CUDA kernels from scratch), llm.c CUDA fork, awesomeMLSys
Strong direct GPU kernel writing signal
US-based
Mid-level seniority (account 2017, ex-DeepMind/Microsoft not senior title)
Hireability: HIGH — actively writing fast CUDA kernels from scratch, Tensor Core focus directly matches the role

Arjun Parthasarathy

high hireability

Incoming Quantitative Research Intern@Jump Trading

Previously: Research Assistant @ Columbia Engineering

Chicago, US

CS + Math at Columbia (Research Assistant), incoming Quant Research Intern at Jump Trading
GitHub shows TKPuzzles (CUDA GPU tile kernel puzzles) and ThunderKittens fork (CUDA tile primitives)
Multiple PRs to HazyResearch/ThunderKittens including FFTConv CUDA kernel and 256 kernel
Location: Palo Alto, CA
Hireability: HIGH — undergrad/junior, strong direct GPU kernel coding in CUDA

David Berard

high hireability

Anthropic engineer, San Francisco CA
GitHub has outperforming-cublas (CUDA kernel), GPU Mode leaderboard participation, triton-cpu fork, tritonbench fork, FBGEMM
Directly writing CUDA kernels and competing in GPU Mode challenges
US-based, mid-level seniority (account 2013 but not senior/staff title)
Hireability: HIGH — actively writing and benchmarking CUDA/Triton kernels, GPU MODE leaderboard competitor, strong direct kernel experience

dePaul Miller

high hireability

DL Software Performance at NVIDIA, PhD student at Lehigh University
GitHub repos: SlabHash (warp-oriented GPU hash table in CUDA), MVGpuBTree (GPU B-Tree), triton fork, cutlass fork, flashinfer
Writing GPU data structure kernels from scratch
US-based, junior (current PhD student)
Hireability: HIGH — PhD student writing real CUDA GPU kernels (hash tables, B-trees), active in Triton/CUTLASS, classic junior GPU kernel engineer profile

Hayden Prairie

high hireability

Kernels Research Intern@Together

Previously: Research Assistant @ University of Texas at Austin

San Diego, US

Kernels Research Intern at Together AI, starting PhD at UCSD
Triton einops-style library (tlib), ThunderKittens CUDA fork, committed to gpu-kernel-dev repo with NCU sweeps and RMS norm
Location: San Diego, CA
Hireability: HIGH — intern/junior with active Triton kernel code

Hongzheng Chen

high hireability

Ph.D. Candidate@Cornell University

Previously: Undergrad student @ SUN YAT-SEN UNIVERSITY

Ithaca, US

Ph.D
Candidate at Cornell University, Ithaca NY. 'Compiler for accelerators' bio, 520 GitHub followers
KernelBench fork (GPU kernel benchmarking), sglang fork
Primary work is in cornell-zhang/allo (HLS/MLIR accelerator compiler) and now also Student Researcher at Google
Strong compiler+GPU kernel systems expertise
Hireability: HIGH — junior PhD with compiler+GPU kernel systems focus, active in accelerator tooling

Kyle Wang

high hireability

Santa Clara CA, bio: 'Triton, MLIR, LLVM
Previously working on deep learning training and inference systems.' GitHub: triton fork (MLIR language), Triton-distributed, iris (Triton RMA library), torch-mlir, llvm-project
Direct Triton kernel compiler contributor
US-based
Account created 2017, mid-level
Hireability: HIGH — Triton/MLIR/LLVM specialist, actively working on Triton compiler and distributed kernel systems, perfect match for junior GPU kernel engineer

Lain

high hireability

Lain/Siyuan Feng (IwakuraRein): NVIDIA engineer (siyuanf@nvidia.com), 7 CUTLASS contributions
US (Santa Clara, CA)

Lei Mao

high hireability

Lei Mao (leimao): Meta engineer (Silicon Valley CA), 5 CUTLASS contributions
Bio: 'AI/ML/CS, C++, CUDA, Python'
Blog has extensive CUDA tutorials
Strong kernel work evidence. 4.9K followers
US

Maksim Levental

high hireability

Maksim Levental (makslevental): Apple engineer, 44 Triton contributions, compilers/DSLs/accelerator architectures focus
Strong GPU kernel + compiler signal
US (Cupertino, CA)

Manish Gupta

high hireability

Manish Gupta (manishucsd): Magic.dev US, CUTLASS contributor with UCSD background
Blog at cseweb.ucsd.edu suggests PhD-level
Repos include iree, llvm-project, scaling book - serious GPU/compiler engineer
High CUDA/Triton signal

Markus Hoehnerbach

high hireability

Markus Hoehnerbach (v0i0): Meta engineer with 7 flash-attention contributions
Research background in HPC/GPU kernels
PhD-level researcher doing applied kernel work
Strong CUDA/kernel signal
US (California)

Pradeep Ramani

high hireability

Pradeep Ramani (IonThruster): NVIDIA CUTLASS/CUDA/GPGPU engineer, Santa Clara CA
Bio: 'CUTLASS / CUDA / GPGPU / NVIDIA'
CUTLASS contributor
Core GPU kernel engineer
Seniority unclear but bio and 3 CUTLASS contributions suggest active dev

Puyan Lotfi

high hireability

Puyan Lotfi (plotfi): Meta AIML GPU Compiler Engineer, SF CA, 20 Triton contributions
Direct GPU compiler/kernel work
US
Strong signal

Ravi Ghadia

high hireability

GEMM Kernels intern@AMD

Previously: Research Intern @ Together AI

Austin, US

GEMM Kernels Intern at AMD, expertise in GEMM kernel optimization and CUDA
GitHub has ROCgdb, flash-attention fork, and a merged FA3 tensor size fix PR on Dao-AILab/flash-attention
Now at NVIDIA
Location: Bengaluru (verify US)
Hireability: HIGH — intern seniority, direct GEMM/CUDA kernel work; confirm US location

Shreya Gaur

high hireability

Shreya Gaur: NVIDIA Deep Learning Performance Engineer, CUTLASS contributor (cutlassdev repo, gemm_sample)
Direct kernel work evidence - CUTLASS dev and GEMM samples
US (Santa Clara)
Strong signal

William Brandon

high hireability

PhD student@MIT CSAIL

Previously: Research Assistant @ MIT Media Lab

Cambridge, US

PhD student at MIT CSAIL, Cambridge MA
Research expertise: GPU programming, ML systems, compilers. striped_attention repo (46 stars, GPU attention kernel). telerun job queue system
Prev UC Berkeley CS/Math
Hireability: HIGH — junior PhD with GPU programming focus and public GPU kernel code

Yang Chen

high hireability

Yang Chen (chenyang78): Meta engineer (Bellevue WA), CUTLASS contributor, tilelang and flashinfer repos. 'GPU/CPU/Accelerators kernels' domain - direct kernel work
Utah CS background
Strong signal

Yashas Samaga B L

high hireability

Student Researcher@Ai2

Previously: Pre-Doctoral Researcher @ DeepMind

Seattle, US

Student Researcher at Ai2, Seattle
GitHub shows ConvolutionBuildingBlocks (CUDA, 28 stars) using CUTLASS for GEMM/Winograd convolutions, plus many OpenCV CUDA DNN PRs including eltwise broadcasting support
Strong GPU kernel track record
Hireability: HIGH — junior/student with proven CUDA kernel code

Ying Zhang

high hireability

Ying Zhang (ipiszy): xAI engineer, Palo Alto CA, 17 flash-attention contributions
Repos: cutlass-tests (Cuda), flash-attention
Direct CUDA kernel work
Strong signal

Yujia Zhai

high hireability

Yujia Zhai (yzhaiustc): NVIDIA engineer (Santa Clara), 18 CUTLASS contributions, ftblas repo (high-perf BLAS with fault tolerance), yzhaiustc.github.io personal site
NVIDIA GPU kernel engineer
Strong signal

Aditya Tomar

medium hireability

Undergraduate Student@UC Berkeley

Previously: Researcher @ PSSG

Undergraduate student at UC Berkeley with GPU programming, CUDA, and parallel computing expertise
Former NVIDIA Research Intern, currently at BAIR
Very early career stage (undergrad)
Published papers (h-index 2)
Hireability: MEDIUM — strong CUDA fit and relevant internships but still undergrad, timing of availability uncertain

Adnan Akhundov

medium hireability

ML Compilers SWE at Meta (SF Bay Area), working on PyTorch, Triton, and AITemplate
Active Triton contributor (forked), CUTLASS contributor, AITemplate contributor — direct GPU kernel/compiler experience
US-based
Mid-level (GitHub account 2017, not Principal/Staff title)
Hireability: MEDIUM — solid Triton/GPU compiler background but seniority slightly uncertain; role is 'SWE' not senior

Aidan Do

medium hireability

SWE at Fireworks AI (SF Bay Area), NVIDIA/Meta OSS contributor, ex-Canva
GitHub repos include ThunderKittens (tile primitives for GPU kernels), CUTLASS, FlashInfer, TensorRT-LLM
Direct GPU kernel exposure
US-based, junior-to-mid (account 2018)
Hireability: MEDIUM — GPU kernel exposure via ThunderKittens/CUTLASS/FlashInfer, contributor profile

Baixi Sun

medium hireability

Associate Instructor@Indiana University

Previously: Doctoral Researcher Intern @ 字节跳动

Bloomington, US

PhD student at Indiana University (HPC for AI Researcher)
GitHub shows COMPSO_Compressor (CUDA language, GPU error-bounded lossy compressor). cuSZp GPU compressor work
LinkedIn shows recently left Indiana University position
Hireability: MEDIUM — junior PhD with actual CUDA GPU compute code, but focus is HPC/compression rather than ML kernel optimization

bingyizh233

medium hireability

NVIDIA engineer, Santa Clara
GitHub (created 2020) shows triton and tritonbench forks — active Triton work
US-based
Account is relatively new (2020), suggesting junior-to-mid level
Hireability: MEDIUM — Triton kernel exposure at NVIDIA, but limited public signal beyond forks; direct seniority hard to confirm

Chamika Sudusinghe

medium hireability

Research Scholar@University of Illinois Urbana-Champaign (UIUC)

Previously: TAB Representative, IEEE Young Professionals @ IEEE

Urbana, US

Research Scholar at ADAPT@Illinois (UIUC), PhD student
Research: Computer Architecture, Compilers, ML, Performance Optimization
GitHub shows triton-cxgnn (Triton fork for GNN kernels) and SparseTIR CUDA kernel compatibility PR
Active GPU/compiler systems work
Hireability: MEDIUM — junior researcher with Triton kernel work, but focus is sparse GNN rather than dense ML kernels

Cliff Burdick

medium hireability

NVIDIA SWE, San Diego CA
Bio says 'CUDA/C++ optimizations'
GitHub forks include CCCL (CUDA Core Compute Libraries), jitify (CUDA NVRTC), CUDA Library Samples — direct GPU kernel experience
US-based
Mid-level (account 2017, no senior title)
Hireability: MEDIUM — genuine CUDA optimization focus at NVIDIA, public portfolio mostly forks

Dan Zimmerman

medium hireability

US-based
GitHub shows triton fork, helion (ML kernel DSL), FBGEMM, ROCm blogs — direct GPU kernel/compiler work
Account created 2009 but no title information suggesting seniority
Company unknown
Hireability: MEDIUM — Triton/ROCm/FBGEMM signals are strong for GPU kernel work, old account age and unclear employer make seniority uncertain

Gary Geng

medium hireability

NVIDIA engineer, Santa Clara CA
GitHub (account created 2024-03-04) shows tritonbench and xla-triton forks — very new account suggesting new grad/junior hire working on Triton at NVIDIA
US-based
Hireability: MEDIUM — fresh NVIDIA Triton engineer (brand new account, likely new grad), directly relevant but limited public track record

HamidReza Imani

medium hireability

Research Assistant@The George Washington University

Previously: Software Engineer Intern @ Modular

Washington, US

Research Assistant at GWU
Research: HPC, Distributed ML, GPU Programming
US location confirmed
No public GPU kernel repos but expertise description explicitly includes GPU programming
Hireability: MEDIUM — junior with stated GPU programming expertise, limited public code to verify

Hongtao Yu

medium hireability

Compiler engineer at Facebook/Meta, Seattle WA (US-based)
GitHub has CUTracer (CUDA kernel instruction tracer), triton fork, tritonbench, FBGEMM
US-based
Mid-level seniority (account 2015, generic title)
Hireability: MEDIUM — compiler engineer with Triton/CUDA tracing experience, but focus is compiler infrastructure rather than writing end-user GPU kernels

Jake Hyun

medium hireability

PhD Student@Cornell University

Previously: Undergraduate Research Intern @ Seoul National University

New York, US

PhD Student at Cornell University, New York
Advanced Compilers coursework (CS 6120)
Sole contributor to ML-HW-SYS/a3-2026-leaderboard (GPU kernel optimization competition, 3665 contributions)
Also contributed to NVFP4-RaZeR inference artifact
Compiler + GPU systems focus
Hireability: MEDIUM — junior PhD with GPU kernel optimization work but limited direct CUDA/Triton public code

Jez Ng

medium hireability

SF Compute, San Francisco CA
GitHub: triton-cpu fork, AITemplate (CUDA/HIP codegen), Mojo path tracer, LLVM fork
Systems/compiler engineer with GPU codegen exposure
US-based
Mid-level apparent seniority (account 2010 but no senior title)
Hireability: MEDIUM — compiler/systems background with GPU codegen (AITemplate, triton-cpu), primary focus is compiler infrastructure rather than direct CUDA/Triton kernel writing

Kirthi Shankar Sivamani

medium hireability

Deep Learning at NVIDIA, Palo Alto CA
GitHub: cudnn-frontend, flash-attention, Megatron-LM, lightning-thunder forks
Account created 2018 — mid-level
Bio says 'Deep Learning at NVIDIA'
Repos show GPU compute stack exposure (cuDNN, flash-attention) but focus appears more on training frameworks than kernel authoring
US-based
Hireability: MEDIUM — NVIDIA DL engineer with GPU compute exposure, but profile suggests framework-level rather than direct kernel writing

Lixun Zhang

medium hireability

Lixun Zhang (zhanglx13): AMD Triton contributor (58 contributions, Austin TX)
Has sglang fork, Triton-focused work
Active AMD GPU engineer with strong Triton signal
US

Michael Melesse

medium hireability

Michael Melesse (micmelesse): AMD engineer (New York), 16 Triton contributions
Repos: aiter (AI Tensor Engine for ROCm), vllm
Active AMD GPU kernel contributor
US

Pengzhan Zhao

medium hireability

Pengzhan Zhao (borontion): AMD engineer (SF Bay Area), 45 Triton contributions
GitHub account from 2017, active AMD Triton work
Linux kernel fork also shows low-level systems interest
US

Scott Cheng

medium hireability

PhD student@Penn State CSE

PhD student at Penn State CSE, 'Open to Work'
GitHub shows cutlass fork and PRs to facebookincubator/AITemplate (nvcc host compiler support, LTO options) and NVIDIA/cutlass (CUDA 13 flags fix)
Compiler-adjacent GPU systems work
Hireability: MEDIUM — junior PhD with real CUDA/CUTLASS contributions but mostly build/tooling level rather than kernel authoring

Yasa Baig

medium hireability

PhD student@Stanford University

San Francisco, US

PhD student at Stanford University
Research: HPC, GPUs, Scientific Computing, Biophysics
US location confirmed
GPU expertise directly listed
Stanford PhD student = junior
Research focus on GPU-accelerated scientific computing
Hireability: MEDIUM — junior PhD with stated GPU expertise, limited public code verification

Ye Charlotte Qi

medium hireability

Ye Charlotte Qi (yeqcharlotte): Meta MSL Inference Infra SWE, Menlo Park, CA
Flash-attention contributor with PyTorch/GPU inference stack work
Role suggests GPU kernel proximity
US, junior-mid level

Yi Qian

medium hireability

Yi Qian (yiqian1): AMD engineer in Frisco TX, 45 Triton contributions, works on LLVM for gfx950 (GPU compiler)
Active Triton contributor
US, mid-level

Runs

#1completed43 qualified / 115 foundApr 19, 4:41 PM