Back to dashboard

Junior GPU kernel engineers in the US with CUDA/Triton experience

completed43 qualified1 runApr 19, 4:41 PMjunior-gpu-kernel-engineers-in-the-us-with-cudatriton-experi-1776616906
Parsed3 topics · Junior · Engineer · US
Generating seed nodes
0 proposed
Explored 0 queries
0/0 done
    3
    Expanding nodes
    queued
    4
    Qualifying candidates
    queued

    Qualified Candidates (43)

    AT

    Abhishek Tyagi

    high hireability

    PhD student@University of Rochester

    Previously: Senior Engineer @ Samsung

    Rochester, US

    • PhD student at University of Rochester with explicit CUDA and parallel computing expertise
    • Former NVIDIA intern (Resiliency and Safety Architecture)
    • LinkedIn profile recently wiped (scraped Jan 2026) suggesting possible job search
    • Strong fit for junior GPU kernel role
    • Hireability: HIGH — CUDA background, NVIDIA experience, and apparent openness to new opportunities
    AG

    Aleksa Gordic

    high hireability
    • Ex-DeepMind/Microsoft, SF-based, self-described 'Tensor Core maximalist'
    • GitHub has fast.cu (fastest CUDA kernels from scratch), llm.c CUDA fork, awesomeMLSys
    • Strong direct GPU kernel writing signal
    • US-based
    • Mid-level seniority (account 2017, ex-DeepMind/Microsoft not senior title)
    • Hireability: HIGH — actively writing fast CUDA kernels from scratch, Tensor Core focus directly matches the role
    AP

    Arjun Parthasarathy

    high hireability

    Incoming Quantitative Research Intern@Jump Trading

    Previously: Research Assistant @ Columbia Engineering

    Chicago, US

    • CS + Math at Columbia (Research Assistant), incoming Quant Research Intern at Jump Trading
    • GitHub shows TKPuzzles (CUDA GPU tile kernel puzzles) and ThunderKittens fork (CUDA tile primitives)
    • Multiple PRs to HazyResearch/ThunderKittens including FFTConv CUDA kernel and 256 kernel
    • Location: Palo Alto, CA
    • Hireability: HIGH — undergrad/junior, strong direct GPU kernel coding in CUDA
    DB

    David Berard

    high hireability
    • Anthropic engineer, San Francisco CA
    • GitHub has outperforming-cublas (CUDA kernel), GPU Mode leaderboard participation, triton-cpu fork, tritonbench fork, FBGEMM
    • Directly writing CUDA kernels and competing in GPU Mode challenges
    • US-based, mid-level seniority (account 2013 but not senior/staff title)
    • Hireability: HIGH — actively writing and benchmarking CUDA/Triton kernels, GPU MODE leaderboard competitor, strong direct kernel experience
    DM

    dePaul Miller

    high hireability
    • DL Software Performance at NVIDIA, PhD student at Lehigh University
    • GitHub repos: SlabHash (warp-oriented GPU hash table in CUDA), MVGpuBTree (GPU B-Tree), triton fork, cutlass fork, flashinfer
    • Writing GPU data structure kernels from scratch
    • US-based, junior (current PhD student)
    • Hireability: HIGH — PhD student writing real CUDA GPU kernels (hash tables, B-trees), active in Triton/CUTLASS, classic junior GPU kernel engineer profile
    HP

    Hayden Prairie

    high hireability

    Kernels Research Intern@Together

    Previously: Research Assistant @ University of Texas at Austin

    San Diego, US

    • Kernels Research Intern at Together AI, starting PhD at UCSD
    • Triton einops-style library (tlib), ThunderKittens CUDA fork, committed to gpu-kernel-dev repo with NCU sweeps and RMS norm
    • Location: San Diego, CA
    • Hireability: HIGH — intern/junior with active Triton kernel code
    HC

    Hongzheng Chen

    high hireability

    Ph.D. Candidate@Cornell University

    Previously: Undergrad student @ SUN YAT-SEN UNIVERSITY

    Ithaca, US

    • Ph.D
    • Candidate at Cornell University, Ithaca NY. 'Compiler for accelerators' bio, 520 GitHub followers
    • KernelBench fork (GPU kernel benchmarking), sglang fork
    • Primary work is in cornell-zhang/allo (HLS/MLIR accelerator compiler) and now also Student Researcher at Google
    • Strong compiler+GPU kernel systems expertise
    • Hireability: HIGH — junior PhD with compiler+GPU kernel systems focus, active in accelerator tooling
    KW

    Kyle Wang

    high hireability
    • Santa Clara CA, bio: 'Triton, MLIR, LLVM
    • Previously working on deep learning training and inference systems.' GitHub: triton fork (MLIR language), Triton-distributed, iris (Triton RMA library), torch-mlir, llvm-project
    • Direct Triton kernel compiler contributor
    • US-based
    • Account created 2017, mid-level
    • Hireability: HIGH — Triton/MLIR/LLVM specialist, actively working on Triton compiler and distributed kernel systems, perfect match for junior GPU kernel engineer
    LA

    Lain

    high hireability
    • Lain/Siyuan Feng (IwakuraRein): NVIDIA engineer (siyuanf@nvidia.com), 7 CUTLASS contributions
    • US (Santa Clara, CA)
    LM

    Lei Mao

    high hireability
    • Lei Mao (leimao): Meta engineer (Silicon Valley CA), 5 CUTLASS contributions
    • Bio: 'AI/ML/CS, C++, CUDA, Python'
    • Blog has extensive CUDA tutorials
    • Strong kernel work evidence. 4.9K followers
    • US
    ML

    Maksim Levental

    high hireability
    • Maksim Levental (makslevental): Apple engineer, 44 Triton contributions, compilers/DSLs/accelerator architectures focus
    • Strong GPU kernel + compiler signal
    • US (Cupertino, CA)
    MG

    Manish Gupta

    high hireability
    • Manish Gupta (manishucsd): Magic.dev US, CUTLASS contributor with UCSD background
    • Blog at cseweb.ucsd.edu suggests PhD-level
    • Repos include iree, llvm-project, scaling book - serious GPU/compiler engineer
    • High CUDA/Triton signal
    MH

    Markus Hoehnerbach

    high hireability
    • Markus Hoehnerbach (v0i0): Meta engineer with 7 flash-attention contributions
    • Research background in HPC/GPU kernels
    • PhD-level researcher doing applied kernel work
    • Strong CUDA/kernel signal
    • US (California)
    PR

    Pradeep Ramani

    high hireability
    • Pradeep Ramani (IonThruster): NVIDIA CUTLASS/CUDA/GPGPU engineer, Santa Clara CA
    • Bio: 'CUTLASS / CUDA / GPGPU / NVIDIA'
    • CUTLASS contributor
    • Core GPU kernel engineer
    • Seniority unclear but bio and 3 CUTLASS contributions suggest active dev
    PL

    Puyan Lotfi

    high hireability
    • Puyan Lotfi (plotfi): Meta AIML GPU Compiler Engineer, SF CA, 20 Triton contributions
    • Direct GPU compiler/kernel work
    • US
    • Strong signal
    RG

    Ravi Ghadia

    high hireability

    GEMM Kernels intern@AMD

    Previously: Research Intern @ Together AI

    Austin, US

    • GEMM Kernels Intern at AMD, expertise in GEMM kernel optimization and CUDA
    • GitHub has ROCgdb, flash-attention fork, and a merged FA3 tensor size fix PR on Dao-AILab/flash-attention
    • Now at NVIDIA
    • Location: Bengaluru (verify US)
    • Hireability: HIGH — intern seniority, direct GEMM/CUDA kernel work; confirm US location
    SG

    Shreya Gaur

    high hireability
    • Shreya Gaur: NVIDIA Deep Learning Performance Engineer, CUTLASS contributor (cutlassdev repo, gemm_sample)
    • Direct kernel work evidence - CUTLASS dev and GEMM samples
    • US (Santa Clara)
    • Strong signal
    WB

    William Brandon

    high hireability

    PhD student@MIT CSAIL

    Previously: Research Assistant @ MIT Media Lab

    Cambridge, US

    • PhD student at MIT CSAIL, Cambridge MA
    • Research expertise: GPU programming, ML systems, compilers. striped_attention repo (46 stars, GPU attention kernel). telerun job queue system
    • Prev UC Berkeley CS/Math
    • Hireability: HIGH — junior PhD with GPU programming focus and public GPU kernel code
    YC

    Yang Chen

    high hireability
    • Yang Chen (chenyang78): Meta engineer (Bellevue WA), CUTLASS contributor, tilelang and flashinfer repos. 'GPU/CPU/Accelerators kernels' domain - direct kernel work
    • Utah CS background
    • Strong signal
    YL

    Yashas Samaga B L

    high hireability

    Student Researcher@Ai2

    Previously: Pre-Doctoral Researcher @ DeepMind

    Seattle, US

    • Student Researcher at Ai2, Seattle
    • GitHub shows ConvolutionBuildingBlocks (CUDA, 28 stars) using CUTLASS for GEMM/Winograd convolutions, plus many OpenCV CUDA DNN PRs including eltwise broadcasting support
    • Strong GPU kernel track record
    • Hireability: HIGH — junior/student with proven CUDA kernel code
    YZ

    Ying Zhang

    high hireability
    • Ying Zhang (ipiszy): xAI engineer, Palo Alto CA, 17 flash-attention contributions
    • Repos: cutlass-tests (Cuda), flash-attention
    • Direct CUDA kernel work
    • Strong signal
    YZ

    Yujia Zhai

    high hireability
    • Yujia Zhai (yzhaiustc): NVIDIA engineer (Santa Clara), 18 CUTLASS contributions, ftblas repo (high-perf BLAS with fault tolerance), yzhaiustc.github.io personal site
    • NVIDIA GPU kernel engineer
    • Strong signal
    AT

    Aditya Tomar

    medium hireability

    Undergraduate Student@UC Berkeley

    Previously: Researcher @ PSSG

    US

    • Undergraduate student at UC Berkeley with GPU programming, CUDA, and parallel computing expertise
    • Former NVIDIA Research Intern, currently at BAIR
    • Very early career stage (undergrad)
    • Published papers (h-index 2)
    • Hireability: MEDIUM — strong CUDA fit and relevant internships but still undergrad, timing of availability uncertain
    AA

    Adnan Akhundov

    medium hireability
    • ML Compilers SWE at Meta (SF Bay Area), working on PyTorch, Triton, and AITemplate
    • Active Triton contributor (forked), CUTLASS contributor, AITemplate contributor — direct GPU kernel/compiler experience
    • US-based
    • Mid-level (GitHub account 2017, not Principal/Staff title)
    • Hireability: MEDIUM — solid Triton/GPU compiler background but seniority slightly uncertain; role is 'SWE' not senior
    AD

    Aidan Do

    medium hireability
    • SWE at Fireworks AI (SF Bay Area), NVIDIA/Meta OSS contributor, ex-Canva
    • GitHub repos include ThunderKittens (tile primitives for GPU kernels), CUTLASS, FlashInfer, TensorRT-LLM
    • Direct GPU kernel exposure
    • US-based, junior-to-mid (account 2018)
    • Hireability: MEDIUM — GPU kernel exposure via ThunderKittens/CUTLASS/FlashInfer, contributor profile
    BS

    Baixi Sun

    medium hireability

    Associate Instructor@Indiana University

    Previously: Doctoral Researcher Intern @ 字节跳动

    Bloomington, US

    • PhD student at Indiana University (HPC for AI Researcher)
    • GitHub shows COMPSO_Compressor (CUDA language, GPU error-bounded lossy compressor). cuSZp GPU compressor work
    • LinkedIn shows recently left Indiana University position
    • Hireability: MEDIUM — junior PhD with actual CUDA GPU compute code, but focus is HPC/compression rather than ML kernel optimization
    BI

    bingyizh233

    medium hireability
    • NVIDIA engineer, Santa Clara
    • GitHub (created 2020) shows triton and tritonbench forks — active Triton work
    • US-based
    • Account is relatively new (2020), suggesting junior-to-mid level
    • Hireability: MEDIUM — Triton kernel exposure at NVIDIA, but limited public signal beyond forks; direct seniority hard to confirm
    CS

    Chamika Sudusinghe

    medium hireability

    Research Scholar@University of Illinois Urbana-Champaign (UIUC)

    Previously: TAB Representative, IEEE Young Professionals @ IEEE

    Urbana, US

    • Research Scholar at ADAPT@Illinois (UIUC), PhD student
    • Research: Computer Architecture, Compilers, ML, Performance Optimization
    • GitHub shows triton-cxgnn (Triton fork for GNN kernels) and SparseTIR CUDA kernel compatibility PR
    • Active GPU/compiler systems work
    • Hireability: MEDIUM — junior researcher with Triton kernel work, but focus is sparse GNN rather than dense ML kernels
    CB

    Cliff Burdick

    medium hireability
    • NVIDIA SWE, San Diego CA
    • Bio says 'CUDA/C++ optimizations'
    • GitHub forks include CCCL (CUDA Core Compute Libraries), jitify (CUDA NVRTC), CUDA Library Samples — direct GPU kernel experience
    • US-based
    • Mid-level (account 2017, no senior title)
    • Hireability: MEDIUM — genuine CUDA optimization focus at NVIDIA, public portfolio mostly forks
    DZ

    Dan Zimmerman

    medium hireability
    • US-based
    • GitHub shows triton fork, helion (ML kernel DSL), FBGEMM, ROCm blogs — direct GPU kernel/compiler work
    • Account created 2009 but no title information suggesting seniority
    • Company unknown
    • Hireability: MEDIUM — Triton/ROCm/FBGEMM signals are strong for GPU kernel work, old account age and unclear employer make seniority uncertain
    GG

    Gary Geng

    medium hireability
    • NVIDIA engineer, Santa Clara CA
    • GitHub (account created 2024-03-04) shows tritonbench and xla-triton forks — very new account suggesting new grad/junior hire working on Triton at NVIDIA
    • US-based
    • Hireability: MEDIUM — fresh NVIDIA Triton engineer (brand new account, likely new grad), directly relevant but limited public track record
    HI

    HamidReza Imani

    medium hireability

    Research Assistant@The George Washington University

    Previously: Software Engineer Intern @ Modular

    Washington, US

    • Research Assistant at GWU
    • Research: HPC, Distributed ML, GPU Programming
    • US location confirmed
    • No public GPU kernel repos but expertise description explicitly includes GPU programming
    • Hireability: MEDIUM — junior with stated GPU programming expertise, limited public code to verify
    HY

    Hongtao Yu

    medium hireability
    • Compiler engineer at Facebook/Meta, Seattle WA (US-based)
    • GitHub has CUTracer (CUDA kernel instruction tracer), triton fork, tritonbench, FBGEMM
    • US-based
    • Mid-level seniority (account 2015, generic title)
    • Hireability: MEDIUM — compiler engineer with Triton/CUDA tracing experience, but focus is compiler infrastructure rather than writing end-user GPU kernels
    JH

    Jake Hyun

    medium hireability

    PhD Student@Cornell University

    Previously: Undergraduate Research Intern @ Seoul National University

    New York, US

    • PhD Student at Cornell University, New York
    • Advanced Compilers coursework (CS 6120)
    • Sole contributor to ML-HW-SYS/a3-2026-leaderboard (GPU kernel optimization competition, 3665 contributions)
    • Also contributed to NVFP4-RaZeR inference artifact
    • Compiler + GPU systems focus
    • Hireability: MEDIUM — junior PhD with GPU kernel optimization work but limited direct CUDA/Triton public code
    JN

    Jez Ng

    medium hireability
    • SF Compute, San Francisco CA
    • GitHub: triton-cpu fork, AITemplate (CUDA/HIP codegen), Mojo path tracer, LLVM fork
    • Systems/compiler engineer with GPU codegen exposure
    • US-based
    • Mid-level apparent seniority (account 2010 but no senior title)
    • Hireability: MEDIUM — compiler/systems background with GPU codegen (AITemplate, triton-cpu), primary focus is compiler infrastructure rather than direct CUDA/Triton kernel writing
    KS

    Kirthi Shankar Sivamani

    medium hireability
    • Deep Learning at NVIDIA, Palo Alto CA
    • GitHub: cudnn-frontend, flash-attention, Megatron-LM, lightning-thunder forks
    • Account created 2018 — mid-level
    • Bio says 'Deep Learning at NVIDIA'
    • Repos show GPU compute stack exposure (cuDNN, flash-attention) but focus appears more on training frameworks than kernel authoring
    • US-based
    • Hireability: MEDIUM — NVIDIA DL engineer with GPU compute exposure, but profile suggests framework-level rather than direct kernel writing
    LZ

    Lixun Zhang

    medium hireability
    • Lixun Zhang (zhanglx13): AMD Triton contributor (58 contributions, Austin TX)
    • Has sglang fork, Triton-focused work
    • Active AMD GPU engineer with strong Triton signal
    • US
    MM

    Michael Melesse

    medium hireability
    • Michael Melesse (micmelesse): AMD engineer (New York), 16 Triton contributions
    • Repos: aiter (AI Tensor Engine for ROCm), vllm
    • Active AMD GPU kernel contributor
    • US
    PZ

    Pengzhan Zhao

    medium hireability
    • Pengzhan Zhao (borontion): AMD engineer (SF Bay Area), 45 Triton contributions
    • GitHub account from 2017, active AMD Triton work
    • Linux kernel fork also shows low-level systems interest
    • US
    SC

    Scott Cheng

    medium hireability

    PhD student@Penn State CSE

    US

    • PhD student at Penn State CSE, 'Open to Work'
    • GitHub shows cutlass fork and PRs to facebookincubator/AITemplate (nvcc host compiler support, LTO options) and NVIDIA/cutlass (CUDA 13 flags fix)
    • Compiler-adjacent GPU systems work
    • Hireability: MEDIUM — junior PhD with real CUDA/CUTLASS contributions but mostly build/tooling level rather than kernel authoring
    YB

    Yasa Baig

    medium hireability

    PhD student@Stanford University

    San Francisco, US

    • PhD student at Stanford University
    • Research: HPC, GPUs, Scientific Computing, Biophysics
    • US location confirmed
    • GPU expertise directly listed
    • Stanford PhD student = junior
    • Research focus on GPU-accelerated scientific computing
    • Hireability: MEDIUM — junior PhD with stated GPU expertise, limited public code verification
    YQ

    Ye Charlotte Qi

    medium hireability
    • Ye Charlotte Qi (yeqcharlotte): Meta MSL Inference Infra SWE, Menlo Park, CA
    • Flash-attention contributor with PyTorch/GPU inference stack work
    • Role suggests GPU kernel proximity
    • US, junior-mid level
    YQ

    Yi Qian

    medium hireability
    • Yi Qian (yiqian1): AMD engineer in Frisco TX, 45 Triton contributions, works on LLVM for gfx950 (GPU compiler)
    • Active Triton contributor
    • US, mid-level

    Runs

    #1completed43 qualified / 115 foundApr 19, 4:41 PM