junior kernel engineers us

completed85 qualified3 runsMar 30, 8:34 AMjunior-kernel-engineers-us

Qualified Candidates (85)

Abu Qader

high hireability

SWE at Baseten since July 2022 (~4 years FTE, Cornell BS 2021, no PhD)
Led TRT-LLM Engine Builder and EAGLE 3 speculative decoding; co-authored with Tri Dao on achieving fastest Kimi K2.5 and 60% faster GPT-OSS inference — deep LLM inference optimization, adjacent to kernel engineering
SF, US-based
Hireability: HIGH — explicit hireable:true set on GitHub (profile updated 2026-04-16), 3.75 years at Baseten entering clear transition window

Aiden Grossman

high hireability

UC Davis BS (~2023-2024 recent grad)
Google SWE. 200+ LLVM commits: MLGO (ML-guided register allocation), AMD Zen 5+6 microarchitecture tuning, XLA, HEIR (homomorphic encryption IR)
Lead author of ComPile (2.4TB LLVM-IR dataset)
Presented at LLVM Dev Meeting 2022/2024/2025
No PhD
US-based. <2yr FTE (started ~2024)
Exact match on LLVM/MLIR compiler stack for GPU kernel backends
Hireability: HIGH — fresh BS-only FTE at Google, exceptional LLVM/compiler depth for a recent undergrad

Akaash Parthasarathy

high hireability

Active ML compiler stack contributor: 14 commits to mlc-ai/mlc-llm, merged PR to mlc-ai/web-llm (April 15, 2026, now listed as co-author), active apache/tvm and tvm-ffi contributor
Research focus explicitly on ML compilers, hardware-aware algorithm design, efficient LLM inference, and parallelization schemes; CUDA experience
CMU MSML + Georgia Tech BS CS
Pittsburgh, PA (US)
Hireability: HIGH — MSML program ending ~May 2026, no current FTE role, prime new-grad hiring window

Alex Kranias

high hireability

Research Intern / GPU Kernel Engineer@Apple (intern) / SHI Labs (Georgia Tech)

Georgia Tech CS BS (AI/ML concentration), current undergrad at SHI Labs
AMD Triton Kernels team intern (Fall 2024) — FlashAttention GPU kernel development on ROCm
Apple GPU kernels intern (Summer 2025) — GPU kernels for on-device MoE LLMs and Diffusion Transformers
Published triton_vs_cuda repo implementing cuBLAS-performant GEMM kernels
Strong multi-internship junior kernel signal

Bob Chen

high hireability

FlexFlow contributor (flexflow-train) and active in LLM inference systems at CMU: sglang-jax (JAX backend for SGLang, contributor to sgl-project org), vLLM-gsoc, KV compression research
Broad coverage of distributed LLM inference and serving stacks — strong systems instincts
Current CMU student (no FTE), US-based
Hireability: HIGH — current student with no FTE employment, graduating likely 2025-2026

cloud11665

high hireability

OpenAI engineer in SF
Pinned repos: gpuocelot (PTX dynamic compilation framework in C++) and telegraf_nv_export (ultra-low-overhead NVIDIA GPU telemetry C++ plugin) — both highly kernel/systems-relevant. 8 tinygrad commits
No PhD signals, no senior title publicly confirmed
PTX compilation and GPU telemetry work is precisely on-target
Hireability: HIGH — exceptional GPU kernel and low-level systems work (PTX, CUDA C++), US-based at OpenAI SF. Real name unknown; outreach via GitHub

David Wang

high hireability

GPU performance engineer at Modal Labs, ex-NVIDIA HPC Architect
Contributed to Mirage (CMU Catalyst Lab) as external contributor
BS Texas A&M + MS CS UIUC — no PhD
Total FTE estimated <5 years
US-based
Hireability: HIGH — GPU perf engineering at Modal is exactly on-topic, NVIDIA HPC background strong, BS/MS only. Currently employed so may not be actively seeking

Dylan Lim

high hireability

Research Scientist@Together AI

Stanford BS + MS CS (no PhD)
Research Scientist at Together AI ('megakernels for everyone')
Previously: Hazy Research RA ('helped GPUs meow' — ThunderKittens contributor), Stanford Compilers Group RA (accelerated DNN training across distributed systems)
Jump Trading (core strategies)
Active GPU kernel and megakernel developer at the premier kernel team (Together AI / Tri Dao group)
US (Palo Alto). ~1yr FTE

Geo Min

high hireability

AMD engineer in San Jose CA
GitHub account created Feb 2025 — near-certain new grad signal
Pins: rocm-rx9070-demo (ROCm AMD RX 9070), iree-test-suites (MLIR), iree fork (MLIR ML compiler), TheRock (HIP/ROCm build), rocm-libraries (Assembly)
Also 14 commits to ROCm/composable_kernel
Reportedly Virginia Tech education (BS/MS, no PhD signals)
Expertise: MLIR, IREE, ROCm, HIP, composable_kernel — directly covers MLIR, AMD GPU, compiler stack
Hireability: HIGH — exceptional MLIR/ROCm match, strong new-grad signal from Feb 2025 account, US-based AMD San Jose

Jesse Cai

high hireability

Machine Learning Engineer@Meta

Previously: Senior Research Engineer @ Cultivate

San Francisco, US

Meta Software Engineer on PyTorch Core Performance team, SF
UCLA 2020 graduate (BS, no PhD). 88 commits to pytorch/ao (rank 8)
Sparsity and quantization work (semi-structured 2:4 sparsity, TorchAO), PyTorch blog published, PyTorch Conference 2024 speaker
Joined Meta ~2021 → ~3-4 years FTE, within range. h-index: 2
US-based SF
Tensor core utilization and sparse kernels for LLM inference directly match query
Hireability: HIGH — PyTorch core contributor with deep sparsity/quantization, UCLA BS, US SF, no PhD

Jiakun Wang

high hireability

BS CS CMU (computer systems/hardware focus) + current MS EE at Columbia (System-on-Chip/VLSI concentration)
Chip-level hardware design: implemented TPU IC (Verilog), systolic array for TPU, Emperor SoC (RTL + C drivers), RISC-V formal verification
Query explicitly allows chip-level kernel work
Contributed to mirage (CUDA superoptimizer)
AI chip focus per personal website
US (New York, Columbia)
No FTE history — entirely within <5yr limit
Graduating likely 2025/2026
Hireability: HIGH — current student at prime graduation transition window, no blocking FTE role

Kenneth Moon

high hireability

MIT EECS 6-3 (CS+EE) undergraduate, expected graduation 2025 per MIT ESP profile
Contributed to exo-lang/exo (hardware accelerator scheduling compiler)
Has personal mkay-attention repo (custom attention kernel)
Early-stage GitHub but direct exposure to low-level compiler work for hardware accelerators at MIT
Hireability: HIGH — current MIT undergrad graduating 2025, entering job market imminently

Kevin Qian

high hireability

Core exo-lang contributor (81 commits on exo-lang/exo), co-author of Exo 2 (ASPLOS 2025), MEng thesis on ExoBLAS (meta-programming a high-performance BLAS)
MIT BS EECS 2023 + MEng 2024, no PhD
Work is directly on user-schedulable DSL for GPU/hardware accelerator kernels — highly query-relevant
Prior internships at Jane Street, Meta, D.E
Shaw (not FTE)
Based in Cambridge, MA (US)
Hireability: HIGH — MEng thesis submitted 2024, appears to have just completed degree with no confirmed full-time role yet; in prime transition window

Kit Ao

high hireability

JHU undergrad (CS + ChemBioE) pursuing CMU MS in Computational Data Science
Research assistant at CMU Catalyst Group on Mirage (CUDA/C++/OpenMP/MPI/NVIDIA Nsight profiling)
Now ML Engineer at Waymo (ML Infra)
Not a PhD — confirmed MS on LinkedIn
US-based Pittsburgh/Mountain View
Hireability: HIGH — Mirage CUDA+Nsight profiling, Waymo MLE role, ML systems bio, LinkedIn available

Kshitij Lakhani

high hireability

MS ECE from UC Davis (no PhD)
Deep Learning Performance Engineer at NVIDIA Santa Clara. 26 commits to TransformerEngine (rank 13)
Personal repos: Intro_to_Parallel_Computing (CUDA scan/reduce/sort) and Cache_Profiling (matrix multiply optimization) — strong CUDA fundamentals
Career path: intern → GPU Software Engineer II at Roche → exo → NVIDIA, estimated ~3-4 years FTE
US-based Santa Clara
Hireability: HIGH — verified MS-only, active CUDA contributor at NVIDIA with kernel optimization work

Mingfei Guo

high hireability

Software Engineer@NVIDIA

Stanford MSEE student (PKU BS alumni) actively building GPU kernels
Implemented Flash Attention in Slang using NV CoopMat2 tensor cores with double-buffered K/V shared memory and parallel softmax, achieving 1.35-1.65x speedup over PyTorch SDPA
Also built 3D Gaussian Splatting from scratch in NVIDIA Warp
Based in Palo Alto
Hireability: HIGH — Job-Apply-Automation repo shows active job search (10+ commits Nov-Dec 2025), MSEE likely in final year

Pawan Jayakumar

high hireability

Current UCSD Masters student in ML systems and security. 6 commits to pytorch/ao (PyTorch quantization and sparsity)
Bio: 'Masters student at UCSD studying ML systems and security' — graduating 2025 or 2026, fits student criteria
No PhD
US-based San Diego CA. pytorch/ao contributions = hands-on quantization and sparsity at PyTorch core level, directly aligned with GPU kernel optimization for LLM inference
Ideal junior profile
Hireability: HIGH — MS ML systems student at UCSD, pytorch/ao contributor, US-based, graduating imminently

Peter Yeh

high hireability

Mountain View CA, focused on 'Accelerating Generative AI/LLM.' 21 commits to pytorch/ao
Pins: pytorch, pytorch/ao, apache/tvm (ML compiler), FasterTransformer (C++ NVIDIA inference), flash-attention, gloo — exceptional technical breadth across ML compilers, GPU kernel inference, and quantization
No PhD signals
No clear senior titles
US-based Mountain View confirmed
Hireability: HIGH — outstanding breadth across LLM inference, ML compilers, and GPU kernels; junior-to-mid level; US-based

Phuong Nguyen

high hireability

DL Performance Engineer at NVIDIA Santa Clara. 106+ commits to NVIDIA/TransformerEngine with focus on FP8 fused ops, NVFP4 recipes, JAX/CUDA kernel acceleration, and distributed training (FSDP, Shardy)
GitHub active since 2023-2024 in TE repo, consistent with <5yr FTE
Stanford-educated (BS/MS, no PhD indicators)
US-based
Strong fit on GPU kernel/CUDA/JAX search criteria
Hireability: HIGH — joined NVIDIA ~2023, within 2-3yr transition window; title is DL Performance Engineer (not Senior/Staff)

Ravi Ghadia

high hireability

GEMM Kernels intern@AMD

Previously: Research Intern @ Together AI

Austin, US

GEMM Kernels intern at AMD — directly writing GPU kernels
Prior GPU Architect at NVIDIA (Bengaluru)
Merged PR in Dao-AILab/flash-attention fixing FA3 int32 overflow for 4M+ token seqlen
First-author on MorphKV (constant-sized KV cache, ICML 2025 poster) and Untied Ulysses (memory-efficient context parallelism, ArXiv 2026)
MS student at UT Austin (ECE), BTech IIT Kharagpur, no PhD
Austin, TX
Hireability: HIGH — 2nd-year MS student, likely graduating 2026, prime transition window

Samurdhi Karunaratne

high hireability

Deep Learning Inference SE at NVIDIA TensorRT since January 2022 (~4.25yr FTE)
BS Computer Engineering, Univ of Peradeniya (2019), MS ECE UCLA (Dec 2021)
Pinned repos: TensorRT (C++), ONNX-TensorRT, PyTorch-TensorRT, ONNX
Physics Olympiad gold/silver medalist (IPhO, APhO)
Directly relevant to GPU kernel/LLM inference stack
No PhD
US-based (Santa Clara CA)
Hireability: HIGH — recent MS grad, <5yr FTE, deep TensorRT inference expertise

Schwinn Saereesitthipitak

high hireability

Distributed LLM Inference@NVIDIA

Previously: ML for CAD Engineering @ Apple

San Francisco, US

Applied AI Software Engineer at NVIDIA Distributed LLM Inference SF (started June 2025)
Stanford CS MS (admitted 2022, no PhD)
Built Prophet — LLM inference engine optimized for head-of-line blocking (Stanford CS244b)
GitHub shows LLM inference engine + Rust cryptography + competitive programming (C++)
Zero prior FTE before NVIDIA
Strong US-based systems/inference engineer at tier-1 target
Hireability: HIGH — <1yr at NVIDIA, Stanford MS graduate, but role is exactly Distributed LLM Inference — prime target

Shanli Xing

high hireability

Undergraduate Researcher@UW / CMU Catalyst

UW CSE undergrad (BS CS, graduating 2026 — actively seeking PhD positions Fall 2026)
Core contributor to FlashInfer (CUDA kernel library for LLM serving): designed and implemented sorting-free GPU sampling kernels, co-authored FlashInfer-Bench (MLSys 2026)
Research @ UW SAMPL (advised by Prof
Luis Ceze) and CMU Catalyst (advised by Prof
Tianqi Chen)
Active kernel work at undergraduate level with published systems paper
US (Seattle, WA). 0 FTE

Songting Wang

high hireability

CMU ECE+CS student (Pittsburgh), contributor to mirage (Mirage Persistent Kernel: Compiling LLMs into a MegaKernel, C++ CUDA superoptimizer)
Pinned mirage fork prominently
Likely graduating 2025/2026. 1 commit on mirage (shallow but right project at CMU)
No visible FTE history — clean within <5yr FTE limit
Hireability: HIGH — current student at graduation window, no blocking employment, Pittsburgh-based at CMU

Surya Subramanian

high hireability

cuBLAS Intern@NVIDIA

Georgia Tech CS student (BS)
NVIDIA cuBLAS intern writing fast matmul CUDA kernels for Blackwell via emulation on low-precision tensor cores
Previously Meta PyTorch distributed training + Pinterest ML infra
Graduating soon (2025/2026)
Very strong kernel signal for a student

Tarushii Goel

high hireability

MIT undergrad CS, Class of 2026 (graduating — explicitly allowed by search)
McLean VA / MIT
Interned at NVIDIA, Modal Labs, and Exafunction (all kernel/inference companies)
CUTLASS forks (quantized BLAS), Triton forks, flash-linear-attention fork
Writes GEMM kernel + AI compiler deep-dives on blog
No FTE yet (student)
USACO Platinum competitive programmer
Hireability: HIGH — MIT 2026 graduating student, direct CUTLASS+Triton experience, NVIDIA+Modal internships, prime kernel engineering profile

Victor Li

high hireability

FlexFlow contributor (flexflow-train, C++); GitHub repos include MiniAPL-release (LLVM-based dense array language compiler, Stanford cs343d course), CS349H project (Stanford ML compilers course), and Mamba SSM
Username victorli2002 strongly implies undergrad graduating 2025-2026
LLVM coursework directly on-topic for the compiler track of this query
No FTE history visible
US-based (Stanford CA implied by course repos)
Hireability: HIGH — current undergrad with no FTE experience, prime graduation window

William Hu

high hireability

MSCS Student / Intern@Stanford / Modal Labs

MSCS student at Stanford (willhu@stanford.edu), focus on systems/DSLs/AI HPC
Co-author of HipKittens paper (AMD ROCm GPU kernels, arxiv Nov 2025): 662 commits to HazyResearch/HipKittens
Intern at Modal on the 'flash team' (flash attention and fast GPU kernel work)
Also contributor to KernelBench (Stanford Scaling Intelligence)
Personal site: willhu-jpg.github.io
Taking Stanford CS 240lx (advanced systems) Spring 2025. 0 FTE, strong research output at grad school level

William Zhou

high hireability

Triton Kernel Engineer@AMD

UCLA CS undergrad (graduating 2026)
Writing Triton kernels @ AMD (GitHub bio: 'writing Triton kernels @ AMD, ACM@UCLA')
Active Triton contributor via AMD
US (LA)
Strong junior kernel signal: recent grad, active kernel work at major GPU company

Willy Chan

high hireability

Student Researcher / Intern@Together AI / Stanford

Stanford BS CS student
Student Researcher @ Together AI (AI Kernels team)
NVIDIA intern (multigpu workloads/libraries)
Meta Superintelligence Lab (Data Foundations)
Stanford SAIL research: KernelBench DSL extension + scaling laws of Kernel DSLs + multigpu kernels
Built NVSHMEM4Py integration for Perplexity's pplx-kernels (MoE communication CUDA kernels)
Active contributor to perplexityai/pplx-kernels. 0 FTE, exceptional kernel breadth for an undergrad

Yiyan Zhai

high hireability

CMU undergraduate (CS&ML) working directly with Prof
Tianqi Chen on FlashInfer-Bench (LLM inference kernel benchmarking system — best paper at MLSys 2025) and MLC-LLM
Pinned repos: flashinfer-bench and libCacheSim (C++ cache simulator)
Contributor to arxiv paper on FlashInfer-Bench (Oct 2025)
Strong LLM inference kernel focus — exactly the query target
Pittsburgh (US)
Undergraduate only, no PhD
Hireability: HIGH — CMU undergrad in graduation window (likely May 2026), no FTE experience, actively doing hands-on LLM inference systems research

Zhongbo Zhu

high hireability

MS CompE from UIUC (no PhD)
DevTech @ NVIDIA — focused on customer CUDA performance optimization
Amazon intern Summer 2023 indicates very recent grad, likely ~1-2 years FTE
Pins: TransformerEngine, Megatron-LM, Megatron-Bridge
NVIDIA Technical Blog author on quantized training and CUDA kernels
ZJU undergrad + UIUC MS
Likely US-based (NVIDIA DevTech teams primarily US)
Hireability: HIGH — MS-only, very junior FTE timeline (~1-2yr), active quantization+kernel work at NVIDIA

Aditya Saigal

high hireability

Tenstorrent tt-metal contributor
US-based (San Francisco). junior (~3 years)

Artem Yerofieiev

high hireability

Tenstorrent tt-metal contributor, compiler tooling
US-based (LA). junior (~3 years)

Bhavya Gada

high hireability

tinygrad compiler/kernel contributor
US-based (United States). junior (~3 years)

cloud11665

high hireability

tinygrad compiler/kernel contributor
US-based (San Francisco). junior (~3 years). high contribution volume

Daniel Vega-Myhre

high hireability

PyTorch AO quantization kernel contributor
US-based (US). junior (~3 years). high contribution volume

David Zhao Akeley

high hireability

MIT exo compiler contributor
US-based (US). junior (~3 years)

Douglas Nyberg

high hireability

tinygrad compiler/kernel contributor
US-based (Lafayette, Indiana). junior (~3 years)

Grace Dinh

high hireability

MIT exo compiler contributor
US-based (US). junior (~3 years)

James Roberts

high hireability

tinygrad compiler/kernel contributor
US-based (Seattle). junior (~3 years)

Kenneth Moon

high hireability

MIT exo compiler contributor
US-based (US). junior (~3 years)

Marcel Bischoff

high hireability

tinygrad compiler/kernel contributor
US-based (Columbus, OH). junior (~3 years)

Nigel Huang

high hireability

306 merged PRs in tenstorrent/tt-metal (TT-Metalium low-level kernel programming model) plus contributions to tt-umd (user-mode driver) and tt-zephyr-platforms (Zephyr RTOS firmware) — exactly the chip-level firmware/driver/kernel work the query targets
Located Santa Clara
No PhD signals; firmware/driver role profile consistent with BS/MS
Hireability: HIGH — confirmed left Tenstorrent March 20, 2026 (GitHub PR body: 'Today is my last day at Tenstorrent') ~4 weeks ago; likely on the job market now

Philip Lassen

high hireability

Groq engineer
US-based (US). junior (~3 years)

Vincent Wells

high hireability

Tenstorrent tt-mlir contributor
US-based (Austin). junior (~3 years)

Wei Feng

high hireability

PyTorch AO quantization kernel contributor
US-based (US). junior (~3 years)

Wesley Maxey

high hireability

NVIDIA contributor
US-based (Sunnyvale, CA). junior (~3 years). high contribution volume

Whitney Tsang

high hireability

Intel XPU Triton backend compiler work
US-based (US). junior (~3 years). high contribution volume

Aaryan Singhal

medium hireability

Software Engineer@ReflectionAI

321 commits to HazyResearch/ThunderKittens + co-author of ThunderKittens paper (ICLR 2025) — CUDA tile primitives targeting H100
Own repos: CUDABenchmarks, TKConvs
Stanford CS Systems/AI, new grad (~0 FTE)
Based in Palo Alto/SF
Hireability: MEDIUM — ~1-1.5 years at ReflectionAI (joined post-Oct 2024 paper), startup role within typical transition window, but no explicit open-to-work signals. GitHub bio still shows 'cs @ stanford' with no company set

Abe Leininger

medium hireability

Creator of Metal-Puzzles (604★ Apple Metal GPU kernel puzzles repo) and gpu-kernels (CUDA kernel implementations); 6+ merged PRs to ml-explore/mlx (CPU LU factorization, nuclear norm, unsigned dtype fixes in reduce ops) and luminal-ai/luminal (StableHLO support, 3939 additions); also contributed to tinygrad
IU BS CS, US-based Austin TX
Hireability: MEDIUM — early career (GitHub acct Dec 2021, ~1-2 yrs FTE per prior research), no current company listed on GitHub (previous node expansion noted Anduril Industries but company field now blank); strong GPU kernel learning velocity across Metal and CUDA but unclear current employment status

Aidan Do

medium hireability

Software Engineer, Inference@Fireworks AI

CUDA kernel engineer at Fireworks AI (SF Bay Area) — wrote custom bicubic interpolation CUDA kernel (11-21% TTFT improvement for Kimi K2.5), merged PyTorch CUDA upsample_bicubic2d kernel rewrite (4.3-43x speedup), active contributor to NVIDIA/cutlass and FlashInfer
BS Software Engineering (Honours), no PhD
Hireability: MEDIUM — ~2yr FTE total, currently at Fireworks AI with no job-seeking signals detected; within typical junior transition window

David Wang

medium hireability

GPU Performance Engineer@Modal Labs

UIUC education
GPU performance engineer @ Modal Labs (Brooklyn, US)
Active GPU kernel work: forked Mirage (automatically generating fast GPU kernels without Triton/CUDA). 0-2yr FTE
No PhD found
Modal is a top GPU cloud company known for kernel-level work

David Zhao Akeley

medium hireability

Developer Technology Engineer at NVIDIA Santa Clara (US)
UCLA BS CS+Math
Contributed to Exo-GPU (PLDI 2026, MIT CSAIL) — CUDA-targeting scheduling compiler achieving >80% of H100 theoretical peak on GEMM
Deep low-level systems work (SIMD combine library in C, OpenGL voxel renderer)
No PhD — BS only
Hireability: MEDIUM — NVIDIA DevTech Engineer with recent exo-gpu collaboration suggesting active research engagement; start date unclear

Angela Yi

medium hireability

Meta PyTorch Compiler SWE since 2022, specializing in torch.export and AOTInductor — core ML compiler/codegen directly relevant to LLM inference optimization
CMU BS CS (2018-2022), also pursuing CS at Stanford (2024-present, likely MS not PhD). ~3 years FTE, no PhD
Contributed to pytorch/executorch (on-device AI acceleration)
Spoke at PyTorch Conference 2024 on export features
US-based
Hireability: MEDIUM — 3 years at Meta, within transition window; Stanford enrollment may indicate career pivot interest

Anton Shepelev

medium hireability

Adjacent expertise: NVIDIA System Software Engineer (Tegra group) with LLVM compiler background — pinned repo is llvm-project fork; designed a custom compiler (DJ language → RISC-V → LLVM) with optimization passes (loop unrolling, constant folding, CSE)
USF MS+BS CS
SF Bay Area, US
Primary work is Tegra (embedded/automotive system software) rather than GPU kernels for ML/LLM inference — adjacent to the query
LLVM/compiler infrastructure background is highly relevant to MLIR-based kernel work
Hireability: MEDIUM — likely 1-3 years at NVIDIA as recent USF grad; Tegra engineers occasionally pivot to GPU compute work

asfiyab-nvidia

medium hireability

NVIDIA TensorRT engineer with MS in Computer Engineering from UC San Diego
Located in San Diego County, US
Work spans TransformerEngine, Megatron-LM, onnx-tensorrt (C++), TensorRT, and NeMo — inference optimization and GPU-accelerated architectures
Account created 2022, ~3 years at NVIDIA
No PhD
MS-only
Hireability: MEDIUM — solid MS-level NVIDIA inference engineer with TensorRT/ONNX depth; more inference-framework than raw CUDA kernel writing

Aviral Goel

medium hireability

GPU kernel engineer at AMD ROCm working on GEMM optimization and composable_kernel (91 commits, C++ HIP kernels)
Personal GitHub shows strong self-directed kernel learning: hip_kernels (HIP GPU kernel examples), flash_attention_triton (Flash Attention from scratch in Triton)
MS Computer Science from USC
Prior experience was internship at Samsung Semiconductor (Vulkan/GPU layers — excluded from FTE)
AMD work account (AviralGoelAMD) created December 9, 2024, confirming very recent hire
US-based (Austin, TX)
No PhD
Hireability: MEDIUM — only ~3 months into AMD role (Dec 2024), likely still settling in; personal GitHub bio still says USC graduate student suggesting very recent graduation

Baizhou Zhang

medium hireability

Member of Technical Staff@RadixArk (SGLang)

MS CS from UC San Diego, BS Intelligence Science from Peking University
MTS at RadixArk (SGLang core developer — attention kernels integration, large-scale EP & Speculative Decoding on GB200/GB300)
NVIDIA intern on CuDNN Heuristic Team (Blackwell GEMM kernel heuristics)
Has cuda-learn-by-practice repo showing hands-on CUDA work
Active FlashInfer contributor
No PhD
US (Palo Alto, CA). ~1yr FTE

Colin Peppler

medium hireability

Meta SWE with direct GPU kernel experience: AITemplate contributor (CUDA/HIP codegen framework), GPU Techniques & Algorithms team at Meta working on custom kernels and inference optimization, plus AI Infra RecSys inference optimization
Virginia Tech BS CS
US-based (Menlo Park)
Career: Meta intern → FTE (likely ~2021 start), estimated ~3-4 years FTE
BS only, no PhD
Hireability: MEDIUM — ~3-4 years at Meta, within transition window; strong GPU kernel skills make this a high-priority reach-out

Emilio Andere

medium hireability

GPU Kernel Engineer@Wafer-AI

UChicago Mathematics BS (2021-2025)
GPU kernel engineer @ Wafer-AI (YC F25 startup, SF), 'fast gpu kernels @wafer-ai'
OSS contributor to tinygrad
Active in GPU kernel development at a well-funded YC startup. ~0-1yr FTE, no PhD

Eric Auld

medium hireability

Systems Research, GPU Programming Engineer@Together AI

BS Arizona State, MA UCLA, MS NYU (all in mathematics, no CS PhD)
Systems Research & GPU Programming Engineer @ Together AI
Co-authored GPU MODE Lecture 15 on CUTLASS/CuTe
Active CUTLASS contributor
Transitioned from math to GPU programming. ~2yr FTE in kernel work

ihar

medium hireability

US-based (SF/TN), bio: 'ml, hpc, processors microarchitecture' — directly on-target for GPU kernel/HPC query. tinygrad contributor with possible Tenstorrent affiliation
Account 2014 but only 17 followers, sparse public presence — no clear senior indicators
No PhD signals found
Hireability: MEDIUM — strong topical alignment (HPC, microarchitecture) and US-based, but sparse profile makes degree and seniority hard to confirm; worth outreach

Ilya Sherstyuk

medium hireability

SE at NVIDIA (Deep Learning Inference Workflows) since ~2022
BS CS Caltech 2022 — clear <5yr FTE (~3-4yr by March 2026)
No PhD
US-based (Santa Clara CA)
NVIDIA TensorRT team working on DL inference workflows
Caltech CS + NVIDIA inference = high-quality junior signal
Limited public GitHub signal but confirmed via LinkedIn
Hireability: MEDIUM — strong educational pedigree (Caltech CS), directly relevant team, but limited public kernel repo evidence

James Thompson

medium hireability

Built llm.metal — LLM training in raw C/Metal Shading Language (Apple GPU compute shaders), a direct port of Karpathy llm.c to Metal, demonstrating genuine low-level GPU programming ability
Bay Area, CA (confirmed)
Bio: ML, computer graphics, GPU programming languages. 1 commit on llm.c
CAVEAT: No education or employer found; FTE history unknown — needs follow-up to verify <5yr constraint
Hireability: MEDIUM — unclear employment status; Bay Area location and hobby GPU projects suggest field-engaged and potentially available

Jason S. Wang

medium hireability

Machine Learning Scientist@Tesla

Previously: Research Assistant @ Stanford University

San Francisco, US

UC Berkeley BS 2022, Stanford MS CS/AI graduating 2025 (just graduated)
At NVIDIA now
OpenReview confirms 'MS student, Stanford University'
Pins: Megatron-LM, NeMo — large-scale training systems. 2 commits to TransformerEngine. <1 year FTE
Strong pedigree but kernel depth less clear vs. training systems focus
Hireability: MEDIUM — excellent junior fit and seniority, but focus is more training systems than raw CUDA/Triton kernel writing

Jianan Ji

medium hireability

Strong fit — contributes to Mirage megakernel compiler (CMU Catalyst Lab) and SGLang (LLM inference)
Confirmed MS student at CMU graduating 2026 per LinkedIn ('CMU 26 | PKU 24'), personal website confirms 'second-year master student at CMU' — not a PhD
US-based Pittsburgh PA
Hireability: MEDIUM — solid research contributions to both GPU kernel compiler and LLM serving, but limited public open-to-work signals

Michael Feil

medium hireability

Model Performance Engineer at Baseten, San Francisco
Hands-on Hopper and Blackwell kernel optimizations (candle-flash-attn-v3 repo: custom Flash Attention V3 C++ implementation)
Distributed LLM inference: TRT-LLM, NVIDIA Dynamo (2x inference speedup), BEI embeddings runtime
Created Infinity production inference engine
Prior: MTS at Gradient (1M-context LLM training on 512 NVIDIA GPUs), ML Engineer at Rohde & Schwarz
MS Robotics/ML from TU Munich (no PhD)
Estimated ~3-4 years FTE (within threshold)
San Francisco
Hireability: MEDIUM — ~2 years at Baseten, within transition window; no open-to-work signals

Nicolas Macchioni

medium hireability

PyTorch Inductor core contributor at Meta (184 PRs in pytorch/pytorch)
Builds GPU kernel autotuning, benchmarking, and caching infrastructure -- wrote InductorBenchmarker, Triton template structural caching, reduction autotune data collection pipeline, and contributed L2 cache flush fix to triton-lang/triton
Deep understanding of GPU kernel performance (CUDA events, profiling, Triton internals) but work is on the compiler/autotuning layer, not kernel authoring itself
Adjacent to GPU kernels -- could transfer
No PhD, based in SF
Hireability: MEDIUM -- ~1.5-2 years at Meta, no explicit signals of looking to leave

Reuben Stern

medium hireability

Core FlashAttention contributor — co-author on FlashAttention-4 paper (arXiv:2603.05451) and PyTorch FlexAttention+FA4 blog. 18+ PRs to Dao-AILab/flash-attention implementing FlexAttention features in CuTe DSL: vectorized score_mods, block sparsity computation kernels, mask_mod support, varlen extensions, PackGQA backward pass for Sm90/Sm100
Research Scientist at Colfax International
BA Mathematics from Harvard, MM from Peabody — no PhD
Non-traditional path (conductor/mathematician) with deep kernel-level GPU work
Boston, MA
Hireability: MEDIUM — at Colfax (small research consultancy), actively shipping FA4 features with Tri Dao and Jay Shah, likely recruitable

Sahan Paliskara

medium hireability

Co-author of KernelLLM (8B model generating Triton kernels from PyTorch) and BackendBench (evaluation suite for LLM/human-written PyTorch backends)
Added Triton backend to KernelBench
Speaker at PyTorch Conf 2025 on LLMs for GPU kernel dev. 402 PRs on pytorch/pytorch, though primarily backend engine work (torch deploy removal, PyObjectSlot simplification), not hand-written GPU kernels
Expertise is in tooling/benchmarking for GPU kernel generation rather than kernel authoring itself — adjacent but strong
Princeton BS CS 2021, ~4yr FTE at Meta PyTorch, SF
Hireability: MEDIUM — ~4-5 years at Meta, within transition window but no explicit job-seeking signals

Shreya Gaur

medium hireability

Deep Learning Performance Engineer@NVIDIA

MS ECE from Purdue University (thesis on GPU kernel optimization for group equivariant CNNs)
Deep Learning Performance Engineer at NVIDIA Santa Clara
Active custom kernel work: cutlassdev repo (CUTLASS development), gemm_sample (GEMM optimizations), cuda_programming (CUDA class assignments for AlexNet layers on V100 using shared/constant memory)
CUTLASS contributor
No PhD
US (Santa Clara)

Shushi Hong

medium hireability

Active contributor to apache/tvm (30+ commits) and mlc-ai/mlc-llm — the two core ML compiler projects for LLM inference
Pinned repos are TVM, MLC-LLM, and web-llm, showing deep specialization in ML compilation and on-device LLM deployment
Based at CMU Pittsburgh (Catalyst group under Tianqi Chen)
No PhD signals; company field shows Carnegie Mellon University
Hireability: MEDIUM — still affiliated with CMU but graduation timeline unclear; likely MS student in 2025-2026 window

Vladimir Penkin

medium hireability

Intel Austin TX engineer, 13 commits to intel/intel-xpu-backend-for-triton (Triton MLIR GPU compiler)
GitHub account 2021, pinned repos: pytorch/pytorch + intel-xpu-backend-for-triton
Active on Intel XPU Triton issues through Jan 2026
Personal website (penk.in) confirms SF location
Education: Taganrog South University of Radio Engineering (Russia)
Generalist background (Ruby/C#/TypeScript) but demonstrated Triton/PyTorch XPU focus at Intel
Hireability: MEDIUM — account 2021 suggests ~4yr FTE, within window; Intel role has Triton/XPU focus matching query

Vyom Sharma

medium hireability

Senior CS student at University of Minnesota (Minneapolis MN, US)
Internship/student-only history, no substantial FTE
Pinned repos: llm.c fork (18 commits), RecurrentGemma-CUDA ('advanced kernels'), Kaleidoscope-Extended (LLVM JIT compiler in C++) — CUDA + compiler dual track exactly on-target
Account 2022, expected graduation ~2025-2026
Hireability: MEDIUM — strong student CUDA+LLVM profile but no published work or industry CUDA contributions; graduation date needs confirmation

Wenxin Cheng

medium hireability

Software Engineer@Meta

MS CS from UCLA (2022-2024), BS CE from Beijing Jiaotong University
Software Engineer at Meta (Jan 2026-present: Dev Infra Triton; Apr 2024-Dec 2025: AI toolchains with CUDA focus)
CUTLASS contributor
Triton contributor
Also contributed to FBGEMM and vLLM
Menlo Park, CA
No PhD. ~2yr FTE at Meta in kernel/toolchain work

William Hu

medium hireability

Member of Technical Staff@Modal

Previously: GPU Compiler Engineer @ Qualcomm

San Francisco, US

GPU Compiler Engineer at Qualcomm (~1.5yr FTE, Sep 2023-Feb 2025) writing OpenCL/Vulkan GPU compilers using LLVM+C++; now at Modal on the flash team (GPU kernels for AI/HPC) while completing MS CS at Stanford (graduating 2026)
Research at Stanford Hazy + Scaling Intelligence labs
BS Math-CS UCSD, MS Stanford — no PhD
US (Bay Area/Stanford)
Clean FTE history: ~1.5yr Qualcomm FTE + Modal intern (student concurrent role)
Hireability: MEDIUM — current Stanford student, graduating 2026 is prime transition window; Modal role is student internship

Yuzhong Wang

medium hireability

NVIDIA engineer (account created 2024 — very recent hire)
Works on NeMo, TransformerEngine, and Megatron-MoE — LLM training infrastructure and GPU kernel optimization. -nvidia username suffix
No PhD signals found
Likely US-based at NVIDIA primary offices
Strong technical stack alignment
Hireability: MEDIUM — relevant TransformerEngine/Megatron work but limited public signals to confirm education level; no location explicitly confirmed

Zephyr Zhao

medium hireability

CMU student (undergrad or new MS) confirmed by andrew.cmu.edu email
Has a CUDA-Manuscript learning repo and confirmed Mirage paper contributor (Zepeng Zhao in paper authors list)
Account Jan 2025 — very fresh, learning-focused CUDA repos
No PhD signals
US-based Pittsburgh
Hireability: MEDIUM — genuine GPU kernel interest with CMU research exposure, but early-stage contributor with minimal independent track record

Zhihao Wang

medium hireability

MLsys Engineer at ByteDance Seed Infra Compiler team (Urbana, IL, US)
UIUC BS Math+CS + MS CS, no PhD
Built SuperScalar Tomasulo OoO RV32M CPU in SystemVerilog from scratch (7K+ lines), contributor to sglang-jax and exo-lang/exo
ByteDance Seed Compiler team does GPU compiler/distributed training work
Hireability: MEDIUM — recently started full-time at ByteDance post-MS; likely <2 years in role

Zhongbo Zhu

medium hireability

DevTech AI Engineer@NVIDIA

BS CS from UIUC (joint degree with Zhejiang University), MS ECE from UIUC
NVIDIA DevTech AI engineer specializing in CUDA kernel development and FP8 quantized training
Previously Amazon intern (Summer 2023). ~1yr FTE in CUDA kernel dev
No PhD
US (Bay Area / Santa Clara)

Daniel Vega-Myhre

low hireability

Deep CUDA/Triton kernel work at Meta PyTorch Core Performance team: FP8/MXFP8 LLM pretraining kernels, async TP, Triton quantization kernels — 228 commits to pytorch/ao
Personal pinned repo gemm written in raw CUDA
BS Computer Science Boise State University; MS CS at Georgia Institute of Technology (OMSCS, in progress 2024-2027)
No PhD
US-based at Meta
SENIORITY NOTE: ~5.5yr total FTE (Clearwater Analytics Dec 2018-Jan 2021 data engineering + Google Jun 2022-Aug 2024 ML/TPU + Meta Aug 2024-present); kernel-specific FTE is ~3.7yr at Google/Meta
Hireability: LOW — joined Meta PyTorch Core ~6 months ago (Aug 2024), recently promoted to Senior SWE Aug 2025, unlikely to be looking

Li Jiaying

low hireability

Led multi-GPU CUDA traffic simulation at UC Berkeley (113X speedup over CPU, C++/CUDA, scaled 1-8 GPUs with ghost-zone partitioning and inter-GPU communication); 4 commits on mirage (Mirage Persistent Kernel CUDA superoptimizer, C++)
CMU MCDS student (BS Software Engineering, Tongji University), now FTE SWE at Snowflake
Contributes to arrow-rs (Rust)
US (Menlo Park)
Hireability: LOW — very recently started FTE at Snowflake, likely still settling in

Nikhil Patel

low hireability

Software Engineer@Meta Superintelligence Labs

Core GPU kernel engineer at Meta Superintelligence Labs
Built the NVFP4 blockscaled GEMM kernel pipeline in PyTorch Inductor (10-PR NVGEMM stack with CuTeDSL/CUTLASS)
Triton compiler contributor (FP8 precision fix, TMA block shape fix). 168 merged PRs in pytorch/pytorch, also contributing NVFP4 backend to vLLM
MS CSE from UMich, BS CS from USC, no PhD
Based in SF
Hireability: LOW — recently at Meta Superintelligence Labs (new org, likely <1 year), shipping at high velocity on cutting-edge GPU kernel work

Rishi Sankar

low hireability

MTS at Anthropic working on LLM performance/kernels/pretraining
Built Flash Attention 2 from scratch in CUDA with 10 optimization iterations (multi-block parallelism, register-based matmul, warp reduction, kernel fusion)
Active LeetGPU participant (fp16 GEMM, multi-head attention, convolution kernels)
UCLA BS in CS + Applied Math, no PhD. ~2-3 years full-time experience (Two Sigma then Anthropic)
Hireability: LOW - likely <2 years at Anthropic in a dream role for GPU kernel engineers

Runs

#3completed17 qualified / 29 foundApr 4, 2:39 AM

#2completed47 qualified / 136 foundMar 30, 8:34 AM

#1completed19 qualified / 19 foundMar 30, 8:34 AM