Back to dashboard

junior kernel engineers us

completed85 qualified3 runsMar 30, 8:34 AMjunior-kernel-engineers-us

Qualified Candidates (85)

AQ

Abu Qader

high hireability
  • SWE at Baseten since July 2022 (~4 years FTE, Cornell BS 2021, no PhD)
  • Led TRT-LLM Engine Builder and EAGLE 3 speculative decoding; co-authored with Tri Dao on achieving fastest Kimi K2.5 and 60% faster GPT-OSS inference — deep LLM inference optimization, adjacent to kernel engineering
  • SF, US-based
  • Hireability: HIGH — explicit hireable:true set on GitHub (profile updated 2026-04-16), 3.75 years at Baseten entering clear transition window
AG

Aiden Grossman

high hireability
  • UC Davis BS (~2023-2024 recent grad)
  • Google SWE. 200+ LLVM commits: MLGO (ML-guided register allocation), AMD Zen 5+6 microarchitecture tuning, XLA, HEIR (homomorphic encryption IR)
  • Lead author of ComPile (2.4TB LLVM-IR dataset)
  • Presented at LLVM Dev Meeting 2022/2024/2025
  • No PhD
  • US-based. <2yr FTE (started ~2024)
  • Exact match on LLVM/MLIR compiler stack for GPU kernel backends
  • Hireability: HIGH — fresh BS-only FTE at Google, exceptional LLVM/compiler depth for a recent undergrad
AP

Akaash Parthasarathy

high hireability
  • Active ML compiler stack contributor: 14 commits to mlc-ai/mlc-llm, merged PR to mlc-ai/web-llm (April 15, 2026, now listed as co-author), active apache/tvm and tvm-ffi contributor
  • Research focus explicitly on ML compilers, hardware-aware algorithm design, efficient LLM inference, and parallelization schemes; CUDA experience
  • CMU MSML + Georgia Tech BS CS
  • Pittsburgh, PA (US)
  • Hireability: HIGH — MSML program ending ~May 2026, no current FTE role, prime new-grad hiring window
AK

Alex Kranias

high hireability

Research Intern / GPU Kernel Engineer@Apple (intern) / SHI Labs (Georgia Tech)

  • Georgia Tech CS BS (AI/ML concentration), current undergrad at SHI Labs
  • AMD Triton Kernels team intern (Fall 2024) — FlashAttention GPU kernel development on ROCm
  • Apple GPU kernels intern (Summer 2025) — GPU kernels for on-device MoE LLMs and Diffusion Transformers
  • Published triton_vs_cuda repo implementing cuBLAS-performant GEMM kernels
  • Strong multi-internship junior kernel signal
BC

Bob Chen

high hireability
  • FlexFlow contributor (flexflow-train) and active in LLM inference systems at CMU: sglang-jax (JAX backend for SGLang, contributor to sgl-project org), vLLM-gsoc, KV compression research
  • Broad coverage of distributed LLM inference and serving stacks — strong systems instincts
  • Current CMU student (no FTE), US-based
  • Hireability: HIGH — current student with no FTE employment, graduating likely 2025-2026
CL

cloud11665

high hireability
  • OpenAI engineer in SF
  • Pinned repos: gpuocelot (PTX dynamic compilation framework in C++) and telegraf_nv_export (ultra-low-overhead NVIDIA GPU telemetry C++ plugin) — both highly kernel/systems-relevant. 8 tinygrad commits
  • No PhD signals, no senior title publicly confirmed
  • PTX compilation and GPU telemetry work is precisely on-target
  • Hireability: HIGH — exceptional GPU kernel and low-level systems work (PTX, CUDA C++), US-based at OpenAI SF. Real name unknown; outreach via GitHub
DW

David Wang

high hireability
  • GPU performance engineer at Modal Labs, ex-NVIDIA HPC Architect
  • Contributed to Mirage (CMU Catalyst Lab) as external contributor
  • BS Texas A&M + MS CS UIUC — no PhD
  • Total FTE estimated <5 years
  • US-based
  • Hireability: HIGH — GPU perf engineering at Modal is exactly on-topic, NVIDIA HPC background strong, BS/MS only. Currently employed so may not be actively seeking
DL

Dylan Lim

high hireability

Research Scientist@Together AI

  • Stanford BS + MS CS (no PhD)
  • Research Scientist at Together AI ('megakernels for everyone')
  • Previously: Hazy Research RA ('helped GPUs meow' — ThunderKittens contributor), Stanford Compilers Group RA (accelerated DNN training across distributed systems)
  • Jump Trading (core strategies)
  • Active GPU kernel and megakernel developer at the premier kernel team (Together AI / Tri Dao group)
  • US (Palo Alto). ~1yr FTE
GM

Geo Min

high hireability
  • AMD engineer in San Jose CA
  • GitHub account created Feb 2025 — near-certain new grad signal
  • Pins: rocm-rx9070-demo (ROCm AMD RX 9070), iree-test-suites (MLIR), iree fork (MLIR ML compiler), TheRock (HIP/ROCm build), rocm-libraries (Assembly)
  • Also 14 commits to ROCm/composable_kernel
  • Reportedly Virginia Tech education (BS/MS, no PhD signals)
  • Expertise: MLIR, IREE, ROCm, HIP, composable_kernel — directly covers MLIR, AMD GPU, compiler stack
  • Hireability: HIGH — exceptional MLIR/ROCm match, strong new-grad signal from Feb 2025 account, US-based AMD San Jose
JC

Jesse Cai

high hireability

Machine Learning Engineer@Meta

Previously: Senior Research Engineer @ Cultivate

San Francisco, US

  • Meta Software Engineer on PyTorch Core Performance team, SF
  • UCLA 2020 graduate (BS, no PhD). 88 commits to pytorch/ao (rank 8)
  • Sparsity and quantization work (semi-structured 2:4 sparsity, TorchAO), PyTorch blog published, PyTorch Conference 2024 speaker
  • Joined Meta ~2021 → ~3-4 years FTE, within range. h-index: 2
  • US-based SF
  • Tensor core utilization and sparse kernels for LLM inference directly match query
  • Hireability: HIGH — PyTorch core contributor with deep sparsity/quantization, UCLA BS, US SF, no PhD
JW

Jiakun Wang

high hireability
  • BS CS CMU (computer systems/hardware focus) + current MS EE at Columbia (System-on-Chip/VLSI concentration)
  • Chip-level hardware design: implemented TPU IC (Verilog), systolic array for TPU, Emperor SoC (RTL + C drivers), RISC-V formal verification
  • Query explicitly allows chip-level kernel work
  • Contributed to mirage (CUDA superoptimizer)
  • AI chip focus per personal website
  • US (New York, Columbia)
  • No FTE history — entirely within <5yr limit
  • Graduating likely 2025/2026
  • Hireability: HIGH — current student at prime graduation transition window, no blocking FTE role
KM

Kenneth Moon

high hireability
  • MIT EECS 6-3 (CS+EE) undergraduate, expected graduation 2025 per MIT ESP profile
  • Contributed to exo-lang/exo (hardware accelerator scheduling compiler)
  • Has personal mkay-attention repo (custom attention kernel)
  • Early-stage GitHub but direct exposure to low-level compiler work for hardware accelerators at MIT
  • Hireability: HIGH — current MIT undergrad graduating 2025, entering job market imminently
KQ

Kevin Qian

high hireability
  • Core exo-lang contributor (81 commits on exo-lang/exo), co-author of Exo 2 (ASPLOS 2025), MEng thesis on ExoBLAS (meta-programming a high-performance BLAS)
  • MIT BS EECS 2023 + MEng 2024, no PhD
  • Work is directly on user-schedulable DSL for GPU/hardware accelerator kernels — highly query-relevant
  • Prior internships at Jane Street, Meta, D.E
  • Shaw (not FTE)
  • Based in Cambridge, MA (US)
  • Hireability: HIGH — MEng thesis submitted 2024, appears to have just completed degree with no confirmed full-time role yet; in prime transition window
KA

Kit Ao

high hireability
  • JHU undergrad (CS + ChemBioE) pursuing CMU MS in Computational Data Science
  • Research assistant at CMU Catalyst Group on Mirage (CUDA/C++/OpenMP/MPI/NVIDIA Nsight profiling)
  • Now ML Engineer at Waymo (ML Infra)
  • Not a PhD — confirmed MS on LinkedIn
  • US-based Pittsburgh/Mountain View
  • Hireability: HIGH — Mirage CUDA+Nsight profiling, Waymo MLE role, ML systems bio, LinkedIn available
KL

Kshitij Lakhani

high hireability
  • MS ECE from UC Davis (no PhD)
  • Deep Learning Performance Engineer at NVIDIA Santa Clara. 26 commits to TransformerEngine (rank 13)
  • Personal repos: Intro_to_Parallel_Computing (CUDA scan/reduce/sort) and Cache_Profiling (matrix multiply optimization) — strong CUDA fundamentals
  • Career path: intern → GPU Software Engineer II at Roche → exo → NVIDIA, estimated ~3-4 years FTE
  • US-based Santa Clara
  • Hireability: HIGH — verified MS-only, active CUDA contributor at NVIDIA with kernel optimization work
MG

Mingfei Guo

high hireability

Software Engineer@NVIDIA

  • Stanford MSEE student (PKU BS alumni) actively building GPU kernels
  • Implemented Flash Attention in Slang using NV CoopMat2 tensor cores with double-buffered K/V shared memory and parallel softmax, achieving 1.35-1.65x speedup over PyTorch SDPA
  • Also built 3D Gaussian Splatting from scratch in NVIDIA Warp
  • Based in Palo Alto
  • Hireability: HIGH — Job-Apply-Automation repo shows active job search (10+ commits Nov-Dec 2025), MSEE likely in final year
PJ

Pawan Jayakumar

high hireability
  • Current UCSD Masters student in ML systems and security. 6 commits to pytorch/ao (PyTorch quantization and sparsity)
  • Bio: 'Masters student at UCSD studying ML systems and security' — graduating 2025 or 2026, fits student criteria
  • No PhD
  • US-based San Diego CA. pytorch/ao contributions = hands-on quantization and sparsity at PyTorch core level, directly aligned with GPU kernel optimization for LLM inference
  • Ideal junior profile
  • Hireability: HIGH — MS ML systems student at UCSD, pytorch/ao contributor, US-based, graduating imminently
PY

Peter Yeh

high hireability
  • Mountain View CA, focused on 'Accelerating Generative AI/LLM.' 21 commits to pytorch/ao
  • Pins: pytorch, pytorch/ao, apache/tvm (ML compiler), FasterTransformer (C++ NVIDIA inference), flash-attention, gloo — exceptional technical breadth across ML compilers, GPU kernel inference, and quantization
  • No PhD signals
  • No clear senior titles
  • US-based Mountain View confirmed
  • Hireability: HIGH — outstanding breadth across LLM inference, ML compilers, and GPU kernels; junior-to-mid level; US-based
PN

Phuong Nguyen

high hireability
  • DL Performance Engineer at NVIDIA Santa Clara. 106+ commits to NVIDIA/TransformerEngine with focus on FP8 fused ops, NVFP4 recipes, JAX/CUDA kernel acceleration, and distributed training (FSDP, Shardy)
  • GitHub active since 2023-2024 in TE repo, consistent with <5yr FTE
  • Stanford-educated (BS/MS, no PhD indicators)
  • US-based
  • Strong fit on GPU kernel/CUDA/JAX search criteria
  • Hireability: HIGH — joined NVIDIA ~2023, within 2-3yr transition window; title is DL Performance Engineer (not Senior/Staff)
RG

Ravi Ghadia

high hireability

GEMM Kernels intern@AMD

Previously: Research Intern @ Together AI

Austin, US

  • GEMM Kernels intern at AMD — directly writing GPU kernels
  • Prior GPU Architect at NVIDIA (Bengaluru)
  • Merged PR in Dao-AILab/flash-attention fixing FA3 int32 overflow for 4M+ token seqlen
  • First-author on MorphKV (constant-sized KV cache, ICML 2025 poster) and Untied Ulysses (memory-efficient context parallelism, ArXiv 2026)
  • MS student at UT Austin (ECE), BTech IIT Kharagpur, no PhD
  • Austin, TX
  • Hireability: HIGH — 2nd-year MS student, likely graduating 2026, prime transition window
SK

Samurdhi Karunaratne

high hireability
  • Deep Learning Inference SE at NVIDIA TensorRT since January 2022 (~4.25yr FTE)
  • BS Computer Engineering, Univ of Peradeniya (2019), MS ECE UCLA (Dec 2021)
  • Pinned repos: TensorRT (C++), ONNX-TensorRT, PyTorch-TensorRT, ONNX
  • Physics Olympiad gold/silver medalist (IPhO, APhO)
  • Directly relevant to GPU kernel/LLM inference stack
  • No PhD
  • US-based (Santa Clara CA)
  • Hireability: HIGH — recent MS grad, <5yr FTE, deep TensorRT inference expertise
SS

Schwinn Saereesitthipitak

high hireability

Distributed LLM Inference@NVIDIA

Previously: ML for CAD Engineering @ Apple

San Francisco, US

  • Applied AI Software Engineer at NVIDIA Distributed LLM Inference SF (started June 2025)
  • Stanford CS MS (admitted 2022, no PhD)
  • Built Prophet — LLM inference engine optimized for head-of-line blocking (Stanford CS244b)
  • GitHub shows LLM inference engine + Rust cryptography + competitive programming (C++)
  • Zero prior FTE before NVIDIA
  • Strong US-based systems/inference engineer at tier-1 target
  • Hireability: HIGH — <1yr at NVIDIA, Stanford MS graduate, but role is exactly Distributed LLM Inference — prime target
SX

Shanli Xing

high hireability

Undergraduate Researcher@UW / CMU Catalyst

  • UW CSE undergrad (BS CS, graduating 2026 — actively seeking PhD positions Fall 2026)
  • Core contributor to FlashInfer (CUDA kernel library for LLM serving): designed and implemented sorting-free GPU sampling kernels, co-authored FlashInfer-Bench (MLSys 2026)
  • Research @ UW SAMPL (advised by Prof
  • Luis Ceze) and CMU Catalyst (advised by Prof
  • Tianqi Chen)
  • Active kernel work at undergraduate level with published systems paper
  • US (Seattle, WA). 0 FTE
SW

Songting Wang

high hireability
  • CMU ECE+CS student (Pittsburgh), contributor to mirage (Mirage Persistent Kernel: Compiling LLMs into a MegaKernel, C++ CUDA superoptimizer)
  • Pinned mirage fork prominently
  • Likely graduating 2025/2026. 1 commit on mirage (shallow but right project at CMU)
  • No visible FTE history — clean within <5yr FTE limit
  • Hireability: HIGH — current student at graduation window, no blocking employment, Pittsburgh-based at CMU
SS

Surya Subramanian

high hireability

cuBLAS Intern@NVIDIA

  • Georgia Tech CS student (BS)
  • NVIDIA cuBLAS intern writing fast matmul CUDA kernels for Blackwell via emulation on low-precision tensor cores
  • Previously Meta PyTorch distributed training + Pinterest ML infra
  • Graduating soon (2025/2026)
  • Very strong kernel signal for a student
TG

Tarushii Goel

high hireability
  • MIT undergrad CS, Class of 2026 (graduating — explicitly allowed by search)
  • McLean VA / MIT
  • Interned at NVIDIA, Modal Labs, and Exafunction (all kernel/inference companies)
  • CUTLASS forks (quantized BLAS), Triton forks, flash-linear-attention fork
  • Writes GEMM kernel + AI compiler deep-dives on blog
  • No FTE yet (student)
  • USACO Platinum competitive programmer
  • Hireability: HIGH — MIT 2026 graduating student, direct CUTLASS+Triton experience, NVIDIA+Modal internships, prime kernel engineering profile
VL

Victor Li

high hireability
  • FlexFlow contributor (flexflow-train, C++); GitHub repos include MiniAPL-release (LLVM-based dense array language compiler, Stanford cs343d course), CS349H project (Stanford ML compilers course), and Mamba SSM
  • Username victorli2002 strongly implies undergrad graduating 2025-2026
  • LLVM coursework directly on-topic for the compiler track of this query
  • No FTE history visible
  • US-based (Stanford CA implied by course repos)
  • Hireability: HIGH — current undergrad with no FTE experience, prime graduation window
WH

William Hu

high hireability

MSCS Student / Intern@Stanford / Modal Labs

  • MSCS student at Stanford (willhu@stanford.edu), focus on systems/DSLs/AI HPC
  • Co-author of HipKittens paper (AMD ROCm GPU kernels, arxiv Nov 2025): 662 commits to HazyResearch/HipKittens
  • Intern at Modal on the 'flash team' (flash attention and fast GPU kernel work)
  • Also contributor to KernelBench (Stanford Scaling Intelligence)
  • Personal site: willhu-jpg.github.io
  • Taking Stanford CS 240lx (advanced systems) Spring 2025. 0 FTE, strong research output at grad school level
WZ

William Zhou

high hireability

Triton Kernel Engineer@AMD

  • UCLA CS undergrad (graduating 2026)
  • Writing Triton kernels @ AMD (GitHub bio: 'writing Triton kernels @ AMD, ACM@UCLA')
  • Active Triton contributor via AMD
  • US (LA)
  • Strong junior kernel signal: recent grad, active kernel work at major GPU company
WC

Willy Chan

high hireability

Student Researcher / Intern@Together AI / Stanford

  • Stanford BS CS student
  • Student Researcher @ Together AI (AI Kernels team)
  • NVIDIA intern (multigpu workloads/libraries)
  • Meta Superintelligence Lab (Data Foundations)
  • Stanford SAIL research: KernelBench DSL extension + scaling laws of Kernel DSLs + multigpu kernels
  • Built NVSHMEM4Py integration for Perplexity's pplx-kernels (MoE communication CUDA kernels)
  • Active contributor to perplexityai/pplx-kernels. 0 FTE, exceptional kernel breadth for an undergrad
YZ

Yiyan Zhai

high hireability
  • CMU undergraduate (CS&ML) working directly with Prof
  • Tianqi Chen on FlashInfer-Bench (LLM inference kernel benchmarking system — best paper at MLSys 2025) and MLC-LLM
  • Pinned repos: flashinfer-bench and libCacheSim (C++ cache simulator)
  • Contributor to arxiv paper on FlashInfer-Bench (Oct 2025)
  • Strong LLM inference kernel focus — exactly the query target
  • Pittsburgh (US)
  • Undergraduate only, no PhD
  • Hireability: HIGH — CMU undergrad in graduation window (likely May 2026), no FTE experience, actively doing hands-on LLM inference systems research
ZZ

Zhongbo Zhu

high hireability
  • MS CompE from UIUC (no PhD)
  • DevTech @ NVIDIA — focused on customer CUDA performance optimization
  • Amazon intern Summer 2023 indicates very recent grad, likely ~1-2 years FTE
  • Pins: TransformerEngine, Megatron-LM, Megatron-Bridge
  • NVIDIA Technical Blog author on quantized training and CUDA kernels
  • ZJU undergrad + UIUC MS
  • Likely US-based (NVIDIA DevTech teams primarily US)
  • Hireability: HIGH — MS-only, very junior FTE timeline (~1-2yr), active quantization+kernel work at NVIDIA
AS

Aditya Saigal

high hireability
  • Tenstorrent tt-metal contributor
  • US-based (San Francisco). junior (~3 years)
AY

Artem Yerofieiev

high hireability
  • Tenstorrent tt-metal contributor, compiler tooling
  • US-based (LA). junior (~3 years)
BG

Bhavya Gada

high hireability
  • tinygrad compiler/kernel contributor
  • US-based (United States). junior (~3 years)
CL

cloud11665

high hireability
  • tinygrad compiler/kernel contributor
  • US-based (San Francisco). junior (~3 years). high contribution volume
DV

Daniel Vega-Myhre

high hireability
  • PyTorch AO quantization kernel contributor
  • US-based (US). junior (~3 years). high contribution volume
DA

David Zhao Akeley

high hireability
  • MIT exo compiler contributor
  • US-based (US). junior (~3 years)
DN

Douglas Nyberg

high hireability
  • tinygrad compiler/kernel contributor
  • US-based (Lafayette, Indiana). junior (~3 years)
GD

Grace Dinh

high hireability
  • MIT exo compiler contributor
  • US-based (US). junior (~3 years)
JR

James Roberts

high hireability
  • tinygrad compiler/kernel contributor
  • US-based (Seattle). junior (~3 years)
KM

Kenneth Moon

high hireability
  • MIT exo compiler contributor
  • US-based (US). junior (~3 years)
MB

Marcel Bischoff

high hireability
  • tinygrad compiler/kernel contributor
  • US-based (Columbus, OH). junior (~3 years)
NH

Nigel Huang

high hireability
  • 306 merged PRs in tenstorrent/tt-metal (TT-Metalium low-level kernel programming model) plus contributions to tt-umd (user-mode driver) and tt-zephyr-platforms (Zephyr RTOS firmware) — exactly the chip-level firmware/driver/kernel work the query targets
  • Located Santa Clara
  • No PhD signals; firmware/driver role profile consistent with BS/MS
  • Hireability: HIGH — confirmed left Tenstorrent March 20, 2026 (GitHub PR body: 'Today is my last day at Tenstorrent') ~4 weeks ago; likely on the job market now
PL

Philip Lassen

high hireability
  • Groq engineer
  • US-based (US). junior (~3 years)
VW

Vincent Wells

high hireability
  • Tenstorrent tt-mlir contributor
  • US-based (Austin). junior (~3 years)
WF

Wei Feng

high hireability
  • PyTorch AO quantization kernel contributor
  • US-based (US). junior (~3 years)
WM

Wesley Maxey

high hireability
  • NVIDIA contributor
  • US-based (Sunnyvale, CA). junior (~3 years). high contribution volume
WT

Whitney Tsang

high hireability
  • Intel XPU Triton backend compiler work
  • US-based (US). junior (~3 years). high contribution volume
AS

Aaryan Singhal

medium hireability

Software Engineer@ReflectionAI

  • 321 commits to HazyResearch/ThunderKittens + co-author of ThunderKittens paper (ICLR 2025) — CUDA tile primitives targeting H100
  • Own repos: CUDABenchmarks, TKConvs
  • Stanford CS Systems/AI, new grad (~0 FTE)
  • Based in Palo Alto/SF
  • Hireability: MEDIUM — ~1-1.5 years at ReflectionAI (joined post-Oct 2024 paper), startup role within typical transition window, but no explicit open-to-work signals. GitHub bio still shows 'cs @ stanford' with no company set
AL

Abe Leininger

medium hireability
  • Creator of Metal-Puzzles (604★ Apple Metal GPU kernel puzzles repo) and gpu-kernels (CUDA kernel implementations); 6+ merged PRs to ml-explore/mlx (CPU LU factorization, nuclear norm, unsigned dtype fixes in reduce ops) and luminal-ai/luminal (StableHLO support, 3939 additions); also contributed to tinygrad
  • IU BS CS, US-based Austin TX
  • Hireability: MEDIUM — early career (GitHub acct Dec 2021, ~1-2 yrs FTE per prior research), no current company listed on GitHub (previous node expansion noted Anduril Industries but company field now blank); strong GPU kernel learning velocity across Metal and CUDA but unclear current employment status
AD

Aidan Do

medium hireability

Software Engineer, Inference@Fireworks AI

  • CUDA kernel engineer at Fireworks AI (SF Bay Area) — wrote custom bicubic interpolation CUDA kernel (11-21% TTFT improvement for Kimi K2.5), merged PyTorch CUDA upsample_bicubic2d kernel rewrite (4.3-43x speedup), active contributor to NVIDIA/cutlass and FlashInfer
  • BS Software Engineering (Honours), no PhD
  • Hireability: MEDIUM — ~2yr FTE total, currently at Fireworks AI with no job-seeking signals detected; within typical junior transition window
DW

David Wang

medium hireability

GPU Performance Engineer@Modal Labs

  • UIUC education
  • GPU performance engineer @ Modal Labs (Brooklyn, US)
  • Active GPU kernel work: forked Mirage (automatically generating fast GPU kernels without Triton/CUDA). 0-2yr FTE
  • No PhD found
  • Modal is a top GPU cloud company known for kernel-level work
DA

David Zhao Akeley

medium hireability
  • Developer Technology Engineer at NVIDIA Santa Clara (US)
  • UCLA BS CS+Math
  • Contributed to Exo-GPU (PLDI 2026, MIT CSAIL) — CUDA-targeting scheduling compiler achieving >80% of H100 theoretical peak on GEMM
  • Deep low-level systems work (SIMD combine library in C, OpenGL voxel renderer)
  • No PhD — BS only
  • Hireability: MEDIUM — NVIDIA DevTech Engineer with recent exo-gpu collaboration suggesting active research engagement; start date unclear
AY

Angela Yi

medium hireability
  • Meta PyTorch Compiler SWE since 2022, specializing in torch.export and AOTInductor — core ML compiler/codegen directly relevant to LLM inference optimization
  • CMU BS CS (2018-2022), also pursuing CS at Stanford (2024-present, likely MS not PhD). ~3 years FTE, no PhD
  • Contributed to pytorch/executorch (on-device AI acceleration)
  • Spoke at PyTorch Conference 2024 on export features
  • US-based
  • Hireability: MEDIUM — 3 years at Meta, within transition window; Stanford enrollment may indicate career pivot interest
AS

Anton Shepelev

medium hireability
  • Adjacent expertise: NVIDIA System Software Engineer (Tegra group) with LLVM compiler background — pinned repo is llvm-project fork; designed a custom compiler (DJ language → RISC-V → LLVM) with optimization passes (loop unrolling, constant folding, CSE)
  • USF MS+BS CS
  • SF Bay Area, US
  • Primary work is Tegra (embedded/automotive system software) rather than GPU kernels for ML/LLM inference — adjacent to the query
  • LLVM/compiler infrastructure background is highly relevant to MLIR-based kernel work
  • Hireability: MEDIUM — likely 1-3 years at NVIDIA as recent USF grad; Tegra engineers occasionally pivot to GPU compute work
AS

asfiyab-nvidia

medium hireability
  • NVIDIA TensorRT engineer with MS in Computer Engineering from UC San Diego
  • Located in San Diego County, US
  • Work spans TransformerEngine, Megatron-LM, onnx-tensorrt (C++), TensorRT, and NeMo — inference optimization and GPU-accelerated architectures
  • Account created 2022, ~3 years at NVIDIA
  • No PhD
  • MS-only
  • Hireability: MEDIUM — solid MS-level NVIDIA inference engineer with TensorRT/ONNX depth; more inference-framework than raw CUDA kernel writing
AG

Aviral Goel

medium hireability
  • GPU kernel engineer at AMD ROCm working on GEMM optimization and composable_kernel (91 commits, C++ HIP kernels)
  • Personal GitHub shows strong self-directed kernel learning: hip_kernels (HIP GPU kernel examples), flash_attention_triton (Flash Attention from scratch in Triton)
  • MS Computer Science from USC
  • Prior experience was internship at Samsung Semiconductor (Vulkan/GPU layers — excluded from FTE)
  • AMD work account (AviralGoelAMD) created December 9, 2024, confirming very recent hire
  • US-based (Austin, TX)
  • No PhD
  • Hireability: MEDIUM — only ~3 months into AMD role (Dec 2024), likely still settling in; personal GitHub bio still says USC graduate student suggesting very recent graduation
BZ

Baizhou Zhang

medium hireability

Member of Technical Staff@RadixArk (SGLang)

  • MS CS from UC San Diego, BS Intelligence Science from Peking University
  • MTS at RadixArk (SGLang core developer — attention kernels integration, large-scale EP & Speculative Decoding on GB200/GB300)
  • NVIDIA intern on CuDNN Heuristic Team (Blackwell GEMM kernel heuristics)
  • Has cuda-learn-by-practice repo showing hands-on CUDA work
  • Active FlashInfer contributor
  • No PhD
  • US (Palo Alto, CA). ~1yr FTE
CP

Colin Peppler

medium hireability
  • Meta SWE with direct GPU kernel experience: AITemplate contributor (CUDA/HIP codegen framework), GPU Techniques & Algorithms team at Meta working on custom kernels and inference optimization, plus AI Infra RecSys inference optimization
  • Virginia Tech BS CS
  • US-based (Menlo Park)
  • Career: Meta intern → FTE (likely ~2021 start), estimated ~3-4 years FTE
  • BS only, no PhD
  • Hireability: MEDIUM — ~3-4 years at Meta, within transition window; strong GPU kernel skills make this a high-priority reach-out
EA

Emilio Andere

medium hireability

GPU Kernel Engineer@Wafer-AI

  • UChicago Mathematics BS (2021-2025)
  • GPU kernel engineer @ Wafer-AI (YC F25 startup, SF), 'fast gpu kernels @wafer-ai'
  • OSS contributor to tinygrad
  • Active in GPU kernel development at a well-funded YC startup. ~0-1yr FTE, no PhD
EA

Eric Auld

medium hireability

Systems Research, GPU Programming Engineer@Together AI

  • BS Arizona State, MA UCLA, MS NYU (all in mathematics, no CS PhD)
  • Systems Research & GPU Programming Engineer @ Together AI
  • Co-authored GPU MODE Lecture 15 on CUTLASS/CuTe
  • Active CUTLASS contributor
  • Transitioned from math to GPU programming. ~2yr FTE in kernel work
IH

ihar

medium hireability
  • US-based (SF/TN), bio: 'ml, hpc, processors microarchitecture' — directly on-target for GPU kernel/HPC query. tinygrad contributor with possible Tenstorrent affiliation
  • Account 2014 but only 17 followers, sparse public presence — no clear senior indicators
  • No PhD signals found
  • Hireability: MEDIUM — strong topical alignment (HPC, microarchitecture) and US-based, but sparse profile makes degree and seniority hard to confirm; worth outreach
IS

Ilya Sherstyuk

medium hireability
  • SE at NVIDIA (Deep Learning Inference Workflows) since ~2022
  • BS CS Caltech 2022 — clear <5yr FTE (~3-4yr by March 2026)
  • No PhD
  • US-based (Santa Clara CA)
  • NVIDIA TensorRT team working on DL inference workflows
  • Caltech CS + NVIDIA inference = high-quality junior signal
  • Limited public GitHub signal but confirmed via LinkedIn
  • Hireability: MEDIUM — strong educational pedigree (Caltech CS), directly relevant team, but limited public kernel repo evidence
JT

James Thompson

medium hireability
  • Built llm.metal — LLM training in raw C/Metal Shading Language (Apple GPU compute shaders), a direct port of Karpathy llm.c to Metal, demonstrating genuine low-level GPU programming ability
  • Bay Area, CA (confirmed)
  • Bio: ML, computer graphics, GPU programming languages. 1 commit on llm.c
  • CAVEAT: No education or employer found; FTE history unknown — needs follow-up to verify <5yr constraint
  • Hireability: MEDIUM — unclear employment status; Bay Area location and hobby GPU projects suggest field-engaged and potentially available
JW

Jason S. Wang

medium hireability

Machine Learning Scientist@Tesla

Previously: Research Assistant @ Stanford University

San Francisco, US

  • UC Berkeley BS 2022, Stanford MS CS/AI graduating 2025 (just graduated)
  • At NVIDIA now
  • OpenReview confirms 'MS student, Stanford University'
  • Pins: Megatron-LM, NeMo — large-scale training systems. 2 commits to TransformerEngine. <1 year FTE
  • Strong pedigree but kernel depth less clear vs. training systems focus
  • Hireability: MEDIUM — excellent junior fit and seniority, but focus is more training systems than raw CUDA/Triton kernel writing
JJ

Jianan Ji

medium hireability
  • Strong fit — contributes to Mirage megakernel compiler (CMU Catalyst Lab) and SGLang (LLM inference)
  • Confirmed MS student at CMU graduating 2026 per LinkedIn ('CMU 26 | PKU 24'), personal website confirms 'second-year master student at CMU' — not a PhD
  • US-based Pittsburgh PA
  • Hireability: MEDIUM — solid research contributions to both GPU kernel compiler and LLM serving, but limited public open-to-work signals
MF

Michael Feil

medium hireability
  • Model Performance Engineer at Baseten, San Francisco
  • Hands-on Hopper and Blackwell kernel optimizations (candle-flash-attn-v3 repo: custom Flash Attention V3 C++ implementation)
  • Distributed LLM inference: TRT-LLM, NVIDIA Dynamo (2x inference speedup), BEI embeddings runtime
  • Created Infinity production inference engine
  • Prior: MTS at Gradient (1M-context LLM training on 512 NVIDIA GPUs), ML Engineer at Rohde & Schwarz
  • MS Robotics/ML from TU Munich (no PhD)
  • Estimated ~3-4 years FTE (within threshold)
  • San Francisco
  • Hireability: MEDIUM — ~2 years at Baseten, within transition window; no open-to-work signals
NM

Nicolas Macchioni

medium hireability
  • PyTorch Inductor core contributor at Meta (184 PRs in pytorch/pytorch)
  • Builds GPU kernel autotuning, benchmarking, and caching infrastructure -- wrote InductorBenchmarker, Triton template structural caching, reduction autotune data collection pipeline, and contributed L2 cache flush fix to triton-lang/triton
  • Deep understanding of GPU kernel performance (CUDA events, profiling, Triton internals) but work is on the compiler/autotuning layer, not kernel authoring itself
  • Adjacent to GPU kernels -- could transfer
  • No PhD, based in SF
  • Hireability: MEDIUM -- ~1.5-2 years at Meta, no explicit signals of looking to leave
RS

Reuben Stern

medium hireability
  • Core FlashAttention contributor — co-author on FlashAttention-4 paper (arXiv:2603.05451) and PyTorch FlexAttention+FA4 blog. 18+ PRs to Dao-AILab/flash-attention implementing FlexAttention features in CuTe DSL: vectorized score_mods, block sparsity computation kernels, mask_mod support, varlen extensions, PackGQA backward pass for Sm90/Sm100
  • Research Scientist at Colfax International
  • BA Mathematics from Harvard, MM from Peabody — no PhD
  • Non-traditional path (conductor/mathematician) with deep kernel-level GPU work
  • Boston, MA
  • Hireability: MEDIUM — at Colfax (small research consultancy), actively shipping FA4 features with Tri Dao and Jay Shah, likely recruitable
SP

Sahan Paliskara

medium hireability
  • Co-author of KernelLLM (8B model generating Triton kernels from PyTorch) and BackendBench (evaluation suite for LLM/human-written PyTorch backends)
  • Added Triton backend to KernelBench
  • Speaker at PyTorch Conf 2025 on LLMs for GPU kernel dev. 402 PRs on pytorch/pytorch, though primarily backend engine work (torch deploy removal, PyObjectSlot simplification), not hand-written GPU kernels
  • Expertise is in tooling/benchmarking for GPU kernel generation rather than kernel authoring itself — adjacent but strong
  • Princeton BS CS 2021, ~4yr FTE at Meta PyTorch, SF
  • Hireability: MEDIUM — ~4-5 years at Meta, within transition window but no explicit job-seeking signals
SG

Shreya Gaur

medium hireability

Deep Learning Performance Engineer@NVIDIA

  • MS ECE from Purdue University (thesis on GPU kernel optimization for group equivariant CNNs)
  • Deep Learning Performance Engineer at NVIDIA Santa Clara
  • Active custom kernel work: cutlassdev repo (CUTLASS development), gemm_sample (GEMM optimizations), cuda_programming (CUDA class assignments for AlexNet layers on V100 using shared/constant memory)
  • CUTLASS contributor
  • No PhD
  • US (Santa Clara)
SH

Shushi Hong

medium hireability
  • Active contributor to apache/tvm (30+ commits) and mlc-ai/mlc-llm — the two core ML compiler projects for LLM inference
  • Pinned repos are TVM, MLC-LLM, and web-llm, showing deep specialization in ML compilation and on-device LLM deployment
  • Based at CMU Pittsburgh (Catalyst group under Tianqi Chen)
  • No PhD signals; company field shows Carnegie Mellon University
  • Hireability: MEDIUM — still affiliated with CMU but graduation timeline unclear; likely MS student in 2025-2026 window
VP

Vladimir Penkin

medium hireability
  • Intel Austin TX engineer, 13 commits to intel/intel-xpu-backend-for-triton (Triton MLIR GPU compiler)
  • GitHub account 2021, pinned repos: pytorch/pytorch + intel-xpu-backend-for-triton
  • Active on Intel XPU Triton issues through Jan 2026
  • Personal website (penk.in) confirms SF location
  • Education: Taganrog South University of Radio Engineering (Russia)
  • Generalist background (Ruby/C#/TypeScript) but demonstrated Triton/PyTorch XPU focus at Intel
  • Hireability: MEDIUM — account 2021 suggests ~4yr FTE, within window; Intel role has Triton/XPU focus matching query
VS

Vyom Sharma

medium hireability
  • Senior CS student at University of Minnesota (Minneapolis MN, US)
  • Internship/student-only history, no substantial FTE
  • Pinned repos: llm.c fork (18 commits), RecurrentGemma-CUDA ('advanced kernels'), Kaleidoscope-Extended (LLVM JIT compiler in C++) — CUDA + compiler dual track exactly on-target
  • Account 2022, expected graduation ~2025-2026
  • Hireability: MEDIUM — strong student CUDA+LLVM profile but no published work or industry CUDA contributions; graduation date needs confirmation
WC

Wenxin Cheng

medium hireability

Software Engineer@Meta

  • MS CS from UCLA (2022-2024), BS CE from Beijing Jiaotong University
  • Software Engineer at Meta (Jan 2026-present: Dev Infra Triton; Apr 2024-Dec 2025: AI toolchains with CUDA focus)
  • CUTLASS contributor
  • Triton contributor
  • Also contributed to FBGEMM and vLLM
  • Menlo Park, CA
  • No PhD. ~2yr FTE at Meta in kernel/toolchain work
WH

William Hu

medium hireability

Member of Technical Staff@Modal

Previously: GPU Compiler Engineer @ Qualcomm

San Francisco, US

  • GPU Compiler Engineer at Qualcomm (~1.5yr FTE, Sep 2023-Feb 2025) writing OpenCL/Vulkan GPU compilers using LLVM+C++; now at Modal on the flash team (GPU kernels for AI/HPC) while completing MS CS at Stanford (graduating 2026)
  • Research at Stanford Hazy + Scaling Intelligence labs
  • BS Math-CS UCSD, MS Stanford — no PhD
  • US (Bay Area/Stanford)
  • Clean FTE history: ~1.5yr Qualcomm FTE + Modal intern (student concurrent role)
  • Hireability: MEDIUM — current Stanford student, graduating 2026 is prime transition window; Modal role is student internship
YW

Yuzhong Wang

medium hireability
  • NVIDIA engineer (account created 2024 — very recent hire)
  • Works on NeMo, TransformerEngine, and Megatron-MoE — LLM training infrastructure and GPU kernel optimization. -nvidia username suffix
  • No PhD signals found
  • Likely US-based at NVIDIA primary offices
  • Strong technical stack alignment
  • Hireability: MEDIUM — relevant TransformerEngine/Megatron work but limited public signals to confirm education level; no location explicitly confirmed
ZZ

Zephyr Zhao

medium hireability
  • CMU student (undergrad or new MS) confirmed by andrew.cmu.edu email
  • Has a CUDA-Manuscript learning repo and confirmed Mirage paper contributor (Zepeng Zhao in paper authors list)
  • Account Jan 2025 — very fresh, learning-focused CUDA repos
  • No PhD signals
  • US-based Pittsburgh
  • Hireability: MEDIUM — genuine GPU kernel interest with CMU research exposure, but early-stage contributor with minimal independent track record
ZW

Zhihao Wang

medium hireability
  • MLsys Engineer at ByteDance Seed Infra Compiler team (Urbana, IL, US)
  • UIUC BS Math+CS + MS CS, no PhD
  • Built SuperScalar Tomasulo OoO RV32M CPU in SystemVerilog from scratch (7K+ lines), contributor to sglang-jax and exo-lang/exo
  • ByteDance Seed Compiler team does GPU compiler/distributed training work
  • Hireability: MEDIUM — recently started full-time at ByteDance post-MS; likely <2 years in role
ZZ

Zhongbo Zhu

medium hireability

DevTech AI Engineer@NVIDIA

  • BS CS from UIUC (joint degree with Zhejiang University), MS ECE from UIUC
  • NVIDIA DevTech AI engineer specializing in CUDA kernel development and FP8 quantized training
  • Previously Amazon intern (Summer 2023). ~1yr FTE in CUDA kernel dev
  • No PhD
  • US (Bay Area / Santa Clara)
DV

Daniel Vega-Myhre

low hireability
  • Deep CUDA/Triton kernel work at Meta PyTorch Core Performance team: FP8/MXFP8 LLM pretraining kernels, async TP, Triton quantization kernels — 228 commits to pytorch/ao
  • Personal pinned repo gemm written in raw CUDA
  • BS Computer Science Boise State University; MS CS at Georgia Institute of Technology (OMSCS, in progress 2024-2027)
  • No PhD
  • US-based at Meta
  • SENIORITY NOTE: ~5.5yr total FTE (Clearwater Analytics Dec 2018-Jan 2021 data engineering + Google Jun 2022-Aug 2024 ML/TPU + Meta Aug 2024-present); kernel-specific FTE is ~3.7yr at Google/Meta
  • Hireability: LOW — joined Meta PyTorch Core ~6 months ago (Aug 2024), recently promoted to Senior SWE Aug 2025, unlikely to be looking
LJ

Li Jiaying

low hireability
  • Led multi-GPU CUDA traffic simulation at UC Berkeley (113X speedup over CPU, C++/CUDA, scaled 1-8 GPUs with ghost-zone partitioning and inter-GPU communication); 4 commits on mirage (Mirage Persistent Kernel CUDA superoptimizer, C++)
  • CMU MCDS student (BS Software Engineering, Tongji University), now FTE SWE at Snowflake
  • Contributes to arrow-rs (Rust)
  • US (Menlo Park)
  • Hireability: LOW — very recently started FTE at Snowflake, likely still settling in
NP

Nikhil Patel

low hireability

Software Engineer@Meta Superintelligence Labs

  • Core GPU kernel engineer at Meta Superintelligence Labs
  • Built the NVFP4 blockscaled GEMM kernel pipeline in PyTorch Inductor (10-PR NVGEMM stack with CuTeDSL/CUTLASS)
  • Triton compiler contributor (FP8 precision fix, TMA block shape fix). 168 merged PRs in pytorch/pytorch, also contributing NVFP4 backend to vLLM
  • MS CSE from UMich, BS CS from USC, no PhD
  • Based in SF
  • Hireability: LOW — recently at Meta Superintelligence Labs (new org, likely <1 year), shipping at high velocity on cutting-edge GPU kernel work
RS

Rishi Sankar

low hireability
  • MTS at Anthropic working on LLM performance/kernels/pretraining
  • Built Flash Attention 2 from scratch in CUDA with 10 optimization iterations (multi-block parallelism, register-based matmul, warp reduction, kernel fusion)
  • Active LeetGPU participant (fp16 GEMM, multi-head attention, convolution kernels)
  • UCLA BS in CS + Applied Math, no PhD. ~2-3 years full-time experience (Two Sigma then Anthropic)
  • Hireability: LOW - likely <2 years at Anthropic in a dream role for GPU kernel engineers

Runs

#3completed17 qualified / 29 foundApr 4, 2:39 AM
#2completed47 qualified / 136 foundMar 30, 8:34 AM
#1completed19 qualified / 19 foundMar 30, 8:34 AM