Back to dashboard

Junior GPU kernel engineers in the US with CUDA/Triton experience + no Phd. With…

completed66 qualified4 runsMar 4, 6:18 AMjunior-gpu-kernel-engineers-in-the-us-with-cudatriton-experi

Qualified Candidates (66)

AP

Aaron Pazdera

high hireability

Researcher@Prime Intellect AI

Previously: Research Engineer @ Prime Intellect

Des Plaines, US

  • BS Computational and Applied Mathematics (UW-Stout ~2021)
  • Researcher at Prime Intellect AI, Des Plaines IL
  • Pinned repos: llama.cpp (LLM inference C/C++), lightning-thunder (PyTorch compiler), prime-rl (async RL training at scale), C PEG parser
  • Direct LLM inference + compiler tooling work
  • FTE start ~2021 = ~4-5yr (borderline but within window)
  • Recent website activity (9 days ago)
  • Hireability: HIGH — hands-on LLM inference (llama.cpp) + PyTorch compiler (lightning-thunder), active frontier AI lab researcher
AS

Aaryan Singhal

high hireability

Software Engineer@ReflectionAI

  • Co-author of ThunderKittens ICLR 2025 paper (arxiv 2410.20399)
  • Active contributor (300+ commits): recent March 2025 work on MHA implementations, benchmarking, and optimizations
  • Personal CUDA repo targets H100 memory bandwidth saturation
  • Built 348k-final-project (optimized Madrona kernels in TK)
  • Stanford CS (BS Systems + MS AI), recently left Stanford
  • Hireability: HIGH — recent graduate, prime hiring window, actively shipping kernel code
AL

Abe Leininger

high hireability
  • Early-career engineer with GPU kernel enthusiasm: creator of Metal-Puzzles (Apple Metal GPU kernel learning project, inspired by GPU-Puzzles), 2 commits on luminal (Rust-based ML compiler generating CUDA/Metal kernels)
  • IU BS CS
  • Career: Genesys → webAI (ML infra) → now Anduril Industries SWE in Colorado. ~1-2 years FTE total
  • GPU work is self-directed (Metal, not CUDA/Triton specifically)
  • Hireability: HIGH — very early career (~1 year FTE), just started at Anduril; genuine GPU kernel curiosity shown by side projects
AG

Aiden Grossman

high hireability

Google

  • UC Davis BS (~2023-2024 recent grad)
  • Google SWE. 200+ LLVM commits: MLGO (ML-guided register allocation), AMD Zen 5+6 microarchitecture tuning, XLA, HEIR (homomorphic encryption IR)
  • Lead author of ComPile (2.4TB LLVM-IR dataset)
  • Presented at LLVM Dev Meeting 2022/2024/2025
  • No PhD
  • US-based. <2yr FTE (started ~2024)
  • Exact match on LLVM/MLIR compiler stack for GPU kernel backends
  • Hireability: HIGH — fresh BS-only FTE at Google, exceptional LLVM/compiler depth for a recent undergrad
AP

Akaash Parthasarathy

high hireability

Carnegie Mellon University

  • CMU MSML (2025-2026, graduating 2026), Georgia Tech BS CS
  • Research focus explicitly on ML compilers, hardware-aware algorithm design, and efficient LLM inference — directly on-target. 14 commits on mlc-ai/mlc-llm; forks TVM and vLLM
  • Website bio confirms: machine learning compilers, efficient training and inference, parallelization schemes, hardware-aware algorithm design
  • Pittsburgh-based (US)
  • BS+MS only, no PhD
  • Hireability: HIGH — MSML graduating May 2026, prime new-grad hiring window with no current FTE role
AK

Alex Kranias

high hireability

Research Intern / GPU Kernel Engineer@Apple (intern) / SHI Labs (Georgia Tech)

  • Georgia Tech CS BS (AI/ML concentration), current undergrad at SHI Labs
  • AMD Triton Kernels team intern (Fall 2024) — FlashAttention GPU kernel development on ROCm
  • Apple GPU kernels intern (Summer 2025) — GPU kernels for on-device MoE LLMs and Diffusion Transformers
  • Published triton_vs_cuda repo implementing cuBLAS-performant GEMM kernels
  • Strong multi-internship junior kernel signal
BC

Bob Chen

high hireability

Carnegie Mellon University

  • FlexFlow contributor (flexflow-train) and active in LLM inference systems at CMU: sglang-jax (JAX backend for SGLang, contributor to sgl-project org), vLLM-gsoc, KV compression research
  • Broad coverage of distributed LLM inference and serving stacks — strong systems instincts
  • Current CMU student (no FTE), US-based
  • Hireability: HIGH — current student with no FTE employment, graduating likely 2025-2026
CL

cloud11665

high hireability

OpenAI

  • OpenAI engineer in SF
  • Pinned repos: gpuocelot (PTX dynamic compilation framework in C++) and telegraf_nv_export (ultra-low-overhead NVIDIA GPU telemetry C++ plugin) — both highly kernel/systems-relevant. 8 tinygrad commits
  • No PhD signals, no senior title publicly confirmed
  • PTX compilation and GPU telemetry work is precisely on-target
  • Hireability: HIGH — exceptional GPU kernel and low-level systems work (PTX, CUDA C++), US-based at OpenAI SF. Real name unknown; outreach via GitHub
DW

David Wang

high hireability

Modal Labs

  • GPU performance engineer at Modal Labs, ex-NVIDIA HPC Architect
  • Contributed to Mirage (CMU Catalyst Lab) as external contributor
  • BS Texas A&M + MS CS UIUC — no PhD
  • Total FTE estimated <5 years
  • US-based
  • Hireability: HIGH — GPU perf engineering at Modal is exactly on-topic, NVIDIA HPC background strong, BS/MS only. Currently employed so may not be actively seeking
DL

Dylan Lim

high hireability

Research Scientist@Together AI

  • Stanford BS + MS CS (no PhD)
  • Research Scientist at Together AI ('megakernels for everyone')
  • Previously: Hazy Research RA ('helped GPUs meow' — ThunderKittens contributor), Stanford Compilers Group RA (accelerated DNN training across distributed systems)
  • Jump Trading (core strategies)
  • Active GPU kernel and megakernel developer at the premier kernel team (Together AI / Tri Dao group)
  • US (Palo Alto). ~1yr FTE
GM

Geo Min

high hireability

AMD

  • AMD engineer in San Jose CA
  • GitHub account created Feb 2025 — near-certain new grad signal
  • Pins: rocm-rx9070-demo (ROCm AMD RX 9070), iree-test-suites (MLIR), iree fork (MLIR ML compiler), TheRock (HIP/ROCm build), rocm-libraries (Assembly)
  • Also 14 commits to ROCm/composable_kernel
  • Reportedly Virginia Tech education (BS/MS, no PhD signals)
  • Expertise: MLIR, IREE, ROCm, HIP, composable_kernel — directly covers MLIR, AMD GPU, compiler stack
  • Hireability: HIGH — exceptional MLIR/ROCm match, strong new-grad signal from Feb 2025 account, US-based AMD San Jose
JC

Jesse Cai

high hireability

Machine Learning Engineer@Meta

Previously: Senior Research Engineer @ Cultivate

San Francisco, US

  • Meta Software Engineer on PyTorch Core Performance team, SF
  • UCLA 2020 graduate (BS, no PhD). 88 commits to pytorch/ao (rank 8)
  • Sparsity and quantization work (semi-structured 2:4 sparsity, TorchAO), PyTorch blog published, PyTorch Conference 2024 speaker
  • Joined Meta ~2021 → ~3-4 years FTE, within range. h-index: 2
  • US-based SF
  • Tensor core utilization and sparse kernels for LLM inference directly match query
  • Hireability: HIGH — PyTorch core contributor with deep sparsity/quantization, UCLA BS, US SF, no PhD
JW

Jiakun Wang

high hireability
  • BS CS CMU (computer systems/hardware focus) + current MS EE at Columbia (System-on-Chip/VLSI concentration)
  • Chip-level hardware design: implemented TPU IC (Verilog), systolic array for TPU, Emperor SoC (RTL + C drivers), RISC-V formal verification
  • Query explicitly allows chip-level kernel work
  • Contributed to mirage (CUDA superoptimizer)
  • AI chip focus per personal website
  • US (New York, Columbia)
  • No FTE history — entirely within <5yr limit
  • Graduating likely 2025/2026
  • Hireability: HIGH — current student at prime graduation transition window, no blocking FTE role
KM

Kenneth Moon

high hireability
  • MIT EECS 6-3 (CS+EE) undergraduate, expected graduation 2025 per MIT ESP profile
  • Contributed to exo-lang/exo (hardware accelerator scheduling compiler)
  • Has personal mkay-attention repo (custom attention kernel)
  • Early-stage GitHub but direct exposure to low-level compiler work for hardware accelerators at MIT
  • Hireability: HIGH — current MIT undergrad graduating 2025, entering job market imminently
KQ

Kevin Qian

high hireability
  • Core exo-lang contributor (81 commits on exo-lang/exo), co-author of Exo 2 (ASPLOS 2025), MEng thesis on ExoBLAS (meta-programming a high-performance BLAS)
  • MIT BS EECS 2023 + MEng 2024, no PhD
  • Work is directly on user-schedulable DSL for GPU/hardware accelerator kernels — highly query-relevant
  • Prior internships at Jane Street, Meta, D.E
  • Shaw (not FTE)
  • Based in Cambridge, MA (US)
  • Hireability: HIGH — MEng thesis submitted 2024, appears to have just completed degree with no confirmed full-time role yet; in prime transition window
KA

Kit Ao

high hireability

Carnegie Mellon University

  • JHU undergrad (CS + ChemBioE) pursuing CMU MS in Computational Data Science
  • Research assistant at CMU Catalyst Group on Mirage (CUDA/C++/OpenMP/MPI/NVIDIA Nsight profiling)
  • Now ML Engineer at Waymo (ML Infra)
  • Not a PhD — confirmed MS on LinkedIn
  • US-based Pittsburgh/Mountain View
  • Hireability: HIGH — Mirage CUDA+Nsight profiling, Waymo MLE role, ML systems bio, LinkedIn available
KL

Kshitij Lakhani

high hireability

NVIDIA

  • MS ECE from UC Davis (no PhD)
  • Deep Learning Performance Engineer at NVIDIA Santa Clara. 26 commits to TransformerEngine (rank 13)
  • Personal repos: Intro_to_Parallel_Computing (CUDA scan/reduce/sort) and Cache_Profiling (matrix multiply optimization) — strong CUDA fundamentals
  • Career path: intern → GPU Software Engineer II at Roche → exo → NVIDIA, estimated ~3-4 years FTE
  • US-based Santa Clara
  • Hireability: HIGH — verified MS-only, active CUDA contributor at NVIDIA with kernel optimization work
MG

Mingfei Guo

high hireability

Software Engineer@NVIDIA

  • Stanford MSEE student (PKU BS alumni) with strong GPU kernel implementation experience
  • Built flash-attention-slang — Flash Attention in Slang using Vulkan + NV CoopMat2 tensor cores (2.1x faster than PyTorch SDPA), slang-torch PyTorch integration with CUDA support, and multiple GPU-accelerated projects with Warp
  • Active recent work (Feb 2026 commits)
  • Based in Palo Alto
  • Hireability: HIGH — Final-year Master's student, likely graduating 2026, in prime transition window
PJ

Pawan Jayakumar

high hireability

UC San Diego

  • Current UCSD Masters student in ML systems and security. 6 commits to pytorch/ao (PyTorch quantization and sparsity)
  • Bio: 'Masters student at UCSD studying ML systems and security' — graduating 2025 or 2026, fits student criteria
  • No PhD
  • US-based San Diego CA. pytorch/ao contributions = hands-on quantization and sparsity at PyTorch core level, directly aligned with GPU kernel optimization for LLM inference
  • Ideal junior profile
  • Hireability: HIGH — MS ML systems student at UCSD, pytorch/ao contributor, US-based, graduating imminently
PY

Peter Yeh

high hireability
  • Mountain View CA, focused on 'Accelerating Generative AI/LLM.' 21 commits to pytorch/ao
  • Pins: pytorch, pytorch/ao, apache/tvm (ML compiler), FasterTransformer (C++ NVIDIA inference), flash-attention, gloo — exceptional technical breadth across ML compilers, GPU kernel inference, and quantization
  • No PhD signals
  • No clear senior titles
  • US-based Mountain View confirmed
  • Hireability: HIGH — outstanding breadth across LLM inference, ML compilers, and GPU kernels; junior-to-mid level; US-based
PN

Phuong Nguyen

high hireability

NVIDIA

  • DL Performance Engineer at NVIDIA Santa Clara. 106+ commits to NVIDIA/TransformerEngine with focus on FP8 fused ops, NVFP4 recipes, JAX/CUDA kernel acceleration, and distributed training (FSDP, Shardy)
  • GitHub active since 2023-2024 in TE repo, consistent with <5yr FTE
  • Stanford-educated (BS/MS, no PhD indicators)
  • US-based
  • Strong fit on GPU kernel/CUDA/JAX search criteria
  • Hireability: HIGH — joined NVIDIA ~2023, within 2-3yr transition window; title is DL Performance Engineer (not Senior/Staff)
SK

Samurdhi Karunaratne

high hireability

NVIDIA

  • Deep Learning Inference SE at NVIDIA TensorRT since January 2022 (~4.25yr FTE)
  • BS Computer Engineering, Univ of Peradeniya (2019), MS ECE UCLA (Dec 2021)
  • Pinned repos: TensorRT (C++), ONNX-TensorRT, PyTorch-TensorRT, ONNX
  • Physics Olympiad gold/silver medalist (IPhO, APhO)
  • Directly relevant to GPU kernel/LLM inference stack
  • No PhD
  • US-based (Santa Clara CA)
  • Hireability: HIGH — recent MS grad, <5yr FTE, deep TensorRT inference expertise
SS

Distributed LLM Inference@NVIDIA

Previously: ML for CAD Engineering @ Apple

San Francisco, US

  • Applied AI Software Engineer at NVIDIA Distributed LLM Inference SF (started June 2025)
  • Stanford CS MS (admitted 2022, no PhD)
  • Built Prophet — LLM inference engine optimized for head-of-line blocking (Stanford CS244b)
  • GitHub shows LLM inference engine + Rust cryptography + competitive programming (C++)
  • Zero prior FTE before NVIDIA
  • Strong US-based systems/inference engineer at tier-1 target
  • Hireability: HIGH — <1yr at NVIDIA, Stanford MS graduate, but role is exactly Distributed LLM Inference — prime target
SX

Shanli Xing

high hireability

Undergraduate Researcher@UW / CMU Catalyst

  • UW CSE undergrad (BS CS, graduating 2026 — actively seeking PhD positions Fall 2026)
  • Core contributor to FlashInfer (CUDA kernel library for LLM serving): designed and implemented sorting-free GPU sampling kernels, co-authored FlashInfer-Bench (MLSys 2026)
  • Research @ UW SAMPL (advised by Prof
  • Luis Ceze) and CMU Catalyst (advised by Prof
  • Tianqi Chen)
  • Active kernel work at undergraduate level with published systems paper
  • US (Seattle, WA). 0 FTE
SW

Songting Wang

high hireability

Carnegie Mellon University

  • CMU ECE+CS student (Pittsburgh), contributor to mirage (Mirage Persistent Kernel: Compiling LLMs into a MegaKernel, C++ CUDA superoptimizer)
  • Pinned mirage fork prominently
  • Likely graduating 2025/2026. 1 commit on mirage (shallow but right project at CMU)
  • No visible FTE history — clean within <5yr FTE limit
  • Hireability: HIGH — current student at graduation window, no blocking employment, Pittsburgh-based at CMU
SS

Surya Subramanian

high hireability

cuBLAS Intern@NVIDIA

  • Georgia Tech CS student (BS)
  • NVIDIA cuBLAS intern writing fast matmul CUDA kernels for Blackwell via emulation on low-precision tensor cores
  • Previously Meta PyTorch distributed training + Pinterest ML infra
  • Graduating soon (2025/2026)
  • Very strong kernel signal for a student
TG

Tarushii Goel

high hireability
  • MIT undergrad CS, Class of 2026 (graduating — explicitly allowed by search)
  • McLean VA / MIT
  • Interned at NVIDIA, Modal Labs, and Exafunction (all kernel/inference companies)
  • CUTLASS forks (quantized BLAS), Triton forks, flash-linear-attention fork
  • Writes GEMM kernel + AI compiler deep-dives on blog
  • No FTE yet (student)
  • USACO Platinum competitive programmer
  • Hireability: HIGH — MIT 2026 graduating student, direct CUTLASS+Triton experience, NVIDIA+Modal internships, prime kernel engineering profile
VL

Victor Li

high hireability
  • FlexFlow contributor (flexflow-train, C++); GitHub repos include MiniAPL-release (LLVM-based dense array language compiler, Stanford cs343d course), CS349H project (Stanford ML compilers course), and Mamba SSM
  • Username victorli2002 strongly implies undergrad graduating 2025-2026
  • LLVM coursework directly on-topic for the compiler track of this query
  • No FTE history visible
  • US-based (Stanford CA implied by course repos)
  • Hireability: HIGH — current undergrad with no FTE experience, prime graduation window
WH

William Hu

high hireability

MSCS Student / Intern@Stanford / Modal Labs

  • MSCS student at Stanford (willhu@stanford.edu), focus on systems/DSLs/AI HPC
  • Co-author of HipKittens paper (AMD ROCm GPU kernels, arxiv Nov 2025): 662 commits to HazyResearch/HipKittens
  • Intern at Modal on the 'flash team' (flash attention and fast GPU kernel work)
  • Also contributor to KernelBench (Stanford Scaling Intelligence)
  • Personal site: willhu-jpg.github.io
  • Taking Stanford CS 240lx (advanced systems) Spring 2025. 0 FTE, strong research output at grad school level
WZ

William Zhou

high hireability

Triton Kernel Engineer@AMD

  • UCLA CS undergrad (graduating 2026)
  • Writing Triton kernels @ AMD (GitHub bio: 'writing Triton kernels @ AMD, ACM@UCLA')
  • Active Triton contributor via AMD
  • US (LA)
  • Strong junior kernel signal: recent grad, active kernel work at major GPU company
WC

Willy Chan

high hireability

Student Researcher / Intern@Together AI / Stanford

  • Stanford BS CS student
  • Student Researcher @ Together AI (AI Kernels team)
  • NVIDIA intern (multigpu workloads/libraries)
  • Meta Superintelligence Lab (Data Foundations)
  • Stanford SAIL research: KernelBench DSL extension + scaling laws of Kernel DSLs + multigpu kernels
  • Built NVSHMEM4Py integration for Perplexity's pplx-kernels (MoE communication CUDA kernels)
  • Active contributor to perplexityai/pplx-kernels. 0 FTE, exceptional kernel breadth for an undergrad
YZ

Yiyan Zhai

high hireability

Carnegie Mellon University

  • CMU undergraduate (CS&ML) working directly with Prof
  • Tianqi Chen on FlashInfer-Bench (LLM inference kernel benchmarking system — best paper at MLSys 2025) and MLC-LLM
  • Pinned repos: flashinfer-bench and libCacheSim (C++ cache simulator)
  • Contributor to arxiv paper on FlashInfer-Bench (Oct 2025)
  • Strong LLM inference kernel focus — exactly the query target
  • Pittsburgh (US)
  • Undergraduate only, no PhD
  • Hireability: HIGH — CMU undergrad in graduation window (likely May 2026), no FTE experience, actively doing hands-on LLM inference systems research
ZZ

Zhongbo Zhu

high hireability

NVIDIA

  • MS CompE from UIUC (no PhD)
  • DevTech @ NVIDIA — focused on customer CUDA performance optimization
  • Amazon intern Summer 2023 indicates very recent grad, likely ~1-2 years FTE
  • Pins: TransformerEngine, Megatron-LM, Megatron-Bridge
  • NVIDIA Technical Blog author on quantized training and CUDA kernels
  • ZJU undergrad + UIUC MS
  • Likely US-based (NVIDIA DevTech teams primarily US)
  • Hireability: HIGH — MS-only, very junior FTE timeline (~1-2yr), active quantization+kernel work at NVIDIA
AQ

Abu Qader

medium hireability

Baseten

  • Software Engineer at Baseten since July 2022 (~3.5 years FTE)
  • Cornell BS 2021 (no PhD)
  • Led TensorRT-LLM Engine Builder and EAGLE 3 speculative decoding integration; authored engineering posts on production-ready speculative decoding with TRT-LLM
  • Adjacent to kernel work (uses TRT-LLM rather than writing raw CUDA/Triton kernels) but deep LLM inference optimization expertise
  • San Francisco, US-based
  • Hireability: MEDIUM — 3+ years at Baseten, entering typical transition window
AD

Aidan Do

medium hireability

Software Engineer, Inference@Fireworks AI

  • BS Software Engineering (Honours), University of Adelaide
  • Software Engineer at Fireworks AI (SF Bay Area): wrote custom bicubic interpolation CUDA kernel improving TTFT of Kimi K2.5 by 11-21%; contributed to CUTLASS and FlashInfer OSS
  • Also rewrote PyTorch CUDA upsample_bicubic2d kernel — 4.3-43x speedup (merged PR)
  • Active NVIDIA/Meta OSS contributor
  • Previously at Canva (Sydney). <2yr FTE
  • US (Bay Area)
AY

Angela Yi

medium hireability
  • Meta PyTorch Compiler SWE since 2022, specializing in torch.export and AOTInductor — core ML compiler/codegen directly relevant to LLM inference optimization
  • CMU BS CS (2018-2022), also pursuing CS at Stanford (2024-present, likely MS not PhD). ~3 years FTE, no PhD
  • Contributed to pytorch/executorch (on-device AI acceleration)
  • Spoke at PyTorch Conference 2024 on export features
  • US-based
  • Hireability: MEDIUM — 3 years at Meta, within transition window; Stanford enrollment may indicate career pivot interest
AS

Anton Shepelev

medium hireability

NVIDIA

  • Adjacent expertise: NVIDIA System Software Engineer (Tegra group) with LLVM compiler background — pinned repo is llvm-project fork; designed a custom compiler (DJ language → RISC-V → LLVM) with optimization passes (loop unrolling, constant folding, CSE)
  • USF MS+BS CS
  • SF Bay Area, US
  • Primary work is Tegra (embedded/automotive system software) rather than GPU kernels for ML/LLM inference — adjacent to the query
  • LLVM/compiler infrastructure background is highly relevant to MLIR-based kernel work
  • Hireability: MEDIUM — likely 1-3 years at NVIDIA as recent USF grad; Tegra engineers occasionally pivot to GPU compute work
AS

asfiyab-nvidia

medium hireability

NVIDIA

  • NVIDIA TensorRT engineer with MS in Computer Engineering from UC San Diego
  • Located in San Diego County, US
  • Work spans TransformerEngine, Megatron-LM, onnx-tensorrt (C++), TensorRT, and NeMo — inference optimization and GPU-accelerated architectures
  • Account created 2022, ~3 years at NVIDIA
  • No PhD
  • MS-only
  • Hireability: MEDIUM — solid MS-level NVIDIA inference engineer with TensorRT/ONNX depth; more inference-framework than raw CUDA kernel writing
AG

Aviral Goel

medium hireability

AMD

  • GPU kernel engineer at AMD ROCm working on GEMM optimization and composable_kernel (91 commits, C++ HIP kernels)
  • Personal GitHub shows strong self-directed kernel learning: hip_kernels (HIP GPU kernel examples), flash_attention_triton (Flash Attention from scratch in Triton)
  • MS Computer Science from USC
  • Prior experience was internship at Samsung Semiconductor (Vulkan/GPU layers — excluded from FTE)
  • AMD work account (AviralGoelAMD) created December 9, 2024, confirming very recent hire
  • US-based (Austin, TX)
  • No PhD
  • Hireability: MEDIUM — only ~3 months into AMD role (Dec 2024), likely still settling in; personal GitHub bio still says USC graduate student suggesting very recent graduation
BZ

Baizhou Zhang

medium hireability

Member of Technical Staff@RadixArk (SGLang)

  • MS CS from UC San Diego, BS Intelligence Science from Peking University
  • MTS at RadixArk (SGLang core developer — attention kernels integration, large-scale EP & Speculative Decoding on GB200/GB300)
  • NVIDIA intern on CuDNN Heuristic Team (Blackwell GEMM kernel heuristics)
  • Has cuda-learn-by-practice repo showing hands-on CUDA work
  • Active FlashInfer contributor
  • No PhD
  • US (Palo Alto, CA). ~1yr FTE
CP

Colin Peppler

medium hireability
  • Meta SWE with direct GPU kernel experience: AITemplate contributor (CUDA/HIP codegen framework), GPU Techniques & Algorithms team at Meta working on custom kernels and inference optimization, plus AI Infra RecSys inference optimization
  • Virginia Tech BS CS
  • US-based (Menlo Park)
  • Career: Meta intern → FTE (likely ~2021 start), estimated ~3-4 years FTE
  • BS only, no PhD
  • Hireability: MEDIUM — ~3-4 years at Meta, within transition window; strong GPU kernel skills make this a high-priority reach-out
DW

David Wang

medium hireability

GPU Performance Engineer@Modal Labs

  • UIUC education
  • GPU performance engineer @ Modal Labs (Brooklyn, US)
  • Active GPU kernel work: forked Mirage (automatically generating fast GPU kernels without Triton/CUDA). 0-2yr FTE
  • No PhD found
  • Modal is a top GPU cloud company known for kernel-level work
DA

David Zhao Akeley

medium hireability

MIT CSAIL

  • Developer Technology Engineer at NVIDIA Santa Clara (US)
  • UCLA BS CS+Math
  • Contributed to Exo-GPU (PLDI 2026, MIT CSAIL) — CUDA-targeting scheduling compiler achieving >80% of H100 theoretical peak on GEMM
  • Deep low-level systems work (SIMD combine library in C, OpenGL voxel renderer)
  • No PhD — BS only
  • Hireability: MEDIUM — NVIDIA DevTech Engineer with recent exo-gpu collaboration suggesting active research engagement; start date unclear
EA

Emilio Andere

medium hireability

GPU Kernel Engineer@Wafer-AI

  • UChicago Mathematics BS (2021-2025)
  • GPU kernel engineer @ Wafer-AI (YC F25 startup, SF), 'fast gpu kernels @wafer-ai'
  • OSS contributor to tinygrad
  • Active in GPU kernel development at a well-funded YC startup. ~0-1yr FTE, no PhD
EA

Eric Auld

medium hireability

Systems Research, GPU Programming Engineer@Together AI

  • BS Arizona State, MA UCLA, MS NYU (all in mathematics, no CS PhD)
  • Systems Research & GPU Programming Engineer @ Together AI
  • Co-authored GPU MODE Lecture 15 on CUTLASS/CuTe
  • Active CUTLASS contributor
  • Transitioned from math to GPU programming. ~2yr FTE in kernel work
IH

ihar

medium hireability
  • US-based (SF/TN), bio: 'ml, hpc, processors microarchitecture' — directly on-target for GPU kernel/HPC query. tinygrad contributor with possible Tenstorrent affiliation
  • Account 2014 but only 17 followers, sparse public presence — no clear senior indicators
  • No PhD signals found
  • Hireability: MEDIUM — strong topical alignment (HPC, microarchitecture) and US-based, but sparse profile makes degree and seniority hard to confirm; worth outreach
IS

Ilya Sherstyuk

medium hireability

NVIDIA

  • SE at NVIDIA (Deep Learning Inference Workflows) since ~2022
  • BS CS Caltech 2022 — clear <5yr FTE (~3-4yr by March 2026)
  • No PhD
  • US-based (Santa Clara CA)
  • NVIDIA TensorRT team working on DL inference workflows
  • Caltech CS + NVIDIA inference = high-quality junior signal
  • Limited public GitHub signal but confirmed via LinkedIn
  • Hireability: MEDIUM — strong educational pedigree (Caltech CS), directly relevant team, but limited public kernel repo evidence
JT

James Thompson

medium hireability
  • Built llm.metal — LLM training in raw C/Metal Shading Language (Apple GPU compute shaders), a direct port of Karpathy llm.c to Metal, demonstrating genuine low-level GPU programming ability
  • Bay Area, CA (confirmed)
  • Bio: ML, computer graphics, GPU programming languages. 1 commit on llm.c
  • CAVEAT: No education or employer found; FTE history unknown — needs follow-up to verify <5yr constraint
  • Hireability: MEDIUM — unclear employment status; Bay Area location and hobby GPU projects suggest field-engaged and potentially available
JW

Jason S. Wang

medium hireability

Machine Learning Scientist@Tesla

Previously: Research Assistant @ Stanford University

San Francisco, US

  • UC Berkeley BS 2022, Stanford MS CS/AI graduating 2025 (just graduated)
  • At NVIDIA now
  • OpenReview confirms 'MS student, Stanford University'
  • Pins: Megatron-LM, NeMo — large-scale training systems. 2 commits to TransformerEngine. <1 year FTE
  • Strong pedigree but kernel depth less clear vs. training systems focus
  • Hireability: MEDIUM — excellent junior fit and seniority, but focus is more training systems than raw CUDA/Triton kernel writing
JJ

Jianan Ji

medium hireability

Carnegie Mellon University

  • Strong fit — contributes to Mirage megakernel compiler (CMU Catalyst Lab) and SGLang (LLM inference)
  • Confirmed MS student at CMU graduating 2026 per LinkedIn ('CMU 26 | PKU 24'), personal website confirms 'second-year master student at CMU' — not a PhD
  • US-based Pittsburgh PA
  • Hireability: MEDIUM — solid research contributions to both GPU kernel compiler and LLM serving, but limited public open-to-work signals
MF

Michael Feil

medium hireability

Baseten

  • Model Performance Engineer at Baseten, San Francisco
  • Hands-on Hopper and Blackwell kernel optimizations (candle-flash-attn-v3 repo: custom Flash Attention V3 C++ implementation)
  • Distributed LLM inference: TRT-LLM, NVIDIA Dynamo (2x inference speedup), BEI embeddings runtime
  • Created Infinity production inference engine
  • Prior: MTS at Gradient (1M-context LLM training on 512 NVIDIA GPUs), ML Engineer at Rohde & Schwarz
  • MS Robotics/ML from TU Munich (no PhD)
  • Estimated ~3-4 years FTE (within threshold)
  • San Francisco
  • Hireability: MEDIUM — ~2 years at Baseten, within transition window; no open-to-work signals
NP

Nikhil Patel

medium hireability

Software Engineer@Meta Superintelligence Labs

  • MS CS from University of Michigan
  • SWE at Meta Superintelligence Labs (SF/Menlo Park) working on GPU kernels, ML compilers, dynamic bytecode transformation
  • Authored tritonparse (TritonParse: Compiler Tracer/Visualizer/Reproducer for Triton Kernels) — pinned on GitHub
  • Active Triton and PyTorch contributor
  • Previously interned at Meta PyTorch, Roblox, Amazon, ZEISS
  • No PhD. ~1-2yr FTE
  • US (Menlo Park, CA)
RG

Raghavv Goel

medium hireability

Senior Machine Learning Researcher@Qualcomm

Previously: Robotics Researcher @ Carnegie Mellon University

San Diego, US

  • Senior DL Researcher at Qualcomm AI Research (Feb 2023-present, ~3yr FTE) on Efficient LLM team; briefly on Qualcomm Compiler Optimization team
  • Work: speculative decoding (2.4x speedup), KV cache eviction (KeyDiff) — inference algorithm research adjacent to but not direct CUDA kernel writing
  • BS ECE IIITD Delhi 2020, MS Robotics CMU 2022 (no PhD)
  • San Diego, US. ~3yr FTE total — within <5yr limit
  • Hireability: MEDIUM — within typical transition window; website position update Aug 2025 and recent papers suggest career motion
RL

Ryan Lynch

medium hireability

Autopilot Compiler Engineer@Tesla

  • BS/MS Georgia Tech
  • Tesla Autopilot Compiler Engineer working with MLIR
  • Bio: 'Tesla Autopilot Compiler Engineer
  • MLIR
  • Small team.' 2-3yr FTE
  • No PhD
  • US (likely CA)
  • Compiler/MLIR background is direct adjacent to GPU kernel work; Tesla is a serious kernel shop
SG

Shreya Gaur

medium hireability

Deep Learning Performance Engineer@NVIDIA

  • MS ECE from Purdue University (thesis on GPU kernel optimization for group equivariant CNNs)
  • Deep Learning Performance Engineer at NVIDIA Santa Clara
  • Active custom kernel work: cutlassdev repo (CUTLASS development), gemm_sample (GEMM optimizations), cuda_programming (CUDA class assignments for AlexNet layers on V100 using shared/constant memory)
  • CUTLASS contributor
  • No PhD
  • US (Santa Clara)
SH

Shushi Hong

medium hireability

Carnegie Mellon University

  • Active contributor to apache/tvm (30+ commits) and mlc-ai/mlc-llm — the two core ML compiler projects for LLM inference
  • Pinned repos are TVM, MLC-LLM, and web-llm, showing deep specialization in ML compilation and on-device LLM deployment
  • Based at CMU Pittsburgh (Catalyst group under Tianqi Chen)
  • No PhD signals; company field shows Carnegie Mellon University
  • Hireability: MEDIUM — still affiliated with CMU but graduation timeline unclear; likely MS student in 2025-2026 window
VP

Vladimir Penkin

medium hireability

Intel

  • Intel Austin TX engineer, 13 commits to intel/intel-xpu-backend-for-triton (Triton MLIR GPU compiler)
  • GitHub account 2021, pinned repos: pytorch/pytorch + intel-xpu-backend-for-triton
  • Active on Intel XPU Triton issues through Jan 2026
  • Personal website (penk.in) confirms SF location
  • Education: Taganrog South University of Radio Engineering (Russia)
  • Generalist background (Ruby/C#/TypeScript) but demonstrated Triton/PyTorch XPU focus at Intel
  • Hireability: MEDIUM — account 2021 suggests ~4yr FTE, within window; Intel role has Triton/XPU focus matching query
VS

Vyom Sharma

medium hireability
  • Senior CS student at University of Minnesota (Minneapolis MN, US)
  • Internship/student-only history, no substantial FTE
  • Pinned repos: llm.c fork (18 commits), RecurrentGemma-CUDA ('advanced kernels'), Kaleidoscope-Extended (LLVM JIT compiler in C++) — CUDA + compiler dual track exactly on-target
  • Account 2022, expected graduation ~2025-2026
  • Hireability: MEDIUM — strong student CUDA+LLVM profile but no published work or industry CUDA contributions; graduation date needs confirmation
WC

Wenxin Cheng

medium hireability

Software Engineer@Meta

  • MS CS from UCLA (2022-2024), BS CE from Beijing Jiaotong University
  • Software Engineer at Meta (Jan 2026-present: Dev Infra Triton; Apr 2024-Dec 2025: AI toolchains with CUDA focus)
  • CUTLASS contributor
  • Triton contributor
  • Also contributed to FBGEMM and vLLM
  • Menlo Park, CA
  • No PhD. ~2yr FTE at Meta in kernel/toolchain work
WH

William Hu

medium hireability

Member of Technical Staff@Modal

Previously: GPU Compiler Engineer @ Qualcomm

San Francisco, US

  • GPU Compiler Engineer at Qualcomm (~1.5yr FTE, Sep 2023-Feb 2025) writing OpenCL/Vulkan GPU compilers using LLVM+C++; now at Modal on the flash team (GPU kernels for AI/HPC) while completing MS CS at Stanford (graduating 2026)
  • Research at Stanford Hazy + Scaling Intelligence labs
  • BS Math-CS UCSD, MS Stanford — no PhD
  • US (Bay Area/Stanford)
  • Clean FTE history: ~1.5yr Qualcomm FTE + Modal intern (student concurrent role)
  • Hireability: MEDIUM — current Stanford student, graduating 2026 is prime transition window; Modal role is student internship
YW

Yuzhong Wang

medium hireability

NVIDIA

  • NVIDIA engineer (account created 2024 — very recent hire)
  • Works on NeMo, TransformerEngine, and Megatron-MoE — LLM training infrastructure and GPU kernel optimization. -nvidia username suffix
  • No PhD signals found
  • Likely US-based at NVIDIA primary offices
  • Strong technical stack alignment
  • Hireability: MEDIUM — relevant TransformerEngine/Megatron work but limited public signals to confirm education level; no location explicitly confirmed
ZZ

Zephyr Zhao

medium hireability

Carnegie Mellon University

  • CMU student (undergrad or new MS) confirmed by andrew.cmu.edu email
  • Has a CUDA-Manuscript learning repo and confirmed Mirage paper contributor (Zepeng Zhao in paper authors list)
  • Account Jan 2025 — very fresh, learning-focused CUDA repos
  • No PhD signals
  • US-based Pittsburgh
  • Hireability: MEDIUM — genuine GPU kernel interest with CMU research exposure, but early-stage contributor with minimal independent track record
ZW

Zhihao Wang

medium hireability

Earth

  • MLsys Engineer at ByteDance Seed Infra Compiler team (Urbana, IL, US)
  • UIUC BS Math+CS + MS CS, no PhD
  • Built SuperScalar Tomasulo OoO RV32M CPU in SystemVerilog from scratch (7K+ lines), contributor to sglang-jax and exo-lang/exo
  • ByteDance Seed Compiler team does GPU compiler/distributed training work
  • Hireability: MEDIUM — recently started full-time at ByteDance post-MS; likely <2 years in role
ZZ

Zhongbo Zhu

medium hireability

DevTech AI Engineer@NVIDIA

  • BS CS from UIUC (joint degree with Zhejiang University), MS ECE from UIUC
  • NVIDIA DevTech AI engineer specializing in CUDA kernel development and FP8 quantized training
  • Previously Amazon intern (Summer 2023). ~1yr FTE in CUDA kernel dev
  • No PhD
  • US (Bay Area / Santa Clara)
DV

Daniel Vega-Myhre

low hireability

Meta

  • Deep CUDA/Triton kernel work at Meta PyTorch Core Performance team: FP8/MXFP8 LLM pretraining kernels, async TP, Triton quantization kernels — 228 commits to pytorch/ao
  • Personal pinned repo gemm written in raw CUDA
  • BS Computer Science Boise State University; MS CS at Georgia Institute of Technology (OMSCS, in progress 2024-2027)
  • No PhD
  • US-based at Meta
  • SENIORITY NOTE: ~5.5yr total FTE (Clearwater Analytics Dec 2018-Jan 2021 data engineering + Google Jun 2022-Aug 2024 ML/TPU + Meta Aug 2024-present); kernel-specific FTE is ~3.7yr at Google/Meta
  • Hireability: LOW — joined Meta PyTorch Core ~6 months ago (Aug 2024), recently promoted to Senior SWE Aug 2025, unlikely to be looking
LJ

Li Jiaying

low hireability

Carnegie Mellon University

  • Led multi-GPU CUDA traffic simulation at UC Berkeley (113X speedup over CPU, C++/CUDA, scaled 1-8 GPUs with ghost-zone partitioning and inter-GPU communication); 4 commits on mirage (Mirage Persistent Kernel CUDA superoptimizer, C++)
  • CMU MCDS student (BS Software Engineering, Tongji University), now FTE SWE at Snowflake
  • Contributes to arrow-rs (Rust)
  • US (Menlo Park)
  • Hireability: LOW — very recently started FTE at Snowflake, likely still settling in

Runs

#4failed0 qualified / 0 foundMar 13, 3:37 PMClaude Code exited with code -1:
#3failed0 qualified / 0 foundMar 12, 8:40 PMClaude Code exited with code -1:
#2completed47 qualified / 47 foundMar 12, 8:35 PM
#1completed19 qualified / 19 foundMar 12, 8:35 PM