Junior GPU kernel engineers in the US with CUDA/Triton experience + no Phd. With…

completed66 qualified4 runsMar 4, 6:18 AMjunior-gpu-kernel-engineers-in-the-us-with-cudatriton-experi

Qualified Candidates (66)

Aaron Pazdera

high hireability

Researcher@Prime Intellect AI

Previously: Research Engineer @ Prime Intellect

Des Plaines, US

BS Computational and Applied Mathematics (UW-Stout ~2021)
Researcher at Prime Intellect AI, Des Plaines IL
Pinned repos: llama.cpp (LLM inference C/C++), lightning-thunder (PyTorch compiler), prime-rl (async RL training at scale), C PEG parser
Direct LLM inference + compiler tooling work
FTE start ~2021 = ~4-5yr (borderline but within window)
Recent website activity (9 days ago)
Hireability: HIGH — hands-on LLM inference (llama.cpp) + PyTorch compiler (lightning-thunder), active frontier AI lab researcher

Aaryan Singhal

high hireability

Software Engineer@ReflectionAI

Co-author of ThunderKittens ICLR 2025 paper (arxiv 2410.20399)
Active contributor (300+ commits): recent March 2025 work on MHA implementations, benchmarking, and optimizations
Personal CUDA repo targets H100 memory bandwidth saturation
Built 348k-final-project (optimized Madrona kernels in TK)
Stanford CS (BS Systems + MS AI), recently left Stanford
Hireability: HIGH — recent graduate, prime hiring window, actively shipping kernel code

Abe Leininger

high hireability

Early-career engineer with GPU kernel enthusiasm: creator of Metal-Puzzles (Apple Metal GPU kernel learning project, inspired by GPU-Puzzles), 2 commits on luminal (Rust-based ML compiler generating CUDA/Metal kernels)
IU BS CS
Career: Genesys → webAI (ML infra) → now Anduril Industries SWE in Colorado. ~1-2 years FTE total
GPU work is self-directed (Metal, not CUDA/Triton specifically)
Hireability: HIGH — very early career (~1 year FTE), just started at Anduril; genuine GPU kernel curiosity shown by side projects

Aiden Grossman

high hireability

Google

UC Davis BS (~2023-2024 recent grad)
Google SWE. 200+ LLVM commits: MLGO (ML-guided register allocation), AMD Zen 5+6 microarchitecture tuning, XLA, HEIR (homomorphic encryption IR)
Lead author of ComPile (2.4TB LLVM-IR dataset)
Presented at LLVM Dev Meeting 2022/2024/2025
No PhD
US-based. <2yr FTE (started ~2024)
Exact match on LLVM/MLIR compiler stack for GPU kernel backends
Hireability: HIGH — fresh BS-only FTE at Google, exceptional LLVM/compiler depth for a recent undergrad

Akaash Parthasarathy

high hireability

Carnegie Mellon University

CMU MSML (2025-2026, graduating 2026), Georgia Tech BS CS
Research focus explicitly on ML compilers, hardware-aware algorithm design, and efficient LLM inference — directly on-target. 14 commits on mlc-ai/mlc-llm; forks TVM and vLLM
Website bio confirms: machine learning compilers, efficient training and inference, parallelization schemes, hardware-aware algorithm design
Pittsburgh-based (US)
BS+MS only, no PhD
Hireability: HIGH — MSML graduating May 2026, prime new-grad hiring window with no current FTE role

Alex Kranias

high hireability

Research Intern / GPU Kernel Engineer@Apple (intern) / SHI Labs (Georgia Tech)

Georgia Tech CS BS (AI/ML concentration), current undergrad at SHI Labs
AMD Triton Kernels team intern (Fall 2024) — FlashAttention GPU kernel development on ROCm
Apple GPU kernels intern (Summer 2025) — GPU kernels for on-device MoE LLMs and Diffusion Transformers
Published triton_vs_cuda repo implementing cuBLAS-performant GEMM kernels
Strong multi-internship junior kernel signal

Bob Chen

high hireability

Carnegie Mellon University

FlexFlow contributor (flexflow-train) and active in LLM inference systems at CMU: sglang-jax (JAX backend for SGLang, contributor to sgl-project org), vLLM-gsoc, KV compression research
Broad coverage of distributed LLM inference and serving stacks — strong systems instincts
Current CMU student (no FTE), US-based
Hireability: HIGH — current student with no FTE employment, graduating likely 2025-2026

cloud11665

high hireability

OpenAI

OpenAI engineer in SF
Pinned repos: gpuocelot (PTX dynamic compilation framework in C++) and telegraf_nv_export (ultra-low-overhead NVIDIA GPU telemetry C++ plugin) — both highly kernel/systems-relevant. 8 tinygrad commits
No PhD signals, no senior title publicly confirmed
PTX compilation and GPU telemetry work is precisely on-target
Hireability: HIGH — exceptional GPU kernel and low-level systems work (PTX, CUDA C++), US-based at OpenAI SF. Real name unknown; outreach via GitHub

David Wang

high hireability

Modal Labs

GPU performance engineer at Modal Labs, ex-NVIDIA HPC Architect
Contributed to Mirage (CMU Catalyst Lab) as external contributor
BS Texas A&M + MS CS UIUC — no PhD
Total FTE estimated <5 years
US-based
Hireability: HIGH — GPU perf engineering at Modal is exactly on-topic, NVIDIA HPC background strong, BS/MS only. Currently employed so may not be actively seeking

Dylan Lim

high hireability

Research Scientist@Together AI

Stanford BS + MS CS (no PhD)
Research Scientist at Together AI ('megakernels for everyone')
Previously: Hazy Research RA ('helped GPUs meow' — ThunderKittens contributor), Stanford Compilers Group RA (accelerated DNN training across distributed systems)
Jump Trading (core strategies)
Active GPU kernel and megakernel developer at the premier kernel team (Together AI / Tri Dao group)
US (Palo Alto). ~1yr FTE

Geo Min

high hireability

AMD

AMD engineer in San Jose CA
GitHub account created Feb 2025 — near-certain new grad signal
Pins: rocm-rx9070-demo (ROCm AMD RX 9070), iree-test-suites (MLIR), iree fork (MLIR ML compiler), TheRock (HIP/ROCm build), rocm-libraries (Assembly)
Also 14 commits to ROCm/composable_kernel
Reportedly Virginia Tech education (BS/MS, no PhD signals)
Expertise: MLIR, IREE, ROCm, HIP, composable_kernel — directly covers MLIR, AMD GPU, compiler stack
Hireability: HIGH — exceptional MLIR/ROCm match, strong new-grad signal from Feb 2025 account, US-based AMD San Jose

Jesse Cai

high hireability

Machine Learning Engineer@Meta

Previously: Senior Research Engineer @ Cultivate

San Francisco, US

Meta Software Engineer on PyTorch Core Performance team, SF
UCLA 2020 graduate (BS, no PhD). 88 commits to pytorch/ao (rank 8)
Sparsity and quantization work (semi-structured 2:4 sparsity, TorchAO), PyTorch blog published, PyTorch Conference 2024 speaker
Joined Meta ~2021 → ~3-4 years FTE, within range. h-index: 2
US-based SF
Tensor core utilization and sparse kernels for LLM inference directly match query
Hireability: HIGH — PyTorch core contributor with deep sparsity/quantization, UCLA BS, US SF, no PhD

Jiakun Wang

high hireability

BS CS CMU (computer systems/hardware focus) + current MS EE at Columbia (System-on-Chip/VLSI concentration)
Chip-level hardware design: implemented TPU IC (Verilog), systolic array for TPU, Emperor SoC (RTL + C drivers), RISC-V formal verification
Query explicitly allows chip-level kernel work
Contributed to mirage (CUDA superoptimizer)
AI chip focus per personal website
US (New York, Columbia)
No FTE history — entirely within <5yr limit
Graduating likely 2025/2026
Hireability: HIGH — current student at prime graduation transition window, no blocking FTE role

Kenneth Moon

high hireability

MIT EECS 6-3 (CS+EE) undergraduate, expected graduation 2025 per MIT ESP profile
Contributed to exo-lang/exo (hardware accelerator scheduling compiler)
Has personal mkay-attention repo (custom attention kernel)
Early-stage GitHub but direct exposure to low-level compiler work for hardware accelerators at MIT
Hireability: HIGH — current MIT undergrad graduating 2025, entering job market imminently

Kevin Qian

high hireability

Core exo-lang contributor (81 commits on exo-lang/exo), co-author of Exo 2 (ASPLOS 2025), MEng thesis on ExoBLAS (meta-programming a high-performance BLAS)
MIT BS EECS 2023 + MEng 2024, no PhD
Work is directly on user-schedulable DSL for GPU/hardware accelerator kernels — highly query-relevant
Prior internships at Jane Street, Meta, D.E
Shaw (not FTE)
Based in Cambridge, MA (US)
Hireability: HIGH — MEng thesis submitted 2024, appears to have just completed degree with no confirmed full-time role yet; in prime transition window

Kit Ao

high hireability

Carnegie Mellon University

JHU undergrad (CS + ChemBioE) pursuing CMU MS in Computational Data Science
Research assistant at CMU Catalyst Group on Mirage (CUDA/C++/OpenMP/MPI/NVIDIA Nsight profiling)
Now ML Engineer at Waymo (ML Infra)
Not a PhD — confirmed MS on LinkedIn
US-based Pittsburgh/Mountain View
Hireability: HIGH — Mirage CUDA+Nsight profiling, Waymo MLE role, ML systems bio, LinkedIn available

Kshitij Lakhani

high hireability

NVIDIA

MS ECE from UC Davis (no PhD)
Deep Learning Performance Engineer at NVIDIA Santa Clara. 26 commits to TransformerEngine (rank 13)
Personal repos: Intro_to_Parallel_Computing (CUDA scan/reduce/sort) and Cache_Profiling (matrix multiply optimization) — strong CUDA fundamentals
Career path: intern → GPU Software Engineer II at Roche → exo → NVIDIA, estimated ~3-4 years FTE
US-based Santa Clara
Hireability: HIGH — verified MS-only, active CUDA contributor at NVIDIA with kernel optimization work

Mingfei Guo

high hireability

Software Engineer@NVIDIA

Stanford MSEE student (PKU BS alumni) with strong GPU kernel implementation experience
Built flash-attention-slang — Flash Attention in Slang using Vulkan + NV CoopMat2 tensor cores (2.1x faster than PyTorch SDPA), slang-torch PyTorch integration with CUDA support, and multiple GPU-accelerated projects with Warp
Active recent work (Feb 2026 commits)
Based in Palo Alto
Hireability: HIGH — Final-year Master's student, likely graduating 2026, in prime transition window

Pawan Jayakumar

high hireability

UC San Diego

Current UCSD Masters student in ML systems and security. 6 commits to pytorch/ao (PyTorch quantization and sparsity)
Bio: 'Masters student at UCSD studying ML systems and security' — graduating 2025 or 2026, fits student criteria
No PhD
US-based San Diego CA. pytorch/ao contributions = hands-on quantization and sparsity at PyTorch core level, directly aligned with GPU kernel optimization for LLM inference
Ideal junior profile
Hireability: HIGH — MS ML systems student at UCSD, pytorch/ao contributor, US-based, graduating imminently

Peter Yeh

high hireability

Mountain View CA, focused on 'Accelerating Generative AI/LLM.' 21 commits to pytorch/ao
Pins: pytorch, pytorch/ao, apache/tvm (ML compiler), FasterTransformer (C++ NVIDIA inference), flash-attention, gloo — exceptional technical breadth across ML compilers, GPU kernel inference, and quantization
No PhD signals
No clear senior titles
US-based Mountain View confirmed
Hireability: HIGH — outstanding breadth across LLM inference, ML compilers, and GPU kernels; junior-to-mid level; US-based

Phuong Nguyen

high hireability

NVIDIA

DL Performance Engineer at NVIDIA Santa Clara. 106+ commits to NVIDIA/TransformerEngine with focus on FP8 fused ops, NVFP4 recipes, JAX/CUDA kernel acceleration, and distributed training (FSDP, Shardy)
GitHub active since 2023-2024 in TE repo, consistent with <5yr FTE
Stanford-educated (BS/MS, no PhD indicators)
US-based
Strong fit on GPU kernel/CUDA/JAX search criteria
Hireability: HIGH — joined NVIDIA ~2023, within 2-3yr transition window; title is DL Performance Engineer (not Senior/Staff)

Samurdhi Karunaratne

high hireability

NVIDIA

Deep Learning Inference SE at NVIDIA TensorRT since January 2022 (~4.25yr FTE)
BS Computer Engineering, Univ of Peradeniya (2019), MS ECE UCLA (Dec 2021)
Pinned repos: TensorRT (C++), ONNX-TensorRT, PyTorch-TensorRT, ONNX
Physics Olympiad gold/silver medalist (IPhO, APhO)
Directly relevant to GPU kernel/LLM inference stack
No PhD
US-based (Santa Clara CA)
Hireability: HIGH — recent MS grad, <5yr FTE, deep TensorRT inference expertise

Schwinn Saereesitthipitak

high hireability

Distributed LLM Inference@NVIDIA

Previously: ML for CAD Engineering @ Apple

San Francisco, US

Applied AI Software Engineer at NVIDIA Distributed LLM Inference SF (started June 2025)
Stanford CS MS (admitted 2022, no PhD)
Built Prophet — LLM inference engine optimized for head-of-line blocking (Stanford CS244b)
GitHub shows LLM inference engine + Rust cryptography + competitive programming (C++)
Zero prior FTE before NVIDIA
Strong US-based systems/inference engineer at tier-1 target
Hireability: HIGH — <1yr at NVIDIA, Stanford MS graduate, but role is exactly Distributed LLM Inference — prime target

Shanli Xing

high hireability

Undergraduate Researcher@UW / CMU Catalyst

UW CSE undergrad (BS CS, graduating 2026 — actively seeking PhD positions Fall 2026)
Core contributor to FlashInfer (CUDA kernel library for LLM serving): designed and implemented sorting-free GPU sampling kernels, co-authored FlashInfer-Bench (MLSys 2026)
Research @ UW SAMPL (advised by Prof
Luis Ceze) and CMU Catalyst (advised by Prof
Tianqi Chen)
Active kernel work at undergraduate level with published systems paper
US (Seattle, WA). 0 FTE

Songting Wang

high hireability

Carnegie Mellon University

CMU ECE+CS student (Pittsburgh), contributor to mirage (Mirage Persistent Kernel: Compiling LLMs into a MegaKernel, C++ CUDA superoptimizer)
Pinned mirage fork prominently
Likely graduating 2025/2026. 1 commit on mirage (shallow but right project at CMU)
No visible FTE history — clean within <5yr FTE limit
Hireability: HIGH — current student at graduation window, no blocking employment, Pittsburgh-based at CMU

Surya Subramanian

high hireability

cuBLAS Intern@NVIDIA

Georgia Tech CS student (BS)
NVIDIA cuBLAS intern writing fast matmul CUDA kernels for Blackwell via emulation on low-precision tensor cores
Previously Meta PyTorch distributed training + Pinterest ML infra
Graduating soon (2025/2026)
Very strong kernel signal for a student

Tarushii Goel

high hireability

MIT undergrad CS, Class of 2026 (graduating — explicitly allowed by search)
McLean VA / MIT
Interned at NVIDIA, Modal Labs, and Exafunction (all kernel/inference companies)
CUTLASS forks (quantized BLAS), Triton forks, flash-linear-attention fork
Writes GEMM kernel + AI compiler deep-dives on blog
No FTE yet (student)
USACO Platinum competitive programmer
Hireability: HIGH — MIT 2026 graduating student, direct CUTLASS+Triton experience, NVIDIA+Modal internships, prime kernel engineering profile

Victor Li

high hireability

FlexFlow contributor (flexflow-train, C++); GitHub repos include MiniAPL-release (LLVM-based dense array language compiler, Stanford cs343d course), CS349H project (Stanford ML compilers course), and Mamba SSM
Username victorli2002 strongly implies undergrad graduating 2025-2026
LLVM coursework directly on-topic for the compiler track of this query
No FTE history visible
US-based (Stanford CA implied by course repos)
Hireability: HIGH — current undergrad with no FTE experience, prime graduation window

William Hu

high hireability

MSCS Student / Intern@Stanford / Modal Labs

MSCS student at Stanford (willhu@stanford.edu), focus on systems/DSLs/AI HPC
Co-author of HipKittens paper (AMD ROCm GPU kernels, arxiv Nov 2025): 662 commits to HazyResearch/HipKittens
Intern at Modal on the 'flash team' (flash attention and fast GPU kernel work)
Also contributor to KernelBench (Stanford Scaling Intelligence)
Personal site: willhu-jpg.github.io
Taking Stanford CS 240lx (advanced systems) Spring 2025. 0 FTE, strong research output at grad school level

William Zhou

high hireability

Triton Kernel Engineer@AMD

UCLA CS undergrad (graduating 2026)
Writing Triton kernels @ AMD (GitHub bio: 'writing Triton kernels @ AMD, ACM@UCLA')
Active Triton contributor via AMD
US (LA)
Strong junior kernel signal: recent grad, active kernel work at major GPU company

Willy Chan

high hireability

Student Researcher / Intern@Together AI / Stanford

Stanford BS CS student
Student Researcher @ Together AI (AI Kernels team)
NVIDIA intern (multigpu workloads/libraries)
Meta Superintelligence Lab (Data Foundations)
Stanford SAIL research: KernelBench DSL extension + scaling laws of Kernel DSLs + multigpu kernels
Built NVSHMEM4Py integration for Perplexity's pplx-kernels (MoE communication CUDA kernels)
Active contributor to perplexityai/pplx-kernels. 0 FTE, exceptional kernel breadth for an undergrad

Yiyan Zhai

high hireability

Carnegie Mellon University

CMU undergraduate (CS&ML) working directly with Prof
Tianqi Chen on FlashInfer-Bench (LLM inference kernel benchmarking system — best paper at MLSys 2025) and MLC-LLM
Pinned repos: flashinfer-bench and libCacheSim (C++ cache simulator)
Contributor to arxiv paper on FlashInfer-Bench (Oct 2025)
Strong LLM inference kernel focus — exactly the query target
Pittsburgh (US)
Undergraduate only, no PhD
Hireability: HIGH — CMU undergrad in graduation window (likely May 2026), no FTE experience, actively doing hands-on LLM inference systems research

Zhongbo Zhu

high hireability

NVIDIA

MS CompE from UIUC (no PhD)
DevTech @ NVIDIA — focused on customer CUDA performance optimization
Amazon intern Summer 2023 indicates very recent grad, likely ~1-2 years FTE
Pins: TransformerEngine, Megatron-LM, Megatron-Bridge
NVIDIA Technical Blog author on quantized training and CUDA kernels
ZJU undergrad + UIUC MS
Likely US-based (NVIDIA DevTech teams primarily US)
Hireability: HIGH — MS-only, very junior FTE timeline (~1-2yr), active quantization+kernel work at NVIDIA

Abu Qader

medium hireability

Baseten

Software Engineer at Baseten since July 2022 (~3.5 years FTE)
Cornell BS 2021 (no PhD)
Led TensorRT-LLM Engine Builder and EAGLE 3 speculative decoding integration; authored engineering posts on production-ready speculative decoding with TRT-LLM
Adjacent to kernel work (uses TRT-LLM rather than writing raw CUDA/Triton kernels) but deep LLM inference optimization expertise
San Francisco, US-based
Hireability: MEDIUM — 3+ years at Baseten, entering typical transition window

Aidan Do

medium hireability

Software Engineer, Inference@Fireworks AI

BS Software Engineering (Honours), University of Adelaide
Software Engineer at Fireworks AI (SF Bay Area): wrote custom bicubic interpolation CUDA kernel improving TTFT of Kimi K2.5 by 11-21%; contributed to CUTLASS and FlashInfer OSS
Also rewrote PyTorch CUDA upsample_bicubic2d kernel — 4.3-43x speedup (merged PR)
Active NVIDIA/Meta OSS contributor
Previously at Canva (Sydney). <2yr FTE
US (Bay Area)

Angela Yi

medium hireability

Meta PyTorch Compiler SWE since 2022, specializing in torch.export and AOTInductor — core ML compiler/codegen directly relevant to LLM inference optimization
CMU BS CS (2018-2022), also pursuing CS at Stanford (2024-present, likely MS not PhD). ~3 years FTE, no PhD
Contributed to pytorch/executorch (on-device AI acceleration)
Spoke at PyTorch Conference 2024 on export features
US-based
Hireability: MEDIUM — 3 years at Meta, within transition window; Stanford enrollment may indicate career pivot interest

Anton Shepelev

medium hireability

NVIDIA

Adjacent expertise: NVIDIA System Software Engineer (Tegra group) with LLVM compiler background — pinned repo is llvm-project fork; designed a custom compiler (DJ language → RISC-V → LLVM) with optimization passes (loop unrolling, constant folding, CSE)
USF MS+BS CS
SF Bay Area, US
Primary work is Tegra (embedded/automotive system software) rather than GPU kernels for ML/LLM inference — adjacent to the query
LLVM/compiler infrastructure background is highly relevant to MLIR-based kernel work
Hireability: MEDIUM — likely 1-3 years at NVIDIA as recent USF grad; Tegra engineers occasionally pivot to GPU compute work

asfiyab-nvidia

medium hireability

NVIDIA

NVIDIA TensorRT engineer with MS in Computer Engineering from UC San Diego
Located in San Diego County, US
Work spans TransformerEngine, Megatron-LM, onnx-tensorrt (C++), TensorRT, and NeMo — inference optimization and GPU-accelerated architectures
Account created 2022, ~3 years at NVIDIA
No PhD
MS-only
Hireability: MEDIUM — solid MS-level NVIDIA inference engineer with TensorRT/ONNX depth; more inference-framework than raw CUDA kernel writing

Aviral Goel

medium hireability

AMD

GPU kernel engineer at AMD ROCm working on GEMM optimization and composable_kernel (91 commits, C++ HIP kernels)
Personal GitHub shows strong self-directed kernel learning: hip_kernels (HIP GPU kernel examples), flash_attention_triton (Flash Attention from scratch in Triton)
MS Computer Science from USC
Prior experience was internship at Samsung Semiconductor (Vulkan/GPU layers — excluded from FTE)
AMD work account (AviralGoelAMD) created December 9, 2024, confirming very recent hire
US-based (Austin, TX)
No PhD
Hireability: MEDIUM — only ~3 months into AMD role (Dec 2024), likely still settling in; personal GitHub bio still says USC graduate student suggesting very recent graduation

Baizhou Zhang

medium hireability

Member of Technical Staff@RadixArk (SGLang)

MS CS from UC San Diego, BS Intelligence Science from Peking University
MTS at RadixArk (SGLang core developer — attention kernels integration, large-scale EP & Speculative Decoding on GB200/GB300)
NVIDIA intern on CuDNN Heuristic Team (Blackwell GEMM kernel heuristics)
Has cuda-learn-by-practice repo showing hands-on CUDA work
Active FlashInfer contributor
No PhD
US (Palo Alto, CA). ~1yr FTE

Colin Peppler

medium hireability

Meta SWE with direct GPU kernel experience: AITemplate contributor (CUDA/HIP codegen framework), GPU Techniques & Algorithms team at Meta working on custom kernels and inference optimization, plus AI Infra RecSys inference optimization
Virginia Tech BS CS
US-based (Menlo Park)
Career: Meta intern → FTE (likely ~2021 start), estimated ~3-4 years FTE
BS only, no PhD
Hireability: MEDIUM — ~3-4 years at Meta, within transition window; strong GPU kernel skills make this a high-priority reach-out

David Wang

medium hireability

GPU Performance Engineer@Modal Labs

UIUC education
GPU performance engineer @ Modal Labs (Brooklyn, US)
Active GPU kernel work: forked Mirage (automatically generating fast GPU kernels without Triton/CUDA). 0-2yr FTE
No PhD found
Modal is a top GPU cloud company known for kernel-level work

David Zhao Akeley

medium hireability

MIT CSAIL

Developer Technology Engineer at NVIDIA Santa Clara (US)
UCLA BS CS+Math
Contributed to Exo-GPU (PLDI 2026, MIT CSAIL) — CUDA-targeting scheduling compiler achieving >80% of H100 theoretical peak on GEMM
Deep low-level systems work (SIMD combine library in C, OpenGL voxel renderer)
No PhD — BS only
Hireability: MEDIUM — NVIDIA DevTech Engineer with recent exo-gpu collaboration suggesting active research engagement; start date unclear

Emilio Andere

medium hireability

GPU Kernel Engineer@Wafer-AI

UChicago Mathematics BS (2021-2025)
GPU kernel engineer @ Wafer-AI (YC F25 startup, SF), 'fast gpu kernels @wafer-ai'
OSS contributor to tinygrad
Active in GPU kernel development at a well-funded YC startup. ~0-1yr FTE, no PhD

Eric Auld

medium hireability

Systems Research, GPU Programming Engineer@Together AI

BS Arizona State, MA UCLA, MS NYU (all in mathematics, no CS PhD)
Systems Research & GPU Programming Engineer @ Together AI
Co-authored GPU MODE Lecture 15 on CUTLASS/CuTe
Active CUTLASS contributor
Transitioned from math to GPU programming. ~2yr FTE in kernel work

ihar

medium hireability

US-based (SF/TN), bio: 'ml, hpc, processors microarchitecture' — directly on-target for GPU kernel/HPC query. tinygrad contributor with possible Tenstorrent affiliation
Account 2014 but only 17 followers, sparse public presence — no clear senior indicators
No PhD signals found
Hireability: MEDIUM — strong topical alignment (HPC, microarchitecture) and US-based, but sparse profile makes degree and seniority hard to confirm; worth outreach

Ilya Sherstyuk

medium hireability

NVIDIA

SE at NVIDIA (Deep Learning Inference Workflows) since ~2022
BS CS Caltech 2022 — clear <5yr FTE (~3-4yr by March 2026)
No PhD
US-based (Santa Clara CA)
NVIDIA TensorRT team working on DL inference workflows
Caltech CS + NVIDIA inference = high-quality junior signal
Limited public GitHub signal but confirmed via LinkedIn
Hireability: MEDIUM — strong educational pedigree (Caltech CS), directly relevant team, but limited public kernel repo evidence

James Thompson

medium hireability

Built llm.metal — LLM training in raw C/Metal Shading Language (Apple GPU compute shaders), a direct port of Karpathy llm.c to Metal, demonstrating genuine low-level GPU programming ability
Bay Area, CA (confirmed)
Bio: ML, computer graphics, GPU programming languages. 1 commit on llm.c
CAVEAT: No education or employer found; FTE history unknown — needs follow-up to verify <5yr constraint
Hireability: MEDIUM — unclear employment status; Bay Area location and hobby GPU projects suggest field-engaged and potentially available

Jason S. Wang

medium hireability

Machine Learning Scientist@Tesla

Previously: Research Assistant @ Stanford University

San Francisco, US

UC Berkeley BS 2022, Stanford MS CS/AI graduating 2025 (just graduated)
At NVIDIA now
OpenReview confirms 'MS student, Stanford University'
Pins: Megatron-LM, NeMo — large-scale training systems. 2 commits to TransformerEngine. <1 year FTE
Strong pedigree but kernel depth less clear vs. training systems focus
Hireability: MEDIUM — excellent junior fit and seniority, but focus is more training systems than raw CUDA/Triton kernel writing

Jianan Ji

medium hireability

Carnegie Mellon University

Strong fit — contributes to Mirage megakernel compiler (CMU Catalyst Lab) and SGLang (LLM inference)
Confirmed MS student at CMU graduating 2026 per LinkedIn ('CMU 26 | PKU 24'), personal website confirms 'second-year master student at CMU' — not a PhD
US-based Pittsburgh PA
Hireability: MEDIUM — solid research contributions to both GPU kernel compiler and LLM serving, but limited public open-to-work signals

Michael Feil

medium hireability

Baseten

Model Performance Engineer at Baseten, San Francisco
Hands-on Hopper and Blackwell kernel optimizations (candle-flash-attn-v3 repo: custom Flash Attention V3 C++ implementation)
Distributed LLM inference: TRT-LLM, NVIDIA Dynamo (2x inference speedup), BEI embeddings runtime
Created Infinity production inference engine
Prior: MTS at Gradient (1M-context LLM training on 512 NVIDIA GPUs), ML Engineer at Rohde & Schwarz
MS Robotics/ML from TU Munich (no PhD)
Estimated ~3-4 years FTE (within threshold)
San Francisco
Hireability: MEDIUM — ~2 years at Baseten, within transition window; no open-to-work signals

Nikhil Patel

medium hireability

Software Engineer@Meta Superintelligence Labs

MS CS from University of Michigan
SWE at Meta Superintelligence Labs (SF/Menlo Park) working on GPU kernels, ML compilers, dynamic bytecode transformation
Authored tritonparse (TritonParse: Compiler Tracer/Visualizer/Reproducer for Triton Kernels) — pinned on GitHub
Active Triton and PyTorch contributor
Previously interned at Meta PyTorch, Roblox, Amazon, ZEISS
No PhD. ~1-2yr FTE
US (Menlo Park, CA)

Raghavv Goel

medium hireability

Senior Machine Learning Researcher@Qualcomm

Previously: Robotics Researcher @ Carnegie Mellon University

San Diego, US

Senior DL Researcher at Qualcomm AI Research (Feb 2023-present, ~3yr FTE) on Efficient LLM team; briefly on Qualcomm Compiler Optimization team
Work: speculative decoding (2.4x speedup), KV cache eviction (KeyDiff) — inference algorithm research adjacent to but not direct CUDA kernel writing
BS ECE IIITD Delhi 2020, MS Robotics CMU 2022 (no PhD)
San Diego, US. ~3yr FTE total — within <5yr limit
Hireability: MEDIUM — within typical transition window; website position update Aug 2025 and recent papers suggest career motion

Ryan Lynch

medium hireability

Autopilot Compiler Engineer@Tesla

BS/MS Georgia Tech
Tesla Autopilot Compiler Engineer working with MLIR
Bio: 'Tesla Autopilot Compiler Engineer
MLIR
Small team.' 2-3yr FTE
No PhD
US (likely CA)
Compiler/MLIR background is direct adjacent to GPU kernel work; Tesla is a serious kernel shop

Shreya Gaur

medium hireability

Deep Learning Performance Engineer@NVIDIA

MS ECE from Purdue University (thesis on GPU kernel optimization for group equivariant CNNs)
Deep Learning Performance Engineer at NVIDIA Santa Clara
Active custom kernel work: cutlassdev repo (CUTLASS development), gemm_sample (GEMM optimizations), cuda_programming (CUDA class assignments for AlexNet layers on V100 using shared/constant memory)
CUTLASS contributor
No PhD
US (Santa Clara)

Shushi Hong

medium hireability

Carnegie Mellon University

Active contributor to apache/tvm (30+ commits) and mlc-ai/mlc-llm — the two core ML compiler projects for LLM inference
Pinned repos are TVM, MLC-LLM, and web-llm, showing deep specialization in ML compilation and on-device LLM deployment
Based at CMU Pittsburgh (Catalyst group under Tianqi Chen)
No PhD signals; company field shows Carnegie Mellon University
Hireability: MEDIUM — still affiliated with CMU but graduation timeline unclear; likely MS student in 2025-2026 window

Vladimir Penkin

medium hireability

Intel

Intel Austin TX engineer, 13 commits to intel/intel-xpu-backend-for-triton (Triton MLIR GPU compiler)
GitHub account 2021, pinned repos: pytorch/pytorch + intel-xpu-backend-for-triton
Active on Intel XPU Triton issues through Jan 2026
Personal website (penk.in) confirms SF location
Education: Taganrog South University of Radio Engineering (Russia)
Generalist background (Ruby/C#/TypeScript) but demonstrated Triton/PyTorch XPU focus at Intel
Hireability: MEDIUM — account 2021 suggests ~4yr FTE, within window; Intel role has Triton/XPU focus matching query

Vyom Sharma

medium hireability

Senior CS student at University of Minnesota (Minneapolis MN, US)
Internship/student-only history, no substantial FTE
Pinned repos: llm.c fork (18 commits), RecurrentGemma-CUDA ('advanced kernels'), Kaleidoscope-Extended (LLVM JIT compiler in C++) — CUDA + compiler dual track exactly on-target
Account 2022, expected graduation ~2025-2026
Hireability: MEDIUM — strong student CUDA+LLVM profile but no published work or industry CUDA contributions; graduation date needs confirmation

Wenxin Cheng

medium hireability

Software Engineer@Meta

MS CS from UCLA (2022-2024), BS CE from Beijing Jiaotong University
Software Engineer at Meta (Jan 2026-present: Dev Infra Triton; Apr 2024-Dec 2025: AI toolchains with CUDA focus)
CUTLASS contributor
Triton contributor
Also contributed to FBGEMM and vLLM
Menlo Park, CA
No PhD. ~2yr FTE at Meta in kernel/toolchain work

William Hu

medium hireability

Member of Technical Staff@Modal

Previously: GPU Compiler Engineer @ Qualcomm

San Francisco, US

GPU Compiler Engineer at Qualcomm (~1.5yr FTE, Sep 2023-Feb 2025) writing OpenCL/Vulkan GPU compilers using LLVM+C++; now at Modal on the flash team (GPU kernels for AI/HPC) while completing MS CS at Stanford (graduating 2026)
Research at Stanford Hazy + Scaling Intelligence labs
BS Math-CS UCSD, MS Stanford — no PhD
US (Bay Area/Stanford)
Clean FTE history: ~1.5yr Qualcomm FTE + Modal intern (student concurrent role)
Hireability: MEDIUM — current Stanford student, graduating 2026 is prime transition window; Modal role is student internship

Yuzhong Wang

medium hireability

NVIDIA

NVIDIA engineer (account created 2024 — very recent hire)
Works on NeMo, TransformerEngine, and Megatron-MoE — LLM training infrastructure and GPU kernel optimization. -nvidia username suffix
No PhD signals found
Likely US-based at NVIDIA primary offices
Strong technical stack alignment
Hireability: MEDIUM — relevant TransformerEngine/Megatron work but limited public signals to confirm education level; no location explicitly confirmed

Zephyr Zhao

medium hireability

Carnegie Mellon University

CMU student (undergrad or new MS) confirmed by andrew.cmu.edu email
Has a CUDA-Manuscript learning repo and confirmed Mirage paper contributor (Zepeng Zhao in paper authors list)
Account Jan 2025 — very fresh, learning-focused CUDA repos
No PhD signals
US-based Pittsburgh
Hireability: MEDIUM — genuine GPU kernel interest with CMU research exposure, but early-stage contributor with minimal independent track record

Zhihao Wang

medium hireability

Earth

MLsys Engineer at ByteDance Seed Infra Compiler team (Urbana, IL, US)
UIUC BS Math+CS + MS CS, no PhD
Built SuperScalar Tomasulo OoO RV32M CPU in SystemVerilog from scratch (7K+ lines), contributor to sglang-jax and exo-lang/exo
ByteDance Seed Compiler team does GPU compiler/distributed training work
Hireability: MEDIUM — recently started full-time at ByteDance post-MS; likely <2 years in role

Zhongbo Zhu

medium hireability

DevTech AI Engineer@NVIDIA

BS CS from UIUC (joint degree with Zhejiang University), MS ECE from UIUC
NVIDIA DevTech AI engineer specializing in CUDA kernel development and FP8 quantized training
Previously Amazon intern (Summer 2023). ~1yr FTE in CUDA kernel dev
No PhD
US (Bay Area / Santa Clara)

Daniel Vega-Myhre

low hireability

Li Jiaying

low hireability

Carnegie Mellon University

Led multi-GPU CUDA traffic simulation at UC Berkeley (113X speedup over CPU, C++/CUDA, scaled 1-8 GPUs with ghost-zone partitioning and inter-GPU communication); 4 commits on mirage (Mirage Persistent Kernel CUDA superoptimizer, C++)
CMU MCDS student (BS Software Engineering, Tongji University), now FTE SWE at Snowflake
Contributes to arrow-rs (Rust)
US (Menlo Park)
Hireability: LOW — very recently started FTE at Snowflake, likely still settling in

Runs

#4failed0 qualified / 0 foundMar 13, 3:37 PMClaude Code exited with code -1:

#3failed0 qualified / 0 foundMar 12, 8:40 PMClaude Code exited with code -1:

#2completed47 qualified / 47 foundMar 12, 8:35 PM

#1completed19 qualified / 19 foundMar 12, 8:35 PM