TPU Kernel Engineer San Francisco, CA | New York City, NY | Seattle, WA About…

cancelled0 qualified1 runApr 22, 1:41 PMdemo-tpu-kernel-engineer-stage2-1776865293

ParsedAnthropic · 3 topics · Senior · Engineer · San Francisco, CA | New York City, NY | Seattle, WA

Generating seed nodes

0 proposed

Explored 0 queries

0/0 done

Expanding nodes

queued

Qualifying candidates

queued

Qualified Candidates (14)

Coleman Richard Charles Hooper

high hireability

Graduate Student - ML Systems@University of California, Berkeley

Previously: Research Intern @ NVIDIA

San Francisco, US

ML systems PhD at UC Berkeley (BAIR + SLICE Lab, advised by Sophia Shao in computer architecture), with high-impact efficient inference work: SqueezeLLM (320 citations, ICML 2024), KVQuant (317 citations, NeurIPS 2024), and ISSCC hardware accelerator chip papers showing low-level hardware depth
Full-stack optimization background maps directly to the TPU kernel role — low-precision inference, memory-bandwidth modeling, and accelerator-aware design
Based in SF
Hireability: HIGH — likely 5-6 years into PhD (website listed '4th year' but EECS tech report filed Dec 2025 indicates near/post-defense), prime industry transition window

Mayank Mishra

high hireability

Graduate Student Researcher@University of California, Berkeley

Previously: Research Engineer-II @ MIT-IBM Watson AI Lab

Berkeley, US

Strong ML systems + kernel engineering background: authored 'FlashFormer: Whole-Model Kernels for Efficient Low-Batch Inference' (2025), co-authored 'SonicMoE: IO and Tile-aware Optimizations' with Tri Dao, published Flash Attention packing + quantization papers, and maintains 'accelerated-model-architectures' repo (kernel implementations)
Heavy Megatron-DeepSpeed and BLOOM/StarCoder contributions
Berkeley (Bay Area = SF)
Hireability: HIGH — PhD student at UC Berkeley, website cv_update + position_update on 2026-02-16 (65 days ago) strongly signals he's on the job market; collaborating with Tri Dao/Ion Stoica at Berkeley

William Brandon

high hireability

PhD student@MIT CSAIL

Previously: Research Assistant @ MIT Media Lab

Cambridge, US

Custom GPU kernel engineer at MIT CSAIL — FlashFormer (fuses full transformer forward pass into a single kernel for low-batch inference, 2025), FLUTE (custom GEMM kernel for lookup-table quantized LLMs achieving 2-4x speedup at low batch sizes, EMNLP 2024), and Ladder-Residual (tensor parallelism with communication-computation overlap for 70B models, 2025)
Strong ML systems + kernel depth, working with Jonathan Ragan-Kelley
Currently Cambridge, MA (not target city, but PhD students relocate)
Hireability: HIGH — ~5-6 years into MIT PhD (started after Berkeley undergrad Fall 2020), prime graduation window

Aniruddha Nrusimha

medium hireability

PhD candidate@MIT

Previously: Undergrad student @ University of California Berkeley

Boston, US

Strong kernel engineering background: FlashFormer (2025) fuses entire transformer forward pass into a single GPU kernel for low-batch LLM inference — directly matches TPU/GPU kernel work
Also CUDA QAT pretraining repo and Striped Attention distributed inference work
Based in Cambridge, MA (would need to relocate)
Hireability: MEDIUM — active PhD candidate at MIT with very recent 2025 publications (FlashFormer revised Dec 2025), no explicit graduation timeline or job market signal, mild openness via 'send me an email' on website

Dylan Lim

medium hireability

Active GPU kernel engineer — 102 commits to HazyResearch/ThunderKittens (CUDA tile primitives, 6th largest contributor), megakernels and kernel optimization at Together AI (Winter 2025), GPU kernel work at Jump Trading (Summer 2025), and CUDA/CPU kernel development on FlexFlow
Stanford CS BS/MS student in Palo Alto (Bay Area)
Job accepts GPU/accelerator experience, making his profile directly relevant despite the 'TPU' title
Hireability: MEDIUM — currently enrolled student with no explicit graduation announcement; website updates in Jan-Feb 2026 were cosmetic refinements, no 'open to work' signal, but impressive accelerator kernel internship track record suggests final-year transition window

Jordan Juravsky

medium hireability

AI Research Scientist@Meta

Previously: Research Scientist Intern @ NVIDIA

San Francisco, US

Direct kernel engineering experience: co-authored Megakernels (fused entire LLM forward pass into single GPU kernel on H100/B200, 78% memory bandwidth utilization, 1.5x over vLLM/SGLang), Hydragen (hardware-aware attention on A100s, 3-30x throughput), and Tokasaurus (async tensor parallelism inference engine)
Also built a prototype ASIC compiler at Groq (TPU-like accelerator) and worked on sparse training at Cerebras
PhD student at Stanford on leave, now AI Research Scientist at Meta Superintelligence Labs in SF Bay Area
Hireability: MEDIUM — no active signals of job searching, currently at Meta on exciting projects; PhD 'on leave' status indicates career flexibility

Junxiong Wang

medium hireability

Research Scientist@Together AI

Previously: Researcher @ Together AI

Research Scientist at Together AI (post-Cornell PhD) leading adaptive speculative decoding (ATLAS), inference-time training (Aurora), and efficient RL rollout projects
Has custom CUDA kernel work (varlen_mamba: selective scan kernels for Mamba SSM, ~14% CUDA codebase) — ML systems focus rather than dedicated kernel engineering, but GPU kernel experience directly relevant given JD accepts GPUs/other accelerators
Hireability: MEDIUM — ~1 year into role at Together AI post-PhD (position update signals April-May 2025), within early transition window but no explicit open-to-work signals

Aart J.C. Bik

low hireability

Distinguished Software Engineer@Nvidia

Previously: Staff Software Engineer @ Google

San Francisco, US

Distinguished-level compiler engineer with 17 years at Google building MLIR and the MLIR Sparsifier — the core infrastructure underlying TPU kernel compilation via XLA
Co-authored multiple MLIR papers (2022–2024) on sparse tensor computations and modular codegen directly relevant to accelerator kernels
Now at NVIDIA working on sparsity, libraries, and compilers for GPUs
Located in SF
Hireability: LOW — joined NVIDIA in September 2024 (~7 months ago), publicly described himself as 'uncomfortably excited' about the new role and is actively hiring for his own NVIDIA team

Amir Yazdanbakhsh

low hireability

Research Scientist@DeepMind

Previously: Research Scientist @ Google

San Francisco, US

Co-founder and co-lead of ML for Computer Architecture team at Google DeepMind; PhD Georgia Tech in computer architecture with Microsoft PhD and Qualcomm Innovation fellowships
Published on TPU evaluation (Edge TPU, 202 citations), attention dataflow optimization (FLAT), and neural network quantization (ReLeQ)
Co-authored Gemini 2.5 (2025)
Strong direct match for TPU kernel engineering — deep hardware accelerator expertise plus ML systems background
Hireability: LOW — ~6 years at Google, team lead, just shipped Gemini 2.5; no open-to-work signals detected anywhere

Daniel Y Fu

low hireability

VP, Kernels@Together AI

Previously: Distinguished Research Scientist @ Together AI

San Francisco, US

Co-author of FlashAttention and FlashFFTConv (Tensor Core convolutions); lead on ThunderKittens CUDA kernel framework; VP Kernels at Together AI
Exceptionally deep GPU kernel optimization expertise highly transferable to TPU
Based in SF
Hireability: LOW — GitHub bio explicitly states 'Incoming assistant professor at UCSD,' indicating a committed academic trajectory that makes industry recruitment very unlikely

Haocheng Xi

low hireability

MLsys Researcher@University of California, Berkeley

Previously: Research Intern @ Nvidia

Berkeley, US

Second-year PhD at UC Berkeley (BS from elite Yao Class, Tsinghua), focused on ML systems + efficient ML
Direct CUDA kernel work (TensorCore HGEMM optimization, CUDA algorithm optimization repos)
Multiple papers on low-precision training/inference (INT4, INT8, FP8, MXFP4) directly relevant to TPU low-precision kernel work
NVIDIA Research Intern (May–Aug 2025) on FP8 training + efficient inference with MIT Han Lab
Prolific: 4+ first-author papers in 2026. h-index 7
Berkeley, CA (Bay Area/near SF)
Hireability: LOW — second-year PhD, likely 3+ years until graduation, not yet in the job market

Sehoon Kim

low hireability

Member of Technical Staff@xAI

Previously: Machine Learning Engineer @ Narada

Expert in LLM inference optimization: SqueezeLLM (dense-and-sparse quantization, ICML 2024), KVQuant (KV cache quantization for 10M-context inference), I-BERT (integer-only quantization, ICML 2021 Oral), BigLittleDecoder (speculative decoding, NeurIPS 2023)
Berkeley PhD in CS (2020-2024) under Prof
Keutzer (hardware-aware ML); ECE B.S. from SNU; MLCommons ML and Systems Rising Star 2024
Work is primarily algorithmic (PyTorch-level) rather than explicit TPU kernel implementation, but directly maps to the role's low-precision inference and high-throughput sampling projects
Now MTS at xAI on Grok in Palo Alto (SF area)
Hireability: LOW — ~1.5 years post-PhD at xAI, no open-to-opportunities signals in bio, pipeline, or LinkedIn

Tim Dettmers

low hireability

Assistant Professor@Carnegie Mellon University

Previously: Researcher @ Allen Institute for Artificial Intelligence

Creator of bitsandbytes (CUDA kernels for 4-bit/8-bit LLM quantization), QLoRA, and LLM.int8() — deep expertise in low-level GPU kernel optimization for ML inference, exactly the low-precision/hardware-efficiency work Anthropic's TPU role targets
H-index 18
Based in Seattle
Hireability: LOW — tenured-track Assistant Professor at CMU; his own website states 'as a professor who does not write much code anymore', signaling full academic commitment and very unlikely to leave for industry

Yida Wang

low hireability

Principal Scientist@Amazon

San Francisco, US

Principal Scientist at Amazon working on ML accelerator kernels and compilers — MLSys 2025 papers on attention kernel optimization (FastTree) and scalable inference (ScaleFusion), KDD 2024 paper on AI accelerator inference optimization, co-authored Ansor (557 citations, tensor program auto-tuning) and Alpa (523 citations, distributed DL parallelism)
TVM contributor
Deep expertise in tensor compilation, quantized inference, and hardware-aware kernel optimization maps directly to TPU kernel work
SF Bay Area
Hireability: LOW — actively posting Amazon job openings, no open-to-work signals, deeply embedded in Amazon's ML accelerator team with no pipeline signals of job market activity

Runs

#1cancelled0 qualified / 0 foundApr 22, 1:41 PM