Back to dashboard

TPU Kernel Engineer San Francisco, CA | New York City, NY | Seattle, WA About…

cancelled0 qualified1 runApr 22, 1:41 PMdemo-tpu-kernel-engineer-stage2-1776865293
ParsedAnthropic · 3 topics · Senior · Engineer · San Francisco, CA | New York City, NY | Seattle, WA
Generating seed nodes
0 proposed
Explored 0 queries
0/0 done
    3
    Expanding nodes
    queued
    4
    Qualifying candidates
    queued

    Qualified Candidates (14)

    CH

    Coleman Richard Charles Hooper

    high hireability

    Graduate Student - ML Systems@University of California, Berkeley

    Previously: Research Intern @ NVIDIA

    San Francisco, US

    • ML systems PhD at UC Berkeley (BAIR + SLICE Lab, advised by Sophia Shao in computer architecture), with high-impact efficient inference work: SqueezeLLM (320 citations, ICML 2024), KVQuant (317 citations, NeurIPS 2024), and ISSCC hardware accelerator chip papers showing low-level hardware depth
    • Full-stack optimization background maps directly to the TPU kernel role — low-precision inference, memory-bandwidth modeling, and accelerator-aware design
    • Based in SF
    • Hireability: HIGH — likely 5-6 years into PhD (website listed '4th year' but EECS tech report filed Dec 2025 indicates near/post-defense), prime industry transition window
    MM

    Mayank Mishra

    high hireability

    Graduate Student Researcher@University of California, Berkeley

    Previously: Research Engineer-II @ MIT-IBM Watson AI Lab

    Berkeley, US

    • Strong ML systems + kernel engineering background: authored 'FlashFormer: Whole-Model Kernels for Efficient Low-Batch Inference' (2025), co-authored 'SonicMoE: IO and Tile-aware Optimizations' with Tri Dao, published Flash Attention packing + quantization papers, and maintains 'accelerated-model-architectures' repo (kernel implementations)
    • Heavy Megatron-DeepSpeed and BLOOM/StarCoder contributions
    • Berkeley (Bay Area = SF)
    • Hireability: HIGH — PhD student at UC Berkeley, website cv_update + position_update on 2026-02-16 (65 days ago) strongly signals he's on the job market; collaborating with Tri Dao/Ion Stoica at Berkeley
    WB

    William Brandon

    high hireability

    PhD student@MIT CSAIL

    Previously: Research Assistant @ MIT Media Lab

    Cambridge, US

    • Custom GPU kernel engineer at MIT CSAIL — FlashFormer (fuses full transformer forward pass into a single kernel for low-batch inference, 2025), FLUTE (custom GEMM kernel for lookup-table quantized LLMs achieving 2-4x speedup at low batch sizes, EMNLP 2024), and Ladder-Residual (tensor parallelism with communication-computation overlap for 70B models, 2025)
    • Strong ML systems + kernel depth, working with Jonathan Ragan-Kelley
    • Currently Cambridge, MA (not target city, but PhD students relocate)
    • Hireability: HIGH — ~5-6 years into MIT PhD (started after Berkeley undergrad Fall 2020), prime graduation window
    AN

    Aniruddha Nrusimha

    medium hireability

    PhD candidate@MIT

    Previously: Undergrad student @ University of California Berkeley

    Boston, US

    • Strong kernel engineering background: FlashFormer (2025) fuses entire transformer forward pass into a single GPU kernel for low-batch LLM inference — directly matches TPU/GPU kernel work
    • Also CUDA QAT pretraining repo and Striped Attention distributed inference work
    • Based in Cambridge, MA (would need to relocate)
    • Hireability: MEDIUM — active PhD candidate at MIT with very recent 2025 publications (FlashFormer revised Dec 2025), no explicit graduation timeline or job market signal, mild openness via 'send me an email' on website
    DL

    Dylan Lim

    medium hireability
    • Active GPU kernel engineer — 102 commits to HazyResearch/ThunderKittens (CUDA tile primitives, 6th largest contributor), megakernels and kernel optimization at Together AI (Winter 2025), GPU kernel work at Jump Trading (Summer 2025), and CUDA/CPU kernel development on FlexFlow
    • Stanford CS BS/MS student in Palo Alto (Bay Area)
    • Job accepts GPU/accelerator experience, making his profile directly relevant despite the 'TPU' title
    • Hireability: MEDIUM — currently enrolled student with no explicit graduation announcement; website updates in Jan-Feb 2026 were cosmetic refinements, no 'open to work' signal, but impressive accelerator kernel internship track record suggests final-year transition window
    JJ

    Jordan Juravsky

    medium hireability

    AI Research Scientist@Meta

    Previously: Research Scientist Intern @ NVIDIA

    San Francisco, US

    • Direct kernel engineering experience: co-authored Megakernels (fused entire LLM forward pass into single GPU kernel on H100/B200, 78% memory bandwidth utilization, 1.5x over vLLM/SGLang), Hydragen (hardware-aware attention on A100s, 3-30x throughput), and Tokasaurus (async tensor parallelism inference engine)
    • Also built a prototype ASIC compiler at Groq (TPU-like accelerator) and worked on sparse training at Cerebras
    • PhD student at Stanford on leave, now AI Research Scientist at Meta Superintelligence Labs in SF Bay Area
    • Hireability: MEDIUM — no active signals of job searching, currently at Meta on exciting projects; PhD 'on leave' status indicates career flexibility
    JW

    Junxiong Wang

    medium hireability

    Research Scientist@Together AI

    Previously: Researcher @ Together AI

    • Research Scientist at Together AI (post-Cornell PhD) leading adaptive speculative decoding (ATLAS), inference-time training (Aurora), and efficient RL rollout projects
    • Has custom CUDA kernel work (varlen_mamba: selective scan kernels for Mamba SSM, ~14% CUDA codebase) — ML systems focus rather than dedicated kernel engineering, but GPU kernel experience directly relevant given JD accepts GPUs/other accelerators
    • Hireability: MEDIUM — ~1 year into role at Together AI post-PhD (position update signals April-May 2025), within early transition window but no explicit open-to-work signals
    AB

    Aart J.C. Bik

    low hireability

    Distinguished Software Engineer@Nvidia

    Previously: Staff Software Engineer @ Google

    San Francisco, US

    • Distinguished-level compiler engineer with 17 years at Google building MLIR and the MLIR Sparsifier — the core infrastructure underlying TPU kernel compilation via XLA
    • Co-authored multiple MLIR papers (2022–2024) on sparse tensor computations and modular codegen directly relevant to accelerator kernels
    • Now at NVIDIA working on sparsity, libraries, and compilers for GPUs
    • Located in SF
    • Hireability: LOW — joined NVIDIA in September 2024 (~7 months ago), publicly described himself as 'uncomfortably excited' about the new role and is actively hiring for his own NVIDIA team
    AY

    Amir Yazdanbakhsh

    low hireability

    Research Scientist@DeepMind

    Previously: Research Scientist @ Google

    San Francisco, US

    • Co-founder and co-lead of ML for Computer Architecture team at Google DeepMind; PhD Georgia Tech in computer architecture with Microsoft PhD and Qualcomm Innovation fellowships
    • Published on TPU evaluation (Edge TPU, 202 citations), attention dataflow optimization (FLAT), and neural network quantization (ReLeQ)
    • Co-authored Gemini 2.5 (2025)
    • Strong direct match for TPU kernel engineering — deep hardware accelerator expertise plus ML systems background
    • Hireability: LOW — ~6 years at Google, team lead, just shipped Gemini 2.5; no open-to-work signals detected anywhere
    DF

    Daniel Y Fu

    low hireability

    VP, Kernels@Together AI

    Previously: Distinguished Research Scientist @ Together AI

    San Francisco, US

    • Co-author of FlashAttention and FlashFFTConv (Tensor Core convolutions); lead on ThunderKittens CUDA kernel framework; VP Kernels at Together AI
    • Exceptionally deep GPU kernel optimization expertise highly transferable to TPU
    • Based in SF
    • Hireability: LOW — GitHub bio explicitly states 'Incoming assistant professor at UCSD,' indicating a committed academic trajectory that makes industry recruitment very unlikely
    HX

    Haocheng Xi

    low hireability

    MLsys Researcher@University of California, Berkeley

    Previously: Research Intern @ Nvidia

    Berkeley, US

    • Second-year PhD at UC Berkeley (BS from elite Yao Class, Tsinghua), focused on ML systems + efficient ML
    • Direct CUDA kernel work (TensorCore HGEMM optimization, CUDA algorithm optimization repos)
    • Multiple papers on low-precision training/inference (INT4, INT8, FP8, MXFP4) directly relevant to TPU low-precision kernel work
    • NVIDIA Research Intern (May–Aug 2025) on FP8 training + efficient inference with MIT Han Lab
    • Prolific: 4+ first-author papers in 2026. h-index 7
    • Berkeley, CA (Bay Area/near SF)
    • Hireability: LOW — second-year PhD, likely 3+ years until graduation, not yet in the job market
    SK

    Sehoon Kim

    low hireability

    Member of Technical Staff@xAI

    Previously: Machine Learning Engineer @ Narada

    US

    • Expert in LLM inference optimization: SqueezeLLM (dense-and-sparse quantization, ICML 2024), KVQuant (KV cache quantization for 10M-context inference), I-BERT (integer-only quantization, ICML 2021 Oral), BigLittleDecoder (speculative decoding, NeurIPS 2023)
    • Berkeley PhD in CS (2020-2024) under Prof
    • Keutzer (hardware-aware ML); ECE B.S. from SNU; MLCommons ML and Systems Rising Star 2024
    • Work is primarily algorithmic (PyTorch-level) rather than explicit TPU kernel implementation, but directly maps to the role's low-precision inference and high-throughput sampling projects
    • Now MTS at xAI on Grok in Palo Alto (SF area)
    • Hireability: LOW — ~1.5 years post-PhD at xAI, no open-to-opportunities signals in bio, pipeline, or LinkedIn
    TD

    Tim Dettmers

    low hireability

    Assistant Professor@Carnegie Mellon University

    Previously: Researcher @ Allen Institute for Artificial Intelligence

    • Creator of bitsandbytes (CUDA kernels for 4-bit/8-bit LLM quantization), QLoRA, and LLM.int8() — deep expertise in low-level GPU kernel optimization for ML inference, exactly the low-precision/hardware-efficiency work Anthropic's TPU role targets
    • H-index 18
    • Based in Seattle
    • Hireability: LOW — tenured-track Assistant Professor at CMU; his own website states 'as a professor who does not write much code anymore', signaling full academic commitment and very unlikely to leave for industry
    YW

    Yida Wang

    low hireability

    Principal Scientist@Amazon

    San Francisco, US

    • Principal Scientist at Amazon working on ML accelerator kernels and compilers — MLSys 2025 papers on attention kernel optimization (FastTree) and scalable inference (ScaleFusion), KDD 2024 paper on AI accelerator inference optimization, co-authored Ansor (557 citations, tensor program auto-tuning) and Alpa (523 citations, distributed DL parallelism)
    • TVM contributor
    • Deep expertise in tensor compilation, quantized inference, and hardware-aware kernel optimization maps directly to TPU kernel work
    • SF Bay Area
    • Hireability: LOW — actively posting Amazon job openings, no open-to-work signals, deeply embedded in Amazon's ML accelerator team with no pipeline signals of job market activity

    Runs

    #1cancelled0 qualified / 0 foundApr 22, 1:41 PM