Back to dashboard

junior CUDA kernel engineer in US, no PhD required

completed22 qualified2 runsApr 27, 4:37 PMjunior-cuda-kernel-engineer-in-us-no-phd-required-1777307862
ParsedNVIDIA · 2 topics · Junior · Ic · no PhD · US
Generating seed nodes
0 proposed
Explored 0 queries
0/0 done
    3
    Expanding nodes
    queued
    4
    Qualifying candidates
    queued

    Qualified Candidates (19)

    AT

    Aditya Tomar

    medium hireability

    Undergraduate Student@UC Berkeley

    Previously: Researcher @ PSSG

    US

    • UC Berkeley EECS undergrad (3rd year) doing GPU-focused LLM inference optimization: QuantSpec (speculative decoding + quantized KV cache), XQuant (KV cache rematerialization), and scalable LLM training on GPU supercomputers
    • Upcoming NVIDIA Applied Deep Learning Research Intern on Megatron-LM (May–Aug 2026)
    • Research at BAIR under Keutzer/SqueezeAI lab on efficient deep learning
    • Based in US
    • No PhD
    • Hireability: MEDIUM — still an undergrad (graduating ~2027), about to start NVIDIA internship, not actively job-seeking but strong new-grad pipeline candidate
    AL

    Ariel Lubonja

    medium hireability

    PhD Student@Johns Hopkins University

    Previously: Full Stack Developer @ jBoxers

    Baltimore, US

    • PhD student at JHU with direct GPU/CUDA kernel optimization experience — published 'Efficient batched CPU/GPU implementation of orthogonal matching pursuit' (2024) and working on 'Scaling OMP to High-Performance CPUs & GPUs'
    • Also has tensor parallelism work on DNNs and supercomputer (Rockfish HPC, #496 globally) experience
    • Research expertise listed as Parallel Computing, Computer Vision, Graph Embeddings
    • Baltimore, US
    • Hireability: MEDIUM — mid-PhD student (GitHub account created June 2023, ~2nd–3rd year), no explicit job-seeking signals, no LinkedIn changes or website updates in 180 days, but may be open to industry internships or full-time roles
    BF

    BOYUAN FENG

    medium hireability

    SWE@PyTorch

    Previously: Researcher @ Meta

    • Strong CUDA kernel engineer — PhD at UCSB working on GPU/Tensor Core acceleration; SWE at PyTorch with FlexAttention (MLSys 2025, fused attention kernel generation), TC-GNN (Tensor Core sparse GPU ops, USENIX ATC 2023), APNN-TC (Ampere GPU Tensor Cores), EGEMM-TC; APNN-TC repo in CUDA language
    • Pinned repos include pytorch/pytorch and vllm
    • US-based
    • Hireability: MEDIUM — currently holds dual PhD + PyTorch SWE role with no explicit 'looking' signals, but PhD student nearing multi-year mark is a natural transition window
    GO

    Gabriele Oliaro

    medium hireability

    CS PhD Student@Snowflake AI Research

    Previously: Research Scientist Intern @ Snowflake

    • ML systems PhD at CMU (4th year, exp. 2027) with direct CUDA kernel work: wrote a fused softmax/argmax CUDA kernel (own repo), co-authored Korch (kernel orchestration for tensor programs, ASPLOS'24), and forks flashinfer (CUDA kernel library for LLM serving)
    • Based in US (Pittsburgh, PA / San Mateo, CA)
    • Research expertise spans ML systems and parallel computing with GPU kernel exposure
    • Hireability: MEDIUM — 4th year PhD student, graduation expected 2027; currently research-interning at Snowflake AI Research with no open-to-work signals, but approaching mid-program transition window
    HX

    Haocheng Xi

    medium hireability

    MLsys Researcher@University of California, Berkeley

    Previously: Research Intern @ Nvidia

    Berkeley, US

    • PhD student at UC Berkeley BAIR Lab (2nd year, started 2024), MLsys researcher specializing in GPU kernel optimization — website explicitly states 'Familiar with CUDA customized kernels'
    • CUDA repos include `how-to-optim-algorithm-in-cuda` and `cuda-tensorcore-hgemm`
    • Published SpargeAttn (sparse attention CUDA kernels), COAT (FP8 training kernels), and INT4 training work
    • NVIDIA internship summer 2025
    • US-based (Berkeley, CA)
    • No PhD yet
    • Hireability: MEDIUM — only ~1.5 years into PhD (class of 2024, ~3-4 years remaining), no explicit job search signals, but NVIDIA internship shows industry openness
    HY

    Hongtao Yu

    medium hireability
    • Active Triton/GPU kernel compiler engineer at Meta in Seattle. 56 commits on openai/triton + recent active contributor to facebookexperimental/triton (MXFP8 Flash Attention kernels, Blackwell GPU modulo scheduling)
    • Also contributes to facebookresearch/CUTracer (CUDA dynamic binary instrumentation tool)
    • Bio: 'Compiler engineer'
    • No PhD evident — industry-focused compiler/GPU kernel work
    • Hireability: MEDIUM — actively shipping GPU kernel code at Meta today (commits 2026-04-27); LLVM/profgen commits from late 2023 suggest ~2.5 years tenure; within typical transition window but no job-seeking signals detected
    JH

    Jung Hwan Heo

    medium hireability

    Researcher@University of Southern California

    • Direct CUDA kernel experience: EE451 USC project accelerating multi-head attention in ViTs using CUDA shared-memory kernels (96% CUDA repo)
    • ICLR 2024 paper on LLM low-bit weight quantization
    • Research expertise in model compression, pruning, and efficient DL
    • Now at Together AI in SF (GPU inference company)
    • No PhD signals (h-index 3, MS-level work)
    • Hireability: MEDIUM — recently transitioned from USC to Together AI (jheo@together.ai email), likely in role <12 months
    KW

    Kyle Wang

    medium hireability
    • 61 commits on openai/triton; GitHub bio explicitly lists Triton, MLIR, LLVM; software engineer at AMD San Jose working on deep learning compilers/runtimes; MS UCLA 2024, no PhD; Santa Clara, CA
    • Hireability: MEDIUM — joined AMD ~2024 post-graduation, ~1-2 years into role, approaching typical transition window but no explicit open-to-work signals
    LZ

    Lixun Zhang

    medium hireability
    • Active AMD Triton contributor (59 commits) with deep AMD GPU kernel specialization: MFMA instruction optimization, Flash Attention kernel tuning, buffer_load_to_lds synchronization for CDNA3/CDNA4, gfx1250 TDM descriptor work
    • Based in Austin, TX at AMD
    • No PhD indicators; UT Austin compilers TA background (CS375)
    • Most recent commit Apr 23 2026
    • Note: work depth suggests mid-level rather than junior
    • Hireability: MEDIUM — actively committing to Triton as of Apr 2026, currently employed at AMD, no explicit open-to-work signals, tenure unknown
    ML

    Maksim Levental

    medium hireability
    • 44 commits on openai/triton + active MLIR/LLVM contributor (llvm/eudsl maintainer, LLVM SparseTensor work merged into llvm-project)
    • At Apple in Cupertino working on GPU compiler infrastructure (DSLs, compilers, accelerator architectures)
    • PhD student at UChicago (Ian Foster) — likely completing or completed — with deep hands-on GPU kernel compiler experience via Triton and IREE
    • Matches junior CUDA kernel engineer profile in US
    • Hireability: MEDIUM — at Apple doing well-aligned work, no open-to-work signals, but estimated 2-3 years in role (within typical transition window); active GitHub commits through April 2026
    PZ

    Pengzhan Zhao

    medium hireability
    • 45 commits on openai/triton and merged PR in ROCm/triton implementing AMD gfx1250 MXFP FlashAttention kernels (MI350)
    • Bio: 'working GPU compiler and kernels' at AMD, SF Bay Area
    • UCLA systems lab background (2022); no PhD
    • Hireability: MEDIUM — actively employed at AMD with very recent commits (March 2026) but no explicit open-to-work signals; within reasonable transition window
    XT

    Xin Tong

    medium hireability

    Software Engineer@Google

    Previously: Deep Learning Engineer @ NVIDIA

    San Francisco, US

    • GPU/parallel computing background (research expertise includes GPU and parallel computing; authored 'Parallel Triangle Clipping on GPU' 2014)
    • Software Engineer at Google in SF; source flagged via CUDA signals in DB
    • Primarily visualization/graphics background but GPU engineering is core
    • Hireability: MEDIUM — ~3.7 years at Google (within typical transition window), no open-to-work signals detected
    XL

    Xueshen Liu

    medium hireability

    Student Researcher@Google

    Previously: Student Researcher @ Google

    Ann Arbor, US

    • 4th-year PhD candidate at U of Michigan with genuine GPU/CUDA work: foundry-org/foundry (C++, CUDA graph materialization for LLM serving cold-start), mm2-gb (GPU-accelerated DNA aligner, ACM BCB 2024 oral), and HeterMoE (heterogeneous GPU training)
    • Research expertise in parallel computing + ML systems
    • US (Ann Arbor)
    • No PhD yet
    • Hireability: MEDIUM — completed Google Student Researcher internship Dec 2025, uploaded CV to website Dec 2025 and updated about page indicating job market preparation; still in PhD program (4th year, likely 1-2 years to graduation)
    ZZ

    Zhihao Zhang

    medium hireability

    Ph.D. student@Carnegie Mellon University

    Previously: MS student @ Carnegie Mellon University

    Pittsburgh, US

    • PhD student at CMU Catalyst building GPU kernels for LLM systems — co-authored Mirage Persistent Kernel (OSDI 2026, compiler for auto-generating CUDA kernels without Triton/CUDA), SpecInfer (ASPLOS 2024), TidalDecode (ICLR 2025)
    • Forks FlashInfer (CUDA kernel library) and Mirage
    • Based in Pittsburgh, PA
    • Hireability: MEDIUM — likely nearing PhD completion (OSDI 2026 paper accepted), LinkedIn profile went completely empty in Jan 2026 snapshot (possible job-transition prep), website says 'open to collaboration'
    ZY

    Zhongming Yu

    medium hireability

    Student Researcher@Google

    Previously: Machine Learning Engineer @ Intel

    San Francisco, US

    • Active GPU/CUDA kernel researcher with multiple directly relevant papers: TorchSparse++ (sparse convolution on GPUs, 46 citations), GeoT (efficient segment reduction on GPU in C++/CUDA), SpMM heuristics on GPUs (31 citations)
    • PhD student at UCSD CSE (4th year, started 2022), Student Researcher at Google SF since April 2025
    • H-index 10
    • Hireability: MEDIUM — currently at Google as Student Researcher, ~4th year into PhD with no explicit openness signals, but within typical industry transition window for PhD students
    ZC

    Zhuoming Chen

    medium hireability

    Ph.D. student@Carnegie Mellon University

    Previously: Research Intern @ Meta

    New York, US

    • LLM inference systems researcher at CMU working on GPU memory optimization and speculative decoding (SpecInfer 381 citations, MagicPIG LSH sparse attention, Mini-Sequence Transformer memory-efficient long-sequence training)
    • Forks and works with CUDA kernel libraries (flashinfer, flash-attention)
    • Primary work is algorithmic/systems-level Python rather than authoring low-level CUDA kernels, making this adjacent to the query rather than a direct match
    • PhD student since 2023 (no PhD yet), US-based
    • Hireability: MEDIUM — CV update 73 days ago (Feb 2026) suggests active career motion, ~3 years into PhD, recent Meta FAIR internship (2025)
    DY

    Da Yan

    low hireability

    Member Of Technical Staff@Anthropic

    Previously: Independent Contractor @ OpenAI

    New York, US

    • 44 commits on openai/triton; authored `turingas` (NVIDIA Volta/Turing GPU assembler, 241 stars) and CUDA-Winograd (fast CUDA kernels)
    • Research expertise in GPU performance optimization and GPU compiler
    • MTS at Anthropic (NY), US
    • Hireability: LOW — ~43 months at Anthropic with no open-to-work signals or job-change activity detected
    JT

    Jiaming Tang

    low hireability

    Ph.D. student@MIT

    Previously: Undergraduate researcher @ SJTU EPCC Lab

    Boston, US

    • MIT Han Lab PhD student (2nd year) with strong CUDA kernel background — authored Quest (ICML 2024, CUDA sparse KV-cache kernel for LLM inference) and co-authored AWQ (MLSys 2024 Best Paper, 1503 citations, integrated into vLLM/TensorRT-LLM)
    • Based in Cambridge, MA
    • Hireability: LOW for full-time — early in PhD under Song Han, recently added RA role at Physical Intelligence; likely 3+ years from graduation with no job-seeking signals, but strong internship prospect
    MS

    Mark Saroufim

    low hireability

    Software Engineer@Meta

    Previously: ML Engineer @ Graphcore

    San Francisco, US

    • Co-founded GPU MODE (CUDA learning community with 100+ lectures) and recently co-founded Core Automation
    • Deep CUDA/GPU kernel expertise: authored KernelBot paper (GPU kernel competition platform), active PyTorch AO contributor, GEMM eval projects, helion ML kernel DSL
    • Bay Area, US; no PhD
    • Overqualified for a junior role but exactly the CUDA profile sought
    • Hireability: LOW — recently co-founded Core Automation startup, actively building his own company

    Runs

    #2completed0 qualified / 0 foundApr 27, 4:51 PM
    #1cancelled0 qualified / 0 foundApr 27, 4:37 PM