junior CUDA kernel engineer in US, no PhD required

completed22 qualified2 runsApr 27, 4:37 PMjunior-cuda-kernel-engineer-in-us-no-phd-required-1777307862

ParsedNVIDIA · 2 topics · Junior · Ic · no PhD · US

Generating seed nodes

0 proposed

Explored 0 queries

0/0 done

Expanding nodes

queued

Qualifying candidates

queued

Qualified Candidates (19)

Aditya Tomar

medium hireability

Undergraduate Student@UC Berkeley

Previously: Researcher @ PSSG

UC Berkeley EECS undergrad (3rd year) doing GPU-focused LLM inference optimization: QuantSpec (speculative decoding + quantized KV cache), XQuant (KV cache rematerialization), and scalable LLM training on GPU supercomputers
Upcoming NVIDIA Applied Deep Learning Research Intern on Megatron-LM (May–Aug 2026)
Research at BAIR under Keutzer/SqueezeAI lab on efficient deep learning
Based in US
No PhD
Hireability: MEDIUM — still an undergrad (graduating ~2027), about to start NVIDIA internship, not actively job-seeking but strong new-grad pipeline candidate

Ariel Lubonja

medium hireability

PhD Student@Johns Hopkins University

Previously: Full Stack Developer @ jBoxers

Baltimore, US

PhD student at JHU with direct GPU/CUDA kernel optimization experience — published 'Efficient batched CPU/GPU implementation of orthogonal matching pursuit' (2024) and working on 'Scaling OMP to High-Performance CPUs & GPUs'
Also has tensor parallelism work on DNNs and supercomputer (Rockfish HPC, #496 globally) experience
Research expertise listed as Parallel Computing, Computer Vision, Graph Embeddings
Baltimore, US
Hireability: MEDIUM — mid-PhD student (GitHub account created June 2023, ~2nd–3rd year), no explicit job-seeking signals, no LinkedIn changes or website updates in 180 days, but may be open to industry internships or full-time roles

BOYUAN FENG

medium hireability

SWE@PyTorch

Previously: Researcher @ Meta

Strong CUDA kernel engineer — PhD at UCSB working on GPU/Tensor Core acceleration; SWE at PyTorch with FlexAttention (MLSys 2025, fused attention kernel generation), TC-GNN (Tensor Core sparse GPU ops, USENIX ATC 2023), APNN-TC (Ampere GPU Tensor Cores), EGEMM-TC; APNN-TC repo in CUDA language
Pinned repos include pytorch/pytorch and vllm
US-based
Hireability: MEDIUM — currently holds dual PhD + PyTorch SWE role with no explicit 'looking' signals, but PhD student nearing multi-year mark is a natural transition window

Gabriele Oliaro

medium hireability

CS PhD Student@Snowflake AI Research

Previously: Research Scientist Intern @ Snowflake

ML systems PhD at CMU (4th year, exp. 2027) with direct CUDA kernel work: wrote a fused softmax/argmax CUDA kernel (own repo), co-authored Korch (kernel orchestration for tensor programs, ASPLOS'24), and forks flashinfer (CUDA kernel library for LLM serving)
Based in US (Pittsburgh, PA / San Mateo, CA)
Research expertise spans ML systems and parallel computing with GPU kernel exposure
Hireability: MEDIUM — 4th year PhD student, graduation expected 2027; currently research-interning at Snowflake AI Research with no open-to-work signals, but approaching mid-program transition window

Haocheng Xi

medium hireability

MLsys Researcher@University of California, Berkeley

Previously: Research Intern @ Nvidia

Berkeley, US

PhD student at UC Berkeley BAIR Lab (2nd year, started 2024), MLsys researcher specializing in GPU kernel optimization — website explicitly states 'Familiar with CUDA customized kernels'
CUDA repos include `how-to-optim-algorithm-in-cuda` and `cuda-tensorcore-hgemm`
Published SpargeAttn (sparse attention CUDA kernels), COAT (FP8 training kernels), and INT4 training work
NVIDIA internship summer 2025
US-based (Berkeley, CA)
No PhD yet
Hireability: MEDIUM — only ~1.5 years into PhD (class of 2024, ~3-4 years remaining), no explicit job search signals, but NVIDIA internship shows industry openness

Hongtao Yu

medium hireability

Active Triton/GPU kernel compiler engineer at Meta in Seattle. 56 commits on openai/triton + recent active contributor to facebookexperimental/triton (MXFP8 Flash Attention kernels, Blackwell GPU modulo scheduling)
Also contributes to facebookresearch/CUTracer (CUDA dynamic binary instrumentation tool)
Bio: 'Compiler engineer'
No PhD evident — industry-focused compiler/GPU kernel work
Hireability: MEDIUM — actively shipping GPU kernel code at Meta today (commits 2026-04-27); LLVM/profgen commits from late 2023 suggest ~2.5 years tenure; within typical transition window but no job-seeking signals detected

Jung Hwan Heo

medium hireability

Researcher@University of Southern California

Direct CUDA kernel experience: EE451 USC project accelerating multi-head attention in ViTs using CUDA shared-memory kernels (96% CUDA repo)
ICLR 2024 paper on LLM low-bit weight quantization
Research expertise in model compression, pruning, and efficient DL
Now at Together AI in SF (GPU inference company)
No PhD signals (h-index 3, MS-level work)
Hireability: MEDIUM — recently transitioned from USC to Together AI (jheo@together.ai email), likely in role <12 months

Kyle Wang

medium hireability

61 commits on openai/triton; GitHub bio explicitly lists Triton, MLIR, LLVM; software engineer at AMD San Jose working on deep learning compilers/runtimes; MS UCLA 2024, no PhD; Santa Clara, CA
Hireability: MEDIUM — joined AMD ~2024 post-graduation, ~1-2 years into role, approaching typical transition window but no explicit open-to-work signals

Lixun Zhang

medium hireability

Active AMD Triton contributor (59 commits) with deep AMD GPU kernel specialization: MFMA instruction optimization, Flash Attention kernel tuning, buffer_load_to_lds synchronization for CDNA3/CDNA4, gfx1250 TDM descriptor work
Based in Austin, TX at AMD
No PhD indicators; UT Austin compilers TA background (CS375)
Most recent commit Apr 23 2026
Note: work depth suggests mid-level rather than junior
Hireability: MEDIUM — actively committing to Triton as of Apr 2026, currently employed at AMD, no explicit open-to-work signals, tenure unknown

Maksim Levental

medium hireability

44 commits on openai/triton + active MLIR/LLVM contributor (llvm/eudsl maintainer, LLVM SparseTensor work merged into llvm-project)
At Apple in Cupertino working on GPU compiler infrastructure (DSLs, compilers, accelerator architectures)
PhD student at UChicago (Ian Foster) — likely completing or completed — with deep hands-on GPU kernel compiler experience via Triton and IREE
Matches junior CUDA kernel engineer profile in US
Hireability: MEDIUM — at Apple doing well-aligned work, no open-to-work signals, but estimated 2-3 years in role (within typical transition window); active GitHub commits through April 2026

Pengzhan Zhao

medium hireability

45 commits on openai/triton and merged PR in ROCm/triton implementing AMD gfx1250 MXFP FlashAttention kernels (MI350)
Bio: 'working GPU compiler and kernels' at AMD, SF Bay Area
UCLA systems lab background (2022); no PhD
Hireability: MEDIUM — actively employed at AMD with very recent commits (March 2026) but no explicit open-to-work signals; within reasonable transition window

Xin Tong

medium hireability

Software Engineer@Google

Previously: Deep Learning Engineer @ NVIDIA

San Francisco, US

GPU/parallel computing background (research expertise includes GPU and parallel computing; authored 'Parallel Triangle Clipping on GPU' 2014)
Software Engineer at Google in SF; source flagged via CUDA signals in DB
Primarily visualization/graphics background but GPU engineering is core
Hireability: MEDIUM — ~3.7 years at Google (within typical transition window), no open-to-work signals detected

Xueshen Liu

medium hireability

Student Researcher@Google

Previously: Student Researcher @ Google

Ann Arbor, US

4th-year PhD candidate at U of Michigan with genuine GPU/CUDA work: foundry-org/foundry (C++, CUDA graph materialization for LLM serving cold-start), mm2-gb (GPU-accelerated DNA aligner, ACM BCB 2024 oral), and HeterMoE (heterogeneous GPU training)
Research expertise in parallel computing + ML systems
US (Ann Arbor)
No PhD yet
Hireability: MEDIUM — completed Google Student Researcher internship Dec 2025, uploaded CV to website Dec 2025 and updated about page indicating job market preparation; still in PhD program (4th year, likely 1-2 years to graduation)

Zhihao Zhang

medium hireability

Ph.D. student@Carnegie Mellon University

Previously: MS student @ Carnegie Mellon University

Pittsburgh, US

PhD student at CMU Catalyst building GPU kernels for LLM systems — co-authored Mirage Persistent Kernel (OSDI 2026, compiler for auto-generating CUDA kernels without Triton/CUDA), SpecInfer (ASPLOS 2024), TidalDecode (ICLR 2025)
Forks FlashInfer (CUDA kernel library) and Mirage
Based in Pittsburgh, PA
Hireability: MEDIUM — likely nearing PhD completion (OSDI 2026 paper accepted), LinkedIn profile went completely empty in Jan 2026 snapshot (possible job-transition prep), website says 'open to collaboration'

Zhongming Yu

medium hireability

Student Researcher@Google

Previously: Machine Learning Engineer @ Intel

San Francisco, US

Active GPU/CUDA kernel researcher with multiple directly relevant papers: TorchSparse++ (sparse convolution on GPUs, 46 citations), GeoT (efficient segment reduction on GPU in C++/CUDA), SpMM heuristics on GPUs (31 citations)
PhD student at UCSD CSE (4th year, started 2022), Student Researcher at Google SF since April 2025
H-index 10
Hireability: MEDIUM — currently at Google as Student Researcher, ~4th year into PhD with no explicit openness signals, but within typical industry transition window for PhD students

Zhuoming Chen

medium hireability

Ph.D. student@Carnegie Mellon University

Previously: Research Intern @ Meta

New York, US

LLM inference systems researcher at CMU working on GPU memory optimization and speculative decoding (SpecInfer 381 citations, MagicPIG LSH sparse attention, Mini-Sequence Transformer memory-efficient long-sequence training)
Forks and works with CUDA kernel libraries (flashinfer, flash-attention)
Primary work is algorithmic/systems-level Python rather than authoring low-level CUDA kernels, making this adjacent to the query rather than a direct match
PhD student since 2023 (no PhD yet), US-based
Hireability: MEDIUM — CV update 73 days ago (Feb 2026) suggests active career motion, ~3 years into PhD, recent Meta FAIR internship (2025)

Da Yan

low hireability

Member Of Technical Staff@Anthropic

Previously: Independent Contractor @ OpenAI

New York, US

44 commits on openai/triton; authored `turingas` (NVIDIA Volta/Turing GPU assembler, 241 stars) and CUDA-Winograd (fast CUDA kernels)
Research expertise in GPU performance optimization and GPU compiler
MTS at Anthropic (NY), US
Hireability: LOW — ~43 months at Anthropic with no open-to-work signals or job-change activity detected

Jiaming Tang

low hireability

Ph.D. student@MIT

Previously: Undergraduate researcher @ SJTU EPCC Lab

Boston, US

MIT Han Lab PhD student (2nd year) with strong CUDA kernel background — authored Quest (ICML 2024, CUDA sparse KV-cache kernel for LLM inference) and co-authored AWQ (MLSys 2024 Best Paper, 1503 citations, integrated into vLLM/TensorRT-LLM)
Based in Cambridge, MA
Hireability: LOW for full-time — early in PhD under Song Han, recently added RA role at Physical Intelligence; likely 3+ years from graduation with no job-seeking signals, but strong internship prospect

Mark Saroufim

low hireability

Software Engineer@Meta

Previously: ML Engineer @ Graphcore

San Francisco, US

Co-founded GPU MODE (CUDA learning community with 100+ lectures) and recently co-founded Core Automation
Deep CUDA/GPU kernel expertise: authored KernelBot paper (GPU kernel competition platform), active PyTorch AO contributor, GEMM eval projects, helion ML kernel DSL
Bay Area, US; no PhD
Overqualified for a junior role but exactly the CUDA profile sought
Hireability: LOW — recently co-founded Core Automation startup, actively building his own company

Runs

#2completed0 qualified / 0 foundApr 27, 4:51 PM

#1cancelled0 qualified / 0 foundApr 27, 4:37 PM