Back to dashboard

Junior GPU kernel engineers in the US with CUDA/Triton experience

completed4 qualified1 runApr 21, 9:38 PMjunior-gpu-kernel-engineers-in-the-us-with-cudatriton-experi-1776807525
Parsed3 topics · Junior · Engineer · United States
Generating seed nodes
0 proposed
Explored 0 queries
0/0 done
    3
    Expanding nodes
    queued
    4
    Qualifying candidates
    queued

    Qualified Candidates (4)

    HX

    Haocheng Xi

    medium hireability

    MLsys Researcher@University of California, Berkeley

    Previously: Research Intern @ Nvidia

    Berkeley, US

    • Strong CUDA/ML-systems PhD student at UC Berkeley (2nd year, BS Tsinghua Yao Class)
    • Pinned repos include cuda-tensorcore-hgemm and how-to-optim-algorithm-in-cuda (both Cuda); papers on SpargeAttn (sparse attention GPU kernels), FP8 training (COAT, Jetfire), INT4 quantization
    • NVIDIA Research Intern May–Aug 2025 on efficient training/inference and FP8 workflows
    • Berkeley, CA (US)
    • Hireability: MEDIUM — active 2nd-year PhD student, unlikely to leave for full-time before ~2028; strong internship candidate per NVIDIA precedent. Website very actively updated (2026-04-20) with new publications, no explicit open-to-work signal
    HM

    Hiva Mohammadzadeh

    medium hireability

    Machine Learning Engineer | Data Scientist@IntuigenceAI

    Previously: Machine Learning Engineer @ Algoverse

    San Francisco, US

    • Co-author on KVQuant (NeurIPS 2024, 322 citations) which develops custom CUDA kernels for KV cache quantization (~1.7x speedup over fp16 matmul), and Squeezed Attention (ACL 2025) implementing sparse FlashAttention kernels (4x+ speedups)
    • Berkeley EECS grad (2023), now in Stanford MS CS (AI, Systems) 2025-2027; ML Engineer at IntuigenceAI startup in SF
    • Hireability: MEDIUM — currently mid-program at Stanford MS (2025-2027) and employed at startup, no open-to-work signals
    LM

    Luca Manolache

    medium hireability

    Head Teaching Assistant@UC Berkeley

    Previously: Teaching Assistant @ UC Berkeley

    San Francisco, US

    • UC Berkeley EECS junior researching ML systems and model efficiency; co-authored NeurIPS 2025 paper 'Multipole Attention for Efficient Long Context Reasoning' which implements attention kernels achieving 4.5× speedup — direct GPU kernel experience
    • Forked NCCL and has a distributed-sparse-attention repo
    • Based in Berkeley/SF, US
    • Hireability: MEDIUM — current undergrad junior (likely graduating 2027), no explicit job-seeking signals, but prime candidate for summer 2026 internship
    XL

    Xiuyu Li

    low hireability

    PhD candidate@Berkeley AI Research (BAIR) at UC Berkeley

    Previously: Research Consultant @ Together AI

    San Francisco, US

    • Berkeley BAIR PhD (Prof
    • Keutzer), strong CUDA/GPU kernel background — lead contributor to TorchSparse and TorchSparse++ (CUDA sparse convolution frameworks, MICRO'23/MLSys'22), also published on Q-Diffusion and SqueezeLLM quantization. h-index 19
    • Bay Area, US
    • Now MTS at xAI working on coding RL and infra
    • Hireability: LOW — pipeline shows recent move from Together AI (Research Consultant) to xAI (MTS), scraped Feb 2026, likely only 3-5 months into new role

    Runs

    #1completed0 qualified / 0 foundApr 21, 9:38 PM