Back to dashboard

Neuralace · Model Architecture Discovery Researcher

completed372 qualified1 runMay 7, 1:24 PMcompany-name-neuralace-sabi-locations-usa-europe-china-india
ParsedNeuralace · 4 topics · Researcher · USA, Europe, China, India
Generating seed nodes
0 proposed
Explored 0 queries
0/0 done
    3
    Expanding nodes
    queued
    4
    Qualifying candidates
    queued

    Qualified Candidates (368)

    CW

    Chengyue Wu

    high hireability

    研究实习生@NVIDIA

    Previously: Research Intern @ DeepSeek AI

    Shenzhen, CN

    39
    KV Cache Optimization72
    Inference-Aware Architecture55
    Weight Compression18
    Weight Streaming Efficiency12
    Strengths
    Fast-dLLM (ICLR 2026): first paper enabling KV cache for diffusion LLMs
    Fast-dLLM v2 (ICLR 2026): block-diffusion inference efficiency follow-up
    Gaps
    No weight compression or microscaling-aware work found
    …click to see all
    CZ

    Cheng Zhang

    high hireability

    Founding Engineer@AI Sequrity Company

    Previously: Research Intern @ Microsoft

    London, GB

    72
    Weight Compression88
    Inference-Aware Architecture82
    KV Cache Optimization72
    Weight Streaming Efficiency45
    Strengths
    LQER (ICML2024) + QERA (ICLR2025): LLM weight quantization error reconstruction
    Block-based Quantisation (EMNLP2023): sub-8-bit block quant, microscaling-relevant
    Gaps
    No explicit MLA / vector-quantized KV cache work found
    …click to see all
    CH

    Coleman Richard Charles Hooper

    high hireability

    Graduate Student - ML Systems@University of California, Berkeley

    Previously: Research Intern @ NVIDIA

    San Francisco, US

    80
    KV Cache Optimization93
    Weight Compression85
    Inference-Aware Architecture72
    Weight Streaming Efficiency68
    Strengths
    KVQuant: first-author, NeurIPS 2024, 317 citations — landmark KV cache quantization
    Squeezed Attention: first-author, ACL 2025 — 3.1x KV budget reduction
    Gaps
    No explicit hardware co-design or chip-architecture papers (more SW-layer inference opt)
    …click to see all
    EL

    Enshu Liu

    high hireability

    MS student@Microsoft Research

    Previously: Intern @ Microsoft

    Beijing, CN

    34
    Weight Compression72
    Weight Streaming Efficiency32
    Inference-Aware Architecture28
    KV Cache Optimization5
    Strengths
    ViDiT-Q (ICLR 2025) — W4A8 quantization, 3x memory reduction for diffusion transformers
    MixDQ (ECCV 2024) — mixed-precision, 3-4x model size compression
    Gaps
    No KV cache work — missing MLA, KV eviction, vector-quantized KV entirely
    …click to see all
    GL

    Gen Li

    high hireability

    PhD student@Clemson University

    US

    37
    Weight Compression73
    Inference-Aware Architecture42
    Weight Streaming Efficiency28
    KV Cache Optimization3
    Strengths
    OWL (ICML 2024, 110 citations) — LLM pruning to high sparsity, directly relevant
    Dynamic Sparsity series (NeurIPS 2023, ICML 2024) — structured channel-level sparsity
    Gaps
    No KV cache work (MLA, KV eviction, vector-quantized KV) — zero evidence
    …click to see all
    HL

    Haokun Lin

    high hireability

    Ph.D. Student@Institute of Automation, Chinese Academy of Sciences

    Previously: Retail Credit Risk Intern @ GAC Auto Finance Co.

    Hong Kong, HK

    60
    Weight Compression80
    KV Cache Optimization65
    Weight Streaming Efficiency55
    Inference-Aware Architecture40
    Strengths
    IntactKV (ACL 2024): KV cache management via pivot token preservation
    DuQuant (NeurIPS 2024 oral): state-of-the-art 4-bit LLM weight quantization
    Gaps
    No explicit hardware chip co-design work (KV constraints from chip packaging perspective)
    …click to see all
    HQ

    Haotong Qin

    high hireability

    Postdoctoral Researcher@ETH Zürich

    Previously: Research Scientist @ ByteDance

    Zurich, CH

    59
    Weight Compression95
    KV Cache Optimization65
    Inference-Aware Architecture55
    Weight Streaming Efficiency20
    Strengths
    BiLLM: 1-bit post-training quantization for LLMs (152 citations, 2024)
    ReCalKV: low-rank KV cache compression (2025) — direct match
    Gaps
    KV cache and hardware-aware work is secondary — primary focus is weight quantization
    …click to see all
    HC

    Hongzheng Chen

    high hireability

    Ph.D. Candidate@Cornell University

    Previously: Undergrad student @ SUN YAT-SEN UNIVERSITY

    Ithaca, US

    29
    Inference-Aware Architecture52
    Weight Compression36
    Weight Streaming Efficiency18
    KV Cache Optimization8
    Strengths
    LLM-FPGA (FCCM'24): 5.7× energy efficiency vs A100 — explicit chip-constrained inference design
    Allo (PLDI'24): accelerator design language — hardware-SW co-design for inference
    Gaps
    No KV cache work — no MLA, KV eviction, or vector-quantized KV
    …click to see all
    HJ

    Huiqiang Jiang

    high hireability

    RSDE@Microsoft

    Previously: Research SDE @ Microsoft

    Shanghai, CN

    55
    KV Cache Optimization93
    Inference-Aware Architecture62
    Weight Compression35
    Weight Streaming Efficiency28
    Strengths
    LLMLingua: 20x KV-cache + prompt compression, EMNLP'23 + ACL'24
    SCBench (ICLR'25): KV cache benchmarking across long-context methods
    Gaps
    No weight compression work (quantization, block-sparse weights) in portfolio
    …click to see all
    HK

    Hyungjun Kim

    high hireability

    Postdoctoral Researcher@Northwestern University

    Previously: Graduate Student @ Seoul National University

    Evanston, US

    54
    Weight Compression83
    Inference-Aware Architecture78
    Weight Streaming Efficiency45
    KV Cache Optimization10
    Strengths
    OWQ: outlier-aware weight quantization for LLM inference (AAAI 2024)
    QUICK: quantization-aware conflict-free kernel for efficient LLM inference
    Gaps
    No KV cache work (MLA, KV eviction, KV quantization) found
    …click to see all
    IM

    Ionut-Vlad Modoranu

    high hireability

    Ph.D. student@Institute of Science and Technology Austria (ISTA)

    Previously: Research Scientist @ Amazon

    Vienna, AT

    33
    Weight Compression80
    Inference-Aware Architecture25
    Weight Streaming Efficiency20
    KV Cache Optimization5
    Strengths
    DASLab (Dan Alistarh) — premier lab for LLM quantization/sparsity
    "Unified Scaling Laws for Compressed Representations" (2025) — direct weight compression
    Gaps
    No KV cache work (MLA, VQ-KV, eviction) found
    …click to see all
    JL

    Junyang Lin

    high hireability

    Research Scientist@Qwen

    Previously: Staff Engineer @ Alibaba

    Beijing, CN

    66
    KV Cache Optimization82
    Weight Compression80
    Inference-Aware Architecture72
    Weight Streaming Efficiency28
    Strengths
    CateKV (ICML 2025): KV eviction/consistency for long-context inference
    Rotated Runtime Smooth (ICLR 2025): training-free INT4 quantization
    Gaps
    No explicit MLA or vector-quantized KV cache work found
    …click to see all
    ML

    Muyang Li

    high hireability

    Doctoral Student@Massachusetts Institute of Technology

    Previously: Research Intern @ NVIDIA

    Boston, US

    52
    Weight Compression88
    Inference-Aware Architecture75
    Weight Streaming Efficiency35
    KV Cache Optimization10
    Strengths
    SVDQuant: 4-bit quantization via SVD low-rank outlier absorption (ICLR 2025 Spotlight)
    deepcompressor: production model compression toolbox, LLMs + diffusion
    Gaps
    No published KV cache work (MLA, KV eviction) — diffusion model focus
    …click to see all
    NF

    Natalia Frumkin

    high hireability

    Research Associate@AMD

    Previously: Research Scientist Intern @ Meta

    Austin, US

    41
    Weight Compression82
    Inference-Aware Architecture50
    KV Cache Optimization15
    Weight Streaming Efficiency15
    Strengths
    Quamba2 (ICML 2025): scalable W4/W8 PTQ for selective SSMs
    Quamba (ICLR 2024): first PTQ recipe for Mamba; 12 citations
    Gaps
    No KV cache work — SSM focus sidesteps transformer KV stack entirely
    …click to see all
    PB

    Payman Behnam

    high hireability

    Student Researcher@Google

    Previously: Graduate Research Assistant @ Georgia Institute of Technology

    Atlanta, US

    67
    KV Cache Optimization90
    Inference-Aware Architecture72
    Weight Compression68
    Weight Streaming Efficiency38
    Strengths
    RocketKV (ICML 2025): KV eviction + sparse attention, 400× compression, 3.7× speedup
    EMPIRIC (2025): systematic KV cache compression gaps for long-context inference
    Gaps
    No MLA or vector-quantized KV cache work specifically
    …click to see all
    RC

    Roberto L. Castro

    high hireability

    Postdoc@Institute of Science and Technology Austria

    Previously: PhD student @ Universidad de La Coruña

    AT

    59
    Weight Compression92
    Inference-Aware Architecture75
    Weight Streaming Efficiency65
    KV Cache Optimization5
    Strengths
    MARLIN: FP16xINT4 inference kernel, ~4x speedup on GPU (60 citations)
    Microscaling FP4 Quantization paper — exact match to query's microscaling mention
    Gaps
    No KV cache work (MLA, KV eviction, vector-quantized KV) found in papers
    …click to see all
    RZ

    Rongzhi Zhang

    high hireability

    Applied Scientist@Amazon

    Previously: Applied Scientist Intern @ Amazon

    San Francisco, US

    31
    KV Cache Optimization80
    Inference-Aware Architecture22
    Weight Compression18
    Weight Streaming Efficiency5
    Strengths
    LoRC: low-rank KV cache compression, NeurIPS Workshop 2024
    Explicit research focus: progressive KV cache compression
    Gaps
    No hardware-aware / inference-chip co-design work found
    …click to see all
    RC

    Ruisi Cai

    high hireability

    Research Intern@NVIDIA

    Previously: Quantitative Researcher Intern @ Citadel Securities

    San Francisco, US

    63
    KV Cache Optimization88
    Weight Compression60
    Inference-Aware Architecture60
    Weight Streaming Efficiency45
    Strengths
    H2O: first-authored KV eviction paper, 658 citations (NeurIPS 2023)
    LoCoCo (ICML 2024): long-context KV compression with convolutions
    Gaps
    No microscaling or sub-bit weight compression work
    …click to see all
    SA

    Saleh Ashkboos

    high hireability

    Research Assistant@ETH Zürich

    Previously: Research Intern @ Apple

    Zurich, CH

    64
    Weight Compression97
    Inference-Aware Architecture70
    Weight Streaming Efficiency62
    KV Cache Optimization25
    Strengths
    GPTQ (1860 citations) — foundational post-training quantization for LLMs
    Microscaling FP4 ICLR 2026 — directly targets MX-format weight compression constraints
    Gaps
    No direct KV cache work (MLA, KV eviction, vector-quantized KV) in visible papers
    …click to see all
    SY

    Shang Yang

    high hireability

    PhD student@MIT EECS

    Previously: Intern @ MIT

    Boston, US

    83
    Weight Compression97
    KV Cache Optimization90
    Inference-Aware Architecture85
    Weight Streaming Efficiency60
    Strengths
    QServe W4A8KV4: KV4 cache + W4 weights co-designed for serving efficiency
    AWQ: MLSys 2024 Best Paper (1498 citations) — landmark activation-aware quantization
    Gaps
    No explicit work on weight streaming power reduction (vs. compression)
    …click to see all
    TR

    Tahseen Rabbani

    high hireability

    Frontier Tech Consultant, Project Management@Scale AI

    Previously: Postdoctoral Research Associate @ Yale University

    Chicago, US

    35
    KV Cache Optimization80
    Weight Compression30
    Inference-Aware Architecture25
    Weight Streaming Efficiency5
    Strengths
    HashEvict (2412.16187): novel pre-attention KV eviction via LSH, 30–70% compression
    LSH-E at NeurIPS 2024 Compression Workshop — original KV eviction algorithm
    Gaps
    No MLA or vector-quantized KV cache work — different approach than query emphasis
    …click to see all
    TG

    Tarushii Goel

    high hireability
    23
    KV Cache Optimization35
    Weight Compression25
    Inference-Aware Architecture25
    Weight Streaming Efficiency5
    Strengths
    Log-Linear Attention (arXiv 2506.04761) — replaces KV cache with log-growing hidden states
    4 commits to fla-org/flash-linear-attention — linear attention kernel contributor
    Gaps
    No direct MLA, vector-quantized KV cache, or KV eviction work
    …click to see all
    VE

    Vage Egiazarian

    high hireability

    Postdoc@ISTA

    Previously: Researcher @ Higher School of Economics

    72
    Weight Compression95
    KV Cache Optimization82
    Inference-Aware Architecture65
    Weight Streaming Efficiency45
    Strengths
    AQLM: 2-bit extreme LLM weight quantization — lead author, 141 citations
    SpQR: near-lossless 3-4 bit quantization enabling 33B models on 24GB GPU
    Gaps
    No explicit hardware co-design or chip architecture work
    …click to see all
    ZH

    Zain Huda

    high hireability
    33
    Weight Compression70
    Weight Streaming Efficiency35
    Inference-Aware Architecture25
    KV Cache Optimization3
    Strengths
    Blockwise FP8 / MX format in pytorch/ao — 11 PRs, direct microscaling weight compression
    Float8BlockwiseLinear for DeepSeek V3: bandwidth & roofline benchmarking
    Gaps
    No KV cache work — MLA, KV eviction, vector-quantized KV absent
    …click to see all
    ZW

    Zhongwei Wan

    high hireability

    PhD student@The Ohio State University, Ph.D. candidate

    Columbus, US

    55
    KV Cache Optimization88
    Weight Compression72
    Inference-Aware Architecture42
    Weight Streaming Efficiency18
    Strengths
    LOOK-M (2024, 60 cites): KV cache compression for multimodal inference
    D2O (2025, 40 cites): dynamic KV eviction with 3x throughput gain
    Gaps
    No chip-level hardware co-design work (power/bandwidth constraints)
    …click to see all
    AR

    Abbas Rahimi

    medium hireability

    Research Staff Member@IBM

    Previously: Postdoctoral Researcher @ UC Berkeley

    Zurich, CH

    36
    Inference-Aware Architecture65
    Weight Compression38
    KV Cache Optimization20
    Weight Streaming Efficiency20
    Strengths
    "Efficient scaling of LLMs with MoE and 3D analog in-memory computing" (2025) — chip-constrained LLM design
    NeurIPS 2025 Spotlight: structured sparse SSMs — attention alternative eliminating KV cache
    Gaps
    No direct KV cache optimization work (MLA, vector-quantized KV, KV eviction)
    …click to see all
    AG

    Abhay Gupta

    medium hireability

    Research Scientist, Machine Learning@Databricks

    Previously: Research Scientist, Machine Learning @ Cerebras Systems

    San Francisco, US

    46
    Weight Compression82
    Inference-Aware Architecture55
    Weight Streaming Efficiency40
    KV Cache Optimization5
    Strengths
    SPDF: 75% sparsity on GPT-3 XL, 2.5x FLOP reduction (NeurIPS 2023, 44 citations)
    High-Sparsity Llama: 70% sparsity + quantization → 8.6x CPU speedup
    Gaps
    No KV cache work — MLA, vector-quantized KV, eviction absent from all papers
    …click to see all
    AB

    Abhimanyu Rajeshkumar Bambhaniya

    medium hireability

    Research Intern@Meta

    Previously: Intern @ Google

    San Francisco, US

    61
    Inference-Aware Architecture82
    Weight Compression68
    Weight Streaming Efficiency55
    KV Cache Optimization40
    Strengths
    GenZ-LLM-Analyzer: LLM inference hardware platform analysis tool
    MIST (2025): co-design framework with explicit KV cache reuse modeling
    Gaps
    No dedicated KV eviction, MLA, or vector-quantized KV cache work
    …click to see all
    AM

    Abhinav Mehrotra

    medium hireability

    Head of On-device GenAI@Samsung

    Previously: Principal Research Scientist @ Samsung

    London, GB

    43
    Weight Compression75
    Inference-Aware Architecture55
    Weight Streaming Efficiency22
    KV Cache Optimization18
    Strengths
    FraQAT: fractional-bit QAT (32→4b) for on-device generative models
    NanoFLUX: 12B→2B diffusion compression for mobile deployment (2026)
    Gaps
    No LLM KV cache work — no MLA, KV eviction, or vector-quantized KV cache
    …click to see all
    AT

    Aditya Tomar

    medium hireability

    Undergraduate Student@UC Berkeley

    Previously: Researcher @ PSSG

    US

    39
    KV Cache Optimization82
    Inference-Aware Architecture38
    Weight Compression28
    Weight Streaming Efficiency8
    Strengths
    QuantSpec (ICML 2025) — hierarchical quantized KV cache, self-speculative decoding
    XQuant — KV cache rematerialization breaking LLM inference memory wall
    Gaps
    No weight streaming or bandwidth-reduction work (axis 3 unaddressed)
    …click to see all
    AA

    Akhil Arunkumar

    medium hireability

    Sr. Principal Software Engineer@d-Matrix

    Previously: SoC Performance Architect @ AMD

    San Francisco, US

    49
    KV Cache Optimization85
    Inference-Aware Architecture70
    Weight Streaming Efficiency25
    Weight Compression15
    Strengths
    Keyformer (MLSys 2024): KV eviction via key token selection, 2.1x latency gain
    d-Matrix Gen AI Serving lead — inference ASIC stack, production KV cache management
    Gaps
    No published work on weight compression or quantization
    …click to see all
    AS

    Aleksandar Samardžić

    medium hireability
    50
    Weight Compression78
    Inference-Aware Architecture70
    Weight Streaming Efficiency45
    KV Cache Optimization5
    Strengths
    CuTeDSL MXFP8 3D quantization kernel — 6+ TB/s on B200
    32x32 MX block scaling for weights in MXFP8 (pytorch/ao)
    Gaps
    No KV cache optimization work found (MLA, KV eviction, etc.)
    …click to see all
    AB

    Alexander Borzunov

    medium hireability

    Researcher@OpenAI

    Previously: Researcher @ Yandex

    San Francisco, US

    53
    Weight Compression88
    Inference-Aware Architecture62
    Weight Streaming Efficiency50
    KV Cache Optimization10
    Strengths
    SpQR: 4x compression, <1% perplexity loss at 3-4 bits (368 citations)
    PETALS int8 model sharding — weight streaming at distributed inference scale
    Gaps
    No KV cache optimization work (MLA, eviction, vector quantization) found
    …click to see all
    AP

    Alexandra Peste

    medium hireability

    Applied Scientist@Canva

    Previously: Postdoctoral Researcher @ Institute of Science and Technology Austria

    Vienna, AT

    36
    Weight Compression78
    Weight Streaming Efficiency42
    Inference-Aware Architecture20
    KV Cache Optimization3
    Strengths
    ISTA DASLab PhD (Alistarh group) — top model compression research lab
    "Sparsity in Deep Learning" (JMLR, 1,229 citations) — co-authored compression survey
    Gaps
    No KV cache, MLA, or KV eviction work — significant gap for axis 1
    …click to see all
    AM

    Alexandre Marques

    medium hireability
    23
    Weight Compression65
    Inference-Aware Architecture15
    KV Cache Optimization8
    Weight Streaming Efficiency5
    Strengths
    51 commits on llm-compressor — Neural Magic production quantization pipeline
    QAT and activation equalization work — core compression techniques
    Gaps
    No KV cache optimization work (MLA, KV eviction, vector-quantized KV)
    …click to see all
    AH

    Ali Hatamizadeh

    medium hireability

    Research Scientist@NVIDIA

    Previously: PhD student @ University of California, Los Angeles

    San Francisco, US

    45
    Inference-Aware Architecture78
    KV Cache Optimization60
    Weight Streaming Efficiency35
    Weight Compression5
    Strengths
    Gated Delta Networks (ICLR 2025) — replaces KV cache with fixed recurrent state
    MambaVision (CVPR 2025) — inference-efficient hybrid Mamba-Transformer at NVIDIA
    Gaps
    No direct MLA, vector-quantized KV, or KV eviction work — KV-free via SSM is adjacent
    …click to see all
    AK

    Alind Khare

    medium hireability

    Senior Researcher@Microsoft

    Previously: PhD Student @ Georgia Institute of Technology

    IN

    54
    Weight Compression75
    Weight Streaming Efficiency72
    Inference-Aware Architecture65
    KV Cache Optimization5
    Strengths
    Weight Sharing Paradigm (SIGOPS 2025) — LLM inference weight-streaming directly on-topic
    ∇QDARTS (TMLR 2025) — joint quantization+NAS for weight compression
    Gaps
    No KV cache work — MLA, KV eviction absent from publication record
    …click to see all
    AL

    Alkaid

    medium hireability
    32
    Inference-Aware Architecture78
    KV Cache Optimization40
    Weight Compression5
    Weight Streaming Efficiency5
    Strengths
    Blackwell sm100 FMHA decode kernel optimization at Meta/FBGEMM
    6 merged PRs to flash-attention v4 (CuTe DSL masking, R2P, TMEM)
    Gaps
    No KV cache compression research (MLA, vector-quantized KV, eviction)
    …click to see all
    AG

    Amir Gholami

    medium hireability

    Postdoc@University of California, Berkeley

    San Francisco, US

    83
    Weight Compression97
    KV Cache Optimization88
    Inference-Aware Architecture83
    Weight Streaming Efficiency65
    Strengths
    KVQuant (2024): KV cache quantization for 10M context — direct hit
    Survey of quantization methods (2022, 1894 citations) — definitive field reference
    Gaps
    No direct weight streaming bandwidth work — coverage is via quantization, not sparsity/streaming architecture
    …click to see all
    AJ

    Amir Jalalirad

    medium hireability

    Staff Engineer@Qualcomm

    Previously: Senior Research Engineer @ HERE Technologies

    Amsterdam, NL

    51
    Weight Streaming Efficiency75
    Inference-Aware Architecture72
    Weight Compression40
    KV Cache Optimization18
    Strengths
    DIP (MLSys 2025): 46% memory, 40% throughput gain via dynamic weight sparsification
    Explicitly targets DRAM bandwidth bottleneck in decode — weight stream constraint
    Gaps
    No transformer KV cache compression (MLA, vector-quantized KV, eviction) work found
    …click to see all
    AN

    Amir Nassereldine

    medium hireability

    PhD Student@University At Buffalo SUNY

    Previously: Summer Intern @ Modular

    Buffalo, US

    19
    Inference-Aware Architecture40
    Weight Compression22
    Weight Streaming Efficiency8
    KV Cache Optimization5
    Strengths
    NVCiM-PT: hardware-software co-design for edge LLM inference (DATE 2025)
    Modular MAX Engine intern — AI inference runtime, hardware efficiency
    Gaps
    No KV cache work — no MLA, KV eviction, or quantized KV papers
    …click to see all
    AY

    Amir Yazdanbakhsh

    medium hireability

    Research Scientist@DeepMind

    Previously: Research Scientist @ Google

    San Francisco, US

    80
    Inference-Aware Architecture92
    Weight Compression90
    KV Cache Optimization70
    Weight Streaming Efficiency68
    Strengths
    SLiM (2025): combined quantization + sparsity for LLM weight compression
    HW-SW co-design primary research identity — 'Beyond Moore's Law' (2025)
    Gaps
    No explicit MLA or KV eviction paper — closest is linear attention replacing KV cache
    …click to see all
    AA

    Ammar Ahmad Awan

    medium hireability

    Principal AI Software Architect@Microsoft

    Previously: Principal Research Manager @ Microsoft

    Richardson, US

    29
    Inference-Aware Architecture72
    Weight Streaming Efficiency20
    Weight Compression12
    KV Cache Optimization12
    Strengths
    Leads LLM inference for Microsoft Maia AI chip — explicit chip-aware work
    DeepSpeed-Inference (577 citations) — inference at unprecedented scale
    Gaps
    No KV cache research (MLA, eviction, vector-quantized KV)
    …click to see all
    AL

    Andrew Li

    medium hireability

    Master's Student, University of California, Berkeley

    Previously: Researcher @ Google

    38
    Inference-Aware Architecture80
    Weight Compression58
    Weight Streaming Efficiency8
    KV Cache Optimization5
    Strengths
    CVPR 2021: designed EfficientNet-X for TPU/GPU — 2x faster than EfficientNet
    FLIQS (2024): quantization NAS covering FP8/INT — weight compression via search
    Gaps
    No published work on KV cache (MLA, KV eviction, vector-quantized KV)
    …click to see all
    AN

    andrewor14

    medium hireability
    49
    Weight Compression88
    Inference-Aware Architecture62
    Weight Streaming Efficiency40
    KV Cache Optimization5
    Strengths
    pytorch/ao: led INT4/INT8/Float8/NF4 v2 tensor architecture (144 commits)
    NVFP4 QAT — NVIDIA FP4 microscaling format, Blackwell inference chip target
    Gaps
    No KV cache work — MLA, vector-quantized KV, KV eviction all absent
    …click to see all
    AF

    Andrew W Fitzgibbon

    medium hireability

    Engineering Fellow@Graphcore

    Previously: Partner Researcher @ Microsoft

    Cambridge, GB

    66
    Weight Compression92
    Inference-Aware Architecture88
    Weight Streaming Efficiency72
    KV Cache Optimization12
    Strengths
    FP8 LLM inference paper — NeurIPS 2023 Oral, 111M–70B parameter models
    Scalify (ICML 2024) — end-to-end scale propagation for low-precision LLMs
    Gaps
    No published work on KV cache: MLA, KV eviction, or vector-quantized KV
    …click to see all
    AG

    Andrey Gromov

    medium hireability

    Research Scientist@Meta

    Previously: Assistant Professor @ University of Maryland, College Park

    34
    Weight Compression80
    Inference-Aware Architecture35
    Weight Streaming Efficiency15
    KV Cache Optimization5
    Strengths
    PARQ (ICML 2025) — piecewise-affine regularized quantization for LLMs
    Deeper Layers pruning (2024, 153 cit) — structured weight reduction via layer removal
    Gaps
    No KV cache optimization work (no MLA, KV eviction, vector quantization)
    …click to see all
    AL

    Angel Li

    medium hireability
    43
    Weight Compression75
    KV Cache Optimization45
    Inference-Aware Architecture40
    Weight Streaming Efficiency10
    Strengths
    MXTensor (mxfp8/nvfp4) in pytorch/ao — microscaling quantization as in query
    int4, int8 weight quantization for vllm/safetensors serialization
    Gaps
    No evidence of MLA or KV eviction strategies specifically
    …click to see all
    AN

    Aniruddha Nrusimha

    medium hireability

    PhD candidate@MIT

    Previously: Undergrad student @ University of California Berkeley

    Boston, US

    62
    KV Cache Optimization88
    Inference-Aware Architecture72
    Weight Compression58
    Weight Streaming Efficiency30
    Strengths
    Cross-Layer Attention (NeurIPS 2024) — primary-author KV cache reduction paper
    FlashFormer (2025) — whole-model kernels for low-batch inference, hardware-aware
    Gaps
    Weight streaming specifically (power-per-token at low-power mode) — not explicitly addressed
    …click to see all
    AO

    Antonio Orvieto

    medium hireability

    Principal Investigator (PI)@ELLIS Institute Tübingen

    Previously: PHD Researcher @ ETH Zurich

    Tübingen, DE

    43
    KV Cache Optimization72
    Inference-Aware Architecture68
    Weight Streaming Efficiency28
    Weight Compression5
    Strengths
    LRU paper (446 citations) — eliminates KV cache via linear recurrent state
    Griffin/Hawk co-author — hybrid recurrent+attention with O(1) decode memory
    Gaps
    No weight compression work (quantization, MicroScaling, topological regularization)
    …click to see all
    AM

    Avner May

    medium hireability

    Staff Research Scientist@Together

    Previously: Research Scientist @ Google

    New York, US

    36
    Inference-Aware Architecture62
    KV Cache Optimization50
    Weight Compression22
    Weight Streaming Efficiency10
    Strengths
    MagicDec: sparse KV cache to address KV bottleneck at high batch sizes
    Sequoia: hardware-aware speculative decoding — explicit HW modeling
    Gaps
    No MLA, vector-quantized KV, or KV quantization work
    …click to see all
    AC

    Ayan Chakraborty

    medium hireability

    Doctoral Student@EPFL

    Previously: Intern @ Nvidia

    Écublens, CH

    35
    Weight Compression72
    Inference-Aware Architecture42
    Weight Streaming Efficiency20
    KV Cache Optimization5
    Strengths
    Sparsity+quantization interplay paper — LLMs (2024, 15 citations)
    Block FP (Mixed-Mantissa) for DNN accelerators — microscaling-adjacent
    Gaps
    No KV cache work (MLA, KV eviction, vector-quantized KV)
    …click to see all
    BB

    Babak Ehteshami Bejnordi

    medium hireability

    Sr. Staff Engineer and Manager@Qualcomm

    Previously: Deep Learning and Computer Vision for Autonomous Driving @ Mapscape

    Amsterdam, NL

    60
    Inference-Aware Architecture72
    KV Cache Optimization70
    Weight Streaming Efficiency58
    Weight Compression38
    Strengths
    KaVa (2025): KV-cache compression via distillation — direct axis match
    Cache-Conditional Experts: 2× speedup on DRAM-constrained mobile MoE inference
    Gaps
    No MLA or vector-quantized KV work — KV research is distillation-based, not attention-level
    …click to see all
    BW

    Bailin Wang

    medium hireability

    MIT CSAIL

    Previously: Researcher @ Apple

    40
    Inference-Aware Architecture78
    KV Cache Optimization68
    Weight Streaming Efficiency8
    Weight Compression7
    Strengths
    GLA Transformers: FlashLinearAttention beats FlashAttention-2, hardware I/O-aware design
    Linear attention eliminates growing KV cache — constant inference memory
    Gaps
    No work on MLA, vector-quantized KV cache, or KV eviction — alternative paradigm (eliminates vs. compresses)
    …click to see all
    BP

    Barun Patra

    medium hireability

    Member of Technical Staff@Microsoft

    Previously: Senior Applied Scientist @ Microsoft

    Seattle, US

    27
    Inference-Aware Architecture75
    KV Cache Optimization20
    Weight Streaming Efficiency8
    Weight Compression5
    Strengths
    S2-Attention (ICLR 2025): Triton kernels, 4.5x inference speedup at 7B scale
    Hardware-aware design: context sharding optimized for memory IO and parallelization
    Gaps
    No KV cache eviction, MLA, or vector-quantized KV work
    …click to see all
    BF

    Benjamin Fineran

    medium hireability
    49
    Weight Compression90
    Weight Streaming Efficiency45
    Inference-Aware Architecture40
    KV Cache Optimization20
    Strengths
    #2 contributor llm-compressor — 293 commits, weight quantization at scale
    Authored HFQuantizer for compressed-tensors (merged HF transformers Sep 2024)
    Gaps
    No visible KV cache eviction, MLA, or vector-quantized KV work
    …click to see all
    BC

    Berlin Chen

    medium hireability
    24
    Inference-Aware Architecture55
    KV Cache Optimization30
    Weight Compression5
    Weight Streaming Efficiency5
    Strengths
    Mamba-3 co-author with Tri Dao / Albert Gu — top SSM lab
    Constant-memory SSM recurrence eliminates KV cache at inference
    Gaps
    No direct KV eviction, MLA, or vector-quantized KV cache work
    …click to see all
    BA

    Bilge Acun

    medium hireability

    Research Scientist@Meta

    Previously: Research Staff Member @ IBM

    San Francisco, US

    53
    KV Cache Optimization72
    Inference-Aware Architecture68
    Weight Compression48
    Weight Streaming Efficiency22
    Strengths
    CHAI (ICML 2024): KV cache reduction via head attention clustering
    CATransformers (NeurIPS 2025): joint model-hardware NAS for inference
    Gaps
    No direct MLA, GQA, or KV eviction policy work — CHAI is head pruning, not KV quantization
    …click to see all
    BG

    Bofei Gao

    medium hireability

    MS student@Peking University

    CN

    23
    KV Cache Optimization72
    Inference-Aware Architecture10
    Weight Compression5
    Weight Streaming Efficiency5
    Strengths
    PyramidKV: pyramidal KV eviction reduces cache footprint (155 citations)
    Kimi k1.5 + k2 co-authorship — Moonshot AI inference lab credibility
    Gaps
    No weight compression or topological regularization work
    …click to see all
    BC

    Boju Chen

    medium hireability

    PhD student@Tsinghua University

    Beijing, CN

    31
    KV Cache Optimization65
    Inference-Aware Architecture50
    Weight Compression5
    Weight Streaming Efficiency5
    Strengths
    MoA: Mixture of Sparse Attention — 47 citations, CoLM'25 acceptance
    Sliding-window KV eviction: 1.2-1.4x memory reduction, 6.6-8.2x decode throughput
    Gaps
    No explicit MLA, vector-quantized KV cache, or KV eviction (policy-based) work
    …click to see all
    BL

    Bo Li

    medium hireability

    AI Compute DevTech Engineer@NVIDIA

    Previously: Semester Project Intern @ Disney Research

    Shanghai, CN

    74
    Weight Compression85
    KV Cache Optimization80
    Inference-Aware Architecture70
    Weight Streaming Efficiency60
    Strengths
    QuaRot: 4-bit quantization of all weights + KV cache (NeurIPS 2024, 286 citations)
    KV cache quantized to 4 bits in QuaRot — direct KV cache size reduction
    Gaps
    No explicit KV eviction or MLA work — QuaRot focuses on quantization
    …click to see all
    BP

    Bo Peng

    medium hireability

    Undergrad student@University of Hong Kong

    Hong Kong, HK

    71
    KV Cache Optimization95
    Inference-Aware Architecture90
    Weight Streaming Efficiency72
    Weight Compression28
    Strengths
    RWKV-LM: KV-free recurrent architecture eliminates KV cache entirely
    Albatross engine: 10,250 TPS on single RTX 5090 (RWKV7 7.2B fp16)
    Gaps
    No microscaling / topological regularization work (weight compression axis)
    …click to see all
    BB

    Boris van Breugel

    medium hireability

    Senior Machine Learning Researcher@Qualcomm

    Previously: PhD Researcher @ University of Cambridge

    Amsterdam, NL

    21
    Weight Compression52
    Inference-Aware Architecture20
    KV Cache Optimization5
    Weight Streaming Efficiency5
    Strengths
    FPTQuant (2025): function-preserving transforms for LLM quantization
    HadaNorm (2025): mean-centered transforms for diffusion transformer quantization
    Gaps
    No KV cache work (MLA, KV eviction, vector-quantized KV) — major axis gap
    …click to see all
    BL

    Boxun Li

    medium hireability

    Principal Researcher@Infinigence-AI

    Previously: Researcher @ Megvii Technology

    Durham, US

    34
    Inference-Aware Architecture65
    Weight Streaming Efficiency35
    Weight Compression20
    KV Cache Optimization15
    Strengths
    Megrez-Omni first author — edge inference model with SW-HW co-design
    Megrez2: cross-layer expert sharing, 3B active / 7.5B stored parameters
    Gaps
    No KV cache work (MLA, vector-quantized KV, eviction strategies)
    …click to see all
    BZ

    Bo Zheng

    medium hireability

    Researcher@Alibaba

    Previously: Researcher @ Alibaba

    26
    Inference-Aware Architecture45
    KV Cache Optimization30
    Weight Streaming Efficiency25
    Weight Compression5
    Strengths
    Gated Attention (arXiv:2505.06708): query-dependent sparse gating, attention-sink-free
    Qwen3 contributor — production MoE+dense LLM deployed at massive scale
    Gaps
    No direct KV eviction, MLA, or vector-quantized KV cache work found
    …click to see all
    BR

    Brian K. Ryu

    medium hireability
    72
    Inference-Aware Architecture82
    Weight Compression80
    KV Cache Optimization65
    Weight Streaming Efficiency62
    Strengths
    100 commits to flashinfer — KV cache / attention kernel library
    FP4 block-scaled kernel (SM120) — 1.20× over CUTLASS, extreme weight compression
    Gaps
    No published papers on MLA or KV eviction specifically
    …click to see all
    BA

    Byung Hoon Ahn

    medium hireability

    Software Engineer@Apple

    Previously: Research Scientist @ Protopia AI

    San Francisco, US

    24
    Inference-Aware Architecture60
    Weight Streaming Efficiency20
    KV Cache Optimization10
    Weight Compression5
    Strengths
    FlexInfer (MLSys 2025): hardware-aware adaptive LLM inference scheduling
    Tandem Processor (ASPLOS 2024): accelerator co-design for emerging NN operators
    Gaps
    No KV cache optimization work (no MLA, vector quantization, or eviction research)
    …click to see all
    CL

    Carlo Luschi

    medium hireability

    VP & Head of Research@Graphcore

    Previously: Director of Research @ Graphcore

    Oxford, GB

    86
    Inference-Aware Architecture92
    Weight Compression90
    KV Cache Optimization82
    Weight Streaming Efficiency80
    Strengths
    SparQ Attention: 8x attention data transfer savings via selective KV fetching
    Leads all research at Graphcore — IPU chip company with explicit HW constraints
    Gaps
    No published work specifically on MLA or KV eviction strategies
    …click to see all
    CY

    Changdi Yang

    medium hireability

    Intern@Snap

    Previously: PhD student @ Northeastern University

    40
    Weight Compression75
    Inference-Aware Architecture42
    Weight Streaming Efficiency35
    KV Cache Optimization8
    Strengths
    EdgeQAT/Squat: sub-8-bit token-adaptive QAT, 2.37x mobile inference speedup
    HyWIA: structured LLM pruning, 50% size reduction, +2.82% accuracy vs LLM-Pruner
    Gaps
    No KV cache, MLA, or KV eviction work found
    …click to see all
    CM

    Changhai Man

    medium hireability

    PhD student@Georgia Institute of Technology

    Atlanta, US

    37
    Inference-Aware Architecture60
    Weight Compression50
    Weight Streaming Efficiency35
    KV Cache Optimization3
    Strengths
    Multi-bit-width systolic accelerator (43 citations) — hardware inference co-design
    RankSearch: auto tensor compression for edge LSTM networks
    Gaps
    No KV cache or attention/transformer inference work found
    …click to see all
    CT

    Chaofan Tao

    medium hireability

    Research Scientist@Huawei

    Previously: Software Engineer @ Meta

    Hong Kong, HK

    53
    KV Cache Optimization80
    Weight Compression75
    Inference-Aware Architecture38
    Weight Streaming Efficiency18
    Strengths
    D2O (2025): dynamic layer/token KV cache compression for long-context LLMs
    UNComp (2024): KV-cache sparsity via uncertainty — two dedicated KV papers
    Gaps
    No hardware-specific chip co-design work; no power/packaging constraint awareness
    …click to see all
    CX

    Chaojun Xiao

    medium hireability

    Post-Doctoral Researcher@Tsinghua University

    Previously: Business Development Intern @ P.E.R.K. Consulting

    Beijing, CN

    52
    KV Cache Optimization80
    Inference-Aware Architecture72
    Weight Streaming Efficiency42
    Weight Compression12
    Strengths
    Locret (2025): trained retaining heads enabling principled KV eviction
    InfLLM (NeurIPS 2024): context memory for long-context KV management
    Gaps
    No weight compression work (quantization, microscaling, topological regularization)
    …click to see all
    CZ

    Chenggang Zhao

    medium hireability

    infra@DeepSeek AI

    ex-NVIDIA, SenseTime

    Hangzhou, CN

    79
    KV Cache Optimization90
    Inference-Aware Architecture85
    Weight Compression75
    Weight Streaming Efficiency65
    Strengths
    DeepSeek-V2 co-author — introduced MLA for KV cache footprint reduction
    DeepGEMM: FP8 GEMM with fine-grained scaling — direct weight compression for inference
    Gaps
    Infra/systems focus — less core architecture research independent of DeepSeek team
    …click to see all
    CL

    Cheng Li

    medium hireability

    Member of Technical Staff@Black Forest Labs

    Previously: Research Engineer @ Databricks

    Bellevue, US

    49
    Weight Compression82
    Inference-Aware Architecture60
    Weight Streaming Efficiency35
    KV Cache Optimization20
    Strengths
    DeepSpeed-Inference (632 citations) — led efficient inference at scale
    INT4 quantization paper (ICML 2023) — direct weight compression evidence
    Gaps
    No MLA, KV eviction, or vector-quantized KV cache papers
    …click to see all
    CL

    Chuanjian Liu

    medium hireability

    Researcher@Huawei

    Previously: Researcher @ Huawei

    32
    Weight Compression78
    Inference-Aware Architecture30
    Weight Streaming Efficiency15
    KV Cache Optimization5
    Strengths
    Rethinking 1-bit Optimization (2025) — extreme LLM weight compression
    Bi-ViT (2023) — pushes binarization limit for vision transformers
    Gaps
    No KV cache optimization work (no MLA, eviction, or vector-quantized KV)
    …click to see all
    CG

    Cong Guo

    medium hireability

    Postdoctoral Associate@Duke University

    Previously: Research intern @ Shanghai Qi Zhi Institute

    Durham, US

    83
    Weight Compression95
    KV Cache Optimization92
    Inference-Aware Architecture88
    Weight Streaming Efficiency58
    Strengths
    Ecco (ISCA'25): direct KV cache compression paper at flagship architecture venue
    VQ-LLM (2025): vector quantization for LLM inference — matches JD's VQ-MLA mention
    Gaps
    Weight streaming efficiency — sparsity work is adjacent; no paper explicitly on decode stream bandwidth for low-power chips
    …click to see all
    CH

    Connor Holmes

    medium hireability

    Researcher@OpenAI

    Previously: Researcher @ Microsoft

    San Francisco, US

    59
    Weight Compression82
    Inference-Aware Architecture75
    Weight Streaming Efficiency50
    KV Cache Optimization28
    Strengths
    NxMTransformer: semi-structured sparsity ADMM (2021, 30 citations)
    Low-bit NxM sparsity compression (2022) — quantization + pruning
    Gaps
    No MLA, vector-quantized KV cache, or KV eviction papers
    …click to see all
    DH

    Daniel HAZIZA

    medium hireability

    Research Engineer - GPU efficiency@Meta

    Previously: Research Engineer @ Meta

    Paris, FR

    57
    Inference-Aware Architecture82
    KV Cache Optimization75
    Weight Streaming Efficiency52
    Weight Compression18
    Strengths
    Flash-Decoding (2023): KV cache memory bandwidth optimization for long-context inference
    xFormers co-author — production GPU inference efficiency library at Meta FAIR
    Gaps
    No published work specifically on MLA, vector-quantized KV, or KV eviction strategies
    …click to see all
    DV

    Daniel Vega-Myhre

    medium hireability
    47
    Weight Compression80
    Inference-Aware Architecture55
    Weight Streaming Efficiency38
    KV Cache Optimization15
    Strengths
    254 commits to pytorch/ao — #3 contributor, quantization + sparsity
    MXFP8 microscaling quantization (MX = microscaling, chip-aware precision)
    Gaps
    No KV cache optimization work found (MLA, KV eviction, vector-quantized KV)
    …click to see all
    DS

    Daria Soboleva

    medium hireability

    Head Research Scientist@Cerebras Systems

    Previously: Senior Research Scientist @ Cerebras Systems

    San Francisco, US

    36
    Inference-Aware Architecture85
    Weight Compression35
    Weight Streaming Efficiency15
    KV Cache Optimization10
    Strengths
    5+ years at Cerebras designing MoE for wafer-scale inference hardware
    BTLM: 3B params achieves 7B quality at 3× less inference compute
    Gaps
    No KV cache work (MLA, vector-quantized KV, eviction strategies)
    …click to see all
    DC

    David Corvoysier

    medium hireability
    50
    Weight Compression88
    Inference-Aware Architecture55
    KV Cache Optimization30
    Weight Streaming Efficiency25
    Strengths
    huggingface/optimum-quanto: 644 commits, sole primary maintainer
    INT2/INT4/INT8/FP8 — full quantization stack across precisions
    Gaps
    No evidence of MLA, vector-quantized KV, or KV eviction algorithm work
    …click to see all
    DR

    David W. Romero

    medium hireability

    Research Scientist@Cartesia

    Previously: Research Scientist @ NVIDIA

    San Francisco, US

    54
    KV Cache Optimization65
    Inference-Aware Architecture60
    Weight Compression55
    Weight Streaming Efficiency35
    Strengths
    Cartesia RS: designs KV-free SSM/hybrid LM architectures for inference
    'Systems & Algorithms for Convolutional Multi-Hybrid LMs at Scale' (2025) — KV-free at scale
    Gaps
    No direct MLA, vector-quantized KV, or KV eviction work — eliminates cache rather than optimizes it
    …click to see all
    DG

    Daya Guo

    medium hireability

    Associate Professor@Sun Yat-sen University

    Previously: Postdoctoral Fellow @ Clemson University

    Zhuhai, CN

    45
    KV Cache Optimization65
    Inference-Aware Architecture60
    Weight Compression35
    Weight Streaming Efficiency20
    Strengths
    DeepSeek-V2 co-author — introduced MLA (93.3% KV cache compression)
    DeepSeek-V3 co-author — MLA + FP8 quantization + MTP continued
    Gaps
    Primary research focus is code intelligence, not KV cache or inference hardware
    …click to see all
    DN

    Deepak Narayanan

    medium hireability

    Senior Applied Deep Learning Research Scientist@NVIDIA

    Previously: Senior Researcher @ Microsoft

    Seattle, US

    55
    Inference-Aware Architecture88
    KV Cache Optimization65
    Weight Compression38
    Weight Streaming Efficiency28
    Strengths
    "The Case for Co-Designing Model Architectures with Hardware" (ICPP 2024) — exact query match
    Nemotron-H: Mamba layers replace attention, eliminating KV cache at inference (3× speedup)
    Gaps
    No dedicated MLA, vector-quantized KV, or KV eviction work — KV reduction is architectural not algorithmic
    …click to see all
    DS

    Dipika Sikka

    medium hireability
    53
    Weight Compression88
    Weight Streaming Efficiency65
    Inference-Aware Architecture55
    KV Cache Optimization5
    Strengths
    257 commits on vllm-project/llm-compressor — top-4 contributor
    MXFP4 support — microscaling format directly relevant to query's MX axis
    Gaps
    No KV cache work found (MLA, KV eviction, quantized KV) — key axis is blank
    …click to see all
    DW

    Di Wu

    medium hireability

    Director, Deep Learning Algorithm and Software@NVIDIA

    Previously: Co-Founder and CEO @ NVIDIA

    San Francisco, US

    50
    Weight Compression75
    Inference-Aware Architecture60
    KV Cache Optimization45
    Weight Streaming Efficiency20
    Strengths
    Founded OmniML — model compression startup acquired by NVIDIA
    Leads NVIDIA FP4 quantization and TensorRT Model Optimizer
    Gaps
    Thin research publication record — most papers from 2010-2018 FPGA era
    …click to see all
    DJ

    Donghyeon Joo

    medium hireability

    Research Associate@AMD

    Previously: Research Associate - PhD @ AMD

    College Park, US

    64
    KV Cache Optimization82
    Inference-Aware Architecture80
    Weight Streaming Efficiency52
    Weight Compression40
    Strengths
    MUSTAFAR (NeurIPS 2025): direct KV cache pruning via unstructured sparsity
    CORUSCANT (MICRO 2025): hardware-aware co-design of GPU kernels + sparse tensor cores
    Gaps
    No direct work on MLA, vector-quantized KV, or attention-level cache compression
    …click to see all
    DL

    Donghyun Lee

    medium hireability

    PhD student@University of Southern California

    Previously: Research Scholar @ Yale University

    New Haven, US

    32
    Weight Compression78
    Weight Streaming Efficiency30
    Inference-Aware Architecture15
    KV Cache Optimization3
    Strengths
    GPTAQ (ICML 2025): finetuning-free LLM weight quantization at inference time
    KronQ: Kronecker-factored Hessian — novel structured quantization approach
    Gaps
    No KV cache work — MLA, vector-quantize KV, eviction all absent
    …click to see all
    DG

    Driss Guessous

    medium hireability

    Staff Software Engineer@Meta

    Previously: Senior Machine Learning Engineer @ Meta

    Redondo Beach, US

    69
    Weight Compression88
    Inference-Aware Architecture78
    KV Cache Optimization60
    Weight Streaming Efficiency50
    Strengths
    117 PRs in pytorch/ao — NVFP4, float8, MX microscaling quantization
    FlexAttention co-author (2025) — fused attention kernel programmability
    Gaps
    No direct MLA or KV eviction work found — attention work is kernel-level, not cache strategy
    …click to see all
    EI

    eigen

    medium hireability
    40
    KV Cache Optimization75
    Inference-Aware Architecture50
    Weight Compression25
    Weight Streaming Efficiency10
    Strengths
    51 commits on flashinfer-ai/flashinfer — paged-KV + GQA core focus
    PR #3221: paged-KV indices, ragged indptrs, RoPE cos/sin infrastructure
    Gaps
    No evidence of MLA or KV eviction strategies specifically
    …click to see all
    EK

    Eldar Kurtic

    medium hireability

    Principal Research Scientist@Red Hat

    Previously: Senior Research Engineer @ Red Hat

    Vienna, AT

    54
    Weight Compression90
    Inference-Aware Architecture65
    Weight Streaming Efficiency55
    KV Cache Optimization5
    Strengths
    ZipLM: hardware-aware structured pruning — explicitly co-designs with inference chip constraints
    OBS/second-order pruning (177 cites) — foundational LLM weight compression
    Gaps
    No KV cache, MLA, or KV eviction work — missing axis entirely
    …click to see all
    EI

    Eugenia Iofinova

    medium hireability

    PhD student@Alistarh Group

    Previously: Intern @ Microsoft

    AT

    35
    Weight Compression78
    Weight Streaming Efficiency30
    Inference-Aware Architecture25
    KV Cache Optimization5
    Strengths
    AC/DC: alternating compressed/decompressed training (86 citations)
    Alistarh Group pedigree — SparseGPT/GPTQ originating lab
    Gaps
    No KV cache work (MLA, vector-quantized KV, eviction) found
    …click to see all
    FM

    Fanxu Meng

    medium hireability

    Sr. Technologist@Technip Energies

    Previously: Research Associate @ Houston Advanced Research Center

    Houston, US

    70
    KV Cache Optimization92
    Weight Compression72
    Inference-Aware Architecture72
    Weight Streaming Efficiency45
    Strengths
    TransMLA (NeurIPS 2025 Spotlight): 93% KV cache compression, 10.6x speedup
    TPLA (ASPLOS 2026): MLA tensor-parallel attention for prefill-decode decoupling
    Gaps
    No specific work on weight stream bandwidth reduction during decode
    …click to see all
    FS

    Fei Sun

    medium hireability

    Software Engineer@Meta

    Previously: Research Scientist @ Alibaba Group

    San Francisco, US

    57
    Weight Compression88
    Inference-Aware Architecture85
    Weight Streaming Efficiency45
    KV Cache Optimization8
    Strengths
    CHEX: channel compression for CNNs (2022, 114 citations)
    FBNet: hardware-aware NAS — inference chip-aware design (1807 cit.)
    Gaps
    No KV cache work found (MLA, vector-quantized KV, KV eviction)
    …click to see all
    FI

    Forrest Iandola

    medium hireability

    AI Research Scientist@Meta

    Previously: Head of Perception @ Anduril Industries

    San Francisco, US

    61
    Weight Compression92
    Inference-Aware Architecture78
    KV Cache Optimization40
    Weight Streaming Efficiency35
    Strengths
    SqueezeNet: AlexNet accuracy with 50x fewer params (<0.5MB)
    MobileLLM: block-wise weight-sharing + GQA for on-device LLMs
    Gaps
    No MLA, vector-quantized KV, or KV eviction work found
    …click to see all
    FM

    Funtowicz Morgan

    medium hireability
    40
    Weight Streaming Efficiency85
    Inference-Aware Architecture55
    Weight Compression15
    KV Cache Optimization5
    Strengths
    hmll: loads AI model weights at wire speed via io_uring/mmap
    ionic: CUDA planner pipelining NVMe→GPU weight streaming
    Gaps
    No evidence of KV cache work (MLA, eviction, quantization)
    …click to see all
    FT

    Fuwen Tan

    medium hireability

    R&D@ByteDance

    Previously: Research Scientist @ Samsung

    San Francisco, US

    56
    Weight Compression80
    Weight Streaming Efficiency72
    Inference-Aware Architecture62
    KV Cache Optimization10
    Strengths
    MobileQuant (EMNLP 2024): quantization for on-device LLM inference chips
    Progressive Mixed-Precision Decoding (ICLR 2025): variable-precision decode phase
    Gaps
    No KV cache work: MLA, KV eviction, vector-quantized KV absent from portfolio
    …click to see all
    FM

    fxmarty (Felix Marty)

    medium hireability
    48
    Weight Compression82
    Inference-Aware Architecture50
    KV Cache Optimization38
    Weight Streaming Efficiency22
    Strengths
    AutoGPTQ maintainer — production GPTQ weight quantization at scale
    Marlin FP8 kernels (optimum-quanto #237/#241) — INT4/FP8 inference
    Gaps
    No evidence of MLA, vector-quantized KV cache, or KV eviction strategies
    …click to see all
    GO

    Gabriele Oliaro

    medium hireability

    CS PhD Student@Snowflake AI Research

    Previously: Research Scientist Intern @ Snowflake

    29
    Inference-Aware Architecture55
    Weight Compression35
    Weight Streaming Efficiency15
    KV Cache Optimization12
    Strengths
    Korch (ASPLOS 2024): hardware-aware kernel orchestration for tensor programs
    Quantized Side Tuning (ACL 2024 Outstanding): 4-bit weight quantization for LLMs
    Gaps
    No work on KV cache compression (MLA, vector-quantized KV, eviction)
    …click to see all
    GK

    Geethan Karunaratne

    medium hireability

    Researcher@IBM

    Previously: Postdoctoral Researcher @ IBM

    Zurich, CH

    48
    Inference-Aware Architecture85
    Weight Compression62
    Weight Streaming Efficiency30
    KV Cache Optimization15
    Strengths
    64-core PCM DNN inference chip — co-designed models for in-memory compute (265 citations)
    HERMES-Core 14nm PCM/CMOS chip — 1.59 TOPS/mm², literal inference chip co-design
    Gaps
    No KV cache optimization work (no MLA, KV eviction, or vector-quantized KV papers)
    …click to see all
    GZ

    Genghan Zhang

    medium hireability

    PhD student@Ph.D. student of Computer Science, Stanford University

    Previously: Intern @ NVIDIA

    56
    Inference-Aware Architecture75
    Weight Streaming Efficiency65
    KV Cache Optimization55
    Weight Compression30
    Strengths
    CATS (37 citations): activation sparsity reducing LLM inference streams
    AccelOpt (MLSys 2026): kernel optimization agents for AI accelerators
    Gaps
    No direct MLA, vector-quantized KV cache, or KV eviction work
    …click to see all
    GJ

    Geonhwa Jeong

    medium hireability

    Research Scientist@Meta

    Previously: Graduate Research Assistant @ Georgia Institute of Technology

    San Francisco, US

    53
    Inference-Aware Architecture85
    Weight Compression60
    Weight Streaming Efficiency45
    KV Cache Optimization20
    Strengths
    TASDER (MLSys 2025): structured sparse weight approx, 83% EDP improvement
    2:4 activation sparsity for Transformer inference with FP8 (SLLM 2025)
    Gaps
    No MLA, vector-quantized KV, or KV eviction algorithm work
    …click to see all
    GS

    Gobinda Saha

    medium hireability

    AI Research Scientist@Meta

    Previously: Graduate Student Researcher @ Center for Brain-Inspired Computing

    San Francisco, US

    40
    KV Cache Optimization75
    Weight Compression45
    Inference-Aware Architecture30
    Weight Streaming Efficiency10
    Strengths
    Eigen Attention: 40% KV cache reduction via low-rank attention (arXiv 2024)
    Meta Super Intelligence Labs — LLM-focused research role
    Gaps
    No evidence of MLA or KV eviction strategies beyond low-rank attention
    …click to see all
    GS

    Grigory Sizov

    medium hireability
    22
    KV Cache Optimization50
    Inference-Aware Architecture25
    Weight Compression8
    Weight Streaming Efficiency5
    Strengths
    Paged attention in FlashAttention varlen — direct KV cache memory management
    Split-kv + M↔H swap for decoding attention — KV splitting optimization
    Gaps
    No weight compression or quantization research visible
    …click to see all
    GL

    Guangda Liu

    medium hireability

    PhD student@Microsoft Research Asia Alumni

    Previously: Research Intern @ Microsoft

    Shanghai, CN

    31
    KV Cache Optimization85
    Inference-Aware Architecture30
    Weight Compression5
    Weight Streaming Efficiency5
    Strengths
    ClusterKV: recallable KV compression via semantic clustering (2x latency speedup)
    FreeKV: speculative KV retrieval — 13x faster than SOTA (2025)
    Gaps
    No work on weight compression or topological regularization
    …click to see all
    GL

    Guangda Liu

    medium hireability
    47
    KV Cache Optimization88
    Inference-Aware Architecture55
    Weight Compression25
    Weight Streaming Efficiency20
    Strengths
    FreeKV (arXiv:2505.13109): first-author, 13× speedup over SOTA KV retrieval
    Speculative retrieval + CPU/GPU hybrid KV layout — novel system co-design
    Gaps
    No direct work on weight compression algorithms (quantization/pruning authorship)
    …click to see all
    HY

    Haichuan Yang

    medium hireability

    Staff Software Engineer@DeepMind

    Previously: Research Scientist @ Meta

    San Francisco, US

    44
    Weight Compression82
    Inference-Aware Architecture55
    Weight Streaming Efficiency35
    KV Cache Optimization5
    Strengths
    Sparsity+quantization joint learning — CVPR 2020, 112 citations
    ECC: energy-constrained, platform-independent DNN compression (CVPR 2019)
    Gaps
    No KV cache, MLA, or KV eviction work in publication record
    …click to see all
    HQ

    Haifeng Qian

    medium hireability

    Principal Applied Scientist@NVIDIA

    Previously: Manager and Senior Applied Scientist @ Amazon

    San Francisco, US

    63
    Inference-Aware Architecture85
    KV Cache Optimization82
    Weight Compression50
    Weight Streaming Efficiency35
    Strengths
    Nemotron-H: Mamba layers eliminate KV cache entirely (constant memory per token)
    Bifurcated Attention: directly reduces KV memory IO during high-batch decoding
    Gaps
    No direct work on microscaling, topological regularization, or bits-per-weight constraints
    …click to see all
    HH

    Hailin Hu

    medium hireability

    Researcher@Huawei

    Previously: PhD student @ Tsinghua University

    33
    Weight Compression55
    KV Cache Optimization35
    Inference-Aware Architecture35
    Weight Streaming Efficiency5
    Strengths
    Transformer Compression Survey (2024) — expert knowledge of pruning, quantization, KD
    SDTP token pruning (2025) — KV cache compression compatible, 1.75× inference speedup
    Gaps
    No direct work on MLA or vector-quantized KV cache
    …click to see all
    HC

    Han Cai

    medium hireability

    AI Research Scientist@NVIDIA

    Previously: Research Intern @ NVIDIA

    Boston, US

    71
    Inference-Aware Architecture92
    Weight Compression80
    KV Cache Optimization65
    Weight Streaming Efficiency45
    Strengths
    Jet-Nemotron (NeurIPS 2025): hybrid linear/full-attention LM targeting KV cache reduction
    ProxylessNAS (2.5K cites): direct hardware-aware NAS on real target hardware
    Gaps
    No direct MLA, vector-quantized KV cache, or KV eviction-specific papers
    …click to see all
    HY

    Hanchen Ye

    medium hireability

    ML/HW/SW Co-Design Engineer@ElastixAI

    Previously: ML/HW/SW Co-Design Engineer @ Apple

    Seattle, US

    65
    Inference-Aware Architecture82
    KV Cache Optimization78
    Weight Streaming Efficiency72
    Weight Compression28
    Strengths
    SnapKV (NeurIPS'24, 313 citations) — KV eviction co-author
    StreamTensor (MICRO'25) — tensor streaming in LLM dataflow accelerators
    Gaps
    SnapKV is 6th-author contribution, not primary lead
    …click to see all
    HG

    Han Guo

    medium hireability

    Research Intern@Together AI

    Previously: Research Intern @ IBM

    San Francisco, US

    55
    Weight Compression85
    KV Cache Optimization65
    Inference-Aware Architecture50
    Weight Streaming Efficiency20
    Strengths
    FLUTE repo: CUDA C++ lookup-table quantization for LLMs — hardware-facing weight compression
    LQ-LoRA (ICLR 2024, 84 cites): quantized matrix decomp for efficient LLM finetuning
    Gaps
    No evidence of chip-constraint-aware architecture co-design (power/bandwidth budgets)
    …click to see all
    HW

    Hanrui Wang

    medium hireability

    Researcher@Stealth mode company

    Previously: PhD student @ Massachusetts Institute of Technology

    53
    Weight Compression72
    Inference-Aware Architecture65
    KV Cache Optimization55
    Weight Streaming Efficiency20
    Strengths
    SpAtten: cascade token/head pruning cuts DRAM access 10x (565 citations)
    HAT: hardware-latency-constrained NAS for transformers (370 citations)
    Gaps
    No work on MLA, vector-quantized KV, or modern KV eviction strategies
    …click to see all
    HS

    Hanshi Sun

    medium hireability

    Research Scientist@ByteDance

    Previously: Teaching Assistant @ Carnegie Mellon University

    Bellevue, US

    48
    KV Cache Optimization93
    Inference-Aware Architecture68
    Weight Compression20
    Weight Streaming Efficiency10
    Strengths
    ShadowKV (ICML 2025 Spotlight) — KV cache offloading for long-context inference
    R-KV — KV cache compression for reasoning model acceleration
    Gaps
    No evidence of weight compression or quantization work on model weights
    …click to see all
    HS

    Han Shu

    medium hireability

    Research Engineer@Huawei

    27
    Weight Compression58
    Inference-Aware Architecture25
    Weight Streaming Efficiency20
    KV Cache Optimization5
    Strengths
    ExCP: LLM checkpoint compression 70× via weight-momentum shrinking + quantization
    TinySAM: post-training quantization for edge device inference
    Gaps
    No KV cache work (MLA, KV eviction, vector-quantized KV cache)
    …click to see all
    HZ

    Hansong Zhou

    medium hireability
    39
    Weight Compression80
    Inference-Aware Architecture45
    Weight Streaming Efficiency25
    KV Cache Optimization5
    Strengths
    microsoft/BitNet top contributor — 19 commits, merge access
    Added full model conversion pipeline for BitNet2b_2501 (1852 lines)
    Gaps
    No KV cache, MLA, or attention optimization work found
    …click to see all
    HX

    Haocheng Xi

    medium hireability

    MLsys Researcher@University of California, Berkeley

    Previously: Research Intern @ Nvidia

    Berkeley, US

    74
    Weight Compression92
    KV Cache Optimization88
    Inference-Aware Architecture72
    Weight Streaming Efficiency42
    Strengths
    QuantSpec: hierarchical quantized KV cache — ICML 2025
    XQuant: KV cache rematerialization for LLM inference (2025)
    Gaps
    No direct work on MLA or KV eviction specifically
    …click to see all
    HB

    Haoli Bai

    medium hireability

    Researcher@Huawei

    Previously: Applied Scientist Intern @ Amazon

    Hong Kong, HK

    62
    KV Cache Optimization92
    Weight Compression88
    Inference-Aware Architecture35
    Weight Streaming Efficiency32
    Strengths
    TreeKV, FreqKV, WeightedKV — 3 KV cache compression papers (2025)
    IntactKV (ACL 2024, 45 cit) — KV-aware quantization for LLMs
    Gaps
    No direct hardware-chip co-design evidence (power/bandwidth/packaging)
    …click to see all
    HY

    Haoran You

    medium hireability

    Research Scientist@Adobe

    Previously: Research Scholar @ SRC Research Scholars Program

    Seattle, US

    82
    Weight Compression90
    Inference-Aware Architecture90
    KV Cache Optimization82
    Weight Streaming Efficiency65
    Strengths
    LaCache (2025) — direct KV caching paper for long-context LLM efficiency
    ShiftAddLLM — multiplication-less reparameterization reduces weight stream compute
    Gaps
    No explicit MLA or vector-quantized KV cache work found
    …click to see all
    HT

    Haotian (Ken) Tang

    medium hireability
    80
    Weight Compression95
    Inference-Aware Architecture82
    KV Cache Optimization78
    Weight Streaming Efficiency65
    Strengths
    AWQ MLSys 2024 Best Paper — canonical activation-aware 4-bit weight quantization
    QServe W4A8KV4: led GPU kernels, 4-bit KV cache (SmoothAttention), 3.5x throughput
    Gaps
    No explicit MLA or KV eviction work — KV focus is quantization, not eviction
    …click to see all
    HW

    Haoxuan Wang

    medium hireability

    Research Intern@Cisco

    Previously: PhD student @ Illinois Institute of Technology

    Chicago, US

    19
    Weight Compression55
    Inference-Aware Architecture10
    KV Cache Optimization5
    Weight Streaming Efficiency5
    Strengths
    PTQ4DiT (NeurIPS 2024, 42 citations) — quantization for diffusion transformers
    QuEST (ICCV 2025) — low-bit diffusion model via selective finetuning
    Gaps
    Zero KV cache work (MLA/KV eviction) — core Neuralace requirement
    …click to see all
    HD

    HDCharles

    medium hireability
    46
    Weight Compression78
    Weight Streaming Efficiency50
    KV Cache Optimization42
    Inference-Aware Architecture12
    Strengths
    Core llm-compressor contributor — GPTQ, AWQ, FP8 quantization in production
    Added KV cache quantization to pytorch/ao (130k ctx, 18.9GB w/ int4+KV quant)
    Gaps
    No MLA, vector-quantized MLA, or KV eviction research
    …click to see all
    HD

    HDCharles

    medium hireability
    58
    Weight Compression85
    Weight Streaming Efficiency70
    Inference-Aware Architecture55
    KV Cache Optimization20
    Strengths
    Core llm-compressor contributor — AWQ, GPTQ, FP8, W4A16, W8A16 quantization schemes
    compressed-tensors library contributor (NeuralMagic's compression runtime)
    Gaps
    No published research papers — engineer rather than academic researcher
    …click to see all
    HH

    HDCharles (Charles Hernandez)

    medium hireability
    57
    Weight Compression88
    KV Cache Optimization65
    Inference-Aware Architecture50
    Weight Streaming Efficiency25
    Strengths
    Added KV cache quantization to torchao — direct implementation evidence
    82 commits pytorch/ao + 468 PRs llm-compressor — quantization core contributor
    Gaps
    No evidence of MLA, vector-quantized MLA, or KV eviction strategies specifically
    …click to see all
    HC

    Heng Chang

    medium hireability

    Researcher@Tsinghua University

    Previously: Research Intern @ Ant Group

    Beijing, CN

    24
    Weight Compression65
    Inference-Aware Architecture20
    KV Cache Optimization5
    Weight Streaming Efficiency5
    Strengths
    QA-LoRA ICLR 2024 — quantization-aware LoRA, 243 citations
    One QuantLLM for ALL ACL 2025 Oral — unified quantized deployment
    Gaps
    No KV cache optimization work (MLA, eviction strategies absent)
    …click to see all
    HP

    Hongwu Peng

    medium hireability

    Research Scientist/Engineer@Adobe

    Previously: Research Scientist/Engineer Intern @ Adobe

    New York, US

    55
    Weight Compression78
    Inference-Aware Architecture72
    Weight Streaming Efficiency48
    KV Cache Optimization22
    Strengths
    Medusa (429 citations) — speculative decoding reduces per-step HBM weight reads
    AQ2PNN: adaptive quantization for hardware-constrained private inference
    Gaps
    No KV cache-specific papers (MLA, KV eviction, vector-quantized KV cache)
    …click to see all
    HJ

    Hongyi Jin

    medium hireability
    59
    Inference-Aware Architecture75
    KV Cache Optimization65
    Weight Compression55
    Weight Streaming Efficiency40
    Strengths
    KV cache transfer kernel for prefill-decode disaggregation (apache/tvm commit)
    Unified KV cache interface in microserving paper (arXiv:2412.12488)
    Gaps
    No evidence of MLA, vector-quantized KV, or KV eviction strategies specifically
    …click to see all
    HZ

    Howard Zhang

    medium hireability
    28
    KV Cache Optimization42
    Weight Compression40
    Inference-Aware Architecture20
    Weight Streaming Efficiency8
    Strengths
    FP8 QKV quantization in torchao — directly reduces K/V precision at inference
    Per-head fused QKV FP8 kernel with FA3/FA4 backends
    Gaps
    Activation quantization focus, not weight compression or architecture design
    …click to see all
    HC

    Hsin-Pai Cheng

    medium hireability

    Researcher@Qualcomm

    Previously: PhD student @ Duke University

    23
    Inference-Aware Architecture68
    KV Cache Optimization12
    Weight Compression8
    Weight Streaming Efficiency5
    Strengths
    PADRe (ICLR 2025): hardware-friendly Hadamard-product attention on GPU/NPU
    11x–43x GPU/NPU speedup vs standard attention — explicit chip benchmarks
    Gaps
    No KV cache work — no MLA, vector-quantized KV, or KV eviction publications
    …click to see all
    HZ

    Hui-Ling Zhen

    medium hireability

    Senior Staff Research Scientist@Huawei

    Previously: Staff Researcher @ Huawei

    HK

    74
    Weight Compression92
    KV Cache Optimization90
    Inference-Aware Architecture80
    Weight Streaming Efficiency35
    Strengths
    KVTuner: nearly lossless 3.25-bit KV cache, 21% throughput gain
    SVDq: 1.25-bit K-cache, 410x key cache compression ratio
    Gaps
    No explicit weight-streaming topology work (TPS/watt streaming reduction)
    …click to see all
    HZ

    Hui-Ling Zhen

    medium hireability
    81
    KV Cache Optimization93
    Weight Compression92
    Inference-Aware Architecture82
    Weight Streaming Efficiency55
    Strengths
    KVTuner (ICML 2025): sensitivity-aware layer-wise KV cache mixed-precision quantization
    SVDq: 1.25-bit key cache + 410x compression via SVD, directly reducing KV footprint
    Gaps
    No explicit weight-streaming bandwidth papers (closest: MoE sparsity reduces active weights)
    …click to see all
    IF

    Igor Fedorov

    medium hireability

    Staff Research Scientist / Tech Lead / Manager@Meta

    Previously: Senior AI Research Scientist @ Meta

    San Diego, US

    58
    Weight Compression92
    Inference-Aware Architecture82
    Weight Streaming Efficiency50
    KV Cache Optimization8
    Strengths
    SpinQuant (ICLR 2025): LLM quantization with learned rotations
    UDC (NeurIPS 2022): compressible TinyML for NPUs — chip-aware architecture design
    Gaps
    No KV cache work — MLA, vector-quantized KV, or KV eviction absent from profile
    …click to see all
    IB

    Irem Boybat

    medium hireability

    Research Staff Member@IBM

    Previously: Postdoctoral Researcher @ IBM

    Zurich, CH

    48
    Inference-Aware Architecture78
    Weight Compression62
    Weight Streaming Efficiency42
    KV Cache Optimization10
    Strengths
    AnalogNAS (2023): hardware-aware NAS for analog inference constraints
    Efficient scaling LLMs + 3D analog CiM — Nature CS 2025, 25 citations
    Gaps
    No KV cache work (MLA, vector-quantized KV, eviction strategies)
    …click to see all
    JR

    Jeff Rasley

    medium hireability
    30
    Inference-Aware Architecture50
    KV Cache Optimization35
    Weight Streaming Efficiency20
    Weight Compression15
    Strengths
    Shift Parallelism (2025): KV cache invariance as core design property
    DeepSpeed Inference (2022): 7.3x speedup, trillion-param inference at scale
    Gaps
    Systems/runtime layer — not chip-constraint-aware model architecture design
    …click to see all
    JK

    Jerome Ku

    medium hireability
    46
    Weight Compression75
    Inference-Aware Architecture50
    Weight Streaming Efficiency45
    KV Cache Optimization15
    Strengths
    HQQ fused GEMM merged to pytorch/ao — core INT4 weight quantization contributor
    tinygemm INT4 unpacker — packed-weight inference, aware of memory-bound decode
    Gaps
    No KV cache-specific work: no MLA, vector-quantized KV, or KV eviction contributions found
    …click to see all
    JZ

    Jerry Zhang

    medium hireability
    60
    Weight Compression90
    Weight Streaming Efficiency75
    Inference-Aware Architecture60
    KV Cache Optimization15
    Strengths
    torchao lead: 366 commits, 430 PRs — production weight quantization at Meta scale
    NVFP4 + MXFP8 microscaling formats — sub-4-bit, block-scale weight compression
    Gaps
    No KV cache work found — MLA, vector-quantized KV, or eviction not in his portfolio
    …click to see all
    JC

    Jesse Cai

    medium hireability

    Machine Learning Engineer@Meta

    Previously: Senior Research Engineer @ Cultivate

    San Francisco, US

    54
    Weight Compression88
    Weight Streaming Efficiency65
    Inference-Aware Architecture58
    KV Cache Optimization5
    Strengths
    94 commits to pytorch/ao — production INT4/FP8/weight-only quantization
    TorchAO co-author (ICLR 2025) — PyTorch-native model optimization
    Gaps
    No KV cache optimization work (MLA, eviction, vector-quantized KV)
    …click to see all
    J(

    Jiahan Chang (Cyrus)

    medium hireability
    63
    KV Cache Optimization75
    Inference-Aware Architecture72
    Weight Compression65
    Weight Streaming Efficiency40
    Strengths
    concat_mla_k CUDA kernel in FlashInfer — direct MLA KV cache optimization
    Integrated MLA kernel into vLLM for DeepSeek R1 (production scale)
    Gaps
    Location unconfirmed — NVIDIA global, unknown if in USA/Europe/China/India
    …click to see all
    JT

    Jiaming Tang

    medium hireability

    Ph.D. student@MIT

    Previously: Undergraduate researcher @ SJTU EPCC Lab

    Boston, US

    68
    Weight Compression92
    KV Cache Optimization82
    Inference-Aware Architecture68
    Weight Streaming Efficiency28
    Strengths
    Quest (ICML 2024) — query-aware KV eviction, directly on-target
    AWQ (MLSys 2024 Best Paper, 1503 cit) — activation-aware weight quantization
    Gaps
    No MLA or vector-quantized KV work — Quest is eviction-based only
    …click to see all
    JX

    Jing Xiong

    medium hireability

    PhD student@University of Hong Kong

    Previously: MS student @ Sun Yat-Sen University

    Shenzhen, CN

    33
    KV Cache Optimization85
    Inference-Aware Architecture28
    Weight Compression15
    Weight Streaming Efficiency5
    Strengths
    D2O: KV eviction paper, 31 citations (EMNLP 2024)
    ParallelComp: parallel KV compressor, ICML 2025
    Gaps
    No weight compression or weight streaming work
    …click to see all
    JI

    jiqing-feng

    medium hireability
    22
    Weight Compression55
    Inference-Aware Architecture22
    KV Cache Optimization5
    Weight Streaming Efficiency5
    Strengths
    FP8 kernel acceleration for compressed-tensors on XPU and CUDA (Apr 2026)
    int4 weight-only quantization on Intel XPU via TorchAO — hardware-aware impl
    Gaps
    No visible KV cache, MLA, or KV eviction work
    …click to see all
    JF

    Josh Fromm

    medium hireability
    58
    Weight Compression92
    Inference-Aware Architecture78
    Weight Streaming Efficiency55
    KV Cache Optimization5
    Strengths
    ScaleBITS (2026): hardware-aligned mixed-precision quantization for LLMs
    Automated Backend-Aware PTQ (2021): chip-specific quantization targeting
    Gaps
    No evidence of KV cache work (MLA, eviction, vector-quantized KV)
    …click to see all
    JS

    Junru Shao

    medium hireability
    40
    Inference-Aware Architecture65
    Weight Compression45
    KV Cache Optimization30
    Weight Streaming Efficiency20
    Strengths
    MLC-LLM quantization pipeline (q4f16_1/q3f16) — deployment-layer weight compression
    FlashInfer integration into TVM — KV cache-aware attention at compiler level
    Gaps
    Compiler/runtime engineer, not model architecture researcher — designs deployment stacks, not new architectures
    …click to see all
    JG

    Junxian Guo

    medium hireability

    PhD student@Shanghai Jiao Tong University

    Previously: Undergrad student @ Shanghai Jiao Tong University

    Shanghai, CN

    72
    KV Cache Optimization85
    Weight Compression82
    Inference-Aware Architecture80
    Weight Streaming Efficiency40
    Strengths
    DuoAttention (99 citations): KV retrieval/streaming heads for long-context inference
    VQ-LLM: vector quantization augmented LLM inference — VQ-KV directly
    Gaps
    No direct work on MLA or KV eviction specifically — DuoAttention is adjacent
    …click to see all
    KN

    Ka-Hyun Nam

    medium hireability
    49
    Inference-Aware Architecture72
    Weight Compression65
    KV Cache Optimization35
    Weight Streaming Efficiency22
    Strengths
    38 PRs on flashinfer-ai/flashinfer — active core contributor
    MXFP8 BlockScaledMmaOp w/ CUTLASS DSL — microscaling quantization ops
    Gaps
    No direct KV cache algorithm work (MLA, KV eviction, vector-quantized KV)
    …click to see all
    KZ

    Kan Zhu

    medium hireability

    PhD Student@University of Washington

    Previously: Undergrad student @ University of Michigan - Ann Arbor

    Seattle, US

    63
    KV Cache Optimization88
    Weight Compression85
    Inference-Aware Architecture60
    Weight Streaming Efficiency20
    Strengths
    Quest (ICML 2024): KV eviction via query-aware sparsity — core fit
    Atom (MLSys 2024, 245 citations): low-bit quantization, direct weight compression
    Gaps
    No explicit weight streaming / bandwidth-reduction work
    …click to see all
    KB

    Kartikeya Bhardwaj

    medium hireability

    Researcher@Qualcomm

    Previously: Senior Machine Learning Engineer @ Arm

    44
    Inference-Aware Architecture75
    Weight Compression72
    Weight Streaming Efficiency22
    KV Cache Optimization8
    Strengths
    "Oh! We Freeze": 4-bit weight quantization KD for LLMs on edge (ICLR 2024)
    ZiCo: hardware-aware NAS, 116 citations (ICLR 2023)
    Gaps
    No KV cache work: no MLA, vector-quantized KV, or KV eviction papers
    …click to see all
    KS

    Keshav Santhanam

    medium hireability
    19
    Inference-Aware Architecture45
    KV Cache Optimization15
    Weight Compression10
    Weight Streaming Efficiency5
    Strengths
    3470 commits to NVIDIA/Megatron-LM — Mamba EP + MoE inference engineer
    "Cheaply Estimating Inference Efficiency Metrics" (NeurIPS 2023)
    Gaps
    No KV cache optimization work (MLA, vector-quantized KV, eviction strategies)
    …click to see all
    KP

    Kimish Patel

    medium hireability
    55
    KV Cache Optimization75
    Inference-Aware Architecture70
    Weight Compression55
    Weight Streaming Efficiency20
    Strengths
    22-PR stack Apr 2026 on transposed KV cache (1.64x decode speedup at pos=1024)
    Non-flash SDPA path added for better decode SeqLen=1 performance
    Gaps
    KV cache work targets layout efficiency, not size reduction (no MLA or eviction)
    …click to see all
    LC

    Lequn Chen

    medium hireability

    Research Engineer@Perplexity AI

    Previously: PhD student @ University of Washington

    71
    Weight Compression82
    Inference-Aware Architecture75
    KV Cache Optimization72
    Weight Streaming Efficiency55
    Strengths
    Atom (MLSys 2024, 245 citations): low-bit quantization for LLM serving
    41 commits to flashinfer-ai/flashinfer — attention/KV kernel library
    Gaps
    No explicit MLA or vector-quantized KV cache work found
    …click to see all
    LL

    Liangzhen Lai

    medium hireability

    Researcher@Meta

    73
    Inference-Aware Architecture90
    Weight Streaming Efficiency78
    Weight Compression68
    KV Cache Optimization55
    Strengths
    Bit Fusion (747 cit.) — seminal inference chip co-design
    Folding Attention (2024) — attention memory/power for on-device streaming
    Gaps
    No explicit MLA/GQA or KV eviction work for large-scale LLM inference
    …click to see all
    LZ

    Ligeng Zhu

    medium hireability

    Ph.D student@NVIDIA

    Previously: Undergrad student @ Simon Fraser University

    Boston, US

    58
    Inference-Aware Architecture82
    Weight Compression62
    KV Cache Optimization52
    Weight Streaming Efficiency35
    Strengths
    HAT: Hardware-Aware Transformers — 367 citations, co-designs architecture with hardware
    ProxylessNAS: direct search on target hardware for inference efficiency
    Gaps
    No direct MLA, vector-quantized KV, or KV eviction work — KV cache work is adjacent
    …click to see all
    LJ

    Lisa Jin

    medium hireability
    43
    Weight Compression82
    Weight Streaming Efficiency55
    Inference-Aware Architecture28
    KV Cache Optimization5
    Strengths
    PARQ first author — principled QAT for 2-4 bit extreme weight compression
    ParetoQ co-author — scaling laws for extremely low-bit LLM quantization
    Gaps
    No KV cache optimization work (MLA, eviction, quantized KV) found
    …click to see all
    LL

    Lujun Li

    medium hireability

    Researcher@Hong Kong University of Science and Technology

    Previously: Researcher @ Hong Kong Generative AI Research and Development Center

    Hong Kong, HK

    38
    Weight Compression82
    Weight Streaming Efficiency45
    Inference-Aware Architecture20
    KV Cache Optimization3
    Strengths
    STBLLM: sub-1-bit structured binary LLMs with custom CUDA kernel (2024)
    EMQ: training-free mixed-precision quantization — ICCV 2023, 54 citations
    Gaps
    No KV cache work — MLA, KV eviction, vector-quantized KV all absent
    …click to see all
    LC

    Lukas Cavigelli

    medium hireability

    Researcher (Expert/Architect)@Huawei

    Previously: Researcher (Principal Engineer) @ Huawei

    Zurich, CH

    84
    Weight Compression92
    Inference-Aware Architecture90
    KV Cache Optimization82
    Weight Streaming Efficiency70
    Strengths
    TyphoonMLA (2025): MLA kernel — exact match to KV cache search axis
    "Don't be so Stief!" (2026): KV cache low-rank compression on Stiefel manifold
    Gaps
    No explicit topological regularization or Microscaling (MX-format) work
    …click to see all
    LZ

    Luoming Zhang

    medium hireability

    Algorithm Engineer@Qualcomm

    Previously: PhD student @ Zhejiang University

    China

    54
    Weight Compression80
    KV Cache Optimization70
    Weight Streaming Efficiency35
    Inference-Aware Architecture30
    Strengths
    ZipCache: 5× KV cache compression with salient token quantization (arxiv:2405.14256)
    Dual Grained Quantization: A8W4 LLM quant, 3.24× speed, 1.12× memory reduction
    Gaps
    No work on MLA, multi-head latent attention, or KV eviction strategies
    …click to see all
    MG

    Manuel Le Gallo

    medium hireability

    Staff Research Scientist@IBM

    Previously: PhD student @ ETH Zurich

    Zurich, CH

    63
    Inference-Aware Architecture88
    Weight Streaming Efficiency80
    Weight Compression72
    KV Cache Optimization10
    Strengths
    64-core PCM inference chip — eliminates weight streaming via in-memory compute
    9.76 TOPS/W on 14nm chip — direct TPS/watt efficiency evidence
    Gaps
    No KV cache work — no MLA, vector-quantized KV, or KV eviction papers found
    …click to see all
    MF

    Marco Federici

    medium hireability

    Principal@Tracey

    Previously: Executive Manager Demand Forecasting @ nbn

    Sydney, AU

    59
    Weight Streaming Efficiency72
    Inference-Aware Architecture65
    Weight Compression58
    KV Cache Optimization40
    Strengths
    DIP (MLSys 2025): 46% memory reduction, 40% throughput on Phi-3-Medium
    Cache-Aware Masking: targets DRAM bandwidth during LLM decode
    Gaps
    KV cache work is activation-cache hit rate, not MLA/KV eviction specifically
    …click to see all
    MS

    Marc Sun

    medium hireability
    27
    Weight Compression65
    Inference-Aware Architecture22
    Weight Streaming Efficiency15
    KV Cache Optimization5
    Strengths
    FP8 per-tensor & block quant in transformers — production implementation
    BNB 4-bit/8-bit, GPTQ, TorchAO, compressed-tensors contributor
    Gaps
    No KV cache compression work (MLA, vector-quantize, KV eviction) found
    …click to see all
    MN

    Markus Nagel

    medium hireability

    Research Scientist (Senior Staff Engineer)@Qualcomm

    Previously: Research Scientist (Staff Engineer) @ Qualcomm

    Amsterdam, NL

    74
    Weight Compression95
    Inference-Aware Architecture75
    Weight Streaming Efficiency70
    KV Cache Optimization55
    Strengths
    GPTVQ (2024): vector quantization for LLMs — weight compression at scale
    ADAROUND (2020): foundational adaptive rounding PTQ, widely cited
    Gaps
    No direct MLA / KV eviction work found — cache work is peripheral
    …click to see all
    MB

    Mart Van Baalen

    medium hireability

    Senior Staff Machine Learning Research Engineer/Manager@Qualcomm

    Previously: Staff Machine Learning Research Engineer/Manager @ Qualcomm

    Amsterdam, NL

    62
    Weight Compression92
    Inference-Aware Architecture65
    Weight Streaming Efficiency60
    KV Cache Optimization30
    Strengths
    GPTVQ (2024): state-of-the-art vector quantization for LLM weight compression
    Leech Lattice VQ (2025): outperforms QuIP#, QTIP, PVQ — latest SOTA
    Gaps
    No direct KV cache work (MLA, KV eviction) — cache papers are MoE routing, not KV
    …click to see all
    MD

    Matthew Douglas

    medium hireability
    42
    Weight Compression82
    Weight Streaming Efficiency45
    Inference-Aware Architecture35
    KV Cache Optimization5
    Strengths
    Core bitsandbytes maintainer — 183 commits, 145+ merged PRs
    NF4/Int4/Int8 CUDA/ROCm kernel implementation in production
    Gaps
    No KV cache work (MLA, VQ-MLA, KV eviction) found
    …click to see all
    MA

    Mehmet Aktukmak

    medium hireability
    36
    Weight Compression52
    Inference-Aware Architecture52
    KV Cache Optimization20
    Weight Streaming Efficiency18
    Strengths
    vLLM-Gaudi: vLLM plugin for Intel Gaudi AI inference chips
    Layer-by-layer SmoothQuant int8 — memory-efficient weight quantization
    Gaps
    No KV cache innovation (MLA, vector-quantized KV, eviction) — only indirect vLLM usage
    …click to see all
    MW

    Michael Wyatt

    medium hireability
    38
    KV Cache Optimization55
    Inference-Aware Architecture45
    Weight Compression35
    Weight Streaming Efficiency15
    Strengths
    #1 contributor to DeepSpeed-MII — blocked KV caching in production
    FP6 quantization config in MII — weight quantization for inference
    Gaps
    Systems engineer, not researcher — no novel KV/compression algorithm contributions
    …click to see all
    MM

    Michele Magno

    medium hireability

    PD Dr.@ETH Zurich

    Previously: Researcher @ ETH Zurich

    Zurich, CH

    50
    Weight Compression85
    Inference-Aware Architecture72
    Weight Streaming Efficiency38
    KV Cache Optimization5
    Strengths
    "Empirical study of Llama3 quantization" (49 cites, 2024) — top LLM weight compression
    BiLLM: Pushing the Limit of Post-Training Quantization — 1-bit extreme compression
    Gaps
    No KV cache optimization work (MLA, eviction, vector quantization) — major gap
    …click to see all
    ML

    Mingzhi Liu

    medium hireability
    70
    Inference-Aware Architecture85
    KV Cache Optimization82
    Weight Streaming Efficiency72
    Weight Compression40
    Strengths
    MLA fp8 kernel fix on gfx950 — direct KV decode path fix (ROCm/aiter #2907)
    mori RDMA KV transfer engine merged into vLLM-omni (PUSH/PULL, 3369 lines)
    Gaps
    AMD-specific (ROCm/MI series) — no custom ASIC inference chip experience
    …click to see all
    ME

    Mostafa Elhoushi

    medium hireability

    Research Scientist@Cerebras Systems

    Previously: Research Engineer, FAIR @ Meta

    Toronto, CA

    80
    Weight Compression92
    Inference-Aware Architecture80
    Weight Streaming Efficiency75
    KV Cache Optimization72
    Strengths
    CHAI (2024): 21.4% KV cache reduction via clustered attention head sharing
    any4 (2025): learned 4-bit LLM weight quantization + tinygemm inference library
    Gaps
    Location Toronto, Canada — not in search regions (USA, Europe, China, India)
    …click to see all
    NL

    Nandor Licker

    medium hireability
    30
    Inference-Aware Architecture50
    KV Cache Optimization45
    Weight Streaming Efficiency15
    Weight Compression8
    Strengths
    23 merged PRs in flashinfer-ai/flashinfer — KV cache/attention kernels
    #2 contributor to pplx-kernels — Perplexity production inference GPU kernels
    Gaps
    Kernel implementer, not architecture researcher — no MLA/KV eviction design work
    …click to see all
    NC

    Nathan Chen

    medium hireability
    33
    Inference-Aware Architecture60
    KV Cache Optimization55
    Weight Streaming Efficiency12
    Weight Compression5
    Strengths
    flash-linear-attention contributor — KV-cache-free linear attention (O(1) memory)
    Co-ALIBI: hardware-aligned Triton kernels, 160 TFLOPS/s on H100
    Gaps
    No MLA, vector-quantized KV, or explicit KV eviction policy work
    …click to see all
    NP

    Nilesh Prasad Pandey

    medium hireability

    PhD student@University of Califnornia San Diego

    Previously: Applied Scientist Intern @ Amazon

    San Diego, US

    35
    Weight Compression78
    Inference-Aware Architecture40
    Weight Streaming Efficiency15
    KV Cache Optimization5
    Strengths
    Mixed precision PTQ (2023, 24 cit.) — hardware-aware weight compression
    DPQ-HD (2025) — ultra-low-power compression for inference hardware
    Gaps
    No KV cache work — MLA, vector-quantized KV, KV eviction absent
    …click to see all
    NV

    nv-yunzheq

    medium hireability
    63
    Weight Compression78
    Inference-Aware Architecture78
    Weight Streaming Efficiency50
    KV Cache Optimization45
    Strengths
    NVFP4 MoE kernels (blockscaled FP4) ported from TRT-LLM — production weight compression
    Blackwell SM100/GB200/GB300 architecture-specific CUTE DSL kernels
    Gaps
    No KV cache compression research (MLA work is inference serving, not architecture design)
    …click to see all
    OS

    Oliver Sieberling

    medium hireability

    PhD Student@MIT

    Previously: Teaching Assistant @ ETH Zürich

    Boston, US

    47
    Weight Compression82
    Inference-Aware Architecture65
    Weight Streaming Efficiency35
    KV Cache Optimization5
    Strengths
    EvoPress (ICML 2025): dynamic quantization + sparsity + pruning on Llama/Mistral
    Quartet: FP4 training/inference with Blackwell-optimized CUDA kernels
    Gaps
    No KV cache work (MLA, KV eviction, vector-quantized KV)
    …click to see all
    PW

    Paul N. Whatmough

    medium hireability

    Senior Director, AI Research@Qualcomm

    Previously: Director, AI Research @ Qualcomm

    Boston, US

    85
    Inference-Aware Architecture95
    Weight Compression93
    KV Cache Optimization78
    Weight Streaming Efficiency72
    Strengths
    GPTVQ (2024, 48 cites): vector quantization cuts LLM DRAM footprint + bandwidth
    KaVa (2025): KV-Cache compressed distillation — on-point for KV eviction/reduction axis
    Gaps
    No explicit MLA (Multi-head Latent Attention) architecture work
    …click to see all
    PD

    Peijie Dong

    medium hireability

    PhD Candidate in Computer Science@The Hong Kong University of Science and Technology (Guangzhou)

    Previously: Intern @ Alibaba

    Guang Zhou, CN

    74
    Weight Compression90
    KV Cache Optimization82
    Weight Streaming Efficiency70
    Inference-Aware Architecture55
    Strengths
    ChunkKV (NeurIPS 2025) — semantic KV cache compression, long-context LLM
    SpInfer (EuroSys 2025 Best Paper) — low-level GPU sparsity for LLM inference
    Gaps
    No chip co-design — GPU inference focus, not custom AI inference chips
    …click to see all
    PZ

    Perkz Zheng

    medium hireability
    63
    KV Cache Optimization88
    Inference-Aware Architecture82
    Weight Compression45
    Weight Streaming Efficiency35
    Strengths
    FlashMLA KV-cache path for DeepSeek V4 prefill — direct MLA implementation (vLLM PR #41836)
    Sparse MLA decode kernel selection, SM100/SM103 Blackwell hardware-aware (FlashInfer #2836)
    Gaps
    No published research papers — pure implementation engineer, not researcher
    …click to see all
    PY

    Peter Yeh

    medium hireability
    46
    Weight Compression72
    Inference-Aware Architecture55
    Weight Streaming Efficiency40
    KV Cache Optimization15
    Strengths
    30 PRs to pytorch/ao — INT4, FP8, MX microscaling quantization on ROCm
    SpinQuant Hadamard matrices — rotation-based weight compression for LLMs
    Gaps
    No direct KV cache work (no MLA, KV eviction, or vector-quantized KV cache found)
    …click to see all
    PM

    Praneeth Medepalli

    medium hireability

    Member of Technical Staff@Zyphra

    Previously: Machine Learning Engineer @ SiMa.ai

    San Francisco, US

    26
    Weight Compression65
    Inference-Aware Architecture30
    KV Cache Optimization5
    Weight Streaming Efficiency5
    Strengths
    MPO-based low-rank factorization+pruning paper — direct weight compression
    Research expertise: model compression, quantization, signal processing
    Gaps
    No KV cache work (MLA, KV eviction, vector-quantized KV) found
    …click to see all
    RT

    Rahul Tuli

    medium hireability
    38
    Weight Compression72
    Weight Streaming Efficiency45
    KV Cache Optimization18
    Inference-Aware Architecture18
    Strengths
    37 PRs to compressed-tensors: AWQ, FP8, sparse 2:4 compression
    63 PRs to llm-compressor: SmoothQuant MoE fix, quantization pipeline
    Gaps
    No published research — engineering contributor, not novel algorithm author
    …click to see all
    RT

    Rajan Troll

    medium hireability

    Member of Technical Staff@OpenAI

    Previously: Chief Technology Officer @ inBalance

    Seattle, US

    49
    Weight Compression62
    Inference-Aware Architecture62
    KV Cache Optimization40
    Weight Streaming Efficiency30
    Strengths
    DB expertise: quantization + efficient DL hardware — core signal
    fast self-attention + long context memory: adjacent KV cache expertise
    Gaps
    No public papers on quantization, KV cache, or hardware-aware inference
    …click to see all
    RH

    Ramyad Hadidi

    medium hireability

    Senior Staff -- ML Computer Architect@d-Matrix

    Previously: Senior Scientist @ Rain AI

    San Francisco, US

    83
    Inference-Aware Architecture92
    Weight Streaming Efficiency82
    KV Cache Optimization80
    Weight Compression78
    Strengths
    Mustafar (NeurIPS'25): KV cache pruning via unstructured sparsity — exact match
    Endor (2024): reduces weight transfer bandwidth for offloaded LLM inference
    Gaps
    KV work is sparsity-based pruning — no published work on MLA or vector-quantized KV
    …click to see all
    RA

    Randy

    medium hireability
    55
    Weight Compression80
    Weight Streaming Efficiency75
    Inference-Aware Architecture60
    KV Cache Optimization5
    Strengths
    2:4 structured sparsity halves weight stream bandwidth — directly cuts TPS power
    FP8 + sparse CUTLASS tensor (Sparse2x4CUTLASSFloat8Tensor) in production at Meta
    Gaps
    No published research — pure practitioner, no papers
    …click to see all
    RZ

    Ritchie Zhao

    medium hireability

    Senior AI and Machine Learning Engineer@NVIDIA

    Previously: Senior Data Science Manager @ Microsoft

    Redmond, US

    83
    Weight Compression96
    KV Cache Optimization88
    Inference-Aware Architecture78
    Weight Streaming Efficiency70
    Strengths
    Shared Microexponents (ISCA 2023) — co-authored the MX spec cited in JD
    Microscaling Data Formats for Deep Learning (2023) — full MX format spec
    Gaps
    No published work specifically on decode-time weight streaming bandwidth reduction
    …click to see all
    RL

    Royson Lee

    medium hireability

    Research Scientist@Samsung

    Previously: Research Engineer @ Samsung

    Cambridge, GB

    56
    Weight Streaming Efficiency78
    Inference-Aware Architecture75
    Weight Compression65
    KV Cache Optimization5
    Strengths
    PMPD (ICLR 2025): precision-lowering during decode cuts weight stream bits
    3.8–8× NPU throughput gain on LLM-optimized NPU hardware
    Gaps
    No KV cache work — no MLA, KV eviction, or vector-quantized KV evidence
    …click to see all
    RL

    Ruihang Lai

    medium hireability

    Ph.D. student@Carnegie Mellon University

    Previously: Research Intern @ OctoAI

    Pittsburgh, US

    51
    KV Cache Optimization82
    Inference-Aware Architecture62
    Weight Streaming Efficiency38
    Weight Compression22
    Strengths
    FlashInfer: KV-cache block-sparse attention engine, MLSys 2025 outstanding paper
    Cascade Inference: memory-bandwidth-efficient batch decoding paper
    Gaps
    No direct work on MLA, vector-quantized KV cache, or KV eviction strategies
    …click to see all
    RG

    Ruihao Gong

    medium hireability

    Beihang University

    Previously: Principal Researcher @ SenseTime

    58
    Weight Compression95
    Inference-Aware Architecture60
    Weight Streaming Efficiency55
    KV Cache Optimization20
    Strengths
    BRECQ (610 citations) — landmark post-training quantization paper
    DSQ (636 citations) — differentiable quantization, full-precision to low-bit bridging
    Gaps
    No published work on MLA, KV eviction, or vector-quantized KV cache
    …click to see all
    RZ

    Rui-Jie Zhu

    medium hireability

    Research Intern@ByteDance

    Previously: Research Intern @ EMD Electronics

    San Francisco, US

    56
    KV Cache Optimization70
    Weight Compression55
    Weight Streaming Efficiency50
    Inference-Aware Architecture50
    Strengths
    RWKV co-author (958 cit.) — KV-cache elimination via linear recurrence
    MatMul-free LM — ternary weights, no matmuls; extreme weight compression
    Gaps
    No explicit work on MLA, vector-quantized KV, or KV eviction strategies
    …click to see all
    RL

    Rui Li

    medium hireability

    Researcher@Samsung AI

    Previously: PhD Student @ University of Edinburgh

    Cambridge, GB

    33
    Inference-Aware Architecture60
    Weight Compression42
    Weight Streaming Efficiency20
    KV Cache Optimization10
    Strengths
    Hardware-Aware Parallel Prompt Decoding — GPU-adaptive, 2.49× speedup (EMNLP 2025)
    Dynamic sparse tree adapts decoding to hardware constraints
    Gaps
    No KV cache-specific work (MLA, KV eviction, vector-quantized KV)
    …click to see all
    RX

    Runxin Xu

    medium hireability

    researcher@DeepSeek

    Previously: Quant researcher @ Metabit Trading

    Barcelona, ES

    71
    KV Cache Optimization88
    Inference-Aware Architecture80
    Weight Compression62
    Weight Streaming Efficiency55
    Strengths
    DeepSeek-V2: introduced MLA — 5.75x KV cache reduction, core to the query
    DeepSeek-V3 co-author — continued inference-efficient architecture design
    Gaps
    No published work on KV eviction or vector-quantized KV cache
    …click to see all
    RY

    Ruokai Yin

    medium hireability

    PhD student@Yale University

    Previously: Research Intern @ Microsoft

    New Haven, US

    38
    Weight Compression75
    Inference-Aware Architecture50
    Weight Streaming Efficiency20
    KV Cache Optimization5
    Strengths
    GPTAQ (ICML 2025) — quantizes 405B LLMs, direct weight compression evidence
    DuoGPT (NeurIPS 2025) — dual sparsity pruning, training-free LLM compression
    Gaps
    No KV cache optimization work (MLA, vector quant, eviction)
    …click to see all
    RT

    Rush Tabesh

    medium hireability

    Ph.D. Student@Institute of Science and Technology Austria

    Previously: Scientific Researcher @ Institute of Science and Technology Austria

    Vienna, AT

    36
    Weight Compression88
    Weight Streaming Efficiency30
    Inference-Aware Architecture22
    KV Cache Optimization5
    Strengths
    QuEST: 1-bit weights+activations LLM training — extreme compression
    Quartet: native FP4 LLM training proven optimal (2025)
    Gaps
    No KV cache optimization work (MLA, KV eviction) found
    …click to see all
    SJ

    Sam Ade Jacobs

    medium hireability

    Computer Scientist@Microsoft

    37
    KV Cache Optimization72
    Inference-Aware Architecture35
    Weight Compression25
    Weight Streaming Efficiency15
    Strengths
    MAC-Attention (2026): 99% KV access reduction, 60% decode latency cut
    KV reuse scheme — constant compute/bandwidth on cache hits regardless of context length
    Gaps
    No inference chip co-design — all systems work is training-focused
    …click to see all
    SR

    Samyam Rajbhandari

    medium hireability

    AI Systems Lead | Principal Architect@Snowflake

    Previously: Principal Architect @ Microsoft

    Redmond, US

    72
    KV Cache Optimization88
    Weight Compression72
    Weight Streaming Efficiency65
    Inference-Aware Architecture62
    Strengths
    SwiftKV (2025): 62.5% KV cache reduction via AcrossKV knowledge-preserving layer merging
    ZeRO (1983 citations) — partitions weight memory across devices, foundational work
    Gaps
    ZeRO family primarily training-focused; decode-time weight streaming for low-power chips not directly addressed
    …click to see all
    SD

    Saurabh Dash

    medium hireability

    Member of Technical Staff@Cohere

    Previously: Machine Learning Researcher @ Apple

    Toronto, CA

    39
    Weight Compression70
    Inference-Aware Architecture55
    Weight Streaming Efficiency20
    KV Cache Optimization10
    Strengths
    "Intriguing Properties of Quantization at Scale" — NeurIPS 2023 LLM weight quant
    Hessian-driven mixed-precision for ReRAM PIM arrays — hardware co-design
    Gaps
    No KV cache work (MLA, vector-quantized KV, KV eviction) found
    …click to see all
    SS

    Sayeh Sharify

    medium hireability

    Principal Machine Learning Research Scientist@d-Matrix

    Previously: Co-Founder @ Tartan AI

    San Francisco, US

    74
    Weight Compression92
    Inference-Aware Architecture88
    KV Cache Optimization65
    Weight Streaming Efficiency50
    Strengths
    ResQ (2025): 4-bit KV cache + weight + activation quantization with 3× speedup
    Microscaling PTQ (2024): chip-native MX format quantization for inference hardware
    Gaps
    No published work on MLA or vector-quantized KV cache specifically
    …click to see all
    SR

    Scott Roy

    medium hireability

    data scientist@Microsoft

    Previously: Researcher @ Meta

    57
    Weight Compression92
    Inference-Aware Architecture75
    Weight Streaming Efficiency55
    KV Cache Optimization5
    Strengths
    ParetoQ: state-of-the-art 1–4 bit LLM quantization (Meta AI, Feb 2025)
    155 PRs to pytorch/ao: HQQ, PARQ, LUT 1–4 bit packing, Int4 weight-only configs
    Gaps
    No KV cache work found (MLA, KV eviction, vector-quantized KV)
    …click to see all
    SK

    Se Jung Kwon

    medium hireability

    Director@NAVER

    Previously: Leader @ NAVER

    Seoul, KR

    71
    Weight Compression92
    KV Cache Optimization80
    Inference-Aware Architecture75
    Weight Streaming Efficiency35
    Strengths
    "No Token Left Behind" (2024, 76 cit.) — KV cache compression via mixed-precision quantization
    LUT-GEMM (2024, 184 cit.) — lookup-table-based quantized matmul for LLM inference hardware
    Gaps
    Location: Seoul, KR — not in requested regions (USA/Europe/China/India)
    …click to see all
    SB

    Shane Bergsma

    medium hireability

    Principal Researcher@Cerebras

    Previously: Researcher @ Huawei

    25
    Weight Compression45
    Inference-Aware Architecture35
    Weight Streaming Efficiency15
    KV Cache Optimization5
    Strengths
    Principal Researcher, Cerebras Systems — wafer-scale inference chip R&D
    Sparsity 2024: 30-40% inference FLOP reduction via unstructured sparsity
    Gaps
    No KV cache work — no MLA, vector-quantized KV, or KV eviction papers
    …click to see all
    SL

    Shengyu Liu

    medium hireability
    46
    KV Cache Optimization90
    Inference-Aware Architecture65
    Weight Compression20
    Weight Streaming Efficiency10
    Strengths
    13 commits to deepseek-ai/FlashMLA — second-highest contributor
    Co-authored FlashMLA kernel blog (Apr + Oct 2025) — MLA KV cache compression
    Gaps
    No weight compression or microscaling research
    …click to see all
    SJ

    Shiqi Jiang

    medium hireability

    Senior Researcher@Microsoft

    Previously: Senior Research Engineer @ Microsoft

    Beijing, CN

    42
    Inference-Aware Architecture75
    Weight Streaming Efficiency68
    Weight Compression18
    KV Cache Optimization5
    Strengths
    Active-Weight Swapping (DRAM/Flash) paper — direct weight streaming bandwidth work
    NPU inference paper (EuroSys '26) — mobile chip constraints addressed
    Gaps
    No KV cache work — MLA, eviction, or KV quantization absent from profile
    …click to see all
    SY

    Shixing Yu

    medium hireability

    PhD student@Cornell University

    Previously: Research Intern @ Meta

    San Francisco, US

    31
    Weight Compression72
    Inference-Aware Architecture35
    Weight Streaming Efficiency10
    KV Cache Optimization5
    Strengths
    UVC (ICLR 2022, 157 cites) — unified ViT pruning + low-rank + quantization
    HAP (WACV 2022, 77 cites) — Hessian-aware pruning, weight structure
    Gaps
    No KV cache work (MLA, KV eviction) — a key Neuralace axis
    …click to see all
    S(

    Shiyang Weng (LevelDownRefine)

    medium hireability
    12
    Weight Compression28
    Inference-Aware Architecture15
    KV Cache Optimization3
    Weight Streaming Efficiency3
    Strengths
    13 merged PRs in pytorch/ao — int8/fp8 quantization contributor
    x86 inductor fusion passes for quantized ops (DLRMv2)
    Gaps
    No KV cache, MLA, or KV eviction work found
    …click to see all
    SM

    Shuming Ma

    medium hireability

    Senior Researcher@Microsoft

    Previously: Researcher @ Microsoft

    84
    Weight Compression96
    KV Cache Optimization90
    Weight Streaming Efficiency75
    Inference-Aware Architecture75
    Strengths
    BitNet b1.58 — seminal 1-bit/1.58-bit LLM compression (h=44 primary author)
    YOCO (2024) — decoder-decoder arch halving KV cache memory at inference
    Gaps
    No explicit hardware chip co-design work (bitnet.cpp targets CPU, not custom ASIC)
    …click to see all
    SW

    Shu Wang

    medium hireability
    59
    Weight Compression72
    Weight Streaming Efficiency65
    Inference-Aware Architecture62
    KV Cache Optimization35
    Strengths
    NVFP4 masked quantization in FlashInfer — 4-bit weight compression at kernel level
    W4A16 AWQ + ModelOpt FP8 support in SGLang — direct weight compression
    Gaps
    Inference systems implementer, not a model architecture designer
    …click to see all
    SC

    Sijia(Jackson) Chen

    medium hireability
    43
    KV Cache Optimization75
    Inference-Aware Architecture55
    Weight Compression25
    Weight Streaming Efficiency15
    Strengths
    FlashMLA FP16 kernel (deepseek-ai/FlashMLA) — direct MLA KV cache work
    FP8 KV cache quantization in FBGEMM — per-head, decode latency reduction
    Gaps
    No weight compression or topological regularization work
    …click to see all
    SZ

    Si Zheng

    medium hireability

    Machine Learning System Researcher Scientist@ByteDance Seed

    Previously: Research Intern @ DeepSeek AI

    Beijing, CN

    71
    KV Cache Optimization92
    Inference-Aware Architecture80
    Weight Compression65
    Weight Streaming Efficiency48
    Strengths
    ShadowKV (ICML 2025) — KV cache via low-rank keys, up to 6× batch size gain
    ArkVale (NeurIPS 2024) — KV eviction with recallable mechanism
    Gaps
    Weight streaming bandwidth reduction not a primary focus
    …click to see all
    SL

    Stefanos Laskaridis

    medium hireability

    Applied Scientist@Amazon

    Previously: Visiting Researcher @ University of Cambridge

    London, GB

    39
    Weight Compression74
    Inference-Aware Architecture48
    Weight Streaming Efficiency22
    KV Cache Optimization10
    Strengths
    FlexRank ICML'26 spotlight — nested low-rank for adaptive deployment
    Maestro — trainable decomposition uncovering low-rank weight structures
    Gaps
    No published KV cache work (no MLA, KV eviction, or vector-quantized KV papers)
    …click to see all
    SS

    Strahinja Stamenkovic

    medium hireability
    52
    Weight Compression80
    Inference-Aware Architecture72
    Weight Streaming Efficiency35
    KV Cache Optimization20
    Strengths
    bitsandbytes ROCm: blocksize-32 4-bit GEMV kernels from scratch
    kgemm_4bit_inference_naive ROCm optimization: 5.13x vLLM throughput
    Gaps
    No KV cache work (MLA, GQA, KV eviction) in any visible contributions
    …click to see all
    TZ

    Ted Zadouri

    medium hireability
    63
    KV Cache Optimization92
    Inference-Aware Architecture88
    Weight Streaming Efficiency45
    Weight Compression25
    Strengths
    "Hardware-Efficient Attention" (2505.21487): GTA cuts KV cache 50% vs GQA
    GLA matches MLA quality with 2× speed over FlashMLA in speculative decode
    Gaps
    No published work on weight compression or microscaling
    …click to see all
    TC

    Tianle Cai

    medium hireability

    Graduate Research Assistant@Princeton University

    Previously: AI Researcher @ Together AI

    Princeton, US

    78
    KV Cache Optimization97
    Weight Compression85
    Inference-Aware Architecture72
    Weight Streaming Efficiency58
    Strengths
    CommVQ: vector-quantized KV cache, 87.5% reduction at 2-bit (2025)
    SnapKV: KV eviction, 306 citations — core KV compression contribution
    Gaps
    No explicit hardware inference chip co-design (power/packaging constraints)
    …click to see all
    TB

    Tijmen Blankevoort

    medium hireability

    Researcher@Meta

    Previously: Researcher @ Qualcomm

    59
    Weight Compression98
    Inference-Aware Architecture72
    Weight Streaming Efficiency55
    KV Cache Optimization10
    Strengths
    ADAROUND (857 citations) — seminal post-training weight quantization
    ParetoQ (NeurIPS 2025) — extreme 1–4 bit LLM quantization framework
    Gaps
    No KV cache quantization or eviction work found
    …click to see all
    TS

    Ting Song

    medium hireability
    59
    Weight Compression92
    Weight Streaming Efficiency75
    Inference-Aware Architecture65
    KV Cache Optimization5
    Strengths
    Lead maintainer microsoft/BitNet; 14 commits including 916-line GEMM kernel
    Sparse-BitNet (2026): 1.58-bit + N:M sparsity — direct weight compression paper
    Gaps
    No KV cache work (MLA, eviction, quantized KV) — axis 1 blind spot
    …click to see all
    TR

    Triang-jyed-driung

    medium hireability
    29
    Inference-Aware Architecture65
    KV Cache Optimization35
    Weight Streaming Efficiency10
    Weight Compression5
    Strengths
    Albatross: RWKV inference engine — 10K+ token/s on RTX5090 with fp16 + CUDAGraph
    Rapid-Sampling: 1.5–25x faster than FlashInfer via CUDA 128-bit vectorization
    Gaps
    No direct work on MLA, vector-quantized KV cache, or KV eviction strategies
    …click to see all
    VE

    VED

    medium hireability
    23
    Weight Compression52
    Inference-Aware Architecture25
    Weight Streaming Efficiency12
    KV Cache Optimization3
    Strengths
    MXFP4 end-to-end integration in Axolotl AI training stack
    8 merged bitsandbytes PRs — 4-bit/8-bit quantization infrastructure
    Gaps
    No research publications — pure engineering role, not researcher
    …click to see all
    VT

    Vithursan Thangarasa

    medium hireability

    Principal Research Scientist@Cerebras Systems

    Previously: Lead Research Scientist @ Cerebras Systems

    San Francisco, US

    55
    Weight Compression80
    Inference-Aware Architecture65
    Weight Streaming Efficiency60
    KV Cache Optimization15
    Strengths
    SPDF: sparse pre-training for LLMs — weight compression via sparsity (44 citations)
    REAP: MoE pruning compression, one-shot (2025)
    Gaps
    No direct KV cache work (MLA, vector-quantized KV, eviction)
    …click to see all
    WM

    Weikang Meng

    medium hireability

    Ph.D. Student@Harbin Institute of Technology, Shenzhen

    Shenzhen, CN

    23
    KV Cache Optimization55
    Inference-Aware Architecture25
    Weight Compression5
    Weight Streaming Efficiency5
    Strengths
    STILL: token selection to linearize LLMs — reduces full KV attention scope (Feb 2026)
    PolaFormer ICLR 2025 — polarity-aware linear attention, 26 citations
    Gaps
    No weight compression or quantization work
    …click to see all
    WS

    Weixuan Sun

    medium hireability

    Researcher@Tencent

    Previously: PhD student @ Australian National University

    Blacksburg, US

    32
    KV Cache Optimization70
    Inference-Aware Architecture48
    Weight Streaming Efficiency6
    Weight Compression5
    Strengths
    Lightning Attention (ICML 2024): first linear attention with constant memory at any seq length
    HGRN2 (COLM 2024): recurrent state expansion — O(1) KV-like memory during decode
    Gaps
    No work specifically on MLA, vector-quantized KV cache, or KV eviction
    …click to see all
    WS

    William Andrew Simon

    medium hireability

    Research Scientist on In-Memory Computing@IBM

    Previously: PhD student @ EPFL - EPF Lausanne

    Zurich, CH

    48
    Inference-Aware Architecture82
    Weight Compression68
    Weight Streaming Efficiency35
    KV Cache Optimization5
    Strengths
    Analog IMC accelerators for transformer LLMs — directly inference-chip-aware (2023)
    MoE + 3D analog in-memory scaling of LLMs — chip-level weight efficiency (2025)
    Gaps
    No evidence of KV cache, MLA, or KV eviction work
    …click to see all
    XC

    Xiangxiang Chu

    medium hireability

    Senior Director@Alibaba

    Previously: Senior Technical Manager @ Meituan

    Beijing, CN

    48
    Inference-Aware Architecture78
    Weight Compression75
    Weight Streaming Efficiency30
    KV Cache Optimization10
    Strengths
    FPTQ + Norm Tweaking + Speed Odyssey: three LLM quantization deployment papers
    EfficientRep: hardware-aware CNN design explicitly optimized for inference chips
    Gaps
    No KV cache optimization work (MLA, KV eviction, vector-quantized KV cache)
    …click to see all
    XD

    Xin Dong

    medium hireability

    Research Scientist@NVIDIA

    Previously: Research Scientist @ Sony

    76
    Weight Compression88
    Inference-Aware Architecture82
    KV Cache Optimization80
    Weight Streaming Efficiency52
    Strengths
    LaCache (2025): novel ladder-shaped KV cache reducing long-context memory
    Hymba (NVlabs): hybrid Mamba-Transformer inference-efficient architecture
    Gaps
    No explicit work on MLA or vector-quantized KV cache (LaCache is KV eviction style)
    …click to see all
    XY

    Xingkai Yu

    medium hireability
    65
    Inference-Aware Architecture82
    KV Cache Optimization78
    Weight Compression50
    Weight Streaming Efficiency48
    Strengths
    DeepSeek-V3 MLA implementation — core KV cache compression technique
    nano-vllm (13.3K stars) — paged KV cache, prefix caching, chunked prefill
    Gaps
    No published academic papers — practitioner/engineer, not researcher
    …click to see all
    XW

    Xin Wang

    medium hireability

    Director, Machine Learning@d-Matrix

    Previously: Principal Scientist & Manager, Machine Learning Research @ Cerebras Systems

    San Francisco, US

    69
    Weight Compression85
    Inference-Aware Architecture85
    Weight Streaming Efficiency55
    KV Cache Optimization50
    Strengths
    Flexpoint: hardware-aware adaptive numerical format for DNN inference (363 cites)
    ResQ: mixed-precision LLM quantization, low-rank residuals (2025)
    Gaps
    No direct MLA or vector-quantized KV cache research
    …click to see all
    XY

    Xinyu Yang

    medium hireability

    Ph.D. Candidate@Carnegie Mellon University

    Previously: Research Intern @ Stanford University

    62
    KV Cache Optimization88
    Inference-Aware Architecture68
    Weight Compression52
    Weight Streaming Efficiency38
    Strengths
    TriForce (COLM 2024): hierarchical speculative decoding cuts KV cache overhead
    LESS (ICML 2024): recurrence + KV cache compression for efficient inference
    Gaps
    No explicit work on MLA or vector-quantized KV cache specifically
    …click to see all
    XS

    XsquirrelC

    medium hireability
    50
    Weight Compression75
    Weight Streaming Efficiency68
    Inference-Aware Architecture52
    KV Cache Optimization3
    Strengths
    Merged PR #379 to microsoft/BitNet — 3,337-line CPU inference optimization
    GEMM kernel config for ternary/1.58-bit weights (gemm-config.h)
    Gaps
    No KV cache work — MLA, eviction, or vector quantization absent
    …click to see all
    XL

    Xuan Liao

    medium hireability
    39
    KV Cache Optimization55
    Inference-Aware Architecture45
    Weight Compression35
    Weight Streaming Efficiency20
    Strengths
    INT8/FP8 QSDPA for CPU: 15 commits to pytorch/ao, including full INT8 SDPA path
    sgl-kernel-xpu: flash attention with paged KV cache for Intel BMG580 XPU
    Gaps
    No MLA, vector-quantize-MLA, or KV eviction work specifically
    …click to see all
    YF

    Yaosheng Fu

    medium hireability

    Member of the Architecture Research Team@NVIDIA

    Previously: PhD student @ Princeton University

    60
    KV Cache Optimization88
    Inference-Aware Architecture78
    Weight Compression42
    Weight Streaming Efficiency30
    Strengths
    RocketKV (2025): two-stage KV cache compression for long-context LLM inference
    AutoScratch (MLSys 2023): ML-optimized scratch cache management for inference GPUs
    Gaps
    No direct weight streaming bandwidth reduction or weight sharing papers found
    …click to see all
    YL

    Yi Liu

    medium hireability

    AI Frameworks Engineer@Intel

    Previously: Assistant Engineer @ National University of Defense Technology

    43
    Weight Compression72
    KV Cache Optimization45
    Inference-Aware Architecture35
    Weight Streaming Efficiency20
    Strengths
    "Optimize Weight Rounding via SGD" (2024, 29 cites) — weight compression paper
    PrefixQuant (2025) — static outlier-aware LLM quantization
    Gaps
    No original KV cache research — blog analysis, not novel contribution
    …click to see all
    YZ

    Yilong Zhao

    medium hireability

    Ph.D. student@University of California, Berkeley

    Previously: Research Intern @ ByteDance

    Berkeley, US

    76
    KV Cache Optimization95
    Weight Compression85
    Inference-Aware Architecture75
    Weight Streaming Efficiency50
    Strengths
    DeepSeek-V2 co-author — pioneered MLA for KV cache compression
    Quest (ICML'24): query-aware KV eviction, 174 citations
    Gaps
    Year-1 PhD (2024) — 4-5 years until completion
    …click to see all
    YH

    Yingbing Huang

    medium hireability

    Ph.D. Candidate@University of Illinois, Urbana Champaign

    Urbana, US

    41
    KV Cache Optimization80
    Inference-Aware Architecture65
    Weight Compression10
    Weight Streaming Efficiency10
    Strengths
    SnapKV NeurIPS'24 (314 citations) — co-author on KV eviction
    10 commits to FasterDecoding/SnapKV — hands-on implementation
    Gaps
    No work on weight compression or quantization
    …click to see all
    YI

    YinHanke

    medium hireability
    26
    Weight Compression50
    Inference-Aware Architecture35
    KV Cache Optimization15
    Weight Streaming Efficiency5
    Strengths
    MNN PR #4336: SmoothQuant + OmniQuant for Qwen3.5 mixed attention export
    Familiar with quantized model export in production inference engine (Alibaba MNN)
    Gaps
    No published research — implementation engineer, not a researcher
    …click to see all
    YF

    Yonggan Fu

    medium hireability

    Research Scientist@NVIDIA

    Previously: Research Intern @ NVIDIA

    San Francisco, US

    69
    Inference-Aware Architecture92
    KV Cache Optimization82
    Weight Compression75
    Weight Streaming Efficiency25
    Strengths
    LaCache (2025): novel KV caching scheme for long-context LLMs
    Hymba: hybrid-head small LM, ICLR 2025 Spotlight — inference-optimized arch
    Gaps
    No direct work on MLA or vector-quantized KV cache
    …click to see all
    YW

    Yong Wu

    medium hireability
    36
    Inference-Aware Architecture65
    KV Cache Optimization55
    Weight Streaming Efficiency15
    Weight Compression10
    Strengths
    60 commits to flashinfer-ai/flashinfer — KV cache attention kernel library
    Bio: ML Compiler + FlashInfer LLM co-design — hardware-aware framing
    Gaps
    No evidence of MLA, vector-quantized KV, or KV eviction algorithm work
    …click to see all
    YB

    Younes Belkada

    medium hireability

    MS student@ENS Paris Saclay

    Previously: Researcher @ Technology Innovation Institute

    Paris, FR

    53
    Weight Compression85
    Inference-Aware Architecture55
    Weight Streaming Efficiency40
    KV Cache Optimization30
    Strengths
    GPT3.int8() co-author — 1556 citations, foundational quantization work
    1.58-bit fine-tuning integration into axolotl via onebitllms (Apr 2026)
    Gaps
    No direct MLA, vector-quantized KV, or KV eviction work
    …click to see all
    YK

    Young D. Kwon

    medium hireability

    Research Scientist@Samsung

    Previously: PHD Student @ University of Cambridge

    Cambridge, GB

    34
    Weight Compression62
    Inference-Aware Architecture45
    Weight Streaming Efficiency22
    KV Cache Optimization8
    Strengths
    HierarchicalPrune: position-aware diffusion model compression (AAAI 2026)
    SpecVocab speculative decoding — commercialized on Samsung Galaxy S26
    Gaps
    No KV cache work — MLA, vector-quantized KV, or eviction strategies absent
    …click to see all
    YL

    Yuming Lou

    medium hireability
    43
    Weight Compression65
    Inference-Aware Architecture55
    KV Cache Optimization35
    Weight Streaming Efficiency15
    Strengths
    MIT HAN Lab intern — AWQ (MLSys 2024 Best Paper) contributor
    Tsinghua quantization algorithms research under Yu Wang
    Gaps
    No original research on MLA, vector-quantized KV cache, or KV eviction
    …click to see all
    YL

    Yun Li

    medium hireability

    Technical Expert@Huawei

    Previously: Senior Algorithm Researcher @ Tencent

    Shanghai, CN

    46
    Weight Compression78
    Weight Streaming Efficiency48
    Inference-Aware Architecture48
    KV Cache Optimization10
    Strengths
    AMS-Quant: novel FP4.25/FP5.33 weight quantization — reduces bits-per-weight directly
    CUDA kernels in AMS-Quant minimize memory access; 2.8–3.2× speedup vs FP16
    Gaps
    No KV cache work — MLA, vector-quantize KV, KV eviction absent from profile
    …click to see all
    YP

    Yuqi Pan

    medium hireability

    PhD student@Institue of Automation, Chinese Academy of Sciences

    Previously: Undergrad student @ Nanjing University

    Beijing, CN

    26
    KV Cache Optimization60
    Inference-Aware Architecture35
    Weight Compression5
    Weight Streaming Efficiency5
    Strengths
    MetaLA: KV-free linear attention via fixed recurrent state (18 citations)
    Contributor to fla-org/flash-linear-attention — includes MLA implementation
    Gaps
    No weight compression work (quantization, microscaling, topological regularization)
    …click to see all
    YZ

    Yuxuan Zhu

    medium hireability

    PhD Student@Rensselaer Polytechnic Institute

    Previously: Graduation thesis/internship Data-driven Nozzle Failures Detection and Classification @ Canon Production Printing

    Troy, US

    32
    KV Cache Optimization82
    Inference-Aware Architecture35
    Weight Compression5
    Weight Streaming Efficiency5
    Strengths
    SentenceKV (COLM 2025) — sentence-level KV cache compression
    OjaKV (ACL 2026) — online low-rank KV cache compression
    Gaps
    No hardware/chip-level architecture co-design work
    …click to see all
    YZ

    Yu Zhang

    medium hireability

    Full-time Researcher@Moonshot AI

    Previously: Research Intern @ Tencent

    CN

    53
    Inference-Aware Architecture88
    KV Cache Optimization75
    Weight Streaming Efficiency30
    Weight Compression20
    Strengths
    Kimi Linear (2025): 75% KV cache reduction via hybrid linear attention (KDA)
    1117 commits to flash-linear-attention — Triton hardware-efficient kernel library
    Gaps
    No direct work on MLA, vector-quantized KV, or KV eviction specifically
    …click to see all
    ZC

    Zefan Cai

    medium hireability

    TikTok Shop Account Manager@Pattern

    Previously: Business Analyst @ Xiaohongshu

    Lehi, US

    31
    KV Cache Optimization90
    Inference-Aware Architecture20
    Weight Compression8
    Weight Streaming Efficiency5
    Strengths
    PyramidKV (NeurIPS 2024, 148 citations) — layer-wise KV cache compression
    R-KV (NeurIPS 2025) — KV compression for reasoning models
    Gaps
    No published work on weight compression or weight streaming efficiency
    …click to see all
    ZW

    Zeyu WANG

    medium hireability
    47
    KV Cache Optimization85
    Inference-Aware Architecture78
    Weight Streaming Efficiency15
    Weight Compression8
    Strengths
    4 merged PRs to deepseek-ai/FlashMLA — MLA KV cache decoding kernels
    PR #76: full Blackwell SM100 architecture MLA kernel support (+11K lines)
    Gaps
    No weight compression or microscaling work found
    …click to see all
    Z(

    zhang (dianzhangchen)

    medium hireability
    51
    KV Cache Optimization88
    Inference-Aware Architecture85
    Weight Compression20
    Weight Streaming Efficiency12
    Strengths
    deepseek-ai/FlashMLA: 2 merged PRs including TMA pipeline optimization
    NVIDIA/cutlass PR #2472: Blackwell MLA forward kernel merged (3323 LOC)
    Gaps
    No evidence of weight streaming / bandwidth reduction work
    …click to see all
    ZW

    Zhangyang Wang

    medium hireability

    Senior Research Scientist@Meta

    Previously: Assistant Professor @ University of Texas at Austin

    80
    KV Cache Optimization93
    Weight Compression92
    Inference-Aware Architecture72
    Weight Streaming Efficiency62
    Strengths
    H2O KV eviction — co-author of seminal KV heavy-hitter eviction paper
    Q-Hitter: sparse-quantized KV cache for efficient LLM inference
    Gaps
    No specific MLA or vector-quantized KV cache work found
    …click to see all
    ZX

    Zhenda Xie

    medium hireability

    AI Researcher@DeepSeek AI

    Previously: Joint PhD & Fulltime Intern @ Microsoft

    Beijing, CN

    61
    KV Cache Optimization92
    Inference-Aware Architecture78
    Weight Streaming Efficiency38
    Weight Compression36
    Strengths
    MLA in DeepSeek-V2: 93.3% KV cache reduction (core designer)
    NSA paper: hardware-aligned sparse attention for modern inference chips
    Gaps
    No explicit work on vector-quantized KV cache or KV eviction strategies
    …click to see all
    ZD

    Zhen Dong

    medium hireability

    Senior/Staff Research Scientist@NVIDIA

    Previously: Founding Member @ Nexusflow

    San Francisco, US

    71
    Weight Compression95
    Inference-Aware Architecture80
    KV Cache Optimization65
    Weight Streaming Efficiency45
    Strengths
    R-KV (2025): KV cache compression for reasoning models — direct search match
    SqueezeLLM: dense-and-sparse weight quantization for LLM decode (323 citations)
    Gaps
    No explicit MLA / multi-latent attention architecture work found
    …click to see all
    ZQ

    Zhen Qin

    medium hireability

    Staff Research Scientist@DeepMind

    Previously: Researcher @ TapTap

    New York, US

    38
    Inference-Aware Architecture65
    KV Cache Optimization58
    Weight Streaming Efficiency18
    Weight Compression10
    Strengths
    Lightning Attention-2: fixed-size recurrent KV state, hardware-efficient kernels
    HGRN2: gated linear RNN — bounded memory at inference, state expansion
    Gaps
    No specific MLA, vector-quantized KV cache, or KV eviction work
    …click to see all
    ZY

    Zhihang Yuan

    medium hireability

    Algorithm Researcher@Bytedance

    Previously: Researcher @ Infinigence AI

    Beijing, CN

    80
    Weight Compression96
    KV Cache Optimization85
    Inference-Aware Architecture72
    Weight Streaming Efficiency65
    Strengths
    WKVQuant: joint weight + KV cache quantization (2024)
    SKVQ: sliding-window KV cache quant — eviction-adjacent (2024)
    Gaps
    No direct MLA or attention architecture design (KV work is compression-focused)
    …click to see all
    ZL

    Zhiyuan Li

    medium hireability
    60
    KV Cache Optimization85
    Inference-Aware Architecture78
    Weight Streaming Efficiency70
    Weight Compression5
    Strengths
    Kimi Linear co-author: 75% KV reduction, 6× TPOT vs MLA at 1M context
    252 commits to flash-linear-attention — core KDA kernel maintainer
    Gaps
    No weight compression work — quantization/pruning absent from profile
    …click to see all
    ZC

    Zhuoming Chen

    medium hireability

    Ph.D. student@Carnegie Mellon University

    Previously: Research Intern @ Meta

    New York, US

    36
    KV Cache Optimization75
    Inference-Aware Architecture42
    Weight Streaming Efficiency18
    Weight Compression8
    Strengths
    MagicPIG: LSH-based KV cache eviction for efficient LLM generation
    TriForce: KV-cache hierarchical draft — directly manages KV at inference
    Gaps
    No weight compression work (quantization, pruning, topoLM-style approaches)
    …click to see all
    ZY

    Zihao Ye

    medium hireability

    Engineer@NVIDIA

    Previously: Intern @ NVIDIA

    San Francisco, US

    74
    KV Cache Optimization95
    Weight Compression80
    Inference-Aware Architecture75
    Weight Streaming Efficiency45
    Strengths
    FlashInfer creator (948 commits): paged KV cache, MLA, eviction kernels — exact match
    MagicPIG paper: LSH-based KV approximation for efficient generation
    Gaps
    Kernel/systems engineer focus — not model architecture designer per se
    …click to see all
    ZZ

    ZZK

    medium hireability
    50
    KV Cache Optimization80
    Inference-Aware Architecture50
    Weight Compression38
    Weight Streaming Efficiency30
    Strengths
    FlashMLA PR #162 — direct KV cache compression kernel work
    FlashInfer PR #3224 — MoE kernel memory access optimization
    Gaps
    No LinkedIn; limited hireability signal
    …click to see all
    AS

    Abu Sebastian

    low hireability

    Manager, AI Compute Frontiers (DRSM)@IBM

    Previously: Principal Research Staff Member @ IBM

    Zurich, CH

    58
    Inference-Aware Architecture85
    Weight Streaming Efficiency75
    Weight Compression65
    KV Cache Optimization8
    Strengths
    NAS for in-memory computing accelerators — hardware-aware architecture co-design
    Efficient LLM scaling with MoE + 3D analog in-memory (Nature Comp Sci 2025)
    Gaps
    No KV cache optimization work (MLA, vector-quantized KV, KV eviction)
    …click to see all
    AH

    Ahmed Hasssan

    low hireability

    MTS Software Development Engineer@AMD

    Previously: Graduate Student Research Assistant @ Cornell Tech

    Pueblo, US

    50
    Weight Compression82
    Inference-Aware Architecture78
    Weight Streaming Efficiency30
    KV Cache Optimization10
    Strengths
    CiM inference chip co-design — 4-year PhD focus at Cornell Seo lab
    Torch2Chip: DNN compression + hardware accelerator deployment toolkit
    Gaps
    No KV cache work: no evidence of MLA, vector-quantized KV, or eviction strategies
    …click to see all
    AJ

    AJAY KUMAR JAISWAL

    low hireability

    Researcher@Apple

    Previously: PHD Scholar @ The University of Texas at Austin

    Seattle, US

    33
    Weight Compression85
    Weight Streaming Efficiency30
    Inference-Aware Architecture15
    KV Cache Optimization0
    Strengths
    OWL: outlier-weighted layerwise sparsity for LLMs (107 citations, ICML 2024)
    WeLore: low-rank weight compression from gradient stabilization (ICML 2025)
    Gaps
    No KV cache optimization work (MLA, vector quantization, eviction) found
    …click to see all
    AY

    Alex Yang

    low hireability
    48
    KV Cache Optimization65
    Weight Compression55
    Inference-Aware Architecture55
    Weight Streaming Efficiency15
    Strengths
    Core FlashInfer maintainer — MLA KV cache kernel contributor
    TRT-LLM paged attention kernel work (KV eviction/management)
    Gaps
    No published research — inference kernel engineer, not architecture researcher
    …click to see all
    AW

    Alvin Wan

    low hireability

    Member of Technical Staff@OpenAI

    Previously: Senior Research Scientist @ Apple

    San Francisco, US

    49
    Weight Compression80
    Inference-Aware Architecture65
    Weight Streaming Efficiency45
    KV Cache Optimization5
    Strengths
    'The Super Weight in LLMs' (2025) — outlier-aware quantization block sizing
    UPSCALE channel pruning — 2x inference speedup via structured sparsity
    Gaps
    No KV cache work — MLA, vector-quantized KV, or KV eviction absent
    …click to see all
    AP

    Amar Phanishayee

    low hireability

    Sr. Principal Researcher@Microsoft

    Previously: PhD student @ Carnegie Mellon University

    49
    Inference-Aware Architecture72
    Weight Compression60
    KV Cache Optimization45
    Weight Streaming Efficiency20
    Strengths
    DéjàVu (ICML 2024): KV-cache streaming for LLM serving — 45 citations
    Block FP activation compression patents (2020, 2024, 2025) — MX/microscaling-adjacent
    Gaps
    KV work is fault-tolerant streaming, not MLA/vector-quantized KV/eviction
    …click to see all
    AM

    Amirkeivan Mohtashami

    low hireability

    Research Scientist@DeepMind

    Previously: Research Scientist @ Google

    Zurich, CH

    65
    Weight Compression88
    KV Cache Optimization82
    Inference-Aware Architecture58
    Weight Streaming Efficiency32
    Strengths
    QuaRot: 4-bit outlier-free LLM inference, NeurIPS 2024 (340 citations)
    Landmark Attention: KV eviction for infinite context, NeurIPS 2023 (223 citations)
    Gaps
    No hardware-aware architecture design for specific inference chips
    …click to see all
    AP

    Andrei Panferov

    low hireability

    PhD Student Researcher, ML@PhD Student, ISTA

    Previously: Senior ML Engineer @ Wildberries

    Vienna, AT

    46
    Weight Compression95
    Inference-Aware Architecture55
    Weight Streaming Efficiency30
    KV Cache Optimization5
    Strengths
    AQLM: extreme additive quantization, 141 citations
    Quartet II: NVFP4 pre-training for NVIDIA Blackwell — inference-chip aware
    Gaps
    No KV cache optimization work (MLA, KV eviction) visible
    …click to see all
    AG

    Anerudhan Gopal

    low hireability
    41
    Inference-Aware Architecture72
    KV Cache Optimization65
    Weight Compression20
    Weight Streaming Efficiency8
    Strengths
    Ragged KV Cache cuDNN backend wrapper for FlashInfer — direct KV cache work
    FP8 Q+KV attention via cuDNN — quantized KV cache at inference time
    Gaps
    No evidence of MLA, KV eviction algorithms, or vector-quantized KV cache
    …click to see all
    AG

    Ankit Gupta

    low hireability

    Research Scientist@IBM

    Previously: Research Scientist @ IBM

    Boston, US

    25
    KV Cache Optimization60
    Inference-Aware Architecture28
    Weight Compression5
    Weight Streaming Efficiency5
    Strengths
    DSS (NeurIPS 2022, 511 cit) — SSMs eliminate KV cache entirely
    Gated State Spaces (ICLR 2023, 409 cit) — KV-free LM architecture
    Gaps
    No weight compression or quantization work (microscaling, topological regularization)
    …click to see all
    AR

    Ankit Singh Rawat

    low hireability

    Senior Staff Research Scientist@DeepMind

    Previously: Staff Research Scientist @ DeepMind

    New York, US

    35
    KV Cache Optimization68
    Inference-Aware Architecture30
    Weight Compression22
    Weight Streaming Efficiency20
    Strengths
    Low-Rank Bottleneck paper (2020) — foundational insight underlying MLA KV compression
    GLA analysis (ICLR 2025) — directly addresses O(1) KV cache via gated linear attention
    Gaps
    No hardware-aware or chip co-design work — inference efficiency is algorithm-focused only
    …click to see all
    AS

    Anshumali Shrivastava

    low hireability

    Founder and Board Chairman@ThirdAI

    Previously: CEO @ ThirdAI

    73
    KV Cache Optimization92
    Weight Compression90
    Inference-Aware Architecture58
    Weight Streaming Efficiency52
    Strengths
    Scissorhands (291 cites) — KV cache eviction, directly on-target
    "KV Cache is 1 Bit Per Channel" — extreme KV quantization at inference
    Gaps
    No published chip co-design work (power/bandwidth/packaging constraints)
    …click to see all
    AB

    Artem Bolshakov

    low hireability

    Researcher@QualComm

    Previously: PhD student @ University of Toronto

    US

    26
    Weight Compression55
    Inference-Aware Architecture30
    Weight Streaming Efficiency20
    KV Cache Optimization0
    Strengths
    GPTVQ: VQ quantization for LLMs, SotA model size vs accuracy trade-off
    GPTVQ targets DRAM + latency reduction on ARM CPU and Nvidia GPU inference
    Gaps
    h=1; single ML paper as 1 of 9 authors — individual contribution unclear
    …click to see all
    AQ

    Aurick Qiao

    low hireability

    Member of Technical Staff@Thinking Machines Lab

    Previously: AI Researcher @ Snowflake

    Seattle, US

    44
    KV Cache Optimization82
    Inference-Aware Architecture55
    Weight Compression32
    Weight Streaming Efficiency5
    Strengths
    TALE (TACL 2025): low-rank KV cache approximation with reconstruction elimination
    SwiftKV: skips later-layer KV prefill — 25-50% prefill FLOP reduction
    Gaps
    No work on MLA, vector-quantized KV, or KV eviction strategies
    …click to see all
    BO

    Barlas Oguz

    low hireability

    Research Scientist@Meta

    Previously: Senior Data Scientist @ Microsoft

    San Francisco, US

    52
    Weight Compression82
    KV Cache Optimization60
    Inference-Aware Architecture35
    Weight Streaming Efficiency30
    Strengths
    LLM-QAT: 4-bit QAT for LLMs with explicit KV cache quantization (414 citations)
    BiT: fully binarized transformers — extreme compression (94 citations)
    Gaps
    No explicit hardware inference chip co-design (power/bandwidth constraints)
    …click to see all
    BC

    Beidi Chen

    low hireability

    Assistant Professor of Electrical and Computer Engineering@Carnegie Mellon University

    Previously: Researcher @ Meta

    Pittsburgh, US

    81
    KV Cache Optimization97
    Inference-Aware Architecture82
    Weight Compression74
    Weight Streaming Efficiency72
    Strengths
    H2O: canonical KV eviction oracle — 689c, NeurIPS 2023
    StreamingLLM: attention-sink infinite-context KV — 998c
    Gaps
    CMU assistant professor — technically triggers 'no professors' constraint
    …click to see all
    BR

    Bita Rouhani

    low hireability

    Distinguished Engineer@NVIDIA

    Previously: Partner Group Manager @ Microsoft

    Seattle, US

    85
    Weight Compression95
    Inference-Aware Architecture90
    KV Cache Optimization82
    Weight Streaming Efficiency72
    Strengths
    OCP MX spec co-author (arXiv:2310.10537) — exact Neuralace-cited standard
    Key, Value, Compress (2025) — comprehensive KV cache compression coverage
    Gaps
    No published work specifically on MLA or KV eviction strategies
    …click to see all
    BD

    Boyu Diao

    low hireability

    Senior Research Engineer@Institute of Computing Technology, Chinese Academy of Sciences

    Previously: Assistant Professor @ Institute of Computing Technology, Chinese Academy of Sciences

    Beijing, CN

    34
    Weight Compression75
    Inference-Aware Architecture38
    Weight Streaming Efficiency18
    KV Cache Optimization5
    Strengths
    MPQ-DM (AAAI'25): extremely low-bit (2-4 bit) mixed-precision quantization
    Q-VDiT (ICML'25): W3A6 quant for video DiT, 1.9× SOTA improvement
    Gaps
    No KV cache work (MLA, vector-quantized KV, KV eviction)
    …click to see all
    BZ

    Bo Zhang

    low hireability

    VLM Tech Lead@RobotEra

    Previously: Algorithm Strategist @ Meituan

    Beijing, CN

    44
    Weight Compression85
    Inference-Aware Architecture55
    Weight Streaming Efficiency30
    KV Cache Optimization5
    Strengths
    FPTQ + Integer Scale + Norm Tweaking — 3 LLM post-training quantization papers
    MobileVLM: explicitly hardware-constrained on-device VLM (2023-2024)
    Gaps
    No KV cache work — no MLA, KV eviction, or cache compression papers
    …click to see all
    BK

    Byeongwook Kim

    low hireability

    Leader@NAVER

    Previously: Technical Leader @ NAVER

    Gyeonggi, KR

    65
    Weight Compression90
    KV Cache Optimization75
    Inference-Aware Architecture65
    Weight Streaming Efficiency30
    Strengths
    "No Token Left Behind": MiKV KV eviction + mixed-precision quantization (2024)
    LUT-GEMM: lookup-table weight quantization, 184 citations
    Gaps
    South Korea location — not in specified geographies (USA/Europe/China/India)
    …click to see all
    CW

    Carole-Jean Wu

    low hireability

    Director of AI Research@Meta

    Previously: Professor with tenure @ Arizona State University

    55
    Inference-Aware Architecture82
    KV Cache Optimization65
    Weight Compression45
    Weight Streaming Efficiency28
    Strengths
    CHAI: clustered attention heads cut KV cache footprint at LLM inference time (2024)
    LayerSkip: early exit + self-speculative decoding for LLM efficiency (153 citations)
    Gaps
    No direct work on MLA, vector-quantized KV cache, or KV eviction strategies
    …click to see all
    CB

    Charlie Blake

    low hireability

    AI research engineer@Graphcore

    Previously: MS student @ University of Oxford

    78
    Inference-Aware Architecture82
    Weight Compression78
    KV Cache Optimization78
    Weight Streaming Efficiency72
    Strengths
    SparQ Attention (ICML 2024) — sparse KV retrieval cuts bandwidth ~8x
    8-bit FP inference (NeurIPS 2023 Oral) — published weight compression work
    Gaps
    Now at OpenAI MTS — very high comp/mission retention, hard to recruit
    …click to see all
    CL

    Cheng Luo

    low hireability

    researcher@TikTok

    Previously: postdoctoral researcher @ Caltech

    43
    KV Cache Optimization80
    Inference-Aware Architecture38
    Weight Compression32
    Weight Streaming Efficiency20
    Strengths
    R-KV: redundancy-aware KV cache compression for reasoning models (2025)
    HeadInfer: head-wise KV offloading for memory-efficient inference (2025)
    Gaps
    No evidence of inference chip co-design or power/bandwidth-aware model architecture
    …click to see all
    CZ

    Chen Zhang

    low hireability

    Assistant Professor@Shanghai Jiao Tong University

    Previously: Chip Architect @ Alibaba

    Shanghai, CN

    81
    Inference-Aware Architecture90
    Weight Compression88
    Weight Streaming Efficiency75
    KV Cache Optimization72
    Strengths
    H2-LLM (ISCA 2025) — hardware-dataflow co-exploration for LLM inference on custom chips
    OliVe (ISCA 2023, 168 cit.) — hardware-friendly outlier-victim pair quantization
    Gaps
    Tenure-track assistant professor — "No professors" flag in search query
    …click to see all
    CC

    Chi-Chih Chang

    low hireability

    Ph.D. Student@Cornell University

    Previously: Remote Intern @ University of Washington

    66
    KV Cache Optimization95
    Weight Compression78
    Inference-Aware Architecture55
    Weight Streaming Efficiency35
    Strengths
    Palu (ICLR 2025): low-rank KV-cache compression — first-author
    xKV (2025): cross-layer SVD KV-cache sharing — first-author
    Gaps
    No MLA or vector-quantized KV cache work — focuses on SVD/low-rank projection
    …click to see all
    CR

    Chong Ruan

    low hireability

    Researcher@DeepSeek

    Previously: MS student @ Peking University

    69
    KV Cache Optimization90
    Inference-Aware Architecture82
    Weight Compression55
    Weight Streaming Efficiency48
    Strengths
    DeepSeek-V2: MLA reduces KV cache 93.3%, throughput 5.76x
    DeepSeek-V3 Technical Report: FP8 training + continued MLA architecture
    Gaps
    No direct work on topological weight regularization or block-sparsity for compression
    …click to see all
    CP

    Christian Puhrsch

    low hireability

    Researcher@Meta

    Previously: MS student @ New York University

    37
    Weight Compression78
    Weight Streaming Efficiency38
    Inference-Aware Architecture28
    KV Cache Optimization5
    Strengths
    TorchAO author — INT4/INT8/FP8/MXFP quantization, 2:4 sparsity (ICML 2025)
    73 PRs to pytorch/ao; 70+ commits — core team contributor
    Gaps
    No KV cache work — no MLA, vector-quantized KV, or KV eviction signals
    …click to see all
    CL

    Christos Louizos

    low hireability

    PhD Candidate@University of Amsterdam

    Previously: Research Intern @ Qualcomm

    Amsterdam, NL

    38
    Weight Compression90
    Inference-Aware Architecture30
    Weight Streaming Efficiency25
    KV Cache Optimization5
    Strengths
    ADAROUND (ICML 2020, 857 cit) — landmark post-training quantization
    Bayesian Compression for DL (NeurIPS 2017, 620 cit) — weight compression identity
    Gaps
    No KV cache work — no MLA, eviction, or vector-quantized KV cache evidence
    …click to see all
    CS

    Christos Sourmpis

    low hireability

    Research Scientist@Huawei

    Previously: Research And Development Engineer @ SynSense

    Zurich, CH

    38
    KV Cache Optimization75
    Inference-Aware Architecture45
    Weight Streaming Efficiency20
    Weight Compression10
    Strengths
    "When Perplexity Lies": 75% KV cache reduction via hybrid SSM distillation (2026)
    AllMem: sliding-window + TTT memory hybrid, 128K context efficiency (2026)
    Gaps
    No explicit chip-constraint co-design (power/bandwidth/packaging)
    …click to see all
    DD

    Damai Dai

    low hireability

    Researcher@DeepSeek AI

    Previously: PhD student @ Peking University

    54
    KV Cache Optimization95
    Inference-Aware Architecture85
    Weight Streaming Efficiency20
    Weight Compression15
    Strengths
    MLA in DeepSeek-V2: 93.3% KV cache reduction — landmark inference innovation
    NSA (2025): hardware-aligned sparse attention, proven speedups on 64k-token decode
    Gaps
    No direct weight streaming / weight bandwidth reduction work found
    …click to see all
    DB

    Davis Blalock

    low hireability

    Research Scientist@DeepMind

    Previously: Research Scientist @ Databricks

    San Francisco, US

    41
    Weight Compression80
    Inference-Aware Architecture32
    KV Cache Optimization28
    Weight Streaming Efficiency22
    Strengths
    'What is the State of Neural Network Pruning?' — landmark survey, 34+ citations
    'Multiplying Matrices Without Multiplying' (2021) — matrix approximation via quantization
    Gaps
    No direct KV cache paper (MLA, KV eviction, KV quantization) — adjacent via VQ
    …click to see all
    DY

    Dejian Yang

    low hireability

    Researcher@DeepSeek AI

    Previously: Researcher @ Microsoft

    52
    KV Cache Optimization95
    Inference-Aware Architecture82
    Weight Compression18
    Weight Streaming Efficiency12
    Strengths
    MLA originator — 93.3% KV cache reduction in DeepSeek-V2
    MLA adopted in DeepSeek-V3 (671B) and V3.2 — production-proven at scale
    Gaps
    No published weight compression work (microscaling, quantization-aware topology)
    …click to see all
    DC

    Deli Chen

    low hireability

    Researcher@DeepSeek AI

    Previously: Research Intern of Wechat AI @ Tencent

    Beijing, CN

    49
    KV Cache Optimization92
    Inference-Aware Architecture80
    Weight Streaming Efficiency15
    Weight Compression10
    Strengths
    DeepSeek-V2 co-author — invented MLA (93.3% KV cache reduction)
    DeepSeek-V3 co-author — MLA adopted in 671B production model
    Gaps
    No published work on weight compression or quantization
    …click to see all
    DG

    Denis A Gudovskiy

    low hireability

    Senior Deep Learning Researcher@Panasonic

    Previously: Senior Wireless Engineer @ Intel

    San Francisco, US

    38
    Weight Compression70
    Inference-Aware Architecture55
    Weight Streaming Efficiency20
    KV Cache Optimization8
    Strengths
    ShiftCNN: multiplierless low-precision CNN inference (74 citations, 2017)
    DNN Feature Map Compression: bandwidth-reduction via GF(2) (ECCV 2018)
    Gaps
    No KV cache work (MLA, vector quantized KV, eviction) — core gap for axis 1
    …click to see all
    DC

    Dhruv Choudhary

    low hireability

    Senior Staff Research Engineer@Meta

    Previously: Senior Tech Lead Manager @ Meta

    San Francisco, US

    53
    Weight Compression88
    Inference-Aware Architecture72
    Weight Streaming Efficiency42
    KV Cache Optimization8
    Strengths
    SpinQuant: LLM quantization via rotations — 201 citations (2025)
    Microscaling MX formats co-author — hardware-aware data format spec
    Gaps
    No KV cache optimization work — MLA, KV eviction, KV quantization absent
    …click to see all
    DN

    Dimin Niu

    low hireability

    Research Scientist@Alibaba

    Previously: Senior / Staff Engineer @ Samsung

    San Francisco, US

    51
    Inference-Aware Architecture90
    Weight Streaming Efficiency65
    Weight Compression30
    KV Cache Optimization20
    Strengths
    H-LLM (ISCA 2025): hardware-dataflow co-design for hybrid-bonding LLM inference
    HD-MoE: 3D near-memory processing reduces weight bandwidth for MoE decode
    Gaps
    No KV cache papers (no MLA, KV eviction, vector-quantized KV work found)
    …click to see all
    EF

    Elias Frantar

    low hireability

    Member of Technical Staff@OpenAI

    Previously: PHD Candidate @ Institute of Science and Technology Austria

    San Francisco, US

    69
    Weight Compression97
    Weight Streaming Efficiency85
    Inference-Aware Architecture82
    KV Cache Optimization12
    Strengths
    GPTQ (ICLR 2023) — foundational post-training quantization for LLMs
    MARLIN: FP16xINT4 kernel, ~4x decode speedup — directly addresses weight streaming
    Gaps
    No KV cache work (MLA, vector-quantized KV, KV eviction) in publications
    …click to see all
    EC

    Eric Chung

    low hireability

    VP of AI Computing@NVIDIA

    Previously: GM & Partner Group Engineering Manager @ Microsoft

    Seattle, US

    66
    Weight Compression95
    Inference-Aware Architecture95
    Weight Streaming Efficiency60
    KV Cache Optimization15
    Strengths
    Microscaling data formats (MX) — number formats co-designed for inference chip constraints
    Shared microexponents (2023): extreme narrow-precision, fewer bits per weight
    Gaps
    No dedicated KV cache research (MLA, eviction strategies, vector-quantized KV)
    …click to see all
    FY

    Fan Yang

    low hireability

    Sr. Principal Research Manager@Microsoft

    Previously: Principal Research Manager @ Microsoft

    CN

    91
    Weight Compression95
    Inference-Aware Architecture95
    KV Cache Optimization92
    Weight Streaming Efficiency80
    Strengths
    RetrievalAttention + SeerAttention: KV eviction via sparse/vector retrieval (63 + 48 citations)
    WaferLLM (OSDI 2025): wafer-scale inference-aware LLM architecture
    Gaps
    Hireability low — entrenched senior MSRA role, no market signals
    …click to see all
    GJ

    Gaurav Jain

    low hireability

    ML Systems@xAI

    Previously: Technical Director, Software @ d-Matrix

    San Francisco, US

    50
    KV Cache Optimization92
    Inference-Aware Architecture78
    Weight Compression15
    Weight Streaming Efficiency15
    Strengths
    Keyformer (MLSys 2024, 110 citations) — KV eviction, 2.1x latency reduction
    MorphKV (ICML 2025) — constant-sized KV cache, 52.9% memory savings
    Gaps
    No published weight compression or weight streaming work
    …click to see all
    GX

    Guangxuan Xiao

    low hireability

    Member of Technical Staff@Thinking Machines Lab

    Previously: Research Intern @ NVIDIA

    USA

    82
    Weight Compression95
    KV Cache Optimization95
    Inference-Aware Architecture82
    Weight Streaming Efficiency55
    Strengths
    StreamingLLM: KV eviction via attention sinks — 1035 citations (ICLR 2024)
    DuoAttention: head-type KV cache reduction — 99 citations (ICLR 2025)
    Gaps
    Joined Thinking Machines Lab mid-2025 — only ~6–12 months in role, low hire window
    …click to see all
    GH

    Guyue Huang

    low hireability

    Deep Learning Architect@NVIDIA

    Previously: Community Associate @ International

    US

    52
    Inference-Aware Architecture80
    Weight Streaming Efficiency68
    Weight Compression50
    KV Cache Optimization8
    Strengths
    Shfl-BW (DAC'22): tensor-core-aware weight pruning for inference acceleration
    RM-STC (MICRO'23): GPU sparse tensor core, energy-efficient sparse acceleration
    Gaps
    No KV cache work (MLA, vector-quantized KV, KV eviction) found
    …click to see all
    HK

    Han-Byul Kim

    low hireability

    ML Research Engineer@Apple

    Previously: Research Intern @ Apple

    Seattle, US

    53
    Weight Compression75
    KV Cache Optimization60
    Inference-Aware Architecture50
    Weight Streaming Efficiency28
    Strengths
    EpiCache (2025): KV cache management for long conversational QA
    BASQ (ECCV 2022): sub-4-bit quantization via branch-wise NAS
    Gaps
    No evidence of MLA or vector-quantized KV cache techniques specifically
    …click to see all
    HH

    Haofeng Huang

    low hireability

    Research Intern@Alibaba

    Previously: Infra Team Member @ ShengShu Technology

    Beijing, CN

    77
    KV Cache Optimization88
    Weight Compression83
    Inference-Aware Architecture80
    Weight Streaming Efficiency55
    Strengths
    SageAttention series: INT4/FP4 quantized attention (ICLR+ICML+NeurIPS 2025 Spotlight)
    SageAttention3: Microscaling FP4 — directly matches search's microscaling constraint
    Gaps
    Incoming PhD at IIIS under Prof. Yao starting Fall 2026 — committed to academic path
    …click to see all
    HC

    Hao Mark Chen

    low hireability

    PhD Student@Imperial College London

    Previously: ML Research Intern @ Samsung

    London, GB

    50
    Weight Compression70
    Inference-Aware Architecture65
    Weight Streaming Efficiency55
    KV Cache Optimization10
    Strengths
    Progressive Mixed-Precision Decoding: INT2/3 on NPUs, 3.8–8× throughput
    Hardware-Aware Parallel Prompt Decoding: adaptive sparse tree per GPU arch
    Gaps
    No KV cache work (MLA, eviction, vector-quantized KV) found
    …click to see all
    HP

    Hayden Prairie

    low hireability

    Kernels Research Intern@Together

    Previously: Research Assistant @ University of Texas at Austin

    San Diego, US

    48
    Weight Compression75
    Inference-Aware Architecture65
    Weight Streaming Efficiency35
    KV Cache Optimization15
    Strengths
    "Search Your Block Floating Point Scales!" MLSys 2026 — BFP/microscaling quantization
    Parcae ICLR 2026 — looped models, inference-efficient weight reuse architecture
    Gaps
    No direct KV cache work (MLA, eviction) — SSM approach replaces rather than compresses KV
    …click to see all
    HH

    Helya Hosseini

    low hireability

    Research Assistant and Teaching Assistant@University of Maryland

    Previously: Logic Design Teaching Assistant @ University of Tehran

    US

    41
    KV Cache Optimization75
    Inference-Aware Architecture60
    Weight Compression20
    Weight Streaming Efficiency10
    Strengths
    MUSTAFAR (NeurIPS 2025): 70% KV cache sparsity, 2.23x throughput gain
    Custom bitmap-sparse attention kernel for compressed KV cache decode
    Gaps
    No weight compression work (focus is KV cache, not weights)
    …click to see all
    HM

    Hesham Mostafa

    low hireability

    Researcher@Intel

    50
    Inference-Aware Architecture82
    Weight Compression78
    Weight Streaming Efficiency30
    KV Cache Optimization8
    Strengths
    Technical Lead ML at d-Matrix — CIM inference chip co-design role
    MF-QAT (2025): elastic inference via multi-format QAT (MXINT/MXFP)
    Gaps
    No KV cache work found (MLA, vector-quantized KV, KV eviction)
    …click to see all
    HC

    Hung-Yueh Chiang

    low hireability

    Ph.D. Candidate@The University of Texas at Austin

    Previously: Machine Learning Engineer @ XYZ Robotics

    Austin, US

    68
    Weight Compression90
    KV Cache Optimization75
    Inference-Aware Architecture65
    Weight Streaming Efficiency40
    Strengths
    UniQL (ICLR 2026): unified quantization + low-rank compression for edge LLMs
    Quamba2 (ICML 2025): scalable PTQ framework for SSMs
    Gaps
    Just started NVIDIA April 2026 — only ~1 month tenure, low hireability
    …click to see all
    JL

    James Liu

    low hireability

    Member of Technical Staff@Anthropic

    Previously: Research Scientist @ Together AI

    San Francisco, US

    44
    Weight Compression78
    Inference-Aware Architecture50
    Weight Streaming Efficiency42
    KV Cache Optimization5
    Strengths
    BitDelta (NeurIPS 2024): 1-bit delta quantization, >10x GPU memory reduction
    TEAL (ICLR 2025 Spotlight): 40-50% activation sparsity, 1.8x wall-clock speedup
    Gaps
    No KV cache work (MLA, KV eviction, vector-quantized KV) found
    …click to see all
    JP

    Jeff Pool

    low hireability

    Senior Manager@NVIDIA

    Previously: Manager - Architecture @ NVIDIA

    US

    61
    Weight Compression93
    Inference-Aware Architecture78
    Weight Streaming Efficiency65
    KV Cache Optimization8
    Strengths
    "Learning both Weights & Connections" (2015) — 9,558 citations, seminal pruning
    MaskLLM (2024) — learnable N:M sparsity for LLMs
    Gaps
    No published work on KV cache, MLA, or attention efficiency
    …click to see all
    JZ

    Jian Zhang

    low hireability

    Director and Distinguished Scientist@Nvidia

    Previously: Co-Founder, CTO, VP Engineering @ Nvidia

    San Francisco, US

    54
    Weight Compression82
    Inference-Aware Architecture72
    Weight Streaming Efficiency45
    KV Cache Optimization15
    Strengths
    NVFP4 pre-training on Nemotron 3 Super — chip-native weight format
    LatentMoE: accuracy per FLOP and per parameter — inference-aware design
    Gaps
    No KV cache papers — MLA, vector-quantized KV, or KV eviction work absent
    …click to see all
    JL

    Jiashi Li

    low hireability
    48
    KV Cache Optimization92
    Inference-Aware Architecture68
    Weight Compression18
    Weight Streaming Efficiency12
    Strengths
    deepseek-ai/FlashMLA top maintainer — MLA CUDA kernels for V3/V3.2-Exp
    FP8 KV cache quantization with per-token scale — novel quantization scheme
    Gaps
    No evidence of weight streaming / bandwidth reduction work
    …click to see all
    JL

    Ji Lin

    low hireability

    Research Scientist@Meta

    Previously: Member of Technical Staff @ OpenAI

    San Francisco, US

    60
    Weight Compression93
    Inference-Aware Architecture78
    Weight Streaming Efficiency52
    KV Cache Optimization15
    Strengths
    AWQ: MLSys 2024 Best Paper — activation-aware weight quantization
    SmoothQuant (1502 citations) — LLM PTQ reducing per-channel variance
    Gaps
    No published work on KV cache (MLA, KV eviction, vector-quantized KV)
    …click to see all
    JZ

    Jimmy Zhou

    low hireability
    58
    Inference-Aware Architecture78
    Weight Compression70
    KV Cache Optimization55
    Weight Streaming Efficiency30
    Strengths
    FMHAv2 paged KV cache integration (PR#2841, PR#2446) — direct KV cache work
    MXINT4 / NVFP4 / W4A8 MoE quantization in FlashInfer production kernels
    Gaps
    Engineer, not researcher — implements formats (MXINT4, FP4) vs. inventing compression methods
    …click to see all
    JH

    Joel Hestness

    low hireability

    Research Scientist@Cerebras

    Previously: Co-founder, Board Member @ 3 Day Startup

    San Francisco, US

    39
    Inference-Aware Architecture80
    Weight Compression48
    Weight Streaming Efficiency18
    KV Cache Optimization8
    Strengths
    Cerebras-GPT: compute-optimal LLMs designed for wafer-scale hardware (118 citations)
    CompleteP (2025): hardware-aware model shapes for efficient inference on CS-3
    Gaps
    No KV cache work — MLA, KV eviction, or vector-quantized KV not in profile
    …click to see all
    JF

    Jonathan Frankle

    low hireability

    Chief AI Scientist@Databricks

    Previously: Chief Scientist @ MosaicML

    New York, US

    38
    Weight Compression80
    Weight Streaming Efficiency35
    Inference-Aware Architecture30
    KV Cache Optimization5
    Strengths
    Lottery Ticket Hypothesis — foundational weight sparsity/compression paper (9K+ citations)
    15 structured/magnitude pruning papers — specialist depth in weight compression
    Gaps
    No KV cache work (MLA, eviction, quantization) found
    …click to see all
    JD

    Jordan Dotzel

    low hireability

    Student Researcher@Google

    Previously: Software Engineer @ Datto

    San Francisco, US

    45
    Weight Compression88
    Inference-Aware Architecture55
    Weight Streaming Efficiency30
    KV Cache Optimization8
    Strengths
    FLIQS: AutoML 2024 Best Paper, mixed-precision LLM quantization
    Learning from Students (ICML 2024): t-distribution LLM weight formats
    Gaps
    No KV cache, MLA, or KV eviction work found
    …click to see all
    KY

    Kaichao You

    low hireability

    Core maintainer@@vllm-project

    Beijing, CN

    35
    KV Cache Optimization75
    Inference-Aware Architecture50
    Weight Streaming Efficiency10
    Weight Compression5
    Strengths
    vLLM core maintainer — PagedAttention KV cache at production scale
    Jenga (SOSP 2025): memory mgmt for heterogeneous LLM inference
    Gaps
    No work on weight compression or microscaling
    …click to see all
    KM

    Kaoutar El Maghraoui

    low hireability

    Principal Research Scientist and Manager@IBM

    Previously: Principal Research Staff Member @ IBM

    New York, US

    62
    Inference-Aware Architecture82
    Weight Compression72
    KV Cache Optimization65
    Weight Streaming Efficiency28
    Strengths
    2025 paper: dynamic KV cache placement for LLM inference in heterogeneous memory
    2025 paper: paged+flex attention for long-context inference efficiency
    Gaps
    KV cache work is placement/paging — no evidence of MLA, vector-quantized KV, or eviction policies
    …click to see all
    KK

    Kurt Keutzer

    low hireability

    Co-Founder and Strategic Advisor@SigIQ.ai

    Previously: Chief Strategy Officer (CSO) @ Nexusflow

    San Francisco, US

    83
    Weight Compression95
    KV Cache Optimization90
    Inference-Aware Architecture85
    Weight Streaming Efficiency60
    Strengths
    KVQuant (296 citations) — KV cache quantization to 10M context at inference
    AI and Memory Wall (376 citations) — maps weight streaming as LLM decode bottleneck
    Gaps
    No explicit MLA / multi-latent attention work found
    …click to see all
    LW

    Laura Wang

    low hireability
    26
    Inference-Aware Architecture75
    Weight Streaming Efficiency20
    Weight Compression5
    KV Cache Optimization5
    Strengths
    KernelFalcon lead author — 124 PRs, 100% correctness on KernelBench L1-L3
    CuTeDSL RMSNorm/LayerNorm kernels for Blackwell SM100/SM103 inference chips
    Gaps
    No KV cache work — MLA, KV eviction, or vector-quantized KV cache not in evidence
    …click to see all
    LZ

    Lianmin Zheng

    low hireability

    Member of Technical Staff@xAI

    Previously: Applied Scientist Intern @ Amazon

    San Francisco, US

    71
    KV Cache Optimization88
    Weight Streaming Efficiency78
    Inference-Aware Architecture76
    Weight Compression42
    Strengths
    H2O (NeurIPS 2023): KV eviction oracle, widely cited
    FlexGen: weight I/O scheduling for memory-constrained GPU inference
    Gaps
    No published work on MLA or vector-quantized KV cache specifically
    …click to see all
    LZ

    Li Lyna Zhang

    low hireability

    Partner Architect, Core AI@Microsoft

    Previously: Senior Staff Research Scientist/Senior Manager @ Google

    San Francisco, US

    58
    Weight Compression88
    Inference-Aware Architecture82
    Weight Streaming Efficiency50
    KV Cache Optimization12
    Strengths
    VPTQ: 2-bit vector quantization, 1.6–1.8× LLM throughput (EMNLP 2024)
    SpaceEvo: hardware-friendly INT8 inference NAS — lead author
    Gaps
    No KV cache work — no MLA, KV eviction, or vector-quantized KV papers
    …click to see all
    LX

    Lin Xiao

    low hireability

    Research Scientist@Meta

    Previously: Senior Principal Researcher @ Microsoft

    Seattle, US

    22
    Weight Compression72
    Inference-Aware Architecture10
    KV Cache Optimization3
    Weight Streaming Efficiency3
    Strengths
    PARQ (ICML 2025): principled LLM weight quantization with optimization guarantees
    BiT (NeurIPS 2022): 1-bit binarized transformer — extreme compression benchmark
    Gaps
    No KV cache, MLA, or KV eviction work
    …click to see all
    MC

    Manuel Candales

    low hireability
    51
    Weight Compression75
    Inference-Aware Architecture68
    Weight Streaming Efficiency42
    KV Cache Optimization20
    Strengths
    2/3/4-bit Metal quantized linear kernels in pytorch/ao — weight compression core
    GEMV qmv_fast kernels for batch=1 decode — decode-mode bandwidth reduction
    Gaps
    No MLA, vector-quantized KV, or KV eviction work
    …click to see all
    MM

    Mayank Mishra

    low hireability

    Graduate Student Researcher@University of California, Berkeley

    Previously: Research Engineer-II @ MIT-IBM Watson AI Lab

    Berkeley, US

    55
    KV Cache Optimization85
    Inference-Aware Architecture78
    Weight Compression35
    Weight Streaming Efficiency20
    Strengths
    Cross-Layer Attention (NeurIPS 2024) — first-authored KV cache reduction, 71 cites
    FlashFormer (2025) — whole-model kernels for efficient low-batch inference
    Gaps
    No explicit power/bandwidth chip co-design or hardware constraint modeling
    …click to see all
    MG

    Michael Goin

    low hireability
    62
    Weight Compression88
    Weight Streaming Efficiency65
    KV Cache Optimization50
    Inference-Aware Architecture45
    Strengths
    llm-compressor lead — quantization + sparsity for LLM deployment
    neuralmagic/compressed-tensors 11 commits — structured compression formats
    Gaps
    No direct papers on MLA, vector-quantized KV cache, or KV eviction
    …click to see all
    MP

    Michael Poli

    low hireability

    Co-founder@Radical Numerics

    Previously: Founding Scientist @ Liquid AI

    San Francisco, US

    59
    KV Cache Optimization82
    Inference-Aware Architecture75
    Weight Compression48
    Weight Streaming Efficiency32
    Strengths
    Hyena + StripedHyena co-author — eliminates KV cache entirely
    vortex: inference framework for multi-hybrid architectures
    Gaps
    No explicit hardware-aware power/bandwidth optimization work
    …click to see all
    MT

    Mingxing Tan

    low hireability

    Research Director@Waymo

    Previously: Research Scientist / TLM @ Google

    San Francisco, US

    33
    Inference-Aware Architecture90
    Weight Compression20
    Weight Streaming Efficiency15
    KV Cache Optimization5
    Strengths
    EfficientNet + MnasNet: foundational hardware-constrained NAS (31K + 4K citations)
    EfficientNet-EdgeTPU: architecture co-designed for TPU inference accelerator
    Gaps
    No KV cache work — no MLA, vector-quantized KV, or eviction strategy research
    …click to see all
    MA

    Mohamed S. Abdelfattah

    low hireability

    Co-Founder and Chief Science Officer@Mako

    Previously: Principal Scientist @ Samsung

    New York, US

    86
    KV Cache Optimization95
    Inference-Aware Architecture95
    Weight Compression90
    Weight Streaming Efficiency65
    Strengths
    xKV (2025): cross-layer SVD KV-cache compression — directly on-target
    Palu (2025): low-rank projection KV-cache compression
    Gaps
    Co-founder of Makora — a competing inference chip startup, direct conflict
    …click to see all
    MS

    Mohammad Shoeybi

    low hireability

    Senior Director of Applied Research@NVIDIA

    Previously: Senior Research Engineer - Tech Lead @ DeepMind

    San Francisco, US

    61
    Weight Compression82
    Inference-Aware Architecture78
    Weight Streaming Efficiency62
    KV Cache Optimization22
    Strengths
    FP8 Formats for Deep Learning (2022, 283 cit.) — microscaling weight precision
    NVFP4 pretraining (2025) — 4-bit floating point, fewer bits streamed per weight
    Gaps
    No direct KV cache optimization papers (MLA, vector-quantized KV, KV eviction)
    …click to see all
    OR

    Olatunji Ruwase

    low hireability
    67
    Weight Compression88
    Weight Streaming Efficiency88
    Inference-Aware Architecture78
    KV Cache Optimization15
    Strengths
    FP6-LLM: Tensor Core co-design for FP6 inference (USENIX ATC 2024)
    ZeroQuant(4+2): FP4/FP6 extreme LLM compression strategy
    Gaps
    No KV cache work — no MLA, KV eviction, or vector-quantized KV papers
    …click to see all
    PM

    Pavlo Molchanov

    low hireability

    Director of Research@NVIDIA

    Previously: Distinguished Scientist and Manager @ NVIDIA

    San Francisco, US

    76
    Weight Compression95
    Inference-Aware Architecture88
    Weight Streaming Efficiency65
    KV Cache Optimization55
    Strengths
    'Importance Estimation for Pruning' (1379 cites) — foundational weight compression
    HALP + Structural Pruning via Latency-Saliency: hardware-latency-aware pruning
    Gaps
    No published work on MLA or vector-quantized KV cache specifically
    …click to see all
    PT

    Po-An Tsai

    low hireability

    Senior Research Scientist@NVIDIA

    Previously: Research Scientist @ NVIDIA

    US

    75
    Inference-Aware Architecture85
    KV Cache Optimization80
    Weight Compression75
    Weight Streaming Efficiency60
    Strengths
    RocketKV (ICML 2025): two-stage KV cache compression — direct axis hit
    ISCA-53 2026: data movement forecasting for MoE LLM serving — weight streaming
    Gaps
    Very recently promoted to Principal (April 2026) — unlikely to be looking
    …click to see all
    PP

    Priyadarshini Panda

    low hireability

    Visiting Faculty@DeepMind

    Previously: Assistant Professor @ Yale University

    Los Angeles, US

    67
    Inference-Aware Architecture85
    Weight Compression82
    Weight Streaming Efficiency62
    KV Cache Optimization40
    Strengths
    MEADOW (2025): memory-efficient dataflow/data packing for low-power edge LLMs
    TesseraQ (2025): ultra low-bit LLM PTQ — block reconstruction for extreme compression
    Gaps
    No direct MLA, vector-quantized KV cache, or KV eviction work — KV angle is hardware noise mitigation
    …click to see all
    QS

    qsang-nv

    low hireability
    51
    KV Cache Optimization88
    Inference-Aware Architecture82
    Weight Compression20
    Weight Streaming Efficiency12
    Strengths
    XQA MLA backend in FlashInfer — direct multi-head latent attention KV kernel
    FP8 KV cache + tensor scale for XQA — quantized KV compression
    Gaps
    No evidence of weight compression / extreme quantization work
    …click to see all
    RP

    Raghu Prabhakar

    low hireability

    Engineering@SambaNova Systems

    Previously: Software Engineer @ NVIDIA

    San Francisco, US

    57
    Inference-Aware Architecture92
    Weight Streaming Efficiency82
    Weight Compression42
    KV Cache Optimization10
    Strengths
    'SambaNova SN40L: Scaling the AI Memory Wall' — 32 citations, weight streaming focus
    ISSCC 2025 SN40L: 5nm chip, 3-tier memory hierarchy for inference
    Gaps
    No published work on KV cache optimization (MLA, KV eviction, vector-quantized KV)
    …click to see all
    RK

    Raghuraman Krishnamoorthi

    low hireability

    Technical Lead Manager@Meta

    Previously: Software Engineer @ Meta

    San Francisco, US

    56
    Weight Compression93
    Inference-Aware Architecture62
    KV Cache Optimization45
    Weight Streaming Efficiency22
    Strengths
    "Quantizing deep convolutional networks" whitepaper — 1489 citations, field-defining
    Leads Meta torch.ao — production-scale quantization framework for PyTorch
    Gaps
    KV eviction, MLA, vector-quantized KV cache — not in portfolio
    …click to see all
    SK

    Sanjiv Kumar

    low hireability

    VP, Google Fellow@DeepMind

    New York, US

    54
    Weight Compression80
    Inference-Aware Architecture65
    Weight Streaming Efficiency50
    KV Cache Optimization20
    Strengths
    Weighted quantization patent (2025) — direct weight compression work
    Spark Transformer: sparsity in FFN and attention (NeurIPS 2025)
    Gaps
    No direct KV cache eviction, MLA, or vector-quantized KV cache work found
    …click to see all
    SR

    Scott Roy

    low hireability
    53
    Weight Compression85
    Inference-Aware Architecture65
    Weight Streaming Efficiency45
    KV Cache Optimization15
    Strengths
    140+ commits to pytorch/ao — HQQ, PARQ, INT4/INT8, LUT quantization
    Improved HQQ scale-only (Apr 2026) — per-group max-error fallback
    Gaps
    No KV cache research: no MLA, vector-quantized KV, or eviction work found
    …click to see all
    SZ

    Sebastian Zhao

    low hireability

    Research Assistant@Berkeley Artificial Intelligence Research

    Previously: ML Research Intern @ Berkeley Artificial Intelligence Research

    Berkeley, US

    46
    KV Cache Optimization72
    Inference-Aware Architecture55
    Weight Compression42
    Weight Streaming Efficiency15
    Strengths
    Multipole Attention (NeurIPS 2025): directly targets KV cache pressure
    Custom Triton/CUDA kernels for attention — practical inference experience
    Gaps
    No dedicated MLA or vector-quantized KV cache paper
    …click to see all
    SK

    Sehoon Kim

    low hireability

    Member of Technical Staff@xAI

    Previously: Machine Learning Engineer @ Narada

    US

    86
    Weight Compression95
    KV Cache Optimization95
    Inference-Aware Architecture80
    Weight Streaming Efficiency72
    Strengths
    KVQuant (317 citations): KV cache quantization to sub-2-bit for long context inference
    SqueezeLLM (320 citations): dense-sparse LLM weight quantization reducing memory bandwidth
    Gaps
    ~14 months at xAI — low hireability; no open-to-work signals
    …click to see all
    SS

    Sheng Shen

    low hireability

    Member of Technical Staff@xAI

    Previously: Research Scientist @ Meta

    San Francisco, US

    46
    Weight Compression90
    Weight Streaming Efficiency50
    Inference-Aware Architecture35
    KV Cache Optimization10
    Strengths
    SqueezeLLM (ICML 2023): dense-sparse quant enabling 6GB LLM serving, in vLLM
    Q-BERT (AAAI 2020): Hessian-based ultra-low precision quantization
    Gaps
    No KV cache optimization work (MLA, KV eviction, vector-quantized KV)
    …click to see all
    SC

    Shijie Cao

    low hireability

    Senior Researcher@Microsoft

    Previously: Senior Researcher @ Microsoft Research Asia

    86
    Inference-Aware Architecture92
    Weight Compression90
    KV Cache Optimization88
    Weight Streaming Efficiency72
    Strengths
    BitDecoding (HPCA 2026): low-bit KV cache + Tensor Core hardware co-design
    T-MAC (EuroSys 2025): LUT-based NPU inference, direct chip co-design
    Gaps
    Just joined Xiaomi MiMo Feb 2026 — only ~3 months in new role, low hireability
    …click to see all
    SL

    Shiwei Liu

    low hireability

    PI@ELLIS Institute Tübingen

    Previously: Royal Society Newton International Fellow @ University of Oxford

    Tübingen, DE

    61
    Weight Compression92
    KV Cache Optimization72
    Weight Streaming Efficiency50
    Inference-Aware Architecture30
    Strengths
    OWL (107 citations): pruning LLMs to extreme sparsity — defines weight compression research
    Q-hitter: sparse-quantized KV cache eviction — direct KV cache optimization
    Gaps
    No inference-chip co-design — not designing models for specific power/bandwidth constraints
    …click to see all
    S(

    Siyuan Fu (Lain)

    low hireability
    46
    Weight Compression68
    KV Cache Optimization42
    Weight Streaming Efficiency40
    Inference-Aware Architecture32
    Strengths
    MXFP4 / NVFP4 block-scale MoE kernels — production FP4 weight compression at NVIDIA
    FP8 MLA quant in vllm (PR #29795 merged) — direct KV cache compute optimization
    Gaps
    IC kernel engineer, not an architecture researcher — no model design work found
    …click to see all
    SH

    Song Han

    low hireability

    Researcher@NVIDIA

    Previously: Assistant Professor @ MIT

    90
    Weight Compression99
    Inference-Aware Architecture95
    KV Cache Optimization92
    Weight Streaming Efficiency75
    Strengths
    AWQ: MLSys 2024 Best Paper — defines activation-aware weight quantization
    SmoothQuant: 1487 citations — gold standard post-training quantization
    Gaps
    Tenured MIT professor — rarely leaves; high retention likelihood
    …click to see all
    SY

    Songlin Yang

    low hireability

    Member of Technical Staff@Thinking Machines Lab

    Previously: Member of Technical Staff @ Thinking Machines Lab

    San Francisco, US

    61
    KV Cache Optimization88
    Inference-Aware Architecture80
    Weight Streaming Efficiency65
    Weight Compression10
    Strengths
    GLA (277 citations): replaces KV cache with O(1) recurrent state
    FLA Triton library: hardware-efficient linear attention CUDA kernels
    Gaps
    No work on weight compression, quantization, or microscaling
    …click to see all
    SV

    Stylianos Venieris

    low hireability

    Head of Distributed AI Group / Senior Research Scientist@Samsung

    Previously: Researcher @ Samsung

    Cambridge, GB

    58
    Inference-Aware Architecture80
    Weight Compression70
    Weight Streaming Efficiency65
    KV Cache Optimization15
    Strengths
    Hardware-Aware Parallel Prompt Decoding (2025) — hardware-aware sparse tree for LLM inference
    Progressive Mixed-Precision Decoding (2025) — phase-aware quantization for LLM decode
    Gaps
    No KV cache optimization work (MLA, KV eviction, vector-quantized KV) found
    …click to see all
    SR

    Supriya Rao

    low hireability
    39
    Weight Compression72
    Inference-Aware Architecture50
    Weight Streaming Efficiency30
    KV Cache Optimization5
    Strengths
    TorchAO paper (2025): end-to-end quantization for inference serving
    2:4 activation sparsity paper: hardware-structured sparsity for inference
    Gaps
    No KV cache work (MLA, vector quantization, KV eviction) found
    …click to see all
    TC

    Tianlong Chen

    low hireability

    Chief AI Scientist@hireEZ

    Previously: Postdoctoral Researcher @ MIT

    Austin, US

    64
    Weight Compression82
    KV Cache Optimization65
    Weight Streaming Efficiency60
    Inference-Aware Architecture50
    Strengths
    FIER (2025): KV cache retrieval for long-context LLM inference — direct axis hit
    MC-SMoE ICLR'24 Spotlight — merge+compress MoE weight compression
    Gaps
    No explicit hardware-chip co-design (ASIC/inference chip constraint modeling)
    …click to see all
    TC

    Tianqi Chen

    low hireability

    Researcher@NVIDIA

    Previously: CTO @ OctoML

    79
    Inference-Aware Architecture95
    Weight Compression85
    KV Cache Optimization72
    Weight Streaming Efficiency65
    Strengths
    FlashInfer: block-sparse KV-cache format, 29-69% inter-token latency reduction
    Apache TVM creator — definitive hardware-aware inference compiler (2812 cites)
    Gaps
    KV cache work is framework-level (not MLA/vector-quantize MLA/eviction research specifically)
    …click to see all
    TZ

    Tianyi Zhang

    low hireability

    AI Research Scientist@Workato

    Previously: Co-Founder & CTO @ xMAD.ai

    San Francisco, US

    71
    Weight Compression92
    KV Cache Optimization85
    Inference-Aware Architecture70
    Weight Streaming Efficiency35
    Strengths
    "KV Cache is 1 Bit Per Channel" (NeurIPS 2024) — direct KV quantization work
    DFloat11: 70%-size lossless LLM compression for GPU inference (NeurIPS '25)
    Gaps
    No explicit weight streaming/bandwidth reduction work
    …click to see all
    TD

    Tri Dao

    low hireability

    Assistant Professor@Princeton University

    Previously: PhD Student @ Stanford University

    Princeton, US

    73
    Inference-Aware Architecture98
    KV Cache Optimization97
    Weight Streaming Efficiency55
    Weight Compression42
    Strengths
    FlashAttention 1/2/3 — defining KV cache IO-aware attention kernels
    Mamba (6K+ citations) — O(1) inference KV cache via selective state spaces
    Gaps
    Dual AP + CSO roles make recruiting extremely difficult
    …click to see all
    US

    Utkarsh Saxena

    low hireability

    Member of Technical Staff@AMD

    Previously: Graduate Research Assistant @ Purdue University

    San Francisco, US

    64
    KV Cache Optimization90
    Weight Compression80
    Inference-Aware Architecture70
    Weight Streaming Efficiency15
    Strengths
    KVLinC (2025): 2.55× KV cache inference speedup vs FlashAttention
    Eigen Attention (2024): 40% KV cache reduction via low-rank attention
    Gaps
    No direct weight streaming / decode bandwidth reduction work
    …click to see all
    VK

    Vasiliy Kuznetsov

    low hireability
    56
    Weight Compression92
    Weight Streaming Efficiency65
    Inference-Aware Architecture62
    KV Cache Optimization5
    Strengths
    #2 torchao contributor (333 commits) — weight quantization core
    NVFP4 + GPTQ for MoE — microscaling inference chip formats
    Gaps
    No KV cache optimization work (MLA, eviction, vector-quant KV)
    …click to see all
    WZ

    Wenqian Zhao

    low hireability

    Researcher@Huawei

    Previously: PhD student @ The Chinese University of Hong Kong

    47
    Weight Compression78
    Inference-Aware Architecture60
    Weight Streaming Efficiency45
    KV Cache Optimization5
    Strengths
    BiE: hardware-friendly block floating-point for LLM quantization (2024)
    HAPE: hardware-aware LLM pruning for on-device inference (2025)
    Gaps
    No KV cache work — MLA, KV eviction, or vector-quantized KV cache absent
    …click to see all
    WL

    Wuwei Lin

    low hireability

    Researcher@OpenAI

    Previously: Researcher @ NVIDIA

    43
    Inference-Aware Architecture82
    KV Cache Optimization68
    Weight Compression18
    Weight Streaming Efficiency5
    Strengths
    FlashInfer (MLSys 2025 Outstanding Paper) — KV block-sparse attention for LLM serving
    KV cache composable formats — directly targets KV memory footprint
    Gaps
    No weight streaming or bandwidth-reduction research
    …click to see all
    XM

    Xiangxi Mo

    low hireability

    PhD student@Berkeley Sky Computing Lab

    Previously: Model Serving System @ Anyscale

    Berkeley, US

    44
    KV Cache Optimization92
    Inference-Aware Architecture70
    Weight Streaming Efficiency10
    Weight Compression5
    Strengths
    PagedAttention (vLLM): invented KV cache paging — gold standard
    JENGA (2025): heterogeneous KV cache memory management
    Gaps
    Active startup founder (Inferact, $150M raised Jan 2026) — hard to recruit
    …click to see all
    XY

    Xianzhi Yu

    low hireability

    Researcher@Huawei

    Previously: Intern @ Sugon

    Beijing, CN

    70
    KV Cache Optimization85
    Inference-Aware Architecture80
    Weight Compression78
    Weight Streaming Efficiency35
    Strengths
    SVDq (2025): 410x KV key cache compression at 1.25 bits
    FlatQuant (ICLR 2025): flatness-aware quantization, 22 citations
    Gaps
    No direct evidence on weight streaming bandwidth reduction
    …click to see all
    XH

    Xiaodong (Vincent) Huang

    low hireability
    43
    Weight Compression60
    Inference-Aware Architecture50
    Weight Streaming Efficiency40
    KV Cache Optimization20
    Strengths
    mm_fp4 implementation (cuDNN + CUTLASS) in FlashInfer — 4-bit inference
    FP8 BMM optimization with cluster shapes for low-precision GEMM
    Gaps
    No published research papers — engineer, not architecture researcher
    …click to see all
    XL

    Xing Li

    low hireability

    Researcher@Huawei

    CN

    41
    KV Cache Optimization88
    Weight Compression42
    Inference-Aware Architecture28
    Weight Streaming Efficiency5
    Strengths
    KVTuner ICML 2025: layer-wise mixed-precision KV cache quantization
    SVDq: 1.25-bit, 410x KV cache compression — extreme compression research
    Gaps
    No evidence of hardware-aware architecture co-design (chip constraints, power budgets)
    …click to see all
    XL

    Xiuyu Li

    low hireability

    PhD candidate@Berkeley AI Research (BAIR) at UC Berkeley

    Previously: Research Consultant @ Together AI

    San Francisco, US

    52
    Weight Compression88
    Inference-Aware Architecture48
    Weight Streaming Efficiency42
    KV Cache Optimization28
    Strengths
    SqueezeLLM (321 cit): dense-sparse LLM quantization, core identity
    SVDQuant: 4-bit models via low-rank SVD + outlier absorption
    Gaps
    No direct KV cache architecture work (MLA, eviction strategies)
    …click to see all
    XZ

    Xiyou Zhou

    low hireability
    70
    Inference-Aware Architecture82
    Weight Compression72
    KV Cache Optimization65
    Weight Streaming Efficiency60
    Strengths
    Apple Intelligence 3B model: KV-cache sharing + 2-bit QAT for Apple silicon
    Parallel Track Transformers: 16x sync reduction, 15-30% TTFT gain (Feb 2026)
    Gaps
    No deep MLA, vector-quantized KV cache, or KV eviction work found
    …click to see all
    YC

    Yanan Cao

    low hireability
    34
    Weight Compression60
    Inference-Aware Architecture45
    Weight Streaming Efficiency25
    KV Cache Optimization5
    Strengths
    pytorch/ao quantization & sparsity — 7 merged PRs
    fp8 scaled_mm kernel with per-platform configs (H100/B200)
    Gaps
    No KV cache optimization work (MLA, eviction) found
    …click to see all
    YL

    Yejing Lai

    low hireability
    60
    Weight Compression80
    Inference-Aware Architecture70
    Weight Streaming Efficiency55
    KV Cache Optimization35
    Strengths
    MXFP4 block quant kernel on Intel BMG GPU (vllm-xpu-kernels #194, 2026)
    MXFP8/fp8 block quant kernel on BMG — microscaling directly relevant
    Gaps
    No published research; implementation engineer, not architect
    …click to see all
    YS

    Yikang Shen

    low hireability

    Member of Technical Staff@xAI

    Previously: Staff Research Scientist @ IBM

    San Francisco, US

    56
    KV Cache Optimization82
    Inference-Aware Architecture80
    Weight Streaming Efficiency48
    Weight Compression15
    Strengths
    GLA (2024, 262 cit): replaces KV cache with fixed-size recurrent state — KV-free inference
    FlashFormer (2025): whole-model kernel fusion for memory-bandwidth-limited inference
    Gaps
    No weight compression work (microscaling, topological regularization, quantization research)
    …click to see all
    YZ

    Yineng Zhang

    low hireability

    Principal AI Researcher@Together AI

    Previously: Lead Software Engineer @ Baseten

    San Francisco, US

    55
    KV Cache Optimization78
    Inference-Aware Architecture68
    Weight Compression55
    Weight Streaming Efficiency20
    Strengths
    Mooncake: KVCache-centric disaggregated serving (ACM ToS 2025)
    FlashInfer 34 commits — top KV cache attention engine
    Gaps
    Software/serving-layer focus, not hardware chip co-design
    …click to see all
    YS

    Ying Sheng

    low hireability

    Co-Founder (CEO)@RadixArk

    Previously: Member of Technical Staff @ xAI

    San Francisco, US

    71
    KV Cache Optimization92
    Inference-Aware Architecture83
    Weight Streaming Efficiency78
    Weight Compression32
    Strengths
    H2O: KV eviction oracle — published NeurIPS-level KV cache eviction research
    Double Sparsity: sparse attention cutting KV cache at post-training
    Gaps
    No dedicated weight compression work (MicroScaling, topological regularization)
    …click to see all
    YL

    Yingyan Celine Lin

    low hireability

    Visiting Professor@NVIDIA

    Previously: Assistant Professor @ Rice University

    Atlanta, US

    78
    Inference-Aware Architecture92
    KV Cache Optimization80
    Weight Compression78
    Weight Streaming Efficiency62
    Strengths
    LaCache (2025): direct KV eviction research for long-context LLMs
    Hymba (2025): hybrid-head arch cutting KV cache via SSM heads (NVIDIA)
    Gaps
    Tenured Associate Professor at Georgia Tech — low probability of full-time departure
    …click to see all
    YT

    Yuandong Tian

    low hireability

    Co-Founder@Stealth AI Startup

    Previously: Research Director @ Meta

    San Francisco, US

    79
    KV Cache Optimization95
    Weight Compression82
    Inference-Aware Architecture75
    Weight Streaming Efficiency65
    Strengths
    H2O (658 citations) — co-invented KV eviction for generative inference
    StreamingLLM (982 citations) — attention sinks for unbounded KV streaming
    Gaps
    Currently co-founding stealth startup — low availability for hire
    …click to see all
    YL

    Yuhong Li

    low hireability

    Engineer@xAI

    Previously: Foundation Models Team @ Apple

    New York, US

    53
    KV Cache Optimization92
    Inference-Aware Architecture65
    Weight Compression35
    Weight Streaming Efficiency18
    Strengths
    SnapKV: core author (18 commits, 314 citations) — canonical KV eviction paper
    Medusa: inference acceleration via multiple decoding heads (435 citations)
    Gaps
    No direct weight compression or microscaling/topological regularization work
    …click to see all
    YX

    Yuhui Xu

    low hireability

    Research Scientist@Google

    Previously: Research Scientist @ Salesforce

    AU

    55
    Weight Compression82
    KV Cache Optimization72
    Inference-Aware Architecture42
    Weight Streaming Efficiency22
    Strengths
    ThinK: query-driven KV cache pruning (2024, 38 citations)
    QA-LoRA: quantization-aware LoRA for LLMs (2023, 245 citations)
    Gaps
    No MLA, KV eviction, or vector-quantized KV cache work found
    …click to see all
    YL

    Yujun Lin

    low hireability

    AI Research Scientist@NVIDIA

    Previously: Research Assistant @ Massachusetts Institute of Technology

    Boston, US

    84
    Weight Compression95
    KV Cache Optimization90
    Inference-Aware Architecture87
    Weight Streaming Efficiency65
    Strengths
    QServe W4A8KV4 co-author — 4-bit KV cache quantization for LLM inference
    LServe: sparse attention serving reduces active KV (KV eviction-like)
    Gaps
    No explicit weight-streaming-bandwidth work — addressed implicitly via W4 quantization
    …click to see all
    YW

    Yunhe Wang

    low hireability

    head of the Huawei Applied AI lab and also a senior researcher@Huawei

    Previously: PhD student @ Peking University

    60
    Weight Compression95
    Inference-Aware Architecture80
    Weight Streaming Efficiency50
    KV Cache Optimization15
    Strengths
    Pangu Ultra (2025): LLM architecture designed for Ascend NPU inference constraints
    GhostNet CVPR2020: cheap-op weight reuse reduces memory bandwidth significantly
    Gaps
    No KV cache / MLA / KV-eviction papers found
    …click to see all
    ZW

    Zekun Wang

    low hireability

    Researcher@Alibaba

    Previously: PhD student @ Harbin Institute of Technology

    United States, US

    38
    Weight Compression65
    Inference-Aware Architecture40
    KV Cache Optimization30
    Weight Streaming Efficiency15
    Strengths
    NeurIPS 2025 Best Paper (Gated Attention) — sparsity, attention-sink-free design
    CFSP: activation-aware structured pruning for LLMs (weight compression)
    Gaps
    No direct MLA, vector-quantized KV, or KV eviction work found
    …click to see all
    ZJ

    Ziheng Jiang

    low hireability

    AI Researcher@Meta

    Previously: Principal Research Scientist @ ByteDance

    Seattle, US

    41
    Inference-Aware Architecture88
    Weight Streaming Efficiency45
    Weight Compression20
    KV Cache Optimization10
    Strengths
    TVM co-author (2511 citations) — gold standard hardware-aware ML compilation
    VTA + hardware-SW blueprint: explicit inference chip co-design work
    Gaps
    No KV cache work found (MLA, vector-quantized KV, eviction strategies)
    …click to see all

    Runs

    #1completed0 qualified / 0 foundMay 7, 1:24 PM