Neuralace · Model Architecture Discovery Researcher

completed372 qualified1 runMay 7, 1:24 PMcompany-name-neuralace-sabi-locations-usa-europe-china-india

ParsedNeuralace · 4 topics · Researcher · USA, Europe, China, India

Generating seed nodes

0 proposed

Explored 0 queries

0/0 done

Expanding nodes

queued

Qualifying candidates

queued

Qualified Candidates (368)

Chengyue Wu

high hireability

研究实习生@NVIDIA

Previously: Research Intern @ DeepSeek AI

Shenzhen, CN

KV Cache Optimization72

Inference-Aware Architecture55

Weight Compression18

Weight Streaming Efficiency12

Strengths

Fast-dLLM (ICLR 2026): first paper enabling KV cache for diffusion LLMs

Fast-dLLM v2 (ICLR 2026): block-diffusion inference efficiency follow-up

Gaps

No weight compression or microscaling-aware work found

…click to see all

Cheng Zhang

high hireability

Founding Engineer@AI Sequrity Company

Previously: Research Intern @ Microsoft

London, GB

Weight Compression88

Inference-Aware Architecture82

KV Cache Optimization72

Weight Streaming Efficiency45

Strengths

LQER (ICML2024) + QERA (ICLR2025): LLM weight quantization error reconstruction

Block-based Quantisation (EMNLP2023): sub-8-bit block quant, microscaling-relevant

Gaps

No explicit MLA / vector-quantized KV cache work found

…click to see all

Coleman Richard Charles Hooper

high hireability

Graduate Student - ML Systems@University of California, Berkeley

Previously: Research Intern @ NVIDIA

San Francisco, US

KV Cache Optimization93

Weight Compression85

Inference-Aware Architecture72

Weight Streaming Efficiency68

Strengths

KVQuant: first-author, NeurIPS 2024, 317 citations — landmark KV cache quantization

Squeezed Attention: first-author, ACL 2025 — 3.1x KV budget reduction

Gaps

No explicit hardware co-design or chip-architecture papers (more SW-layer inference opt)

…click to see all

Enshu Liu

high hireability

MS student@Microsoft Research

Previously: Intern @ Microsoft

Beijing, CN

Weight Compression72

Weight Streaming Efficiency32

Inference-Aware Architecture28

KV Cache Optimization5

Strengths

ViDiT-Q (ICLR 2025) — W4A8 quantization, 3x memory reduction for diffusion transformers

MixDQ (ECCV 2024) — mixed-precision, 3-4x model size compression

Gaps

No KV cache work — missing MLA, KV eviction, vector-quantized KV entirely

…click to see all

Gen Li

high hireability

PhD student@Clemson University

Weight Compression73

Inference-Aware Architecture42

Weight Streaming Efficiency28

KV Cache Optimization3

Strengths

OWL (ICML 2024, 110 citations) — LLM pruning to high sparsity, directly relevant

Dynamic Sparsity series (NeurIPS 2023, ICML 2024) — structured channel-level sparsity

Gaps

No KV cache work (MLA, KV eviction, vector-quantized KV) — zero evidence

…click to see all

Haokun Lin

high hireability

Ph.D. Student@Institute of Automation, Chinese Academy of Sciences

Previously: Retail Credit Risk Intern @ GAC Auto Finance Co.

Hong Kong, HK

Weight Compression80

KV Cache Optimization65

Weight Streaming Efficiency55

Inference-Aware Architecture40

Strengths

IntactKV (ACL 2024): KV cache management via pivot token preservation

DuQuant (NeurIPS 2024 oral): state-of-the-art 4-bit LLM weight quantization

Gaps

No explicit hardware chip co-design work (KV constraints from chip packaging perspective)

…click to see all

Haotong Qin

high hireability

Postdoctoral Researcher@ETH Zürich

Previously: Research Scientist @ ByteDance

Zurich, CH

Weight Compression95

KV Cache Optimization65

Inference-Aware Architecture55

Weight Streaming Efficiency20

Strengths

BiLLM: 1-bit post-training quantization for LLMs (152 citations, 2024)

ReCalKV: low-rank KV cache compression (2025) — direct match

Gaps

KV cache and hardware-aware work is secondary — primary focus is weight quantization

…click to see all

Hongzheng Chen

high hireability

Ph.D. Candidate@Cornell University

Previously: Undergrad student @ SUN YAT-SEN UNIVERSITY

Ithaca, US

Inference-Aware Architecture52

Weight Compression36

Weight Streaming Efficiency18

KV Cache Optimization8

Strengths

LLM-FPGA (FCCM'24): 5.7× energy efficiency vs A100 — explicit chip-constrained inference design

Allo (PLDI'24): accelerator design language — hardware-SW co-design for inference

Gaps

No KV cache work — no MLA, KV eviction, or vector-quantized KV

…click to see all

Huiqiang Jiang

high hireability

RSDE@Microsoft

Previously: Research SDE @ Microsoft

Shanghai, CN

KV Cache Optimization93

Inference-Aware Architecture62

Weight Compression35

Weight Streaming Efficiency28

Strengths

LLMLingua: 20x KV-cache + prompt compression, EMNLP'23 + ACL'24

SCBench (ICLR'25): KV cache benchmarking across long-context methods

Gaps

No weight compression work (quantization, block-sparse weights) in portfolio

…click to see all

Hyungjun Kim

high hireability

Postdoctoral Researcher@Northwestern University

Previously: Graduate Student @ Seoul National University

Evanston, US

Weight Compression83

Inference-Aware Architecture78

Weight Streaming Efficiency45

KV Cache Optimization10

Strengths

OWQ: outlier-aware weight quantization for LLM inference (AAAI 2024)

QUICK: quantization-aware conflict-free kernel for efficient LLM inference

Gaps

No KV cache work (MLA, KV eviction, KV quantization) found

…click to see all

Ionut-Vlad Modoranu

high hireability

Ph.D. student@Institute of Science and Technology Austria (ISTA)

Previously: Research Scientist @ Amazon

Vienna, AT

Weight Compression80

Inference-Aware Architecture25

Weight Streaming Efficiency20

KV Cache Optimization5

Strengths

DASLab (Dan Alistarh) — premier lab for LLM quantization/sparsity

"Unified Scaling Laws for Compressed Representations" (2025) — direct weight compression

Gaps

No KV cache work (MLA, VQ-KV, eviction) found

…click to see all

Junyang Lin

high hireability

Research Scientist@Qwen

Previously: Staff Engineer @ Alibaba

Beijing, CN

KV Cache Optimization82

Weight Compression80

Inference-Aware Architecture72

Weight Streaming Efficiency28

Strengths

CateKV (ICML 2025): KV eviction/consistency for long-context inference

Rotated Runtime Smooth (ICLR 2025): training-free INT4 quantization

Gaps

No explicit MLA or vector-quantized KV cache work found

…click to see all

Muyang Li

high hireability

Doctoral Student@Massachusetts Institute of Technology

Previously: Research Intern @ NVIDIA

Boston, US

Weight Compression88

Inference-Aware Architecture75

Weight Streaming Efficiency35

KV Cache Optimization10

Strengths

SVDQuant: 4-bit quantization via SVD low-rank outlier absorption (ICLR 2025 Spotlight)

deepcompressor: production model compression toolbox, LLMs + diffusion

Gaps

No published KV cache work (MLA, KV eviction) — diffusion model focus

…click to see all

Natalia Frumkin

high hireability

Research Associate@AMD

Previously: Research Scientist Intern @ Meta

Austin, US

Weight Compression82

Inference-Aware Architecture50

KV Cache Optimization15

Weight Streaming Efficiency15

Strengths

Quamba2 (ICML 2025): scalable W4/W8 PTQ for selective SSMs

Quamba (ICLR 2024): first PTQ recipe for Mamba; 12 citations

Gaps

No KV cache work — SSM focus sidesteps transformer KV stack entirely

…click to see all

Payman Behnam

high hireability

Student Researcher@Google

Previously: Graduate Research Assistant @ Georgia Institute of Technology

Atlanta, US

KV Cache Optimization90

Inference-Aware Architecture72

Weight Compression68

Weight Streaming Efficiency38

Strengths

RocketKV (ICML 2025): KV eviction + sparse attention, 400× compression, 3.7× speedup

EMPIRIC (2025): systematic KV cache compression gaps for long-context inference

Gaps

No MLA or vector-quantized KV cache work specifically

…click to see all

Roberto L. Castro

high hireability

Postdoc@Institute of Science and Technology Austria

Previously: PhD student @ Universidad de La Coruña

Weight Compression92

Inference-Aware Architecture75

Weight Streaming Efficiency65

KV Cache Optimization5

Strengths

MARLIN: FP16xINT4 inference kernel, ~4x speedup on GPU (60 citations)

Microscaling FP4 Quantization paper — exact match to query's microscaling mention

Gaps

No KV cache work (MLA, KV eviction, vector-quantized KV) found in papers

…click to see all

Rongzhi Zhang

high hireability

Applied Scientist@Amazon

Previously: Applied Scientist Intern @ Amazon

San Francisco, US

KV Cache Optimization80

Inference-Aware Architecture22

Weight Compression18

Weight Streaming Efficiency5

Strengths

LoRC: low-rank KV cache compression, NeurIPS Workshop 2024

Explicit research focus: progressive KV cache compression

Gaps

No hardware-aware / inference-chip co-design work found

…click to see all

Ruisi Cai

high hireability

Research Intern@NVIDIA

Previously: Quantitative Researcher Intern @ Citadel Securities

San Francisco, US

KV Cache Optimization88

Weight Compression60

Inference-Aware Architecture60

Weight Streaming Efficiency45

Strengths

H2O: first-authored KV eviction paper, 658 citations (NeurIPS 2023)

LoCoCo (ICML 2024): long-context KV compression with convolutions

Gaps

No microscaling or sub-bit weight compression work

…click to see all

Saleh Ashkboos

high hireability

Research Assistant@ETH Zürich

Previously: Research Intern @ Apple

Zurich, CH

Weight Compression97

Inference-Aware Architecture70

Weight Streaming Efficiency62

KV Cache Optimization25

Strengths

GPTQ (1860 citations) — foundational post-training quantization for LLMs

Microscaling FP4 ICLR 2026 — directly targets MX-format weight compression constraints

Gaps

No direct KV cache work (MLA, KV eviction, vector-quantized KV) in visible papers

…click to see all

Shang Yang

high hireability

PhD student@MIT EECS

Previously: Intern @ MIT

Boston, US

Weight Compression97

KV Cache Optimization90

Inference-Aware Architecture85

Weight Streaming Efficiency60

Strengths

QServe W4A8KV4: KV4 cache + W4 weights co-designed for serving efficiency

AWQ: MLSys 2024 Best Paper (1498 citations) — landmark activation-aware quantization

Gaps

No explicit work on weight streaming power reduction (vs. compression)

…click to see all

Tahseen Rabbani

high hireability

Frontier Tech Consultant, Project Management@Scale AI

Previously: Postdoctoral Research Associate @ Yale University

Chicago, US

KV Cache Optimization80

Weight Compression30

Inference-Aware Architecture25

Weight Streaming Efficiency5

Strengths

HashEvict (2412.16187): novel pre-attention KV eviction via LSH, 30–70% compression

LSH-E at NeurIPS 2024 Compression Workshop — original KV eviction algorithm

Gaps

No MLA or vector-quantized KV cache work — different approach than query emphasis

…click to see all

Tarushii Goel

high hireability

KV Cache Optimization35

Weight Compression25

Inference-Aware Architecture25

Weight Streaming Efficiency5

Strengths

Log-Linear Attention (arXiv 2506.04761) — replaces KV cache with log-growing hidden states

4 commits to fla-org/flash-linear-attention — linear attention kernel contributor

Gaps

No direct MLA, vector-quantized KV cache, or KV eviction work

…click to see all

Vage Egiazarian

high hireability

Postdoc@ISTA

Previously: Researcher @ Higher School of Economics

Weight Compression95

KV Cache Optimization82

Inference-Aware Architecture65

Weight Streaming Efficiency45

Strengths

AQLM: 2-bit extreme LLM weight quantization — lead author, 141 citations

SpQR: near-lossless 3-4 bit quantization enabling 33B models on 24GB GPU

Gaps

No explicit hardware co-design or chip architecture work

…click to see all

Zain Huda

high hireability

Weight Compression70

Weight Streaming Efficiency35

Inference-Aware Architecture25

KV Cache Optimization3

Strengths

Blockwise FP8 / MX format in pytorch/ao — 11 PRs, direct microscaling weight compression

Float8BlockwiseLinear for DeepSeek V3: bandwidth & roofline benchmarking

Gaps

No KV cache work — MLA, KV eviction, vector-quantized KV absent

…click to see all

Zhongwei Wan

high hireability

PhD student@The Ohio State University, Ph.D. candidate

Columbus, US

KV Cache Optimization88

Weight Compression72

Inference-Aware Architecture42

Weight Streaming Efficiency18

Strengths

LOOK-M (2024, 60 cites): KV cache compression for multimodal inference

D2O (2025, 40 cites): dynamic KV eviction with 3x throughput gain

Gaps

No chip-level hardware co-design work (power/bandwidth constraints)

…click to see all

Abbas Rahimi

medium hireability

Research Staff Member@IBM

Previously: Postdoctoral Researcher @ UC Berkeley

Zurich, CH

Inference-Aware Architecture65

Weight Compression38

KV Cache Optimization20

Weight Streaming Efficiency20

Strengths

"Efficient scaling of LLMs with MoE and 3D analog in-memory computing" (2025) — chip-constrained LLM design

NeurIPS 2025 Spotlight: structured sparse SSMs — attention alternative eliminating KV cache

Gaps

No direct KV cache optimization work (MLA, vector-quantized KV, KV eviction)

…click to see all

Abhay Gupta

medium hireability

Research Scientist, Machine Learning@Databricks

Previously: Research Scientist, Machine Learning @ Cerebras Systems

San Francisco, US

Weight Compression82

Inference-Aware Architecture55

Weight Streaming Efficiency40

KV Cache Optimization5

Strengths

SPDF: 75% sparsity on GPT-3 XL, 2.5x FLOP reduction (NeurIPS 2023, 44 citations)

High-Sparsity Llama: 70% sparsity + quantization → 8.6x CPU speedup

Gaps

No KV cache work — MLA, vector-quantized KV, eviction absent from all papers

…click to see all

Abhimanyu Rajeshkumar Bambhaniya

medium hireability

Research Intern@Meta

Previously: Intern @ Google

San Francisco, US

Inference-Aware Architecture82

Weight Compression68

Weight Streaming Efficiency55

KV Cache Optimization40

Strengths

GenZ-LLM-Analyzer: LLM inference hardware platform analysis tool

MIST (2025): co-design framework with explicit KV cache reuse modeling

Gaps

No dedicated KV eviction, MLA, or vector-quantized KV cache work

…click to see all

Abhinav Mehrotra

medium hireability

Head of On-device GenAI@Samsung

Previously: Principal Research Scientist @ Samsung

London, GB

Weight Compression75

Inference-Aware Architecture55

Weight Streaming Efficiency22

KV Cache Optimization18

Strengths

FraQAT: fractional-bit QAT (32→4b) for on-device generative models

NanoFLUX: 12B→2B diffusion compression for mobile deployment (2026)

Gaps

No LLM KV cache work — no MLA, KV eviction, or vector-quantized KV cache

…click to see all

Aditya Tomar

medium hireability

Undergraduate Student@UC Berkeley

Previously: Researcher @ PSSG

KV Cache Optimization82

Inference-Aware Architecture38

Weight Compression28

Weight Streaming Efficiency8

Strengths

QuantSpec (ICML 2025) — hierarchical quantized KV cache, self-speculative decoding

XQuant — KV cache rematerialization breaking LLM inference memory wall

Gaps

No weight streaming or bandwidth-reduction work (axis 3 unaddressed)

…click to see all

Akhil Arunkumar

medium hireability

Sr. Principal Software Engineer@d-Matrix

Previously: SoC Performance Architect @ AMD

San Francisco, US

KV Cache Optimization85

Inference-Aware Architecture70

Weight Streaming Efficiency25

Weight Compression15

Strengths

Keyformer (MLSys 2024): KV eviction via key token selection, 2.1x latency gain

d-Matrix Gen AI Serving lead — inference ASIC stack, production KV cache management

Gaps

No published work on weight compression or quantization

…click to see all

Aleksandar Samardžić

medium hireability

Weight Compression78

Inference-Aware Architecture70

Weight Streaming Efficiency45

KV Cache Optimization5

Strengths

CuTeDSL MXFP8 3D quantization kernel — 6+ TB/s on B200

32x32 MX block scaling for weights in MXFP8 (pytorch/ao)

Gaps

No KV cache optimization work found (MLA, KV eviction, etc.)

…click to see all

Alexander Borzunov

medium hireability

Researcher@OpenAI

Previously: Researcher @ Yandex

San Francisco, US

Weight Compression88

Inference-Aware Architecture62

Weight Streaming Efficiency50

KV Cache Optimization10

Strengths

SpQR: 4x compression, <1% perplexity loss at 3-4 bits (368 citations)

PETALS int8 model sharding — weight streaming at distributed inference scale

Gaps

No KV cache optimization work (MLA, eviction, vector quantization) found

…click to see all

Alexandra Peste

medium hireability

Applied Scientist@Canva

Previously: Postdoctoral Researcher @ Institute of Science and Technology Austria

Vienna, AT

Weight Compression78

Weight Streaming Efficiency42

Inference-Aware Architecture20

KV Cache Optimization3

Strengths

ISTA DASLab PhD (Alistarh group) — top model compression research lab

"Sparsity in Deep Learning" (JMLR, 1,229 citations) — co-authored compression survey

Gaps

No KV cache, MLA, or KV eviction work — significant gap for axis 1

…click to see all

Alexandre Marques

medium hireability

Weight Compression65

Inference-Aware Architecture15

KV Cache Optimization8

Weight Streaming Efficiency5

Strengths

51 commits on llm-compressor — Neural Magic production quantization pipeline

QAT and activation equalization work — core compression techniques

Gaps

No KV cache optimization work (MLA, KV eviction, vector-quantized KV)

…click to see all

Ali Hatamizadeh

medium hireability

Research Scientist@NVIDIA

Previously: PhD student @ University of California, Los Angeles

San Francisco, US

Inference-Aware Architecture78

KV Cache Optimization60

Weight Streaming Efficiency35

Weight Compression5

Strengths

Gated Delta Networks (ICLR 2025) — replaces KV cache with fixed recurrent state

MambaVision (CVPR 2025) — inference-efficient hybrid Mamba-Transformer at NVIDIA

Gaps

No direct MLA, vector-quantized KV, or KV eviction work — KV-free via SSM is adjacent

…click to see all

Alind Khare

medium hireability

Senior Researcher@Microsoft

Previously: PhD Student @ Georgia Institute of Technology

Weight Compression75

Weight Streaming Efficiency72

Inference-Aware Architecture65

KV Cache Optimization5

Strengths

Weight Sharing Paradigm (SIGOPS 2025) — LLM inference weight-streaming directly on-topic

∇QDARTS (TMLR 2025) — joint quantization+NAS for weight compression

Gaps

No KV cache work — MLA, KV eviction absent from publication record

…click to see all

Alkaid

medium hireability

Inference-Aware Architecture78

KV Cache Optimization40

Weight Compression5

Weight Streaming Efficiency5

Strengths

Blackwell sm100 FMHA decode kernel optimization at Meta/FBGEMM

6 merged PRs to flash-attention v4 (CuTe DSL masking, R2P, TMEM)

Gaps

No KV cache compression research (MLA, vector-quantized KV, eviction)

…click to see all

Amir Gholami

medium hireability

Postdoc@University of California, Berkeley

San Francisco, US

Weight Compression97

KV Cache Optimization88

Inference-Aware Architecture83

Weight Streaming Efficiency65

Strengths

KVQuant (2024): KV cache quantization for 10M context — direct hit

Survey of quantization methods (2022, 1894 citations) — definitive field reference

Gaps

No direct weight streaming bandwidth work — coverage is via quantization, not sparsity/streaming architecture

…click to see all

Amir Jalalirad

medium hireability

Staff Engineer@Qualcomm

Previously: Senior Research Engineer @ HERE Technologies

Amsterdam, NL

Weight Streaming Efficiency75

Inference-Aware Architecture72

Weight Compression40

KV Cache Optimization18

Strengths

DIP (MLSys 2025): 46% memory, 40% throughput gain via dynamic weight sparsification

Explicitly targets DRAM bandwidth bottleneck in decode — weight stream constraint

Gaps

No transformer KV cache compression (MLA, vector-quantized KV, eviction) work found

…click to see all

Amir Nassereldine

medium hireability

PhD Student@University At Buffalo SUNY

Previously: Summer Intern @ Modular

Buffalo, US

Inference-Aware Architecture40

Weight Compression22

Weight Streaming Efficiency8

KV Cache Optimization5

Strengths

NVCiM-PT: hardware-software co-design for edge LLM inference (DATE 2025)

Modular MAX Engine intern — AI inference runtime, hardware efficiency

Gaps

No KV cache work — no MLA, KV eviction, or quantized KV papers

…click to see all

Amir Yazdanbakhsh

medium hireability

Research Scientist@DeepMind

Previously: Research Scientist @ Google

San Francisco, US

Inference-Aware Architecture92

Weight Compression90

KV Cache Optimization70

Weight Streaming Efficiency68

Strengths

SLiM (2025): combined quantization + sparsity for LLM weight compression

HW-SW co-design primary research identity — 'Beyond Moore's Law' (2025)

Gaps

No explicit MLA or KV eviction paper — closest is linear attention replacing KV cache

…click to see all

Ammar Ahmad Awan

medium hireability

Principal AI Software Architect@Microsoft

Previously: Principal Research Manager @ Microsoft

Richardson, US

Inference-Aware Architecture72

Weight Streaming Efficiency20

Weight Compression12

KV Cache Optimization12

Strengths

Leads LLM inference for Microsoft Maia AI chip — explicit chip-aware work

DeepSpeed-Inference (577 citations) — inference at unprecedented scale

Gaps

No KV cache research (MLA, eviction, vector-quantized KV)

…click to see all

Andrew Li

medium hireability

Master's Student, University of California, Berkeley

Previously: Researcher @ Google

Inference-Aware Architecture80

Weight Compression58

Weight Streaming Efficiency8

KV Cache Optimization5

Strengths

CVPR 2021: designed EfficientNet-X for TPU/GPU — 2x faster than EfficientNet

FLIQS (2024): quantization NAS covering FP8/INT — weight compression via search

Gaps

No published work on KV cache (MLA, KV eviction, vector-quantized KV)

…click to see all

andrewor14

medium hireability

Weight Compression88

Inference-Aware Architecture62

Weight Streaming Efficiency40

KV Cache Optimization5

Strengths

pytorch/ao: led INT4/INT8/Float8/NF4 v2 tensor architecture (144 commits)

NVFP4 QAT — NVIDIA FP4 microscaling format, Blackwell inference chip target

Gaps

No KV cache work — MLA, vector-quantized KV, KV eviction all absent

…click to see all

Andrew W Fitzgibbon

medium hireability

Engineering Fellow@Graphcore

Previously: Partner Researcher @ Microsoft

Cambridge, GB

Weight Compression92

Inference-Aware Architecture88

Weight Streaming Efficiency72

KV Cache Optimization12

Strengths

FP8 LLM inference paper — NeurIPS 2023 Oral, 111M–70B parameter models

Scalify (ICML 2024) — end-to-end scale propagation for low-precision LLMs

Gaps

No published work on KV cache: MLA, KV eviction, or vector-quantized KV

…click to see all

Andrey Gromov

medium hireability

Research Scientist@Meta

Previously: Assistant Professor @ University of Maryland, College Park

Weight Compression80

Inference-Aware Architecture35

Weight Streaming Efficiency15

KV Cache Optimization5

Strengths

PARQ (ICML 2025) — piecewise-affine regularized quantization for LLMs

Deeper Layers pruning (2024, 153 cit) — structured weight reduction via layer removal

Gaps

No KV cache optimization work (no MLA, KV eviction, vector quantization)

…click to see all

Angel Li

medium hireability

Weight Compression75

KV Cache Optimization45

Inference-Aware Architecture40

Weight Streaming Efficiency10

Strengths

MXTensor (mxfp8/nvfp4) in pytorch/ao — microscaling quantization as in query

int4, int8 weight quantization for vllm/safetensors serialization

Gaps

No evidence of MLA or KV eviction strategies specifically

…click to see all

Aniruddha Nrusimha

medium hireability

PhD candidate@MIT

Previously: Undergrad student @ University of California Berkeley

Boston, US

KV Cache Optimization88

Inference-Aware Architecture72

Weight Compression58

Weight Streaming Efficiency30

Strengths

Cross-Layer Attention (NeurIPS 2024) — primary-author KV cache reduction paper

FlashFormer (2025) — whole-model kernels for low-batch inference, hardware-aware

Gaps

Weight streaming specifically (power-per-token at low-power mode) — not explicitly addressed

…click to see all

Antonio Orvieto

medium hireability

Principal Investigator (PI)@ELLIS Institute Tübingen

Previously: PHD Researcher @ ETH Zurich

Tübingen, DE

KV Cache Optimization72

Inference-Aware Architecture68

Weight Streaming Efficiency28

Weight Compression5

Strengths

LRU paper (446 citations) — eliminates KV cache via linear recurrent state

Griffin/Hawk co-author — hybrid recurrent+attention with O(1) decode memory

Gaps

No weight compression work (quantization, MicroScaling, topological regularization)

…click to see all

Avner May

medium hireability

Staff Research Scientist@Together

Previously: Research Scientist @ Google

New York, US

Inference-Aware Architecture62

KV Cache Optimization50

Weight Compression22

Weight Streaming Efficiency10

Strengths

MagicDec: sparse KV cache to address KV bottleneck at high batch sizes

Sequoia: hardware-aware speculative decoding — explicit HW modeling

Gaps

No MLA, vector-quantized KV, or KV quantization work

…click to see all

Ayan Chakraborty

medium hireability

Doctoral Student@EPFL

Previously: Intern @ Nvidia

Écublens, CH

Weight Compression72

Inference-Aware Architecture42

Weight Streaming Efficiency20

KV Cache Optimization5

Strengths

Sparsity+quantization interplay paper — LLMs (2024, 15 citations)

Block FP (Mixed-Mantissa) for DNN accelerators — microscaling-adjacent

Gaps

No KV cache work (MLA, KV eviction, vector-quantized KV)

…click to see all

Babak Ehteshami Bejnordi

medium hireability

Sr. Staff Engineer and Manager@Qualcomm

Previously: Deep Learning and Computer Vision for Autonomous Driving @ Mapscape

Amsterdam, NL

Inference-Aware Architecture72

KV Cache Optimization70

Weight Streaming Efficiency58

Weight Compression38

Strengths

KaVa (2025): KV-cache compression via distillation — direct axis match

Cache-Conditional Experts: 2× speedup on DRAM-constrained mobile MoE inference

Gaps

No MLA or vector-quantized KV work — KV research is distillation-based, not attention-level

…click to see all

Bailin Wang

medium hireability

MIT CSAIL

Previously: Researcher @ Apple

Inference-Aware Architecture78

KV Cache Optimization68

Weight Streaming Efficiency8

Weight Compression7

Strengths

GLA Transformers: FlashLinearAttention beats FlashAttention-2, hardware I/O-aware design

Linear attention eliminates growing KV cache — constant inference memory

Gaps

No work on MLA, vector-quantized KV cache, or KV eviction — alternative paradigm (eliminates vs. compresses)

…click to see all

Barun Patra

medium hireability

Member of Technical Staff@Microsoft

Previously: Senior Applied Scientist @ Microsoft

Seattle, US

Inference-Aware Architecture75

KV Cache Optimization20

Weight Streaming Efficiency8

Weight Compression5

Strengths

S2-Attention (ICLR 2025): Triton kernels, 4.5x inference speedup at 7B scale

Hardware-aware design: context sharding optimized for memory IO and parallelization

Gaps

No KV cache eviction, MLA, or vector-quantized KV work

…click to see all

Benjamin Fineran

medium hireability

Weight Compression90

Weight Streaming Efficiency45

Inference-Aware Architecture40

KV Cache Optimization20

Strengths

#2 contributor llm-compressor — 293 commits, weight quantization at scale

Authored HFQuantizer for compressed-tensors (merged HF transformers Sep 2024)

Gaps

No visible KV cache eviction, MLA, or vector-quantized KV work

…click to see all

Berlin Chen

medium hireability

Inference-Aware Architecture55

KV Cache Optimization30

Weight Compression5

Weight Streaming Efficiency5

Strengths

Mamba-3 co-author with Tri Dao / Albert Gu — top SSM lab

Constant-memory SSM recurrence eliminates KV cache at inference

Gaps

No direct KV eviction, MLA, or vector-quantized KV cache work

…click to see all

Bilge Acun

medium hireability

Research Scientist@Meta

Previously: Research Staff Member @ IBM

San Francisco, US

KV Cache Optimization72

Inference-Aware Architecture68

Weight Compression48

Weight Streaming Efficiency22

Strengths

CHAI (ICML 2024): KV cache reduction via head attention clustering

CATransformers (NeurIPS 2025): joint model-hardware NAS for inference

Gaps

No direct MLA, GQA, or KV eviction policy work — CHAI is head pruning, not KV quantization

…click to see all

Bofei Gao

medium hireability

MS student@Peking University

KV Cache Optimization72

Inference-Aware Architecture10

Weight Compression5

Weight Streaming Efficiency5

Strengths

PyramidKV: pyramidal KV eviction reduces cache footprint (155 citations)

Kimi k1.5 + k2 co-authorship — Moonshot AI inference lab credibility

Gaps

No weight compression or topological regularization work

…click to see all

Boju Chen

medium hireability

PhD student@Tsinghua University

Beijing, CN

KV Cache Optimization65

Inference-Aware Architecture50

Weight Compression5

Weight Streaming Efficiency5

Strengths

MoA: Mixture of Sparse Attention — 47 citations, CoLM'25 acceptance

Sliding-window KV eviction: 1.2-1.4x memory reduction, 6.6-8.2x decode throughput

Gaps

No explicit MLA, vector-quantized KV cache, or KV eviction (policy-based) work

…click to see all

Bo Li

medium hireability

AI Compute DevTech Engineer@NVIDIA

Previously: Semester Project Intern @ Disney Research

Shanghai, CN

Weight Compression85

KV Cache Optimization80

Inference-Aware Architecture70

Weight Streaming Efficiency60

Strengths

QuaRot: 4-bit quantization of all weights + KV cache (NeurIPS 2024, 286 citations)

KV cache quantized to 4 bits in QuaRot — direct KV cache size reduction

Gaps

No explicit KV eviction or MLA work — QuaRot focuses on quantization

…click to see all

Bo Peng

medium hireability

Undergrad student@University of Hong Kong

Hong Kong, HK

KV Cache Optimization95

Inference-Aware Architecture90

Weight Streaming Efficiency72

Weight Compression28

Strengths

RWKV-LM: KV-free recurrent architecture eliminates KV cache entirely

Albatross engine: 10,250 TPS on single RTX 5090 (RWKV7 7.2B fp16)

Gaps

No microscaling / topological regularization work (weight compression axis)

…click to see all

Boris van Breugel

medium hireability

Senior Machine Learning Researcher@Qualcomm

Previously: PhD Researcher @ University of Cambridge

Amsterdam, NL

Weight Compression52

Inference-Aware Architecture20

KV Cache Optimization5

Weight Streaming Efficiency5

Strengths

FPTQuant (2025): function-preserving transforms for LLM quantization

HadaNorm (2025): mean-centered transforms for diffusion transformer quantization

Gaps

No KV cache work (MLA, KV eviction, vector-quantized KV) — major axis gap

…click to see all

Boxun Li

medium hireability

Principal Researcher@Infinigence-AI

Previously: Researcher @ Megvii Technology

Durham, US

Inference-Aware Architecture65

Weight Streaming Efficiency35

Weight Compression20

KV Cache Optimization15

Strengths

Megrez-Omni first author — edge inference model with SW-HW co-design

Megrez2: cross-layer expert sharing, 3B active / 7.5B stored parameters

Gaps

No KV cache work (MLA, vector-quantized KV, eviction strategies)

…click to see all

Bo Zheng

medium hireability

Researcher@Alibaba

Previously: Researcher @ Alibaba

Inference-Aware Architecture45

KV Cache Optimization30

Weight Streaming Efficiency25

Weight Compression5

Strengths

Gated Attention (arXiv:2505.06708): query-dependent sparse gating, attention-sink-free

Qwen3 contributor — production MoE+dense LLM deployed at massive scale

Gaps

No direct KV eviction, MLA, or vector-quantized KV cache work found

…click to see all

Brian K. Ryu

medium hireability

Inference-Aware Architecture82

Weight Compression80

KV Cache Optimization65

Weight Streaming Efficiency62

Strengths

100 commits to flashinfer — KV cache / attention kernel library

FP4 block-scaled kernel (SM120) — 1.20× over CUTLASS, extreme weight compression

Gaps

No published papers on MLA or KV eviction specifically

…click to see all

Byung Hoon Ahn

medium hireability

Software Engineer@Apple

Previously: Research Scientist @ Protopia AI

San Francisco, US

Inference-Aware Architecture60

Weight Streaming Efficiency20

KV Cache Optimization10

Weight Compression5

Strengths

FlexInfer (MLSys 2025): hardware-aware adaptive LLM inference scheduling

Tandem Processor (ASPLOS 2024): accelerator co-design for emerging NN operators

Gaps

No KV cache optimization work (no MLA, vector quantization, or eviction research)

…click to see all

Carlo Luschi

medium hireability

VP & Head of Research@Graphcore

Previously: Director of Research @ Graphcore

Oxford, GB

Inference-Aware Architecture92

Weight Compression90

KV Cache Optimization82

Weight Streaming Efficiency80

Strengths

SparQ Attention: 8x attention data transfer savings via selective KV fetching

Leads all research at Graphcore — IPU chip company with explicit HW constraints

Gaps

No published work specifically on MLA or KV eviction strategies

…click to see all

Changdi Yang

medium hireability

Intern@Snap

Previously: PhD student @ Northeastern University

Weight Compression75

Inference-Aware Architecture42

Weight Streaming Efficiency35

KV Cache Optimization8

Strengths

EdgeQAT/Squat: sub-8-bit token-adaptive QAT, 2.37x mobile inference speedup

HyWIA: structured LLM pruning, 50% size reduction, +2.82% accuracy vs LLM-Pruner

Gaps

No KV cache, MLA, or KV eviction work found

…click to see all

Changhai Man

medium hireability

PhD student@Georgia Institute of Technology

Atlanta, US

Inference-Aware Architecture60

Weight Compression50

Weight Streaming Efficiency35

KV Cache Optimization3

Strengths

Multi-bit-width systolic accelerator (43 citations) — hardware inference co-design

RankSearch: auto tensor compression for edge LSTM networks

Gaps

No KV cache or attention/transformer inference work found

…click to see all

Chaofan Tao

medium hireability

Research Scientist@Huawei

Previously: Software Engineer @ Meta

Hong Kong, HK

KV Cache Optimization80

Weight Compression75

Inference-Aware Architecture38

Weight Streaming Efficiency18

Strengths

D2O (2025): dynamic layer/token KV cache compression for long-context LLMs

UNComp (2024): KV-cache sparsity via uncertainty — two dedicated KV papers

Gaps

No hardware-specific chip co-design work; no power/packaging constraint awareness

…click to see all

Chaojun Xiao

medium hireability

Post-Doctoral Researcher@Tsinghua University

Previously: Business Development Intern @ P.E.R.K. Consulting

Beijing, CN

KV Cache Optimization80

Inference-Aware Architecture72

Weight Streaming Efficiency42

Weight Compression12

Strengths

Locret (2025): trained retaining heads enabling principled KV eviction

InfLLM (NeurIPS 2024): context memory for long-context KV management

Gaps

No weight compression work (quantization, microscaling, topological regularization)

…click to see all

Chenggang Zhao

medium hireability

infra@DeepSeek AI

ex-NVIDIA, SenseTime

Hangzhou, CN

KV Cache Optimization90

Inference-Aware Architecture85

Weight Compression75

Weight Streaming Efficiency65

Strengths

DeepSeek-V2 co-author — introduced MLA for KV cache footprint reduction

DeepGEMM: FP8 GEMM with fine-grained scaling — direct weight compression for inference

Gaps

Infra/systems focus — less core architecture research independent of DeepSeek team

…click to see all

Cheng Li

medium hireability

Member of Technical Staff@Black Forest Labs

Previously: Research Engineer @ Databricks

Bellevue, US

Weight Compression82

Inference-Aware Architecture60

Weight Streaming Efficiency35

KV Cache Optimization20

Strengths

DeepSpeed-Inference (632 citations) — led efficient inference at scale

INT4 quantization paper (ICML 2023) — direct weight compression evidence

Gaps

No MLA, KV eviction, or vector-quantized KV cache papers

…click to see all

Chuanjian Liu

medium hireability

Researcher@Huawei

Previously: Researcher @ Huawei

Weight Compression78

Inference-Aware Architecture30

Weight Streaming Efficiency15

KV Cache Optimization5

Strengths

Rethinking 1-bit Optimization (2025) — extreme LLM weight compression

Bi-ViT (2023) — pushes binarization limit for vision transformers

Gaps

No KV cache optimization work (no MLA, eviction, or vector-quantized KV)

…click to see all

Cong Guo

medium hireability

Postdoctoral Associate@Duke University

Previously: Research intern @ Shanghai Qi Zhi Institute

Durham, US

Weight Compression95

KV Cache Optimization92

Inference-Aware Architecture88

Weight Streaming Efficiency58

Strengths

Ecco (ISCA'25): direct KV cache compression paper at flagship architecture venue

VQ-LLM (2025): vector quantization for LLM inference — matches JD's VQ-MLA mention

Gaps

Weight streaming efficiency — sparsity work is adjacent; no paper explicitly on decode stream bandwidth for low-power chips

…click to see all

Connor Holmes

medium hireability

Researcher@OpenAI

Previously: Researcher @ Microsoft

San Francisco, US

Weight Compression82

Inference-Aware Architecture75

Weight Streaming Efficiency50

KV Cache Optimization28

Strengths

NxMTransformer: semi-structured sparsity ADMM (2021, 30 citations)

Low-bit NxM sparsity compression (2022) — quantization + pruning

Gaps

No MLA, vector-quantized KV cache, or KV eviction papers

…click to see all

Daniel HAZIZA

medium hireability

Research Engineer - GPU efficiency@Meta

Previously: Research Engineer @ Meta

Paris, FR

Inference-Aware Architecture82

KV Cache Optimization75

Weight Streaming Efficiency52

Weight Compression18

Strengths

Flash-Decoding (2023): KV cache memory bandwidth optimization for long-context inference

xFormers co-author — production GPU inference efficiency library at Meta FAIR

Gaps

No published work specifically on MLA, vector-quantized KV, or KV eviction strategies

…click to see all

Daniel Vega-Myhre

medium hireability

Weight Compression80

Inference-Aware Architecture55

Weight Streaming Efficiency38

KV Cache Optimization15

Strengths

254 commits to pytorch/ao — #3 contributor, quantization + sparsity

MXFP8 microscaling quantization (MX = microscaling, chip-aware precision)

Gaps

No KV cache optimization work found (MLA, KV eviction, vector-quantized KV)

…click to see all

Daria Soboleva

medium hireability

Head Research Scientist@Cerebras Systems

Previously: Senior Research Scientist @ Cerebras Systems

San Francisco, US

Inference-Aware Architecture85

Weight Compression35

Weight Streaming Efficiency15

KV Cache Optimization10

Strengths

5+ years at Cerebras designing MoE for wafer-scale inference hardware

BTLM: 3B params achieves 7B quality at 3× less inference compute

Gaps

No KV cache work (MLA, vector-quantized KV, eviction strategies)

…click to see all

David Corvoysier

medium hireability

Weight Compression88

Inference-Aware Architecture55

KV Cache Optimization30

Weight Streaming Efficiency25

Strengths

huggingface/optimum-quanto: 644 commits, sole primary maintainer

INT2/INT4/INT8/FP8 — full quantization stack across precisions

Gaps

No evidence of MLA, vector-quantized KV, or KV eviction algorithm work

…click to see all

David W. Romero

medium hireability

Research Scientist@Cartesia

Previously: Research Scientist @ NVIDIA

San Francisco, US

KV Cache Optimization65

Inference-Aware Architecture60

Weight Compression55

Weight Streaming Efficiency35

Strengths

Cartesia RS: designs KV-free SSM/hybrid LM architectures for inference

'Systems & Algorithms for Convolutional Multi-Hybrid LMs at Scale' (2025) — KV-free at scale

Gaps

No direct MLA, vector-quantized KV, or KV eviction work — eliminates cache rather than optimizes it

…click to see all

Daya Guo

medium hireability

Associate Professor@Sun Yat-sen University

Previously: Postdoctoral Fellow @ Clemson University

Zhuhai, CN

KV Cache Optimization65

Inference-Aware Architecture60

Weight Compression35

Weight Streaming Efficiency20

Strengths

DeepSeek-V2 co-author — introduced MLA (93.3% KV cache compression)

DeepSeek-V3 co-author — MLA + FP8 quantization + MTP continued

Gaps

Primary research focus is code intelligence, not KV cache or inference hardware

…click to see all

Deepak Narayanan

medium hireability

Senior Applied Deep Learning Research Scientist@NVIDIA

Previously: Senior Researcher @ Microsoft

Seattle, US

Inference-Aware Architecture88

KV Cache Optimization65

Weight Compression38

Weight Streaming Efficiency28

Strengths

"The Case for Co-Designing Model Architectures with Hardware" (ICPP 2024) — exact query match

Nemotron-H: Mamba layers replace attention, eliminating KV cache at inference (3× speedup)

Gaps

No dedicated MLA, vector-quantized KV, or KV eviction work — KV reduction is architectural not algorithmic

…click to see all

Dipika Sikka

medium hireability

Weight Compression88

Weight Streaming Efficiency65

Inference-Aware Architecture55

KV Cache Optimization5

Strengths

257 commits on vllm-project/llm-compressor — top-4 contributor

MXFP4 support — microscaling format directly relevant to query's MX axis

Gaps

No KV cache work found (MLA, KV eviction, quantized KV) — key axis is blank

…click to see all

Di Wu

medium hireability

Director, Deep Learning Algorithm and Software@NVIDIA

Previously: Co-Founder and CEO @ NVIDIA

San Francisco, US

Weight Compression75

Inference-Aware Architecture60

KV Cache Optimization45

Weight Streaming Efficiency20

Strengths

Founded OmniML — model compression startup acquired by NVIDIA

Leads NVIDIA FP4 quantization and TensorRT Model Optimizer

Gaps

Thin research publication record — most papers from 2010-2018 FPGA era

…click to see all

Donghyeon Joo

medium hireability

Research Associate@AMD

Previously: Research Associate - PhD @ AMD

College Park, US

KV Cache Optimization82

Inference-Aware Architecture80

Weight Streaming Efficiency52

Weight Compression40

Strengths

MUSTAFAR (NeurIPS 2025): direct KV cache pruning via unstructured sparsity

CORUSCANT (MICRO 2025): hardware-aware co-design of GPU kernels + sparse tensor cores

Gaps

No direct work on MLA, vector-quantized KV, or attention-level cache compression

…click to see all

Donghyun Lee

medium hireability

PhD student@University of Southern California

Previously: Research Scholar @ Yale University

New Haven, US

Weight Compression78

Weight Streaming Efficiency30

Inference-Aware Architecture15

KV Cache Optimization3

Strengths

GPTAQ (ICML 2025): finetuning-free LLM weight quantization at inference time

KronQ: Kronecker-factored Hessian — novel structured quantization approach

Gaps

No KV cache work — MLA, vector-quantize KV, eviction all absent

…click to see all

Driss Guessous

medium hireability

Staff Software Engineer@Meta

Previously: Senior Machine Learning Engineer @ Meta

Redondo Beach, US

Weight Compression88

Inference-Aware Architecture78

KV Cache Optimization60

Weight Streaming Efficiency50

Strengths

117 PRs in pytorch/ao — NVFP4, float8, MX microscaling quantization

FlexAttention co-author (2025) — fused attention kernel programmability

Gaps

No direct MLA or KV eviction work found — attention work is kernel-level, not cache strategy

…click to see all

eigen

medium hireability

KV Cache Optimization75

Inference-Aware Architecture50

Weight Compression25

Weight Streaming Efficiency10

Strengths

51 commits on flashinfer-ai/flashinfer — paged-KV + GQA core focus

PR #3221: paged-KV indices, ragged indptrs, RoPE cos/sin infrastructure

Gaps

No evidence of MLA or KV eviction strategies specifically

…click to see all

Eldar Kurtic

medium hireability

Principal Research Scientist@Red Hat

Previously: Senior Research Engineer @ Red Hat

Vienna, AT

Weight Compression90

Inference-Aware Architecture65

Weight Streaming Efficiency55

KV Cache Optimization5

Strengths

ZipLM: hardware-aware structured pruning — explicitly co-designs with inference chip constraints

OBS/second-order pruning (177 cites) — foundational LLM weight compression

Gaps

No KV cache, MLA, or KV eviction work — missing axis entirely

…click to see all

Eugenia Iofinova

medium hireability

PhD student@Alistarh Group

Previously: Intern @ Microsoft

Weight Compression78

Weight Streaming Efficiency30

Inference-Aware Architecture25

KV Cache Optimization5

Strengths

AC/DC: alternating compressed/decompressed training (86 citations)

Alistarh Group pedigree — SparseGPT/GPTQ originating lab

Gaps

No KV cache work (MLA, vector-quantized KV, eviction) found

…click to see all

Fanxu Meng

medium hireability

Sr. Technologist@Technip Energies

Previously: Research Associate @ Houston Advanced Research Center

Houston, US

KV Cache Optimization92

Weight Compression72

Inference-Aware Architecture72

Weight Streaming Efficiency45

Strengths

TransMLA (NeurIPS 2025 Spotlight): 93% KV cache compression, 10.6x speedup

TPLA (ASPLOS 2026): MLA tensor-parallel attention for prefill-decode decoupling

Gaps

No specific work on weight stream bandwidth reduction during decode

…click to see all

Fei Sun

medium hireability

Software Engineer@Meta

Previously: Research Scientist @ Alibaba Group

San Francisco, US

Weight Compression88

Inference-Aware Architecture85

Weight Streaming Efficiency45

KV Cache Optimization8

Strengths

CHEX: channel compression for CNNs (2022, 114 citations)

FBNet: hardware-aware NAS — inference chip-aware design (1807 cit.)

Gaps

No KV cache work found (MLA, vector-quantized KV, KV eviction)

…click to see all

Forrest Iandola

medium hireability

AI Research Scientist@Meta

Previously: Head of Perception @ Anduril Industries

San Francisco, US

Weight Compression92

Inference-Aware Architecture78

KV Cache Optimization40

Weight Streaming Efficiency35

Strengths

SqueezeNet: AlexNet accuracy with 50x fewer params (<0.5MB)

MobileLLM: block-wise weight-sharing + GQA for on-device LLMs

Gaps

No MLA, vector-quantized KV, or KV eviction work found

…click to see all

Funtowicz Morgan

medium hireability

Weight Streaming Efficiency85

Inference-Aware Architecture55

Weight Compression15

KV Cache Optimization5

Strengths

hmll: loads AI model weights at wire speed via io_uring/mmap

ionic: CUDA planner pipelining NVMe→GPU weight streaming

Gaps

No evidence of KV cache work (MLA, eviction, quantization)

…click to see all

Fuwen Tan

medium hireability

R&D@ByteDance

Previously: Research Scientist @ Samsung

San Francisco, US

Weight Compression80

Weight Streaming Efficiency72

Inference-Aware Architecture62

KV Cache Optimization10

Strengths

MobileQuant (EMNLP 2024): quantization for on-device LLM inference chips

Progressive Mixed-Precision Decoding (ICLR 2025): variable-precision decode phase

Gaps

No KV cache work: MLA, KV eviction, vector-quantized KV absent from portfolio

…click to see all

fxmarty (Felix Marty)

medium hireability

Weight Compression82

Inference-Aware Architecture50

KV Cache Optimization38

Weight Streaming Efficiency22

Strengths

AutoGPTQ maintainer — production GPTQ weight quantization at scale

Marlin FP8 kernels (optimum-quanto #237/#241) — INT4/FP8 inference

Gaps

No evidence of MLA, vector-quantized KV cache, or KV eviction strategies

…click to see all

Gabriele Oliaro

medium hireability

CS PhD Student@Snowflake AI Research

Previously: Research Scientist Intern @ Snowflake

Inference-Aware Architecture55

Weight Compression35

Weight Streaming Efficiency15

KV Cache Optimization12

Strengths

Korch (ASPLOS 2024): hardware-aware kernel orchestration for tensor programs

Quantized Side Tuning (ACL 2024 Outstanding): 4-bit weight quantization for LLMs

Gaps

No work on KV cache compression (MLA, vector-quantized KV, eviction)

…click to see all

Geethan Karunaratne

medium hireability

Researcher@IBM

Previously: Postdoctoral Researcher @ IBM

Zurich, CH

Inference-Aware Architecture85

Weight Compression62

Weight Streaming Efficiency30

KV Cache Optimization15

Strengths

64-core PCM DNN inference chip — co-designed models for in-memory compute (265 citations)

HERMES-Core 14nm PCM/CMOS chip — 1.59 TOPS/mm², literal inference chip co-design

Gaps

No KV cache optimization work (no MLA, KV eviction, or vector-quantized KV papers)

…click to see all

Genghan Zhang

medium hireability

PhD student@Ph.D. student of Computer Science, Stanford University

Previously: Intern @ NVIDIA

Inference-Aware Architecture75

Weight Streaming Efficiency65

KV Cache Optimization55

Weight Compression30

Strengths

CATS (37 citations): activation sparsity reducing LLM inference streams

AccelOpt (MLSys 2026): kernel optimization agents for AI accelerators

Gaps

No direct MLA, vector-quantized KV cache, or KV eviction work

…click to see all

Geonhwa Jeong

medium hireability

Research Scientist@Meta

Previously: Graduate Research Assistant @ Georgia Institute of Technology

San Francisco, US

Inference-Aware Architecture85

Weight Compression60

Weight Streaming Efficiency45

KV Cache Optimization20

Strengths

TASDER (MLSys 2025): structured sparse weight approx, 83% EDP improvement

2:4 activation sparsity for Transformer inference with FP8 (SLLM 2025)

Gaps

No MLA, vector-quantized KV, or KV eviction algorithm work

…click to see all

Gobinda Saha

medium hireability

AI Research Scientist@Meta

Previously: Graduate Student Researcher @ Center for Brain-Inspired Computing

San Francisco, US

KV Cache Optimization75

Weight Compression45

Inference-Aware Architecture30

Weight Streaming Efficiency10

Strengths

Eigen Attention: 40% KV cache reduction via low-rank attention (arXiv 2024)

Meta Super Intelligence Labs — LLM-focused research role

Gaps

No evidence of MLA or KV eviction strategies beyond low-rank attention

…click to see all

Grigory Sizov

medium hireability

KV Cache Optimization50

Inference-Aware Architecture25

Weight Compression8

Weight Streaming Efficiency5

Strengths

Paged attention in FlashAttention varlen — direct KV cache memory management

Split-kv + M↔H swap for decoding attention — KV splitting optimization

Gaps

No weight compression or quantization research visible

…click to see all

Guangda Liu

medium hireability

PhD student@Microsoft Research Asia Alumni

Previously: Research Intern @ Microsoft

Shanghai, CN

KV Cache Optimization85

Inference-Aware Architecture30

Weight Compression5

Weight Streaming Efficiency5

Strengths

ClusterKV: recallable KV compression via semantic clustering (2x latency speedup)

FreeKV: speculative KV retrieval — 13x faster than SOTA (2025)

Gaps

No work on weight compression or topological regularization

…click to see all

Guangda Liu

medium hireability

KV Cache Optimization88

Inference-Aware Architecture55

Weight Compression25

Weight Streaming Efficiency20

Strengths

FreeKV (arXiv:2505.13109): first-author, 13× speedup over SOTA KV retrieval

Speculative retrieval + CPU/GPU hybrid KV layout — novel system co-design

Gaps

No direct work on weight compression algorithms (quantization/pruning authorship)

…click to see all

Haichuan Yang

medium hireability

Staff Software Engineer@DeepMind

Previously: Research Scientist @ Meta

San Francisco, US

Weight Compression82

Inference-Aware Architecture55

Weight Streaming Efficiency35

KV Cache Optimization5

Strengths

Sparsity+quantization joint learning — CVPR 2020, 112 citations

ECC: energy-constrained, platform-independent DNN compression (CVPR 2019)

Gaps

No KV cache, MLA, or KV eviction work in publication record

…click to see all

Haifeng Qian

medium hireability

Principal Applied Scientist@NVIDIA

Previously: Manager and Senior Applied Scientist @ Amazon

San Francisco, US

Inference-Aware Architecture85

KV Cache Optimization82

Weight Compression50

Weight Streaming Efficiency35

Strengths

Nemotron-H: Mamba layers eliminate KV cache entirely (constant memory per token)

Bifurcated Attention: directly reduces KV memory IO during high-batch decoding

Gaps

No direct work on microscaling, topological regularization, or bits-per-weight constraints

…click to see all

Hailin Hu

medium hireability

Researcher@Huawei

Previously: PhD student @ Tsinghua University

Weight Compression55

KV Cache Optimization35

Inference-Aware Architecture35

Weight Streaming Efficiency5

Strengths

Transformer Compression Survey (2024) — expert knowledge of pruning, quantization, KD

SDTP token pruning (2025) — KV cache compression compatible, 1.75× inference speedup

Gaps

No direct work on MLA or vector-quantized KV cache

…click to see all

Han Cai

medium hireability

AI Research Scientist@NVIDIA

Previously: Research Intern @ NVIDIA

Boston, US

Inference-Aware Architecture92

Weight Compression80

KV Cache Optimization65

Weight Streaming Efficiency45

Strengths

Jet-Nemotron (NeurIPS 2025): hybrid linear/full-attention LM targeting KV cache reduction

ProxylessNAS (2.5K cites): direct hardware-aware NAS on real target hardware

Gaps

No direct MLA, vector-quantized KV cache, or KV eviction-specific papers

…click to see all

Hanchen Ye

medium hireability

ML/HW/SW Co-Design Engineer@ElastixAI

Previously: ML/HW/SW Co-Design Engineer @ Apple

Seattle, US

Inference-Aware Architecture82

KV Cache Optimization78

Weight Streaming Efficiency72

Weight Compression28

Strengths

SnapKV (NeurIPS'24, 313 citations) — KV eviction co-author

StreamTensor (MICRO'25) — tensor streaming in LLM dataflow accelerators

Gaps

SnapKV is 6th-author contribution, not primary lead

…click to see all

Han Guo

medium hireability

Research Intern@Together AI

Previously: Research Intern @ IBM

San Francisco, US

Weight Compression85

KV Cache Optimization65

Inference-Aware Architecture50

Weight Streaming Efficiency20

Strengths

FLUTE repo: CUDA C++ lookup-table quantization for LLMs — hardware-facing weight compression

LQ-LoRA (ICLR 2024, 84 cites): quantized matrix decomp for efficient LLM finetuning

Gaps

No evidence of chip-constraint-aware architecture co-design (power/bandwidth budgets)

…click to see all

Hanrui Wang

medium hireability

Researcher@Stealth mode company

Previously: PhD student @ Massachusetts Institute of Technology

Weight Compression72

Inference-Aware Architecture65

KV Cache Optimization55

Weight Streaming Efficiency20

Strengths

SpAtten: cascade token/head pruning cuts DRAM access 10x (565 citations)

HAT: hardware-latency-constrained NAS for transformers (370 citations)

Gaps

No work on MLA, vector-quantized KV, or modern KV eviction strategies

…click to see all

Hanshi Sun

medium hireability

Research Scientist@ByteDance

Previously: Teaching Assistant @ Carnegie Mellon University

Bellevue, US

KV Cache Optimization93

Inference-Aware Architecture68

Weight Compression20

Weight Streaming Efficiency10

Strengths

ShadowKV (ICML 2025 Spotlight) — KV cache offloading for long-context inference

R-KV — KV cache compression for reasoning model acceleration

Gaps

No evidence of weight compression or quantization work on model weights

…click to see all

Han Shu

medium hireability

Research Engineer@Huawei

Weight Compression58

Inference-Aware Architecture25

Weight Streaming Efficiency20

KV Cache Optimization5

Strengths

ExCP: LLM checkpoint compression 70× via weight-momentum shrinking + quantization

TinySAM: post-training quantization for edge device inference

Gaps

No KV cache work (MLA, KV eviction, vector-quantized KV cache)

…click to see all

Hansong Zhou

medium hireability

Weight Compression80

Inference-Aware Architecture45

Weight Streaming Efficiency25

KV Cache Optimization5

Strengths

microsoft/BitNet top contributor — 19 commits, merge access

Added full model conversion pipeline for BitNet2b_2501 (1852 lines)

Gaps

No KV cache, MLA, or attention optimization work found

…click to see all

Haocheng Xi

medium hireability

MLsys Researcher@University of California, Berkeley

Previously: Research Intern @ Nvidia

Berkeley, US

Weight Compression92

KV Cache Optimization88

Inference-Aware Architecture72

Weight Streaming Efficiency42

Strengths

QuantSpec: hierarchical quantized KV cache — ICML 2025

XQuant: KV cache rematerialization for LLM inference (2025)

Gaps

No direct work on MLA or KV eviction specifically

…click to see all

Haoli Bai

medium hireability

Researcher@Huawei

Previously: Applied Scientist Intern @ Amazon

Hong Kong, HK

KV Cache Optimization92

Weight Compression88

Inference-Aware Architecture35

Weight Streaming Efficiency32

Strengths

TreeKV, FreqKV, WeightedKV — 3 KV cache compression papers (2025)

IntactKV (ACL 2024, 45 cit) — KV-aware quantization for LLMs

Gaps

No direct hardware-chip co-design evidence (power/bandwidth/packaging)

…click to see all

Haoran You

medium hireability

Research Scientist@Adobe

Previously: Research Scholar @ SRC Research Scholars Program

Seattle, US

Weight Compression90

Inference-Aware Architecture90

KV Cache Optimization82

Weight Streaming Efficiency65

Strengths

LaCache (2025) — direct KV caching paper for long-context LLM efficiency

ShiftAddLLM — multiplication-less reparameterization reduces weight stream compute

Gaps

No explicit MLA or vector-quantized KV cache work found

…click to see all

Haotian (Ken) Tang

medium hireability

Weight Compression95

Inference-Aware Architecture82

KV Cache Optimization78

Weight Streaming Efficiency65

Strengths

AWQ MLSys 2024 Best Paper — canonical activation-aware 4-bit weight quantization

QServe W4A8KV4: led GPU kernels, 4-bit KV cache (SmoothAttention), 3.5x throughput

Gaps

No explicit MLA or KV eviction work — KV focus is quantization, not eviction

…click to see all

Haoxuan Wang

medium hireability

Research Intern@Cisco

Previously: PhD student @ Illinois Institute of Technology

Chicago, US

Weight Compression55

Inference-Aware Architecture10

KV Cache Optimization5

Weight Streaming Efficiency5

Strengths

PTQ4DiT (NeurIPS 2024, 42 citations) — quantization for diffusion transformers

QuEST (ICCV 2025) — low-bit diffusion model via selective finetuning

Gaps

Zero KV cache work (MLA/KV eviction) — core Neuralace requirement

…click to see all

HDCharles

medium hireability

Weight Compression78

Weight Streaming Efficiency50

KV Cache Optimization42

Inference-Aware Architecture12

Strengths

Core llm-compressor contributor — GPTQ, AWQ, FP8 quantization in production

Added KV cache quantization to pytorch/ao (130k ctx, 18.9GB w/ int4+KV quant)

Gaps

No MLA, vector-quantized MLA, or KV eviction research

…click to see all

HDCharles

medium hireability

Weight Compression85

Weight Streaming Efficiency70

Inference-Aware Architecture55

KV Cache Optimization20

Strengths

Core llm-compressor contributor — AWQ, GPTQ, FP8, W4A16, W8A16 quantization schemes

compressed-tensors library contributor (NeuralMagic's compression runtime)

Gaps

No published research papers — engineer rather than academic researcher

…click to see all

HDCharles (Charles Hernandez)

medium hireability

Weight Compression88

KV Cache Optimization65

Inference-Aware Architecture50

Weight Streaming Efficiency25

Strengths

Added KV cache quantization to torchao — direct implementation evidence

82 commits pytorch/ao + 468 PRs llm-compressor — quantization core contributor

Gaps

No evidence of MLA, vector-quantized MLA, or KV eviction strategies specifically

…click to see all

Heng Chang

medium hireability

Researcher@Tsinghua University

Previously: Research Intern @ Ant Group

Beijing, CN

Weight Compression65

Inference-Aware Architecture20

KV Cache Optimization5

Weight Streaming Efficiency5

Strengths

QA-LoRA ICLR 2024 — quantization-aware LoRA, 243 citations

One QuantLLM for ALL ACL 2025 Oral — unified quantized deployment

Gaps

No KV cache optimization work (MLA, eviction strategies absent)

…click to see all

Hongwu Peng

medium hireability

Research Scientist/Engineer@Adobe

Previously: Research Scientist/Engineer Intern @ Adobe

New York, US

Weight Compression78

Inference-Aware Architecture72

Weight Streaming Efficiency48

KV Cache Optimization22

Strengths

Medusa (429 citations) — speculative decoding reduces per-step HBM weight reads

AQ2PNN: adaptive quantization for hardware-constrained private inference

Gaps

No KV cache-specific papers (MLA, KV eviction, vector-quantized KV cache)

…click to see all

Hongyi Jin

medium hireability

Inference-Aware Architecture75

KV Cache Optimization65

Weight Compression55

Weight Streaming Efficiency40

Strengths

KV cache transfer kernel for prefill-decode disaggregation (apache/tvm commit)

Unified KV cache interface in microserving paper (arXiv:2412.12488)

Gaps

No evidence of MLA, vector-quantized KV, or KV eviction strategies specifically

…click to see all

Howard Zhang

medium hireability

KV Cache Optimization42

Weight Compression40

Inference-Aware Architecture20

Weight Streaming Efficiency8

Strengths

FP8 QKV quantization in torchao — directly reduces K/V precision at inference

Per-head fused QKV FP8 kernel with FA3/FA4 backends

Gaps

Activation quantization focus, not weight compression or architecture design

…click to see all

Hsin-Pai Cheng

medium hireability

Researcher@Qualcomm

Previously: PhD student @ Duke University

Inference-Aware Architecture68

KV Cache Optimization12

Weight Compression8

Weight Streaming Efficiency5

Strengths

PADRe (ICLR 2025): hardware-friendly Hadamard-product attention on GPU/NPU

11x–43x GPU/NPU speedup vs standard attention — explicit chip benchmarks

Gaps

No KV cache work — no MLA, vector-quantized KV, or KV eviction publications

…click to see all

Hui-Ling Zhen

medium hireability

Senior Staff Research Scientist@Huawei

Previously: Staff Researcher @ Huawei

Weight Compression92

KV Cache Optimization90

Inference-Aware Architecture80

Weight Streaming Efficiency35

Strengths

KVTuner: nearly lossless 3.25-bit KV cache, 21% throughput gain

SVDq: 1.25-bit K-cache, 410x key cache compression ratio

Gaps

No explicit weight-streaming topology work (TPS/watt streaming reduction)

…click to see all

Hui-Ling Zhen

medium hireability

KV Cache Optimization93

Weight Compression92

Inference-Aware Architecture82

Weight Streaming Efficiency55

Strengths

KVTuner (ICML 2025): sensitivity-aware layer-wise KV cache mixed-precision quantization

SVDq: 1.25-bit key cache + 410x compression via SVD, directly reducing KV footprint

Gaps

No explicit weight-streaming bandwidth papers (closest: MoE sparsity reduces active weights)

…click to see all

Igor Fedorov

medium hireability

Staff Research Scientist / Tech Lead / Manager@Meta

Previously: Senior AI Research Scientist @ Meta

San Diego, US

Weight Compression92

Inference-Aware Architecture82

Weight Streaming Efficiency50

KV Cache Optimization8

Strengths

SpinQuant (ICLR 2025): LLM quantization with learned rotations

UDC (NeurIPS 2022): compressible TinyML for NPUs — chip-aware architecture design

Gaps

No KV cache work — MLA, vector-quantized KV, or KV eviction absent from profile

…click to see all

Irem Boybat

medium hireability

Research Staff Member@IBM

Previously: Postdoctoral Researcher @ IBM

Zurich, CH

Inference-Aware Architecture78

Weight Compression62

Weight Streaming Efficiency42

KV Cache Optimization10

Strengths

AnalogNAS (2023): hardware-aware NAS for analog inference constraints

Efficient scaling LLMs + 3D analog CiM — Nature CS 2025, 25 citations

Gaps

No KV cache work (MLA, vector-quantized KV, eviction strategies)

…click to see all

Jeff Rasley

medium hireability

Inference-Aware Architecture50

KV Cache Optimization35

Weight Streaming Efficiency20

Weight Compression15

Strengths

Shift Parallelism (2025): KV cache invariance as core design property

DeepSpeed Inference (2022): 7.3x speedup, trillion-param inference at scale

Gaps

Systems/runtime layer — not chip-constraint-aware model architecture design

…click to see all

Jerome Ku

medium hireability

Weight Compression75

Inference-Aware Architecture50

Weight Streaming Efficiency45

KV Cache Optimization15

Strengths

HQQ fused GEMM merged to pytorch/ao — core INT4 weight quantization contributor

tinygemm INT4 unpacker — packed-weight inference, aware of memory-bound decode

Gaps

No KV cache-specific work: no MLA, vector-quantized KV, or KV eviction contributions found

…click to see all

Jerry Zhang

medium hireability

Weight Compression90

Weight Streaming Efficiency75

Inference-Aware Architecture60

KV Cache Optimization15

Strengths

torchao lead: 366 commits, 430 PRs — production weight quantization at Meta scale

NVFP4 + MXFP8 microscaling formats — sub-4-bit, block-scale weight compression

Gaps

No KV cache work found — MLA, vector-quantized KV, or eviction not in his portfolio

…click to see all

Jesse Cai

medium hireability

Machine Learning Engineer@Meta

Previously: Senior Research Engineer @ Cultivate

San Francisco, US

Weight Compression88

Weight Streaming Efficiency65

Inference-Aware Architecture58

KV Cache Optimization5

Strengths

94 commits to pytorch/ao — production INT4/FP8/weight-only quantization

TorchAO co-author (ICLR 2025) — PyTorch-native model optimization

Gaps

No KV cache optimization work (MLA, eviction, vector-quantized KV)

…click to see all

Jiahan Chang (Cyrus)

medium hireability

KV Cache Optimization75

Inference-Aware Architecture72

Weight Compression65

Weight Streaming Efficiency40

Strengths

concat_mla_k CUDA kernel in FlashInfer — direct MLA KV cache optimization

Integrated MLA kernel into vLLM for DeepSeek R1 (production scale)

Gaps

Location unconfirmed — NVIDIA global, unknown if in USA/Europe/China/India

…click to see all

Jiaming Tang

medium hireability

Ph.D. student@MIT

Previously: Undergraduate researcher @ SJTU EPCC Lab

Boston, US

Weight Compression92

KV Cache Optimization82

Inference-Aware Architecture68

Weight Streaming Efficiency28

Strengths

Quest (ICML 2024) — query-aware KV eviction, directly on-target

AWQ (MLSys 2024 Best Paper, 1503 cit) — activation-aware weight quantization

Gaps

No MLA or vector-quantized KV work — Quest is eviction-based only

…click to see all

Jing Xiong

medium hireability

PhD student@University of Hong Kong

Previously: MS student @ Sun Yat-Sen University

Shenzhen, CN

KV Cache Optimization85

Inference-Aware Architecture28

Weight Compression15

Weight Streaming Efficiency5

Strengths

D2O: KV eviction paper, 31 citations (EMNLP 2024)

ParallelComp: parallel KV compressor, ICML 2025

Gaps

No weight compression or weight streaming work

…click to see all

jiqing-feng

medium hireability

Weight Compression55

Inference-Aware Architecture22

KV Cache Optimization5

Weight Streaming Efficiency5

Strengths

FP8 kernel acceleration for compressed-tensors on XPU and CUDA (Apr 2026)

int4 weight-only quantization on Intel XPU via TorchAO — hardware-aware impl

Gaps

No visible KV cache, MLA, or KV eviction work

…click to see all

Josh Fromm

medium hireability

Weight Compression92

Inference-Aware Architecture78

Weight Streaming Efficiency55

KV Cache Optimization5

Strengths

ScaleBITS (2026): hardware-aligned mixed-precision quantization for LLMs

Automated Backend-Aware PTQ (2021): chip-specific quantization targeting

Gaps

No evidence of KV cache work (MLA, eviction, vector-quantized KV)

…click to see all

Junru Shao

medium hireability

Inference-Aware Architecture65

Weight Compression45

KV Cache Optimization30

Weight Streaming Efficiency20

Strengths

MLC-LLM quantization pipeline (q4f16_1/q3f16) — deployment-layer weight compression

FlashInfer integration into TVM — KV cache-aware attention at compiler level

Gaps

Compiler/runtime engineer, not model architecture researcher — designs deployment stacks, not new architectures

…click to see all

Junxian Guo

medium hireability

PhD student@Shanghai Jiao Tong University

Previously: Undergrad student @ Shanghai Jiao Tong University

Shanghai, CN

KV Cache Optimization85

Weight Compression82

Inference-Aware Architecture80

Weight Streaming Efficiency40

Strengths

DuoAttention (99 citations): KV retrieval/streaming heads for long-context inference

VQ-LLM: vector quantization augmented LLM inference — VQ-KV directly

Gaps

No direct work on MLA or KV eviction specifically — DuoAttention is adjacent

…click to see all

Ka-Hyun Nam

medium hireability

Inference-Aware Architecture72

Weight Compression65

KV Cache Optimization35

Weight Streaming Efficiency22

Strengths

38 PRs on flashinfer-ai/flashinfer — active core contributor

MXFP8 BlockScaledMmaOp w/ CUTLASS DSL — microscaling quantization ops

Gaps

No direct KV cache algorithm work (MLA, KV eviction, vector-quantized KV)

…click to see all

Kan Zhu

medium hireability

PhD Student@University of Washington

Previously: Undergrad student @ University of Michigan - Ann Arbor

Seattle, US

KV Cache Optimization88

Weight Compression85

Inference-Aware Architecture60

Weight Streaming Efficiency20

Strengths

Quest (ICML 2024): KV eviction via query-aware sparsity — core fit

Atom (MLSys 2024, 245 citations): low-bit quantization, direct weight compression

Gaps

No explicit weight streaming / bandwidth-reduction work

…click to see all

Kartikeya Bhardwaj

medium hireability

Researcher@Qualcomm

Previously: Senior Machine Learning Engineer @ Arm

Inference-Aware Architecture75

Weight Compression72

Weight Streaming Efficiency22

KV Cache Optimization8

Strengths

"Oh! We Freeze": 4-bit weight quantization KD for LLMs on edge (ICLR 2024)

ZiCo: hardware-aware NAS, 116 citations (ICLR 2023)

Gaps

No KV cache work: no MLA, vector-quantized KV, or KV eviction papers

…click to see all

Keshav Santhanam

medium hireability

Inference-Aware Architecture45

KV Cache Optimization15

Weight Compression10

Weight Streaming Efficiency5

Strengths

3470 commits to NVIDIA/Megatron-LM — Mamba EP + MoE inference engineer

"Cheaply Estimating Inference Efficiency Metrics" (NeurIPS 2023)

Gaps

No KV cache optimization work (MLA, vector-quantized KV, eviction strategies)

…click to see all

Kimish Patel

medium hireability

KV Cache Optimization75

Inference-Aware Architecture70

Weight Compression55

Weight Streaming Efficiency20

Strengths

22-PR stack Apr 2026 on transposed KV cache (1.64x decode speedup at pos=1024)

Non-flash SDPA path added for better decode SeqLen=1 performance

Gaps

KV cache work targets layout efficiency, not size reduction (no MLA or eviction)

…click to see all

Lequn Chen

medium hireability

Research Engineer@Perplexity AI

Previously: PhD student @ University of Washington

Weight Compression82

Inference-Aware Architecture75

KV Cache Optimization72

Weight Streaming Efficiency55

Strengths

Atom (MLSys 2024, 245 citations): low-bit quantization for LLM serving

41 commits to flashinfer-ai/flashinfer — attention/KV kernel library

Gaps

No explicit MLA or vector-quantized KV cache work found

…click to see all

Liangzhen Lai

medium hireability

Researcher@Meta

Inference-Aware Architecture90

Weight Streaming Efficiency78

Weight Compression68

KV Cache Optimization55

Strengths

Bit Fusion (747 cit.) — seminal inference chip co-design

Folding Attention (2024) — attention memory/power for on-device streaming

Gaps

No explicit MLA/GQA or KV eviction work for large-scale LLM inference

…click to see all

Ligeng Zhu

medium hireability

Ph.D student@NVIDIA

Previously: Undergrad student @ Simon Fraser University

Boston, US

Inference-Aware Architecture82

Weight Compression62

KV Cache Optimization52

Weight Streaming Efficiency35

Strengths

HAT: Hardware-Aware Transformers — 367 citations, co-designs architecture with hardware

ProxylessNAS: direct search on target hardware for inference efficiency

Gaps

No direct MLA, vector-quantized KV, or KV eviction work — KV cache work is adjacent

…click to see all

Lisa Jin

medium hireability

Weight Compression82

Weight Streaming Efficiency55

Inference-Aware Architecture28

KV Cache Optimization5

Strengths

PARQ first author — principled QAT for 2-4 bit extreme weight compression

ParetoQ co-author — scaling laws for extremely low-bit LLM quantization

Gaps

No KV cache optimization work (MLA, eviction, quantized KV) found

…click to see all

Lujun Li

medium hireability

Researcher@Hong Kong University of Science and Technology

Previously: Researcher @ Hong Kong Generative AI Research and Development Center

Hong Kong, HK

Weight Compression82

Weight Streaming Efficiency45

Inference-Aware Architecture20

KV Cache Optimization3

Strengths

STBLLM: sub-1-bit structured binary LLMs with custom CUDA kernel (2024)

EMQ: training-free mixed-precision quantization — ICCV 2023, 54 citations

Gaps

No KV cache work — MLA, KV eviction, vector-quantized KV all absent

…click to see all

Lukas Cavigelli

medium hireability

Researcher (Expert/Architect)@Huawei

Previously: Researcher (Principal Engineer) @ Huawei

Zurich, CH

Weight Compression92

Inference-Aware Architecture90

KV Cache Optimization82

Weight Streaming Efficiency70

Strengths

TyphoonMLA (2025): MLA kernel — exact match to KV cache search axis

"Don't be so Stief!" (2026): KV cache low-rank compression on Stiefel manifold

Gaps

No explicit topological regularization or Microscaling (MX-format) work

…click to see all

Luoming Zhang

medium hireability

Algorithm Engineer@Qualcomm

Previously: PhD student @ Zhejiang University

China

Weight Compression80

KV Cache Optimization70

Weight Streaming Efficiency35

Inference-Aware Architecture30

Strengths

ZipCache: 5× KV cache compression with salient token quantization (arxiv:2405.14256)

Dual Grained Quantization: A8W4 LLM quant, 3.24× speed, 1.12× memory reduction

Gaps

No work on MLA, multi-head latent attention, or KV eviction strategies

…click to see all

Manuel Le Gallo

medium hireability

Staff Research Scientist@IBM

Previously: PhD student @ ETH Zurich

Zurich, CH

Inference-Aware Architecture88

Weight Streaming Efficiency80

Weight Compression72

KV Cache Optimization10

Strengths

64-core PCM inference chip — eliminates weight streaming via in-memory compute

9.76 TOPS/W on 14nm chip — direct TPS/watt efficiency evidence

Gaps

No KV cache work — no MLA, vector-quantized KV, or KV eviction papers found

…click to see all

Marco Federici

medium hireability

Principal@Tracey

Previously: Executive Manager Demand Forecasting @ nbn

Sydney, AU

Weight Streaming Efficiency72

Inference-Aware Architecture65

Weight Compression58

KV Cache Optimization40

Strengths

DIP (MLSys 2025): 46% memory reduction, 40% throughput on Phi-3-Medium

Cache-Aware Masking: targets DRAM bandwidth during LLM decode

Gaps

KV cache work is activation-cache hit rate, not MLA/KV eviction specifically

…click to see all

Marc Sun

medium hireability

Weight Compression65

Inference-Aware Architecture22

Weight Streaming Efficiency15

KV Cache Optimization5

Strengths

FP8 per-tensor & block quant in transformers — production implementation

BNB 4-bit/8-bit, GPTQ, TorchAO, compressed-tensors contributor

Gaps

No KV cache compression work (MLA, vector-quantize, KV eviction) found

…click to see all

Markus Nagel

medium hireability

Research Scientist (Senior Staff Engineer)@Qualcomm

Previously: Research Scientist (Staff Engineer) @ Qualcomm

Amsterdam, NL

Weight Compression95

Inference-Aware Architecture75

Weight Streaming Efficiency70

KV Cache Optimization55

Strengths

GPTVQ (2024): vector quantization for LLMs — weight compression at scale

ADAROUND (2020): foundational adaptive rounding PTQ, widely cited

Gaps

No direct MLA / KV eviction work found — cache work is peripheral

…click to see all

Mart Van Baalen

medium hireability

Senior Staff Machine Learning Research Engineer/Manager@Qualcomm

Previously: Staff Machine Learning Research Engineer/Manager @ Qualcomm

Amsterdam, NL

Weight Compression92

Inference-Aware Architecture65

Weight Streaming Efficiency60

KV Cache Optimization30

Strengths

GPTVQ (2024): state-of-the-art vector quantization for LLM weight compression

Leech Lattice VQ (2025): outperforms QuIP#, QTIP, PVQ — latest SOTA

Gaps

No direct KV cache work (MLA, KV eviction) — cache papers are MoE routing, not KV

…click to see all

Matthew Douglas

medium hireability

Weight Compression82

Weight Streaming Efficiency45

Inference-Aware Architecture35

KV Cache Optimization5

Strengths

Core bitsandbytes maintainer — 183 commits, 145+ merged PRs

NF4/Int4/Int8 CUDA/ROCm kernel implementation in production

Gaps

No KV cache work (MLA, VQ-MLA, KV eviction) found

…click to see all

Mehmet Aktukmak

medium hireability

Weight Compression52

Inference-Aware Architecture52

KV Cache Optimization20

Weight Streaming Efficiency18

Strengths

vLLM-Gaudi: vLLM plugin for Intel Gaudi AI inference chips

Layer-by-layer SmoothQuant int8 — memory-efficient weight quantization

Gaps

No KV cache innovation (MLA, vector-quantized KV, eviction) — only indirect vLLM usage

…click to see all

Michael Wyatt

medium hireability

KV Cache Optimization55

Inference-Aware Architecture45

Weight Compression35

Weight Streaming Efficiency15

Strengths

#1 contributor to DeepSpeed-MII — blocked KV caching in production

FP6 quantization config in MII — weight quantization for inference

Gaps

Systems engineer, not researcher — no novel KV/compression algorithm contributions

…click to see all

Michele Magno

medium hireability

PD Dr.@ETH Zurich

Previously: Researcher @ ETH Zurich

Zurich, CH

Weight Compression85

Inference-Aware Architecture72

Weight Streaming Efficiency38

KV Cache Optimization5

Strengths

"Empirical study of Llama3 quantization" (49 cites, 2024) — top LLM weight compression

BiLLM: Pushing the Limit of Post-Training Quantization — 1-bit extreme compression

Gaps

No KV cache optimization work (MLA, eviction, vector quantization) — major gap

…click to see all

Mingzhi Liu

medium hireability

Inference-Aware Architecture85

KV Cache Optimization82

Weight Streaming Efficiency72

Weight Compression40

Strengths

MLA fp8 kernel fix on gfx950 — direct KV decode path fix (ROCm/aiter #2907)

mori RDMA KV transfer engine merged into vLLM-omni (PUSH/PULL, 3369 lines)

Gaps

AMD-specific (ROCm/MI series) — no custom ASIC inference chip experience

…click to see all

Mostafa Elhoushi

medium hireability

Research Scientist@Cerebras Systems

Previously: Research Engineer, FAIR @ Meta

Toronto, CA

Weight Compression92

Inference-Aware Architecture80

Weight Streaming Efficiency75

KV Cache Optimization72

Strengths

CHAI (2024): 21.4% KV cache reduction via clustered attention head sharing

any4 (2025): learned 4-bit LLM weight quantization + tinygemm inference library

Gaps

Location Toronto, Canada — not in search regions (USA, Europe, China, India)

…click to see all

Nandor Licker

medium hireability

Inference-Aware Architecture50

KV Cache Optimization45

Weight Streaming Efficiency15

Weight Compression8

Strengths

23 merged PRs in flashinfer-ai/flashinfer — KV cache/attention kernels

#2 contributor to pplx-kernels — Perplexity production inference GPU kernels

Gaps

Kernel implementer, not architecture researcher — no MLA/KV eviction design work

…click to see all

Nathan Chen

medium hireability

Inference-Aware Architecture60

KV Cache Optimization55

Weight Streaming Efficiency12

Weight Compression5

Strengths

flash-linear-attention contributor — KV-cache-free linear attention (O(1) memory)

Co-ALIBI: hardware-aligned Triton kernels, 160 TFLOPS/s on H100

Gaps

No MLA, vector-quantized KV, or explicit KV eviction policy work

…click to see all

Nilesh Prasad Pandey

medium hireability

PhD student@University of Califnornia San Diego

Previously: Applied Scientist Intern @ Amazon

San Diego, US

Weight Compression78

Inference-Aware Architecture40

Weight Streaming Efficiency15

KV Cache Optimization5

Strengths

Mixed precision PTQ (2023, 24 cit.) — hardware-aware weight compression

DPQ-HD (2025) — ultra-low-power compression for inference hardware

Gaps

No KV cache work — MLA, vector-quantized KV, KV eviction absent

…click to see all

nv-yunzheq

medium hireability

Weight Compression78

Inference-Aware Architecture78

Weight Streaming Efficiency50

KV Cache Optimization45

Strengths

NVFP4 MoE kernels (blockscaled FP4) ported from TRT-LLM — production weight compression

Blackwell SM100/GB200/GB300 architecture-specific CUTE DSL kernels

Gaps

No KV cache compression research (MLA work is inference serving, not architecture design)

…click to see all

Oliver Sieberling

medium hireability

PhD Student@MIT

Previously: Teaching Assistant @ ETH Zürich

Boston, US

Weight Compression82

Inference-Aware Architecture65

Weight Streaming Efficiency35

KV Cache Optimization5

Strengths

EvoPress (ICML 2025): dynamic quantization + sparsity + pruning on Llama/Mistral

Quartet: FP4 training/inference with Blackwell-optimized CUDA kernels

Gaps

No KV cache work (MLA, KV eviction, vector-quantized KV)

…click to see all

Paul N. Whatmough

medium hireability

Senior Director, AI Research@Qualcomm

Previously: Director, AI Research @ Qualcomm

Boston, US

Inference-Aware Architecture95

Weight Compression93

KV Cache Optimization78

Weight Streaming Efficiency72

Strengths

GPTVQ (2024, 48 cites): vector quantization cuts LLM DRAM footprint + bandwidth

KaVa (2025): KV-Cache compressed distillation — on-point for KV eviction/reduction axis

Gaps

No explicit MLA (Multi-head Latent Attention) architecture work

…click to see all

Peijie Dong

medium hireability

PhD Candidate in Computer Science@The Hong Kong University of Science and Technology (Guangzhou)

Previously: Intern @ Alibaba

Guang Zhou, CN

Weight Compression90

KV Cache Optimization82

Weight Streaming Efficiency70

Inference-Aware Architecture55

Strengths

ChunkKV (NeurIPS 2025) — semantic KV cache compression, long-context LLM

SpInfer (EuroSys 2025 Best Paper) — low-level GPU sparsity for LLM inference

Gaps

No chip co-design — GPU inference focus, not custom AI inference chips

…click to see all

Perkz Zheng

medium hireability

KV Cache Optimization88

Inference-Aware Architecture82

Weight Compression45

Weight Streaming Efficiency35

Strengths

FlashMLA KV-cache path for DeepSeek V4 prefill — direct MLA implementation (vLLM PR #41836)

Sparse MLA decode kernel selection, SM100/SM103 Blackwell hardware-aware (FlashInfer #2836)

Gaps

No published research papers — pure implementation engineer, not researcher

…click to see all

Peter Yeh

medium hireability

Weight Compression72

Inference-Aware Architecture55

Weight Streaming Efficiency40

KV Cache Optimization15

Strengths

30 PRs to pytorch/ao — INT4, FP8, MX microscaling quantization on ROCm

SpinQuant Hadamard matrices — rotation-based weight compression for LLMs

Gaps

No direct KV cache work (no MLA, KV eviction, or vector-quantized KV cache found)

…click to see all

Praneeth Medepalli

medium hireability

Member of Technical Staff@Zyphra

Previously: Machine Learning Engineer @ SiMa.ai

San Francisco, US

Weight Compression65

Inference-Aware Architecture30

KV Cache Optimization5

Weight Streaming Efficiency5

Strengths

MPO-based low-rank factorization+pruning paper — direct weight compression

Research expertise: model compression, quantization, signal processing

Gaps

No KV cache work (MLA, KV eviction, vector-quantized KV) found

…click to see all

Rahul Tuli

medium hireability

Weight Compression72

Weight Streaming Efficiency45

KV Cache Optimization18

Inference-Aware Architecture18

Strengths

37 PRs to compressed-tensors: AWQ, FP8, sparse 2:4 compression

63 PRs to llm-compressor: SmoothQuant MoE fix, quantization pipeline

Gaps

No published research — engineering contributor, not novel algorithm author

…click to see all

Rajan Troll

medium hireability

Member of Technical Staff@OpenAI

Previously: Chief Technology Officer @ inBalance

Seattle, US

Weight Compression62

Inference-Aware Architecture62

KV Cache Optimization40

Weight Streaming Efficiency30

Strengths

DB expertise: quantization + efficient DL hardware — core signal

fast self-attention + long context memory: adjacent KV cache expertise

Gaps

No public papers on quantization, KV cache, or hardware-aware inference

…click to see all

Ramyad Hadidi

medium hireability

Senior Staff -- ML Computer Architect@d-Matrix

Previously: Senior Scientist @ Rain AI

San Francisco, US

Inference-Aware Architecture92

Weight Streaming Efficiency82

KV Cache Optimization80

Weight Compression78

Strengths

Mustafar (NeurIPS'25): KV cache pruning via unstructured sparsity — exact match

Endor (2024): reduces weight transfer bandwidth for offloaded LLM inference

Gaps

KV work is sparsity-based pruning — no published work on MLA or vector-quantized KV

…click to see all

Randy

medium hireability

Weight Compression80

Weight Streaming Efficiency75

Inference-Aware Architecture60

KV Cache Optimization5

Strengths

2:4 structured sparsity halves weight stream bandwidth — directly cuts TPS power

FP8 + sparse CUTLASS tensor (Sparse2x4CUTLASSFloat8Tensor) in production at Meta

Gaps

No published research — pure practitioner, no papers

…click to see all

Ritchie Zhao

medium hireability

Senior AI and Machine Learning Engineer@NVIDIA

Previously: Senior Data Science Manager @ Microsoft

Redmond, US

Weight Compression96

KV Cache Optimization88

Inference-Aware Architecture78

Weight Streaming Efficiency70

Strengths

Shared Microexponents (ISCA 2023) — co-authored the MX spec cited in JD

Microscaling Data Formats for Deep Learning (2023) — full MX format spec

Gaps

No published work specifically on decode-time weight streaming bandwidth reduction

…click to see all

Royson Lee

medium hireability

Research Scientist@Samsung

Previously: Research Engineer @ Samsung

Cambridge, GB

Weight Streaming Efficiency78

Inference-Aware Architecture75

Weight Compression65

KV Cache Optimization5

Strengths

PMPD (ICLR 2025): precision-lowering during decode cuts weight stream bits

3.8–8× NPU throughput gain on LLM-optimized NPU hardware

Gaps

No KV cache work — no MLA, KV eviction, or vector-quantized KV evidence

…click to see all

Ruihang Lai

medium hireability

Ph.D. student@Carnegie Mellon University

Previously: Research Intern @ OctoAI

Pittsburgh, US

KV Cache Optimization82

Inference-Aware Architecture62

Weight Streaming Efficiency38

Weight Compression22

Strengths

FlashInfer: KV-cache block-sparse attention engine, MLSys 2025 outstanding paper

Cascade Inference: memory-bandwidth-efficient batch decoding paper

Gaps

No direct work on MLA, vector-quantized KV cache, or KV eviction strategies

…click to see all

Ruihao Gong

medium hireability

Beihang University

Previously: Principal Researcher @ SenseTime

Weight Compression95

Inference-Aware Architecture60

Weight Streaming Efficiency55

KV Cache Optimization20

Strengths

BRECQ (610 citations) — landmark post-training quantization paper

DSQ (636 citations) — differentiable quantization, full-precision to low-bit bridging

Gaps

No published work on MLA, KV eviction, or vector-quantized KV cache

…click to see all

Rui-Jie Zhu

medium hireability

Research Intern@ByteDance

Previously: Research Intern @ EMD Electronics

San Francisco, US

KV Cache Optimization70

Weight Compression55

Weight Streaming Efficiency50

Inference-Aware Architecture50

Strengths

RWKV co-author (958 cit.) — KV-cache elimination via linear recurrence

MatMul-free LM — ternary weights, no matmuls; extreme weight compression

Gaps

No explicit work on MLA, vector-quantized KV, or KV eviction strategies

…click to see all

Rui Li

medium hireability

Researcher@Samsung AI

Previously: PhD Student @ University of Edinburgh

Cambridge, GB

Inference-Aware Architecture60

Weight Compression42

Weight Streaming Efficiency20

KV Cache Optimization10

Strengths

Hardware-Aware Parallel Prompt Decoding — GPU-adaptive, 2.49× speedup (EMNLP 2025)

Dynamic sparse tree adapts decoding to hardware constraints

Gaps

No KV cache-specific work (MLA, KV eviction, vector-quantized KV)

…click to see all

Runxin Xu

medium hireability

researcher@DeepSeek

Previously: Quant researcher @ Metabit Trading

Barcelona, ES

KV Cache Optimization88

Inference-Aware Architecture80

Weight Compression62

Weight Streaming Efficiency55

Strengths

DeepSeek-V2: introduced MLA — 5.75x KV cache reduction, core to the query

DeepSeek-V3 co-author — continued inference-efficient architecture design

Gaps

No published work on KV eviction or vector-quantized KV cache

…click to see all

Ruokai Yin

medium hireability

PhD student@Yale University

Previously: Research Intern @ Microsoft

New Haven, US

Weight Compression75

Inference-Aware Architecture50

Weight Streaming Efficiency20

KV Cache Optimization5

Strengths

GPTAQ (ICML 2025) — quantizes 405B LLMs, direct weight compression evidence

DuoGPT (NeurIPS 2025) — dual sparsity pruning, training-free LLM compression

Gaps

No KV cache optimization work (MLA, vector quant, eviction)

…click to see all

Rush Tabesh

medium hireability

Ph.D. Student@Institute of Science and Technology Austria

Previously: Scientific Researcher @ Institute of Science and Technology Austria

Vienna, AT

Weight Compression88

Weight Streaming Efficiency30

Inference-Aware Architecture22

KV Cache Optimization5

Strengths

QuEST: 1-bit weights+activations LLM training — extreme compression

Quartet: native FP4 LLM training proven optimal (2025)

Gaps

No KV cache optimization work (MLA, KV eviction) found

…click to see all

Sam Ade Jacobs

medium hireability

Computer Scientist@Microsoft

KV Cache Optimization72

Inference-Aware Architecture35

Weight Compression25

Weight Streaming Efficiency15

Strengths

MAC-Attention (2026): 99% KV access reduction, 60% decode latency cut

KV reuse scheme — constant compute/bandwidth on cache hits regardless of context length

Gaps

No inference chip co-design — all systems work is training-focused

…click to see all

Samyam Rajbhandari

medium hireability

AI Systems Lead | Principal Architect@Snowflake

Previously: Principal Architect @ Microsoft

Redmond, US

KV Cache Optimization88

Weight Compression72

Weight Streaming Efficiency65

Inference-Aware Architecture62

Strengths

SwiftKV (2025): 62.5% KV cache reduction via AcrossKV knowledge-preserving layer merging

ZeRO (1983 citations) — partitions weight memory across devices, foundational work

Gaps

ZeRO family primarily training-focused; decode-time weight streaming for low-power chips not directly addressed

…click to see all

Saurabh Dash

medium hireability

Member of Technical Staff@Cohere

Previously: Machine Learning Researcher @ Apple

Toronto, CA

Weight Compression70

Inference-Aware Architecture55

Weight Streaming Efficiency20

KV Cache Optimization10

Strengths

"Intriguing Properties of Quantization at Scale" — NeurIPS 2023 LLM weight quant

Hessian-driven mixed-precision for ReRAM PIM arrays — hardware co-design

Gaps

No KV cache work (MLA, vector-quantized KV, KV eviction) found

…click to see all

Sayeh Sharify

medium hireability

Principal Machine Learning Research Scientist@d-Matrix

Previously: Co-Founder @ Tartan AI

San Francisco, US

Weight Compression92

Inference-Aware Architecture88

KV Cache Optimization65

Weight Streaming Efficiency50

Strengths

ResQ (2025): 4-bit KV cache + weight + activation quantization with 3× speedup

Microscaling PTQ (2024): chip-native MX format quantization for inference hardware

Gaps

No published work on MLA or vector-quantized KV cache specifically

…click to see all

Scott Roy

medium hireability

data scientist@Microsoft

Previously: Researcher @ Meta

Weight Compression92

Inference-Aware Architecture75

Weight Streaming Efficiency55

KV Cache Optimization5

Strengths

ParetoQ: state-of-the-art 1–4 bit LLM quantization (Meta AI, Feb 2025)

155 PRs to pytorch/ao: HQQ, PARQ, LUT 1–4 bit packing, Int4 weight-only configs

Gaps

No KV cache work found (MLA, KV eviction, vector-quantized KV)

…click to see all

Se Jung Kwon

medium hireability

Director@NAVER

Previously: Leader @ NAVER

Seoul, KR

Weight Compression92

KV Cache Optimization80

Inference-Aware Architecture75

Weight Streaming Efficiency35

Strengths

"No Token Left Behind" (2024, 76 cit.) — KV cache compression via mixed-precision quantization

LUT-GEMM (2024, 184 cit.) — lookup-table-based quantized matmul for LLM inference hardware

Gaps

Location: Seoul, KR — not in requested regions (USA/Europe/China/India)

…click to see all

Shane Bergsma

medium hireability

Principal Researcher@Cerebras

Previously: Researcher @ Huawei

Weight Compression45

Inference-Aware Architecture35

Weight Streaming Efficiency15

KV Cache Optimization5

Strengths

Principal Researcher, Cerebras Systems — wafer-scale inference chip R&D

Sparsity 2024: 30-40% inference FLOP reduction via unstructured sparsity

Gaps

No KV cache work — no MLA, vector-quantized KV, or KV eviction papers

…click to see all

Shengyu Liu

medium hireability

KV Cache Optimization90

Inference-Aware Architecture65

Weight Compression20

Weight Streaming Efficiency10

Strengths

13 commits to deepseek-ai/FlashMLA — second-highest contributor

Co-authored FlashMLA kernel blog (Apr + Oct 2025) — MLA KV cache compression

Gaps

No weight compression or microscaling research

…click to see all

Shiqi Jiang

medium hireability

Senior Researcher@Microsoft

Previously: Senior Research Engineer @ Microsoft

Beijing, CN

Inference-Aware Architecture75

Weight Streaming Efficiency68

Weight Compression18

KV Cache Optimization5

Strengths

Active-Weight Swapping (DRAM/Flash) paper — direct weight streaming bandwidth work

NPU inference paper (EuroSys '26) — mobile chip constraints addressed

Gaps

No KV cache work — MLA, eviction, or KV quantization absent from profile

…click to see all

Shixing Yu

medium hireability

PhD student@Cornell University

Previously: Research Intern @ Meta

San Francisco, US

Weight Compression72

Inference-Aware Architecture35

Weight Streaming Efficiency10

KV Cache Optimization5

Strengths

UVC (ICLR 2022, 157 cites) — unified ViT pruning + low-rank + quantization

HAP (WACV 2022, 77 cites) — Hessian-aware pruning, weight structure

Gaps

No KV cache work (MLA, KV eviction) — a key Neuralace axis

…click to see all

Shiyang Weng (LevelDownRefine)

medium hireability

Weight Compression28

Inference-Aware Architecture15

KV Cache Optimization3

Weight Streaming Efficiency3

Strengths

13 merged PRs in pytorch/ao — int8/fp8 quantization contributor

x86 inductor fusion passes for quantized ops (DLRMv2)

Gaps

No KV cache, MLA, or KV eviction work found

…click to see all

Shuming Ma

medium hireability

Senior Researcher@Microsoft

Previously: Researcher @ Microsoft

Weight Compression96

KV Cache Optimization90

Weight Streaming Efficiency75

Inference-Aware Architecture75

Strengths

BitNet b1.58 — seminal 1-bit/1.58-bit LLM compression (h=44 primary author)

YOCO (2024) — decoder-decoder arch halving KV cache memory at inference

Gaps

No explicit hardware chip co-design work (bitnet.cpp targets CPU, not custom ASIC)

…click to see all

Shu Wang

medium hireability

Weight Compression72

Weight Streaming Efficiency65

Inference-Aware Architecture62

KV Cache Optimization35

Strengths

NVFP4 masked quantization in FlashInfer — 4-bit weight compression at kernel level

W4A16 AWQ + ModelOpt FP8 support in SGLang — direct weight compression

Gaps

Inference systems implementer, not a model architecture designer

…click to see all

Sijia(Jackson) Chen

medium hireability

KV Cache Optimization75

Inference-Aware Architecture55

Weight Compression25

Weight Streaming Efficiency15

Strengths

FlashMLA FP16 kernel (deepseek-ai/FlashMLA) — direct MLA KV cache work

FP8 KV cache quantization in FBGEMM — per-head, decode latency reduction

Gaps

No weight compression or topological regularization work

…click to see all

Si Zheng

medium hireability

Machine Learning System Researcher Scientist@ByteDance Seed

Previously: Research Intern @ DeepSeek AI

Beijing, CN

KV Cache Optimization92

Inference-Aware Architecture80

Weight Compression65

Weight Streaming Efficiency48

Strengths

ShadowKV (ICML 2025) — KV cache via low-rank keys, up to 6× batch size gain

ArkVale (NeurIPS 2024) — KV eviction with recallable mechanism

Gaps

Weight streaming bandwidth reduction not a primary focus

…click to see all

Stefanos Laskaridis

medium hireability

Applied Scientist@Amazon

Previously: Visiting Researcher @ University of Cambridge

London, GB

Weight Compression74

Inference-Aware Architecture48

Weight Streaming Efficiency22

KV Cache Optimization10

Strengths

FlexRank ICML'26 spotlight — nested low-rank for adaptive deployment

Maestro — trainable decomposition uncovering low-rank weight structures

Gaps

No published KV cache work (no MLA, KV eviction, or vector-quantized KV papers)

…click to see all

Strahinja Stamenkovic

medium hireability

Weight Compression80

Inference-Aware Architecture72

Weight Streaming Efficiency35

KV Cache Optimization20

Strengths

bitsandbytes ROCm: blocksize-32 4-bit GEMV kernels from scratch

kgemm_4bit_inference_naive ROCm optimization: 5.13x vLLM throughput

Gaps

No KV cache work (MLA, GQA, KV eviction) in any visible contributions

…click to see all

Ted Zadouri

medium hireability

KV Cache Optimization92

Inference-Aware Architecture88

Weight Streaming Efficiency45

Weight Compression25

Strengths

"Hardware-Efficient Attention" (2505.21487): GTA cuts KV cache 50% vs GQA

GLA matches MLA quality with 2× speed over FlashMLA in speculative decode

Gaps

No published work on weight compression or microscaling

…click to see all

Tianle Cai

medium hireability

Graduate Research Assistant@Princeton University

Previously: AI Researcher @ Together AI

Princeton, US

KV Cache Optimization97

Weight Compression85

Inference-Aware Architecture72

Weight Streaming Efficiency58

Strengths

CommVQ: vector-quantized KV cache, 87.5% reduction at 2-bit (2025)

SnapKV: KV eviction, 306 citations — core KV compression contribution

Gaps

No explicit hardware inference chip co-design (power/packaging constraints)

…click to see all

Tijmen Blankevoort

medium hireability

Researcher@Meta

Previously: Researcher @ Qualcomm

Weight Compression98

Inference-Aware Architecture72

Weight Streaming Efficiency55

KV Cache Optimization10

Strengths

ADAROUND (857 citations) — seminal post-training weight quantization

ParetoQ (NeurIPS 2025) — extreme 1–4 bit LLM quantization framework

Gaps

No KV cache quantization or eviction work found

…click to see all

Ting Song

medium hireability

Weight Compression92

Weight Streaming Efficiency75

Inference-Aware Architecture65

KV Cache Optimization5

Strengths

Lead maintainer microsoft/BitNet; 14 commits including 916-line GEMM kernel

Sparse-BitNet (2026): 1.58-bit + N:M sparsity — direct weight compression paper

Gaps

No KV cache work (MLA, eviction, quantized KV) — axis 1 blind spot

…click to see all

Triang-jyed-driung

medium hireability

Inference-Aware Architecture65

KV Cache Optimization35

Weight Streaming Efficiency10

Weight Compression5

Strengths

Albatross: RWKV inference engine — 10K+ token/s on RTX5090 with fp16 + CUDAGraph

Rapid-Sampling: 1.5–25x faster than FlashInfer via CUDA 128-bit vectorization

Gaps

No direct work on MLA, vector-quantized KV cache, or KV eviction strategies

…click to see all

VED

medium hireability

Weight Compression52

Inference-Aware Architecture25

Weight Streaming Efficiency12

KV Cache Optimization3

Strengths

MXFP4 end-to-end integration in Axolotl AI training stack

8 merged bitsandbytes PRs — 4-bit/8-bit quantization infrastructure

Gaps

No research publications — pure engineering role, not researcher

…click to see all

Vithursan Thangarasa

medium hireability

Principal Research Scientist@Cerebras Systems

Previously: Lead Research Scientist @ Cerebras Systems

San Francisco, US

Weight Compression80

Inference-Aware Architecture65

Weight Streaming Efficiency60

KV Cache Optimization15

Strengths

SPDF: sparse pre-training for LLMs — weight compression via sparsity (44 citations)

REAP: MoE pruning compression, one-shot (2025)

Gaps

No direct KV cache work (MLA, vector-quantized KV, eviction)

…click to see all

Weikang Meng

medium hireability

Ph.D. Student@Harbin Institute of Technology, Shenzhen

Shenzhen, CN

KV Cache Optimization55

Inference-Aware Architecture25

Weight Compression5

Weight Streaming Efficiency5

Strengths

STILL: token selection to linearize LLMs — reduces full KV attention scope (Feb 2026)

PolaFormer ICLR 2025 — polarity-aware linear attention, 26 citations

Gaps

No weight compression or quantization work

…click to see all

Weixuan Sun

medium hireability

Researcher@Tencent

Previously: PhD student @ Australian National University

Blacksburg, US

KV Cache Optimization70

Inference-Aware Architecture48

Weight Streaming Efficiency6

Weight Compression5

Strengths

Lightning Attention (ICML 2024): first linear attention with constant memory at any seq length

HGRN2 (COLM 2024): recurrent state expansion — O(1) KV-like memory during decode

Gaps

No work specifically on MLA, vector-quantized KV cache, or KV eviction

…click to see all

William Andrew Simon

medium hireability

Research Scientist on In-Memory Computing@IBM

Previously: PhD student @ EPFL - EPF Lausanne

Zurich, CH

Inference-Aware Architecture82

Weight Compression68

Weight Streaming Efficiency35

KV Cache Optimization5

Strengths

Analog IMC accelerators for transformer LLMs — directly inference-chip-aware (2023)

MoE + 3D analog in-memory scaling of LLMs — chip-level weight efficiency (2025)

Gaps

No evidence of KV cache, MLA, or KV eviction work

…click to see all

Xiangxiang Chu

medium hireability

Senior Director@Alibaba

Previously: Senior Technical Manager @ Meituan

Beijing, CN

Inference-Aware Architecture78

Weight Compression75

Weight Streaming Efficiency30

KV Cache Optimization10

Strengths

FPTQ + Norm Tweaking + Speed Odyssey: three LLM quantization deployment papers

EfficientRep: hardware-aware CNN design explicitly optimized for inference chips

Gaps

No KV cache optimization work (MLA, KV eviction, vector-quantized KV cache)

…click to see all

Xin Dong

medium hireability

Research Scientist@NVIDIA

Previously: Research Scientist @ Sony

Weight Compression88

Inference-Aware Architecture82

KV Cache Optimization80

Weight Streaming Efficiency52

Strengths

LaCache (2025): novel ladder-shaped KV cache reducing long-context memory

Hymba (NVlabs): hybrid Mamba-Transformer inference-efficient architecture

Gaps

No explicit work on MLA or vector-quantized KV cache (LaCache is KV eviction style)

…click to see all

Xingkai Yu

medium hireability

Inference-Aware Architecture82

KV Cache Optimization78

Weight Compression50

Weight Streaming Efficiency48

Strengths

DeepSeek-V3 MLA implementation — core KV cache compression technique

nano-vllm (13.3K stars) — paged KV cache, prefix caching, chunked prefill

Gaps

No published academic papers — practitioner/engineer, not researcher

…click to see all

Xin Wang

medium hireability

Director, Machine Learning@d-Matrix

Previously: Principal Scientist & Manager, Machine Learning Research @ Cerebras Systems

San Francisco, US

Weight Compression85

Inference-Aware Architecture85

Weight Streaming Efficiency55

KV Cache Optimization50

Strengths

Flexpoint: hardware-aware adaptive numerical format for DNN inference (363 cites)

ResQ: mixed-precision LLM quantization, low-rank residuals (2025)

Gaps

No direct MLA or vector-quantized KV cache research

…click to see all

Xinyu Yang

medium hireability

Ph.D. Candidate@Carnegie Mellon University

Previously: Research Intern @ Stanford University

KV Cache Optimization88

Inference-Aware Architecture68

Weight Compression52

Weight Streaming Efficiency38

Strengths

TriForce (COLM 2024): hierarchical speculative decoding cuts KV cache overhead

LESS (ICML 2024): recurrence + KV cache compression for efficient inference

Gaps

No explicit work on MLA or vector-quantized KV cache specifically

…click to see all

XsquirrelC

medium hireability

Weight Compression75

Weight Streaming Efficiency68

Inference-Aware Architecture52

KV Cache Optimization3

Strengths

Merged PR #379 to microsoft/BitNet — 3,337-line CPU inference optimization

GEMM kernel config for ternary/1.58-bit weights (gemm-config.h)

Gaps

No KV cache work — MLA, eviction, or vector quantization absent

…click to see all

Xuan Liao

medium hireability

KV Cache Optimization55

Inference-Aware Architecture45

Weight Compression35

Weight Streaming Efficiency20

Strengths

INT8/FP8 QSDPA for CPU: 15 commits to pytorch/ao, including full INT8 SDPA path

sgl-kernel-xpu: flash attention with paged KV cache for Intel BMG580 XPU

Gaps

No MLA, vector-quantize-MLA, or KV eviction work specifically

…click to see all

Yaosheng Fu

medium hireability

Member of the Architecture Research Team@NVIDIA

Previously: PhD student @ Princeton University

KV Cache Optimization88

Inference-Aware Architecture78

Weight Compression42

Weight Streaming Efficiency30

Strengths

RocketKV (2025): two-stage KV cache compression for long-context LLM inference

AutoScratch (MLSys 2023): ML-optimized scratch cache management for inference GPUs

Gaps

No direct weight streaming bandwidth reduction or weight sharing papers found

…click to see all

Yi Liu

medium hireability

AI Frameworks Engineer@Intel

Previously: Assistant Engineer @ National University of Defense Technology

Weight Compression72

KV Cache Optimization45

Inference-Aware Architecture35

Weight Streaming Efficiency20

Strengths

"Optimize Weight Rounding via SGD" (2024, 29 cites) — weight compression paper

PrefixQuant (2025) — static outlier-aware LLM quantization

Gaps

No original KV cache research — blog analysis, not novel contribution

…click to see all

Yilong Zhao

medium hireability

Ph.D. student@University of California, Berkeley

Previously: Research Intern @ ByteDance

Berkeley, US

KV Cache Optimization95

Weight Compression85

Inference-Aware Architecture75

Weight Streaming Efficiency50

Strengths

DeepSeek-V2 co-author — pioneered MLA for KV cache compression

Quest (ICML'24): query-aware KV eviction, 174 citations

Gaps

Year-1 PhD (2024) — 4-5 years until completion

…click to see all

Yingbing Huang

medium hireability

Ph.D. Candidate@University of Illinois, Urbana Champaign

Urbana, US

KV Cache Optimization80

Inference-Aware Architecture65

Weight Compression10

Weight Streaming Efficiency10

Strengths

SnapKV NeurIPS'24 (314 citations) — co-author on KV eviction

10 commits to FasterDecoding/SnapKV — hands-on implementation

Gaps

No work on weight compression or quantization

…click to see all

YinHanke

medium hireability

Weight Compression50

Inference-Aware Architecture35

KV Cache Optimization15

Weight Streaming Efficiency5

Strengths

MNN PR #4336: SmoothQuant + OmniQuant for Qwen3.5 mixed attention export

Familiar with quantized model export in production inference engine (Alibaba MNN)

Gaps

No published research — implementation engineer, not a researcher

…click to see all

Yonggan Fu

medium hireability

Research Scientist@NVIDIA

Previously: Research Intern @ NVIDIA

San Francisco, US

Inference-Aware Architecture92

KV Cache Optimization82

Weight Compression75

Weight Streaming Efficiency25

Strengths

LaCache (2025): novel KV caching scheme for long-context LLMs

Hymba: hybrid-head small LM, ICLR 2025 Spotlight — inference-optimized arch

Gaps

No direct work on MLA or vector-quantized KV cache

…click to see all

Yong Wu

medium hireability

Inference-Aware Architecture65

KV Cache Optimization55

Weight Streaming Efficiency15

Weight Compression10

Strengths

60 commits to flashinfer-ai/flashinfer — KV cache attention kernel library

Bio: ML Compiler + FlashInfer LLM co-design — hardware-aware framing

Gaps

No evidence of MLA, vector-quantized KV, or KV eviction algorithm work

…click to see all

Younes Belkada

medium hireability

MS student@ENS Paris Saclay

Previously: Researcher @ Technology Innovation Institute

Paris, FR

Weight Compression85

Inference-Aware Architecture55

Weight Streaming Efficiency40

KV Cache Optimization30

Strengths

GPT3.int8() co-author — 1556 citations, foundational quantization work

1.58-bit fine-tuning integration into axolotl via onebitllms (Apr 2026)

Gaps

No direct MLA, vector-quantized KV, or KV eviction work

…click to see all

Young D. Kwon

medium hireability

Research Scientist@Samsung

Previously: PHD Student @ University of Cambridge

Cambridge, GB

Weight Compression62

Inference-Aware Architecture45

Weight Streaming Efficiency22

KV Cache Optimization8

Strengths

HierarchicalPrune: position-aware diffusion model compression (AAAI 2026)

SpecVocab speculative decoding — commercialized on Samsung Galaxy S26

Gaps

No KV cache work — MLA, vector-quantized KV, or eviction strategies absent

…click to see all

Yuming Lou

medium hireability

Weight Compression65

Inference-Aware Architecture55

KV Cache Optimization35

Weight Streaming Efficiency15

Strengths

MIT HAN Lab intern — AWQ (MLSys 2024 Best Paper) contributor

Tsinghua quantization algorithms research under Yu Wang

Gaps

No original research on MLA, vector-quantized KV cache, or KV eviction

…click to see all

Yun Li

medium hireability

Technical Expert@Huawei

Previously: Senior Algorithm Researcher @ Tencent

Shanghai, CN

Weight Compression78

Weight Streaming Efficiency48

Inference-Aware Architecture48

KV Cache Optimization10

Strengths

AMS-Quant: novel FP4.25/FP5.33 weight quantization — reduces bits-per-weight directly

CUDA kernels in AMS-Quant minimize memory access; 2.8–3.2× speedup vs FP16

Gaps

No KV cache work — MLA, vector-quantize KV, KV eviction absent from profile

…click to see all

Yuqi Pan

medium hireability

PhD student@Institue of Automation, Chinese Academy of Sciences

Previously: Undergrad student @ Nanjing University

Beijing, CN

KV Cache Optimization60

Inference-Aware Architecture35

Weight Compression5

Weight Streaming Efficiency5

Strengths

MetaLA: KV-free linear attention via fixed recurrent state (18 citations)

Contributor to fla-org/flash-linear-attention — includes MLA implementation

Gaps

No weight compression work (quantization, microscaling, topological regularization)

…click to see all

Yuxuan Zhu

medium hireability

PhD Student@Rensselaer Polytechnic Institute

Previously: Graduation thesis/internship Data-driven Nozzle Failures Detection and Classification @ Canon Production Printing

Troy, US

KV Cache Optimization82

Inference-Aware Architecture35

Weight Compression5

Weight Streaming Efficiency5

Strengths

SentenceKV (COLM 2025) — sentence-level KV cache compression

OjaKV (ACL 2026) — online low-rank KV cache compression

Gaps

No hardware/chip-level architecture co-design work

…click to see all

Yu Zhang

medium hireability

Full-time Researcher@Moonshot AI

Previously: Research Intern @ Tencent

Inference-Aware Architecture88

KV Cache Optimization75

Weight Streaming Efficiency30

Weight Compression20

Strengths

Kimi Linear (2025): 75% KV cache reduction via hybrid linear attention (KDA)

1117 commits to flash-linear-attention — Triton hardware-efficient kernel library

Gaps

No direct work on MLA, vector-quantized KV, or KV eviction specifically

…click to see all

Zefan Cai

medium hireability

TikTok Shop Account Manager@Pattern

Previously: Business Analyst @ Xiaohongshu

Lehi, US

KV Cache Optimization90

Inference-Aware Architecture20

Weight Compression8

Weight Streaming Efficiency5

Strengths

PyramidKV (NeurIPS 2024, 148 citations) — layer-wise KV cache compression

R-KV (NeurIPS 2025) — KV compression for reasoning models

Gaps

No published work on weight compression or weight streaming efficiency

…click to see all

Zeyu WANG

medium hireability

KV Cache Optimization85

Inference-Aware Architecture78

Weight Streaming Efficiency15

Weight Compression8

Strengths

4 merged PRs to deepseek-ai/FlashMLA — MLA KV cache decoding kernels

PR #76: full Blackwell SM100 architecture MLA kernel support (+11K lines)

Gaps

No weight compression or microscaling work found

…click to see all

zhang (dianzhangchen)

medium hireability

KV Cache Optimization88

Inference-Aware Architecture85

Weight Compression20

Weight Streaming Efficiency12

Strengths

deepseek-ai/FlashMLA: 2 merged PRs including TMA pipeline optimization

NVIDIA/cutlass PR #2472: Blackwell MLA forward kernel merged (3323 LOC)

Gaps

No evidence of weight streaming / bandwidth reduction work

…click to see all

Zhangyang Wang

medium hireability

Senior Research Scientist@Meta

Previously: Assistant Professor @ University of Texas at Austin

KV Cache Optimization93

Weight Compression92

Inference-Aware Architecture72

Weight Streaming Efficiency62

Strengths

H2O KV eviction — co-author of seminal KV heavy-hitter eviction paper

Q-Hitter: sparse-quantized KV cache for efficient LLM inference

Gaps

No specific MLA or vector-quantized KV cache work found

…click to see all

Zhenda Xie

medium hireability

AI Researcher@DeepSeek AI

Previously: Joint PhD & Fulltime Intern @ Microsoft

Beijing, CN

KV Cache Optimization92

Inference-Aware Architecture78

Weight Streaming Efficiency38

Weight Compression36

Strengths

MLA in DeepSeek-V2: 93.3% KV cache reduction (core designer)

NSA paper: hardware-aligned sparse attention for modern inference chips

Gaps

No explicit work on vector-quantized KV cache or KV eviction strategies

…click to see all

Zhen Dong

medium hireability

Senior/Staff Research Scientist@NVIDIA

Previously: Founding Member @ Nexusflow

San Francisco, US

Weight Compression95

Inference-Aware Architecture80

KV Cache Optimization65

Weight Streaming Efficiency45

Strengths

R-KV (2025): KV cache compression for reasoning models — direct search match

SqueezeLLM: dense-and-sparse weight quantization for LLM decode (323 citations)

Gaps

No explicit MLA / multi-latent attention architecture work found

…click to see all

Zhen Qin

medium hireability

Staff Research Scientist@DeepMind

Previously: Researcher @ TapTap

New York, US

Inference-Aware Architecture65

KV Cache Optimization58

Weight Streaming Efficiency18

Weight Compression10

Strengths

Lightning Attention-2: fixed-size recurrent KV state, hardware-efficient kernels

HGRN2: gated linear RNN — bounded memory at inference, state expansion

Gaps

No specific MLA, vector-quantized KV cache, or KV eviction work

…click to see all

Zhihang Yuan

medium hireability

Algorithm Researcher@Bytedance

Previously: Researcher @ Infinigence AI

Beijing, CN

Weight Compression96

KV Cache Optimization85

Inference-Aware Architecture72

Weight Streaming Efficiency65

Strengths

WKVQuant: joint weight + KV cache quantization (2024)

SKVQ: sliding-window KV cache quant — eviction-adjacent (2024)

Gaps

No direct MLA or attention architecture design (KV work is compression-focused)

…click to see all

Zhiyuan Li

medium hireability

KV Cache Optimization85

Inference-Aware Architecture78

Weight Streaming Efficiency70

Weight Compression5

Strengths

Kimi Linear co-author: 75% KV reduction, 6× TPOT vs MLA at 1M context

252 commits to flash-linear-attention — core KDA kernel maintainer

Gaps

No weight compression work — quantization/pruning absent from profile

…click to see all

Zhuoming Chen

medium hireability

Ph.D. student@Carnegie Mellon University

Previously: Research Intern @ Meta

New York, US

KV Cache Optimization75

Inference-Aware Architecture42

Weight Streaming Efficiency18

Weight Compression8

Strengths

MagicPIG: LSH-based KV cache eviction for efficient LLM generation

TriForce: KV-cache hierarchical draft — directly manages KV at inference

Gaps

No weight compression work (quantization, pruning, topoLM-style approaches)

…click to see all

Zihao Ye

medium hireability

Engineer@NVIDIA

Previously: Intern @ NVIDIA

San Francisco, US

KV Cache Optimization95

Weight Compression80

Inference-Aware Architecture75

Weight Streaming Efficiency45

Strengths

FlashInfer creator (948 commits): paged KV cache, MLA, eviction kernels — exact match

MagicPIG paper: LSH-based KV approximation for efficient generation

Gaps

Kernel/systems engineer focus — not model architecture designer per se

…click to see all

ZZK

medium hireability

KV Cache Optimization80

Inference-Aware Architecture50

Weight Compression38

Weight Streaming Efficiency30

Strengths

FlashMLA PR #162 — direct KV cache compression kernel work

FlashInfer PR #3224 — MoE kernel memory access optimization

Gaps

No LinkedIn; limited hireability signal

…click to see all

Abu Sebastian

low hireability

Manager, AI Compute Frontiers (DRSM)@IBM

Previously: Principal Research Staff Member @ IBM

Zurich, CH

Inference-Aware Architecture85

Weight Streaming Efficiency75

Weight Compression65

KV Cache Optimization8

Strengths

NAS for in-memory computing accelerators — hardware-aware architecture co-design

Efficient LLM scaling with MoE + 3D analog in-memory (Nature Comp Sci 2025)

Gaps

No KV cache optimization work (MLA, vector-quantized KV, KV eviction)

…click to see all

Ahmed Hasssan

low hireability

MTS Software Development Engineer@AMD

Previously: Graduate Student Research Assistant @ Cornell Tech

Pueblo, US

Weight Compression82

Inference-Aware Architecture78

Weight Streaming Efficiency30

KV Cache Optimization10

Strengths

CiM inference chip co-design — 4-year PhD focus at Cornell Seo lab

Torch2Chip: DNN compression + hardware accelerator deployment toolkit

Gaps

No KV cache work: no evidence of MLA, vector-quantized KV, or eviction strategies

…click to see all

AJAY KUMAR JAISWAL

low hireability

Researcher@Apple

Previously: PHD Scholar @ The University of Texas at Austin

Seattle, US

Weight Compression85

Weight Streaming Efficiency30

Inference-Aware Architecture15

KV Cache Optimization0

Strengths

OWL: outlier-weighted layerwise sparsity for LLMs (107 citations, ICML 2024)

WeLore: low-rank weight compression from gradient stabilization (ICML 2025)

Gaps

No KV cache optimization work (MLA, vector quantization, eviction) found

…click to see all

Alex Yang

low hireability

KV Cache Optimization65

Weight Compression55

Inference-Aware Architecture55

Weight Streaming Efficiency15

Strengths

Core FlashInfer maintainer — MLA KV cache kernel contributor

TRT-LLM paged attention kernel work (KV eviction/management)

Gaps

No published research — inference kernel engineer, not architecture researcher

…click to see all

Alvin Wan

low hireability

Member of Technical Staff@OpenAI

Previously: Senior Research Scientist @ Apple

San Francisco, US

Weight Compression80

Inference-Aware Architecture65

Weight Streaming Efficiency45

KV Cache Optimization5

Strengths

'The Super Weight in LLMs' (2025) — outlier-aware quantization block sizing

UPSCALE channel pruning — 2x inference speedup via structured sparsity

Gaps

No KV cache work — MLA, vector-quantized KV, or KV eviction absent

…click to see all

Amar Phanishayee

low hireability

Sr. Principal Researcher@Microsoft

Previously: PhD student @ Carnegie Mellon University

Inference-Aware Architecture72

Weight Compression60

KV Cache Optimization45

Weight Streaming Efficiency20

Strengths

DéjàVu (ICML 2024): KV-cache streaming for LLM serving — 45 citations

Block FP activation compression patents (2020, 2024, 2025) — MX/microscaling-adjacent

Gaps

KV work is fault-tolerant streaming, not MLA/vector-quantized KV/eviction

…click to see all

Amirkeivan Mohtashami

low hireability

Research Scientist@DeepMind

Previously: Research Scientist @ Google

Zurich, CH

Weight Compression88

KV Cache Optimization82

Inference-Aware Architecture58

Weight Streaming Efficiency32

Strengths

QuaRot: 4-bit outlier-free LLM inference, NeurIPS 2024 (340 citations)

Landmark Attention: KV eviction for infinite context, NeurIPS 2023 (223 citations)

Gaps

No hardware-aware architecture design for specific inference chips

…click to see all

Andrei Panferov

low hireability

PhD Student Researcher, ML@PhD Student, ISTA

Previously: Senior ML Engineer @ Wildberries

Vienna, AT

Weight Compression95

Inference-Aware Architecture55

Weight Streaming Efficiency30

KV Cache Optimization5

Strengths

AQLM: extreme additive quantization, 141 citations

Quartet II: NVFP4 pre-training for NVIDIA Blackwell — inference-chip aware

Gaps

No KV cache optimization work (MLA, KV eviction) visible

…click to see all

Anerudhan Gopal

low hireability

Inference-Aware Architecture72

KV Cache Optimization65

Weight Compression20

Weight Streaming Efficiency8

Strengths

Ragged KV Cache cuDNN backend wrapper for FlashInfer — direct KV cache work

FP8 Q+KV attention via cuDNN — quantized KV cache at inference time

Gaps

No evidence of MLA, KV eviction algorithms, or vector-quantized KV cache

…click to see all

Ankit Gupta

low hireability

Research Scientist@IBM

Previously: Research Scientist @ IBM

Boston, US

KV Cache Optimization60

Inference-Aware Architecture28

Weight Compression5

Weight Streaming Efficiency5

Strengths

DSS (NeurIPS 2022, 511 cit) — SSMs eliminate KV cache entirely

Gated State Spaces (ICLR 2023, 409 cit) — KV-free LM architecture

Gaps

No weight compression or quantization work (microscaling, topological regularization)

…click to see all

Ankit Singh Rawat

low hireability

Senior Staff Research Scientist@DeepMind

Previously: Staff Research Scientist @ DeepMind

New York, US

KV Cache Optimization68

Inference-Aware Architecture30

Weight Compression22

Weight Streaming Efficiency20

Strengths

Low-Rank Bottleneck paper (2020) — foundational insight underlying MLA KV compression

GLA analysis (ICLR 2025) — directly addresses O(1) KV cache via gated linear attention

Gaps

No hardware-aware or chip co-design work — inference efficiency is algorithm-focused only

…click to see all

Anshumali Shrivastava

low hireability

Founder and Board Chairman@ThirdAI

Previously: CEO @ ThirdAI

KV Cache Optimization92

Weight Compression90

Inference-Aware Architecture58

Weight Streaming Efficiency52

Strengths

Scissorhands (291 cites) — KV cache eviction, directly on-target

"KV Cache is 1 Bit Per Channel" — extreme KV quantization at inference

Gaps

No published chip co-design work (power/bandwidth/packaging constraints)

…click to see all

Artem Bolshakov

low hireability

Researcher@QualComm

Previously: PhD student @ University of Toronto

Weight Compression55

Inference-Aware Architecture30

Weight Streaming Efficiency20

KV Cache Optimization0

Strengths

GPTVQ: VQ quantization for LLMs, SotA model size vs accuracy trade-off

GPTVQ targets DRAM + latency reduction on ARM CPU and Nvidia GPU inference

Gaps

h=1; single ML paper as 1 of 9 authors — individual contribution unclear

…click to see all

Aurick Qiao

low hireability

Member of Technical Staff@Thinking Machines Lab

Previously: AI Researcher @ Snowflake

Seattle, US

KV Cache Optimization82

Inference-Aware Architecture55

Weight Compression32

Weight Streaming Efficiency5

Strengths

TALE (TACL 2025): low-rank KV cache approximation with reconstruction elimination

SwiftKV: skips later-layer KV prefill — 25-50% prefill FLOP reduction

Gaps

No work on MLA, vector-quantized KV, or KV eviction strategies

…click to see all

Barlas Oguz

low hireability

Research Scientist@Meta

Previously: Senior Data Scientist @ Microsoft

San Francisco, US

Weight Compression82

KV Cache Optimization60

Inference-Aware Architecture35

Weight Streaming Efficiency30

Strengths

LLM-QAT: 4-bit QAT for LLMs with explicit KV cache quantization (414 citations)

BiT: fully binarized transformers — extreme compression (94 citations)

Gaps

No explicit hardware inference chip co-design (power/bandwidth constraints)

…click to see all

Beidi Chen

low hireability

Assistant Professor of Electrical and Computer Engineering@Carnegie Mellon University

Previously: Researcher @ Meta

Pittsburgh, US

KV Cache Optimization97

Inference-Aware Architecture82

Weight Compression74

Weight Streaming Efficiency72

Strengths

H2O: canonical KV eviction oracle — 689c, NeurIPS 2023

StreamingLLM: attention-sink infinite-context KV — 998c

Gaps

CMU assistant professor — technically triggers 'no professors' constraint

…click to see all

Bita Rouhani

low hireability

Distinguished Engineer@NVIDIA

Previously: Partner Group Manager @ Microsoft

Seattle, US

Weight Compression95

Inference-Aware Architecture90

KV Cache Optimization82

Weight Streaming Efficiency72

Strengths

OCP MX spec co-author (arXiv:2310.10537) — exact Neuralace-cited standard

Key, Value, Compress (2025) — comprehensive KV cache compression coverage

Gaps

No published work specifically on MLA or KV eviction strategies

…click to see all

Boyu Diao

low hireability

Senior Research Engineer@Institute of Computing Technology, Chinese Academy of Sciences

Previously: Assistant Professor @ Institute of Computing Technology, Chinese Academy of Sciences

Beijing, CN

Weight Compression75

Inference-Aware Architecture38

Weight Streaming Efficiency18

KV Cache Optimization5

Strengths

MPQ-DM (AAAI'25): extremely low-bit (2-4 bit) mixed-precision quantization

Q-VDiT (ICML'25): W3A6 quant for video DiT, 1.9× SOTA improvement

Gaps

No KV cache work (MLA, vector-quantized KV, KV eviction)

…click to see all

Bo Zhang

low hireability

VLM Tech Lead@RobotEra

Previously: Algorithm Strategist @ Meituan

Beijing, CN

Weight Compression85

Inference-Aware Architecture55

Weight Streaming Efficiency30

KV Cache Optimization5

Strengths

FPTQ + Integer Scale + Norm Tweaking — 3 LLM post-training quantization papers

MobileVLM: explicitly hardware-constrained on-device VLM (2023-2024)

Gaps

No KV cache work — no MLA, KV eviction, or cache compression papers

…click to see all

Byeongwook Kim

low hireability

Leader@NAVER

Previously: Technical Leader @ NAVER

Gyeonggi, KR

Weight Compression90

KV Cache Optimization75

Inference-Aware Architecture65

Weight Streaming Efficiency30

Strengths

"No Token Left Behind": MiKV KV eviction + mixed-precision quantization (2024)

LUT-GEMM: lookup-table weight quantization, 184 citations

Gaps

South Korea location — not in specified geographies (USA/Europe/China/India)

…click to see all

Carole-Jean Wu

low hireability

Director of AI Research@Meta

Previously: Professor with tenure @ Arizona State University

Inference-Aware Architecture82

KV Cache Optimization65

Weight Compression45

Weight Streaming Efficiency28

Strengths

CHAI: clustered attention heads cut KV cache footprint at LLM inference time (2024)

LayerSkip: early exit + self-speculative decoding for LLM efficiency (153 citations)

Gaps

No direct work on MLA, vector-quantized KV cache, or KV eviction strategies

…click to see all

Charlie Blake

low hireability

AI research engineer@Graphcore

Previously: MS student @ University of Oxford

Inference-Aware Architecture82

Weight Compression78

KV Cache Optimization78

Weight Streaming Efficiency72

Strengths

SparQ Attention (ICML 2024) — sparse KV retrieval cuts bandwidth ~8x

8-bit FP inference (NeurIPS 2023 Oral) — published weight compression work

Gaps

Now at OpenAI MTS — very high comp/mission retention, hard to recruit

…click to see all

Cheng Luo

low hireability

researcher@TikTok

Previously: postdoctoral researcher @ Caltech

KV Cache Optimization80

Inference-Aware Architecture38

Weight Compression32

Weight Streaming Efficiency20

Strengths

R-KV: redundancy-aware KV cache compression for reasoning models (2025)

HeadInfer: head-wise KV offloading for memory-efficient inference (2025)

Gaps

No evidence of inference chip co-design or power/bandwidth-aware model architecture

…click to see all

Chen Zhang

low hireability

Assistant Professor@Shanghai Jiao Tong University

Previously: Chip Architect @ Alibaba

Shanghai, CN

Inference-Aware Architecture90

Weight Compression88

Weight Streaming Efficiency75

KV Cache Optimization72

Strengths

H2-LLM (ISCA 2025) — hardware-dataflow co-exploration for LLM inference on custom chips

OliVe (ISCA 2023, 168 cit.) — hardware-friendly outlier-victim pair quantization

Gaps

Tenure-track assistant professor — "No professors" flag in search query

…click to see all

Chi-Chih Chang

low hireability

Ph.D. Student@Cornell University

Previously: Remote Intern @ University of Washington

KV Cache Optimization95

Weight Compression78

Inference-Aware Architecture55

Weight Streaming Efficiency35

Strengths

Palu (ICLR 2025): low-rank KV-cache compression — first-author

xKV (2025): cross-layer SVD KV-cache sharing — first-author

Gaps

No MLA or vector-quantized KV cache work — focuses on SVD/low-rank projection

…click to see all

Chong Ruan

low hireability

Researcher@DeepSeek

Previously: MS student @ Peking University

KV Cache Optimization90

Inference-Aware Architecture82

Weight Compression55

Weight Streaming Efficiency48

Strengths

DeepSeek-V2: MLA reduces KV cache 93.3%, throughput 5.76x

DeepSeek-V3 Technical Report: FP8 training + continued MLA architecture

Gaps

No direct work on topological weight regularization or block-sparsity for compression

…click to see all

Christian Puhrsch

low hireability

Researcher@Meta

Previously: MS student @ New York University

Weight Compression78

Weight Streaming Efficiency38

Inference-Aware Architecture28

KV Cache Optimization5

Strengths

TorchAO author — INT4/INT8/FP8/MXFP quantization, 2:4 sparsity (ICML 2025)

73 PRs to pytorch/ao; 70+ commits — core team contributor

Gaps

No KV cache work — no MLA, vector-quantized KV, or KV eviction signals

…click to see all

Christos Louizos

low hireability

PhD Candidate@University of Amsterdam

Previously: Research Intern @ Qualcomm

Amsterdam, NL

Weight Compression90

Inference-Aware Architecture30

Weight Streaming Efficiency25

KV Cache Optimization5

Strengths

ADAROUND (ICML 2020, 857 cit) — landmark post-training quantization

Bayesian Compression for DL (NeurIPS 2017, 620 cit) — weight compression identity

Gaps

No KV cache work — no MLA, eviction, or vector-quantized KV cache evidence

…click to see all

Christos Sourmpis

low hireability

Research Scientist@Huawei

Previously: Research And Development Engineer @ SynSense

Zurich, CH

KV Cache Optimization75

Inference-Aware Architecture45

Weight Streaming Efficiency20

Weight Compression10

Strengths

"When Perplexity Lies": 75% KV cache reduction via hybrid SSM distillation (2026)

AllMem: sliding-window + TTT memory hybrid, 128K context efficiency (2026)

Gaps

No explicit chip-constraint co-design (power/bandwidth/packaging)

…click to see all

Damai Dai

low hireability

Researcher@DeepSeek AI

Previously: PhD student @ Peking University

KV Cache Optimization95

Inference-Aware Architecture85

Weight Streaming Efficiency20

Weight Compression15

Strengths

MLA in DeepSeek-V2: 93.3% KV cache reduction — landmark inference innovation

NSA (2025): hardware-aligned sparse attention, proven speedups on 64k-token decode

Gaps

No direct weight streaming / weight bandwidth reduction work found

…click to see all

Davis Blalock

low hireability

Research Scientist@DeepMind

Previously: Research Scientist @ Databricks

San Francisco, US

Weight Compression80

Inference-Aware Architecture32

KV Cache Optimization28

Weight Streaming Efficiency22

Strengths

'What is the State of Neural Network Pruning?' — landmark survey, 34+ citations

'Multiplying Matrices Without Multiplying' (2021) — matrix approximation via quantization

Gaps

No direct KV cache paper (MLA, KV eviction, KV quantization) — adjacent via VQ

…click to see all

Dejian Yang

low hireability

Researcher@DeepSeek AI

Previously: Researcher @ Microsoft

KV Cache Optimization95

Inference-Aware Architecture82

Weight Compression18

Weight Streaming Efficiency12

Strengths

MLA originator — 93.3% KV cache reduction in DeepSeek-V2

MLA adopted in DeepSeek-V3 (671B) and V3.2 — production-proven at scale

Gaps

No published weight compression work (microscaling, quantization-aware topology)

…click to see all

Deli Chen

low hireability

Researcher@DeepSeek AI

Previously: Research Intern of Wechat AI @ Tencent

Beijing, CN

KV Cache Optimization92

Inference-Aware Architecture80

Weight Streaming Efficiency15

Weight Compression10

Strengths

DeepSeek-V2 co-author — invented MLA (93.3% KV cache reduction)

DeepSeek-V3 co-author — MLA adopted in 671B production model

Gaps

No published work on weight compression or quantization

…click to see all

Denis A Gudovskiy

low hireability

Senior Deep Learning Researcher@Panasonic

Previously: Senior Wireless Engineer @ Intel

San Francisco, US

Weight Compression70

Inference-Aware Architecture55

Weight Streaming Efficiency20

KV Cache Optimization8

Strengths

ShiftCNN: multiplierless low-precision CNN inference (74 citations, 2017)

DNN Feature Map Compression: bandwidth-reduction via GF(2) (ECCV 2018)

Gaps

No KV cache work (MLA, vector quantized KV, eviction) — core gap for axis 1

…click to see all

Dhruv Choudhary

low hireability

Senior Staff Research Engineer@Meta

Previously: Senior Tech Lead Manager @ Meta

San Francisco, US

Weight Compression88

Inference-Aware Architecture72

Weight Streaming Efficiency42

KV Cache Optimization8

Strengths

SpinQuant: LLM quantization via rotations — 201 citations (2025)

Microscaling MX formats co-author — hardware-aware data format spec

Gaps

No KV cache optimization work — MLA, KV eviction, KV quantization absent

…click to see all

Dimin Niu

low hireability

Research Scientist@Alibaba

Previously: Senior / Staff Engineer @ Samsung

San Francisco, US

Inference-Aware Architecture90

Weight Streaming Efficiency65

Weight Compression30

KV Cache Optimization20

Strengths

H-LLM (ISCA 2025): hardware-dataflow co-design for hybrid-bonding LLM inference

HD-MoE: 3D near-memory processing reduces weight bandwidth for MoE decode

Gaps

No KV cache papers (no MLA, KV eviction, vector-quantized KV work found)

…click to see all

Elias Frantar

low hireability

Member of Technical Staff@OpenAI

Previously: PHD Candidate @ Institute of Science and Technology Austria

San Francisco, US

Weight Compression97

Weight Streaming Efficiency85

Inference-Aware Architecture82

KV Cache Optimization12

Strengths

GPTQ (ICLR 2023) — foundational post-training quantization for LLMs

MARLIN: FP16xINT4 kernel, ~4x decode speedup — directly addresses weight streaming

Gaps

No KV cache work (MLA, vector-quantized KV, KV eviction) in publications

…click to see all

Eric Chung

low hireability

VP of AI Computing@NVIDIA

Previously: GM & Partner Group Engineering Manager @ Microsoft

Seattle, US

Weight Compression95

Inference-Aware Architecture95

Weight Streaming Efficiency60

KV Cache Optimization15

Strengths

Microscaling data formats (MX) — number formats co-designed for inference chip constraints

Shared microexponents (2023): extreme narrow-precision, fewer bits per weight

Gaps

No dedicated KV cache research (MLA, eviction strategies, vector-quantized KV)

…click to see all

Fan Yang

low hireability

Sr. Principal Research Manager@Microsoft

Previously: Principal Research Manager @ Microsoft

Weight Compression95

Inference-Aware Architecture95

KV Cache Optimization92

Weight Streaming Efficiency80

Strengths

RetrievalAttention + SeerAttention: KV eviction via sparse/vector retrieval (63 + 48 citations)

WaferLLM (OSDI 2025): wafer-scale inference-aware LLM architecture

Gaps

Hireability low — entrenched senior MSRA role, no market signals

…click to see all

Gaurav Jain

low hireability

ML Systems@xAI

Previously: Technical Director, Software @ d-Matrix

San Francisco, US

KV Cache Optimization92

Inference-Aware Architecture78

Weight Compression15

Weight Streaming Efficiency15

Strengths

Keyformer (MLSys 2024, 110 citations) — KV eviction, 2.1x latency reduction

MorphKV (ICML 2025) — constant-sized KV cache, 52.9% memory savings

Gaps

No published weight compression or weight streaming work

…click to see all

Guangxuan Xiao

low hireability

Member of Technical Staff@Thinking Machines Lab

Previously: Research Intern @ NVIDIA

USA

Weight Compression95

KV Cache Optimization95

Inference-Aware Architecture82

Weight Streaming Efficiency55

Strengths

StreamingLLM: KV eviction via attention sinks — 1035 citations (ICLR 2024)

DuoAttention: head-type KV cache reduction — 99 citations (ICLR 2025)

Gaps

Joined Thinking Machines Lab mid-2025 — only ~6–12 months in role, low hire window

…click to see all

Guyue Huang

low hireability

Deep Learning Architect@NVIDIA

Previously: Community Associate @ International

Inference-Aware Architecture80

Weight Streaming Efficiency68

Weight Compression50

KV Cache Optimization8

Strengths

Shfl-BW (DAC'22): tensor-core-aware weight pruning for inference acceleration

RM-STC (MICRO'23): GPU sparse tensor core, energy-efficient sparse acceleration

Gaps

No KV cache work (MLA, vector-quantized KV, KV eviction) found

…click to see all

Han-Byul Kim

low hireability

ML Research Engineer@Apple

Previously: Research Intern @ Apple

Seattle, US

Weight Compression75

KV Cache Optimization60

Inference-Aware Architecture50

Weight Streaming Efficiency28

Strengths

EpiCache (2025): KV cache management for long conversational QA

BASQ (ECCV 2022): sub-4-bit quantization via branch-wise NAS

Gaps

No evidence of MLA or vector-quantized KV cache techniques specifically

…click to see all

Haofeng Huang

low hireability

Research Intern@Alibaba

Previously: Infra Team Member @ ShengShu Technology

Beijing, CN

KV Cache Optimization88

Weight Compression83

Inference-Aware Architecture80

Weight Streaming Efficiency55

Strengths

SageAttention series: INT4/FP4 quantized attention (ICLR+ICML+NeurIPS 2025 Spotlight)

SageAttention3: Microscaling FP4 — directly matches search's microscaling constraint

Gaps

Incoming PhD at IIIS under Prof. Yao starting Fall 2026 — committed to academic path

…click to see all

Hao Mark Chen

low hireability

PhD Student@Imperial College London

Previously: ML Research Intern @ Samsung

London, GB

Weight Compression70

Inference-Aware Architecture65

Weight Streaming Efficiency55

KV Cache Optimization10

Strengths

Progressive Mixed-Precision Decoding: INT2/3 on NPUs, 3.8–8× throughput

Hardware-Aware Parallel Prompt Decoding: adaptive sparse tree per GPU arch

Gaps

No KV cache work (MLA, eviction, vector-quantized KV) found

…click to see all

Hayden Prairie

low hireability

Kernels Research Intern@Together

Previously: Research Assistant @ University of Texas at Austin

San Diego, US

Weight Compression75

Inference-Aware Architecture65

Weight Streaming Efficiency35

KV Cache Optimization15

Strengths

"Search Your Block Floating Point Scales!" MLSys 2026 — BFP/microscaling quantization

Parcae ICLR 2026 — looped models, inference-efficient weight reuse architecture

Gaps

No direct KV cache work (MLA, eviction) — SSM approach replaces rather than compresses KV

…click to see all

Helya Hosseini

low hireability

Research Assistant and Teaching Assistant@University of Maryland

Previously: Logic Design Teaching Assistant @ University of Tehran

KV Cache Optimization75

Inference-Aware Architecture60

Weight Compression20

Weight Streaming Efficiency10

Strengths

MUSTAFAR (NeurIPS 2025): 70% KV cache sparsity, 2.23x throughput gain

Custom bitmap-sparse attention kernel for compressed KV cache decode

Gaps

No weight compression work (focus is KV cache, not weights)

…click to see all

Hesham Mostafa

low hireability

Researcher@Intel

Inference-Aware Architecture82

Weight Compression78

Weight Streaming Efficiency30

KV Cache Optimization8

Strengths

Technical Lead ML at d-Matrix — CIM inference chip co-design role

MF-QAT (2025): elastic inference via multi-format QAT (MXINT/MXFP)

Gaps

No KV cache work found (MLA, vector-quantized KV, KV eviction)

…click to see all

Hung-Yueh Chiang

low hireability

Ph.D. Candidate@The University of Texas at Austin

Previously: Machine Learning Engineer @ XYZ Robotics

Austin, US

Weight Compression90

KV Cache Optimization75

Inference-Aware Architecture65

Weight Streaming Efficiency40

Strengths

UniQL (ICLR 2026): unified quantization + low-rank compression for edge LLMs

Quamba2 (ICML 2025): scalable PTQ framework for SSMs

Gaps

Just started NVIDIA April 2026 — only ~1 month tenure, low hireability

…click to see all

James Liu

low hireability

Member of Technical Staff@Anthropic

Previously: Research Scientist @ Together AI

San Francisco, US

Weight Compression78

Inference-Aware Architecture50

Weight Streaming Efficiency42

KV Cache Optimization5

Strengths

BitDelta (NeurIPS 2024): 1-bit delta quantization, >10x GPU memory reduction

TEAL (ICLR 2025 Spotlight): 40-50% activation sparsity, 1.8x wall-clock speedup

Gaps

No KV cache work (MLA, KV eviction, vector-quantized KV) found

…click to see all

Jeff Pool

low hireability

Senior Manager@NVIDIA

Previously: Manager - Architecture @ NVIDIA

Weight Compression93

Inference-Aware Architecture78

Weight Streaming Efficiency65

KV Cache Optimization8

Strengths

"Learning both Weights & Connections" (2015) — 9,558 citations, seminal pruning

MaskLLM (2024) — learnable N:M sparsity for LLMs

Gaps

No published work on KV cache, MLA, or attention efficiency

…click to see all

Jian Zhang

low hireability

Director and Distinguished Scientist@Nvidia

Previously: Co-Founder, CTO, VP Engineering @ Nvidia

San Francisco, US

Weight Compression82

Inference-Aware Architecture72

Weight Streaming Efficiency45

KV Cache Optimization15

Strengths

NVFP4 pre-training on Nemotron 3 Super — chip-native weight format

LatentMoE: accuracy per FLOP and per parameter — inference-aware design

Gaps

No KV cache papers — MLA, vector-quantized KV, or KV eviction work absent

…click to see all

Jiashi Li

low hireability

KV Cache Optimization92

Inference-Aware Architecture68

Weight Compression18

Weight Streaming Efficiency12

Strengths

deepseek-ai/FlashMLA top maintainer — MLA CUDA kernels for V3/V3.2-Exp

FP8 KV cache quantization with per-token scale — novel quantization scheme

Gaps

No evidence of weight streaming / bandwidth reduction work

…click to see all

Ji Lin

low hireability

Research Scientist@Meta

Previously: Member of Technical Staff @ OpenAI

San Francisco, US

Weight Compression93

Inference-Aware Architecture78

Weight Streaming Efficiency52

KV Cache Optimization15

Strengths

AWQ: MLSys 2024 Best Paper — activation-aware weight quantization

SmoothQuant (1502 citations) — LLM PTQ reducing per-channel variance

Gaps

No published work on KV cache (MLA, KV eviction, vector-quantized KV)

…click to see all

Jimmy Zhou

low hireability

Inference-Aware Architecture78

Weight Compression70

KV Cache Optimization55

Weight Streaming Efficiency30

Strengths

FMHAv2 paged KV cache integration (PR#2841, PR#2446) — direct KV cache work

MXINT4 / NVFP4 / W4A8 MoE quantization in FlashInfer production kernels

Gaps

Engineer, not researcher — implements formats (MXINT4, FP4) vs. inventing compression methods

…click to see all

Joel Hestness

low hireability

Research Scientist@Cerebras

Previously: Co-founder, Board Member @ 3 Day Startup

San Francisco, US

Inference-Aware Architecture80

Weight Compression48

Weight Streaming Efficiency18

KV Cache Optimization8

Strengths

Cerebras-GPT: compute-optimal LLMs designed for wafer-scale hardware (118 citations)

CompleteP (2025): hardware-aware model shapes for efficient inference on CS-3

Gaps

No KV cache work — MLA, KV eviction, or vector-quantized KV not in profile

…click to see all

Jonathan Frankle

low hireability

Chief AI Scientist@Databricks

Previously: Chief Scientist @ MosaicML

New York, US

Weight Compression80

Weight Streaming Efficiency35

Inference-Aware Architecture30

KV Cache Optimization5

Strengths

Lottery Ticket Hypothesis — foundational weight sparsity/compression paper (9K+ citations)

15 structured/magnitude pruning papers — specialist depth in weight compression

Gaps

No KV cache work (MLA, eviction, quantization) found

…click to see all

Jordan Dotzel

low hireability

Student Researcher@Google

Previously: Software Engineer @ Datto

San Francisco, US

Weight Compression88

Inference-Aware Architecture55

Weight Streaming Efficiency30

KV Cache Optimization8

Strengths

FLIQS: AutoML 2024 Best Paper, mixed-precision LLM quantization

Learning from Students (ICML 2024): t-distribution LLM weight formats

Gaps

No KV cache, MLA, or KV eviction work found

…click to see all

Kaichao You

low hireability

Core maintainer@@vllm-project

Beijing, CN

KV Cache Optimization75

Inference-Aware Architecture50

Weight Streaming Efficiency10

Weight Compression5

Strengths

vLLM core maintainer — PagedAttention KV cache at production scale

Jenga (SOSP 2025): memory mgmt for heterogeneous LLM inference

Gaps

No work on weight compression or microscaling

…click to see all

Kaoutar El Maghraoui

low hireability

Principal Research Scientist and Manager@IBM

Previously: Principal Research Staff Member @ IBM

New York, US

Inference-Aware Architecture82

Weight Compression72

KV Cache Optimization65

Weight Streaming Efficiency28

Strengths

2025 paper: dynamic KV cache placement for LLM inference in heterogeneous memory

2025 paper: paged+flex attention for long-context inference efficiency

Gaps

KV cache work is placement/paging — no evidence of MLA, vector-quantized KV, or eviction policies

…click to see all

Kurt Keutzer

low hireability

Co-Founder and Strategic Advisor@SigIQ.ai

Previously: Chief Strategy Officer (CSO) @ Nexusflow

San Francisco, US

Weight Compression95

KV Cache Optimization90

Inference-Aware Architecture85

Weight Streaming Efficiency60

Strengths

KVQuant (296 citations) — KV cache quantization to 10M context at inference

AI and Memory Wall (376 citations) — maps weight streaming as LLM decode bottleneck

Gaps

No explicit MLA / multi-latent attention work found

…click to see all

Laura Wang

low hireability

Inference-Aware Architecture75

Weight Streaming Efficiency20

Weight Compression5

KV Cache Optimization5

Strengths

KernelFalcon lead author — 124 PRs, 100% correctness on KernelBench L1-L3

CuTeDSL RMSNorm/LayerNorm kernels for Blackwell SM100/SM103 inference chips

Gaps

No KV cache work — MLA, KV eviction, or vector-quantized KV cache not in evidence

…click to see all

Lianmin Zheng

low hireability

Member of Technical Staff@xAI

Previously: Applied Scientist Intern @ Amazon

San Francisco, US

KV Cache Optimization88

Weight Streaming Efficiency78

Inference-Aware Architecture76

Weight Compression42

Strengths

H2O (NeurIPS 2023): KV eviction oracle, widely cited

FlexGen: weight I/O scheduling for memory-constrained GPU inference

Gaps

No published work on MLA or vector-quantized KV cache specifically

…click to see all

Li Lyna Zhang

low hireability

Partner Architect, Core AI@Microsoft

Previously: Senior Staff Research Scientist/Senior Manager @ Google

San Francisco, US

Weight Compression88

Inference-Aware Architecture82

Weight Streaming Efficiency50

KV Cache Optimization12

Strengths

VPTQ: 2-bit vector quantization, 1.6–1.8× LLM throughput (EMNLP 2024)

SpaceEvo: hardware-friendly INT8 inference NAS — lead author

Gaps

No KV cache work — no MLA, KV eviction, or vector-quantized KV papers

…click to see all

Lin Xiao

low hireability

Research Scientist@Meta

Previously: Senior Principal Researcher @ Microsoft

Seattle, US

Weight Compression72

Inference-Aware Architecture10

KV Cache Optimization3

Weight Streaming Efficiency3

Strengths

PARQ (ICML 2025): principled LLM weight quantization with optimization guarantees

BiT (NeurIPS 2022): 1-bit binarized transformer — extreme compression benchmark

Gaps

No KV cache, MLA, or KV eviction work

…click to see all

Manuel Candales

low hireability

Weight Compression75

Inference-Aware Architecture68

Weight Streaming Efficiency42

KV Cache Optimization20

Strengths

2/3/4-bit Metal quantized linear kernels in pytorch/ao — weight compression core

GEMV qmv_fast kernels for batch=1 decode — decode-mode bandwidth reduction

Gaps

No MLA, vector-quantized KV, or KV eviction work

…click to see all

Mayank Mishra

low hireability

Graduate Student Researcher@University of California, Berkeley

Previously: Research Engineer-II @ MIT-IBM Watson AI Lab

Berkeley, US

KV Cache Optimization85

Inference-Aware Architecture78

Weight Compression35

Weight Streaming Efficiency20

Strengths

Cross-Layer Attention (NeurIPS 2024) — first-authored KV cache reduction, 71 cites

FlashFormer (2025) — whole-model kernels for efficient low-batch inference

Gaps

No explicit power/bandwidth chip co-design or hardware constraint modeling

…click to see all

Michael Goin

low hireability

Weight Compression88

Weight Streaming Efficiency65

KV Cache Optimization50

Inference-Aware Architecture45

Strengths

llm-compressor lead — quantization + sparsity for LLM deployment

neuralmagic/compressed-tensors 11 commits — structured compression formats

Gaps

No direct papers on MLA, vector-quantized KV cache, or KV eviction

…click to see all

Michael Poli

low hireability

Co-founder@Radical Numerics

Previously: Founding Scientist @ Liquid AI

San Francisco, US

KV Cache Optimization82

Inference-Aware Architecture75

Weight Compression48

Weight Streaming Efficiency32

Strengths

Hyena + StripedHyena co-author — eliminates KV cache entirely

vortex: inference framework for multi-hybrid architectures

Gaps

No explicit hardware-aware power/bandwidth optimization work

…click to see all

Mingxing Tan

low hireability

Research Director@Waymo

Previously: Research Scientist / TLM @ Google

San Francisco, US

Inference-Aware Architecture90

Weight Compression20

Weight Streaming Efficiency15

KV Cache Optimization5

Strengths

EfficientNet + MnasNet: foundational hardware-constrained NAS (31K + 4K citations)

EfficientNet-EdgeTPU: architecture co-designed for TPU inference accelerator

Gaps

No KV cache work — no MLA, vector-quantized KV, or eviction strategy research

…click to see all

Mohamed S. Abdelfattah

low hireability

Co-Founder and Chief Science Officer@Mako

Previously: Principal Scientist @ Samsung

New York, US

KV Cache Optimization95

Inference-Aware Architecture95

Weight Compression90

Weight Streaming Efficiency65

Strengths

xKV (2025): cross-layer SVD KV-cache compression — directly on-target

Palu (2025): low-rank projection KV-cache compression

Gaps

Co-founder of Makora — a competing inference chip startup, direct conflict

…click to see all

Mohammad Shoeybi

low hireability

Senior Director of Applied Research@NVIDIA

Previously: Senior Research Engineer - Tech Lead @ DeepMind

San Francisco, US

Weight Compression82

Inference-Aware Architecture78

Weight Streaming Efficiency62

KV Cache Optimization22

Strengths

FP8 Formats for Deep Learning (2022, 283 cit.) — microscaling weight precision

NVFP4 pretraining (2025) — 4-bit floating point, fewer bits streamed per weight

Gaps

No direct KV cache optimization papers (MLA, vector-quantized KV, KV eviction)

…click to see all

Olatunji Ruwase

low hireability

Weight Compression88

Weight Streaming Efficiency88

Inference-Aware Architecture78

KV Cache Optimization15

Strengths

FP6-LLM: Tensor Core co-design for FP6 inference (USENIX ATC 2024)

ZeroQuant(4+2): FP4/FP6 extreme LLM compression strategy

Gaps

No KV cache work — no MLA, KV eviction, or vector-quantized KV papers

…click to see all

Pavlo Molchanov

low hireability

Director of Research@NVIDIA

Previously: Distinguished Scientist and Manager @ NVIDIA

San Francisco, US

Weight Compression95

Inference-Aware Architecture88

Weight Streaming Efficiency65

KV Cache Optimization55

Strengths

'Importance Estimation for Pruning' (1379 cites) — foundational weight compression

HALP + Structural Pruning via Latency-Saliency: hardware-latency-aware pruning

Gaps

No published work on MLA or vector-quantized KV cache specifically

…click to see all

Po-An Tsai

low hireability

Senior Research Scientist@NVIDIA

Previously: Research Scientist @ NVIDIA

Inference-Aware Architecture85

KV Cache Optimization80

Weight Compression75

Weight Streaming Efficiency60

Strengths

RocketKV (ICML 2025): two-stage KV cache compression — direct axis hit

ISCA-53 2026: data movement forecasting for MoE LLM serving — weight streaming

Gaps

Very recently promoted to Principal (April 2026) — unlikely to be looking

…click to see all

Priyadarshini Panda

low hireability

Visiting Faculty@DeepMind

Previously: Assistant Professor @ Yale University

Los Angeles, US

Inference-Aware Architecture85

Weight Compression82

Weight Streaming Efficiency62

KV Cache Optimization40

Strengths

MEADOW (2025): memory-efficient dataflow/data packing for low-power edge LLMs

TesseraQ (2025): ultra low-bit LLM PTQ — block reconstruction for extreme compression

Gaps

No direct MLA, vector-quantized KV cache, or KV eviction work — KV angle is hardware noise mitigation

…click to see all

qsang-nv

low hireability

KV Cache Optimization88

Inference-Aware Architecture82

Weight Compression20

Weight Streaming Efficiency12

Strengths

XQA MLA backend in FlashInfer — direct multi-head latent attention KV kernel

FP8 KV cache + tensor scale for XQA — quantized KV compression

Gaps

No evidence of weight compression / extreme quantization work

…click to see all

Raghu Prabhakar

low hireability

Engineering@SambaNova Systems

Previously: Software Engineer @ NVIDIA

San Francisco, US

Inference-Aware Architecture92

Weight Streaming Efficiency82

Weight Compression42

KV Cache Optimization10

Strengths

'SambaNova SN40L: Scaling the AI Memory Wall' — 32 citations, weight streaming focus

ISSCC 2025 SN40L: 5nm chip, 3-tier memory hierarchy for inference

Gaps

No published work on KV cache optimization (MLA, KV eviction, vector-quantized KV)

…click to see all

Raghuraman Krishnamoorthi

low hireability

Technical Lead Manager@Meta

Previously: Software Engineer @ Meta

San Francisco, US

Weight Compression93

Inference-Aware Architecture62

KV Cache Optimization45

Weight Streaming Efficiency22

Strengths

"Quantizing deep convolutional networks" whitepaper — 1489 citations, field-defining

Leads Meta torch.ao — production-scale quantization framework for PyTorch

Gaps

KV eviction, MLA, vector-quantized KV cache — not in portfolio

…click to see all

Sanjiv Kumar

low hireability

VP, Google Fellow@DeepMind

New York, US

Weight Compression80

Inference-Aware Architecture65

Weight Streaming Efficiency50

KV Cache Optimization20

Strengths

Weighted quantization patent (2025) — direct weight compression work

Spark Transformer: sparsity in FFN and attention (NeurIPS 2025)

Gaps

No direct KV cache eviction, MLA, or vector-quantized KV cache work found

…click to see all

Scott Roy

low hireability

Weight Compression85

Inference-Aware Architecture65

Weight Streaming Efficiency45

KV Cache Optimization15

Strengths

140+ commits to pytorch/ao — HQQ, PARQ, INT4/INT8, LUT quantization

Improved HQQ scale-only (Apr 2026) — per-group max-error fallback

Gaps

No KV cache research: no MLA, vector-quantized KV, or eviction work found

…click to see all

Sebastian Zhao

low hireability

Research Assistant@Berkeley Artificial Intelligence Research

Previously: ML Research Intern @ Berkeley Artificial Intelligence Research

Berkeley, US

KV Cache Optimization72

Inference-Aware Architecture55

Weight Compression42

Weight Streaming Efficiency15

Strengths

Multipole Attention (NeurIPS 2025): directly targets KV cache pressure

Custom Triton/CUDA kernels for attention — practical inference experience

Gaps

No dedicated MLA or vector-quantized KV cache paper

…click to see all

Sehoon Kim

low hireability

Member of Technical Staff@xAI

Previously: Machine Learning Engineer @ Narada

Weight Compression95

KV Cache Optimization95

Inference-Aware Architecture80

Weight Streaming Efficiency72

Strengths

KVQuant (317 citations): KV cache quantization to sub-2-bit for long context inference

SqueezeLLM (320 citations): dense-sparse LLM weight quantization reducing memory bandwidth

Gaps

~14 months at xAI — low hireability; no open-to-work signals

…click to see all

Sheng Shen

low hireability

Member of Technical Staff@xAI

Previously: Research Scientist @ Meta

San Francisco, US

Weight Compression90

Weight Streaming Efficiency50

Inference-Aware Architecture35

KV Cache Optimization10

Strengths

SqueezeLLM (ICML 2023): dense-sparse quant enabling 6GB LLM serving, in vLLM

Q-BERT (AAAI 2020): Hessian-based ultra-low precision quantization

Gaps

No KV cache optimization work (MLA, KV eviction, vector-quantized KV)

…click to see all

Shijie Cao

low hireability

Senior Researcher@Microsoft

Previously: Senior Researcher @ Microsoft Research Asia

Inference-Aware Architecture92

Weight Compression90

KV Cache Optimization88

Weight Streaming Efficiency72

Strengths

BitDecoding (HPCA 2026): low-bit KV cache + Tensor Core hardware co-design

T-MAC (EuroSys 2025): LUT-based NPU inference, direct chip co-design

Gaps

Just joined Xiaomi MiMo Feb 2026 — only ~3 months in new role, low hireability

…click to see all

Shiwei Liu

low hireability

PI@ELLIS Institute Tübingen

Previously: Royal Society Newton International Fellow @ University of Oxford

Tübingen, DE

Weight Compression92

KV Cache Optimization72

Weight Streaming Efficiency50

Inference-Aware Architecture30

Strengths

OWL (107 citations): pruning LLMs to extreme sparsity — defines weight compression research

Q-hitter: sparse-quantized KV cache eviction — direct KV cache optimization

Gaps

No inference-chip co-design — not designing models for specific power/bandwidth constraints

…click to see all

Siyuan Fu (Lain)

low hireability

Weight Compression68

KV Cache Optimization42

Weight Streaming Efficiency40

Inference-Aware Architecture32

Strengths

MXFP4 / NVFP4 block-scale MoE kernels — production FP4 weight compression at NVIDIA

FP8 MLA quant in vllm (PR #29795 merged) — direct KV cache compute optimization

Gaps

IC kernel engineer, not an architecture researcher — no model design work found

…click to see all

Song Han

low hireability

Researcher@NVIDIA

Previously: Assistant Professor @ MIT

Weight Compression99

Inference-Aware Architecture95

KV Cache Optimization92

Weight Streaming Efficiency75

Strengths

AWQ: MLSys 2024 Best Paper — defines activation-aware weight quantization

SmoothQuant: 1487 citations — gold standard post-training quantization

Gaps

Tenured MIT professor — rarely leaves; high retention likelihood

…click to see all

Songlin Yang

low hireability

Member of Technical Staff@Thinking Machines Lab

Previously: Member of Technical Staff @ Thinking Machines Lab

San Francisco, US

KV Cache Optimization88

Inference-Aware Architecture80

Weight Streaming Efficiency65

Weight Compression10

Strengths

GLA (277 citations): replaces KV cache with O(1) recurrent state

FLA Triton library: hardware-efficient linear attention CUDA kernels

Gaps

No work on weight compression, quantization, or microscaling

…click to see all

Stylianos Venieris

low hireability

Head of Distributed AI Group / Senior Research Scientist@Samsung

Previously: Researcher @ Samsung

Cambridge, GB

Inference-Aware Architecture80

Weight Compression70

Weight Streaming Efficiency65

KV Cache Optimization15

Strengths

Hardware-Aware Parallel Prompt Decoding (2025) — hardware-aware sparse tree for LLM inference

Progressive Mixed-Precision Decoding (2025) — phase-aware quantization for LLM decode

Gaps

No KV cache optimization work (MLA, KV eviction, vector-quantized KV) found

…click to see all

Supriya Rao

low hireability

Weight Compression72

Inference-Aware Architecture50

Weight Streaming Efficiency30

KV Cache Optimization5

Strengths

TorchAO paper (2025): end-to-end quantization for inference serving

2:4 activation sparsity paper: hardware-structured sparsity for inference

Gaps

No KV cache work (MLA, vector quantization, KV eviction) found

…click to see all

Tianlong Chen

low hireability

Chief AI Scientist@hireEZ

Previously: Postdoctoral Researcher @ MIT

Austin, US

Weight Compression82

KV Cache Optimization65

Weight Streaming Efficiency60

Inference-Aware Architecture50

Strengths

FIER (2025): KV cache retrieval for long-context LLM inference — direct axis hit

MC-SMoE ICLR'24 Spotlight — merge+compress MoE weight compression

Gaps

No explicit hardware-chip co-design (ASIC/inference chip constraint modeling)

…click to see all

Tianqi Chen

low hireability

Researcher@NVIDIA

Previously: CTO @ OctoML

Inference-Aware Architecture95

Weight Compression85

KV Cache Optimization72

Weight Streaming Efficiency65

Strengths

FlashInfer: block-sparse KV-cache format, 29-69% inter-token latency reduction

Apache TVM creator — definitive hardware-aware inference compiler (2812 cites)

Gaps

KV cache work is framework-level (not MLA/vector-quantize MLA/eviction research specifically)

…click to see all

Tianyi Zhang

low hireability

AI Research Scientist@Workato

Previously: Co-Founder & CTO @ xMAD.ai

San Francisco, US

Weight Compression92

KV Cache Optimization85

Inference-Aware Architecture70

Weight Streaming Efficiency35

Strengths

"KV Cache is 1 Bit Per Channel" (NeurIPS 2024) — direct KV quantization work

DFloat11: 70%-size lossless LLM compression for GPU inference (NeurIPS '25)

Gaps

No explicit weight streaming/bandwidth reduction work

…click to see all

Tri Dao

low hireability

Assistant Professor@Princeton University

Previously: PhD Student @ Stanford University

Princeton, US

Inference-Aware Architecture98

KV Cache Optimization97

Weight Streaming Efficiency55

Weight Compression42

Strengths

FlashAttention 1/2/3 — defining KV cache IO-aware attention kernels

Mamba (6K+ citations) — O(1) inference KV cache via selective state spaces

Gaps

Dual AP + CSO roles make recruiting extremely difficult

…click to see all

Utkarsh Saxena

low hireability

Member of Technical Staff@AMD

Previously: Graduate Research Assistant @ Purdue University

San Francisco, US

KV Cache Optimization90

Weight Compression80

Inference-Aware Architecture70

Weight Streaming Efficiency15

Strengths

KVLinC (2025): 2.55× KV cache inference speedup vs FlashAttention

Eigen Attention (2024): 40% KV cache reduction via low-rank attention

Gaps

No direct weight streaming / decode bandwidth reduction work

…click to see all

Vasiliy Kuznetsov

low hireability

Weight Compression92

Weight Streaming Efficiency65

Inference-Aware Architecture62

KV Cache Optimization5

Strengths

#2 torchao contributor (333 commits) — weight quantization core

NVFP4 + GPTQ for MoE — microscaling inference chip formats

Gaps

No KV cache optimization work (MLA, eviction, vector-quant KV)

…click to see all

Wenqian Zhao

low hireability

Researcher@Huawei

Previously: PhD student @ The Chinese University of Hong Kong

Weight Compression78

Inference-Aware Architecture60

Weight Streaming Efficiency45

KV Cache Optimization5

Strengths

BiE: hardware-friendly block floating-point for LLM quantization (2024)

HAPE: hardware-aware LLM pruning for on-device inference (2025)

Gaps

No KV cache work — MLA, KV eviction, or vector-quantized KV cache absent

…click to see all

Wuwei Lin

low hireability

Researcher@OpenAI

Previously: Researcher @ NVIDIA

Inference-Aware Architecture82

KV Cache Optimization68

Weight Compression18

Weight Streaming Efficiency5

Strengths

FlashInfer (MLSys 2025 Outstanding Paper) — KV block-sparse attention for LLM serving

KV cache composable formats — directly targets KV memory footprint

Gaps

No weight streaming or bandwidth-reduction research

…click to see all

Xiangxi Mo

low hireability

PhD student@Berkeley Sky Computing Lab

Previously: Model Serving System @ Anyscale

Berkeley, US

KV Cache Optimization92

Inference-Aware Architecture70

Weight Streaming Efficiency10

Weight Compression5

Strengths

PagedAttention (vLLM): invented KV cache paging — gold standard

JENGA (2025): heterogeneous KV cache memory management

Gaps

Active startup founder (Inferact, $150M raised Jan 2026) — hard to recruit

…click to see all

Xianzhi Yu

low hireability

Researcher@Huawei

Previously: Intern @ Sugon

Beijing, CN

KV Cache Optimization85

Inference-Aware Architecture80

Weight Compression78

Weight Streaming Efficiency35

Strengths

SVDq (2025): 410x KV key cache compression at 1.25 bits

FlatQuant (ICLR 2025): flatness-aware quantization, 22 citations

Gaps

No direct evidence on weight streaming bandwidth reduction

…click to see all

Xiaodong (Vincent) Huang

low hireability

Weight Compression60

Inference-Aware Architecture50

Weight Streaming Efficiency40

KV Cache Optimization20

Strengths

mm_fp4 implementation (cuDNN + CUTLASS) in FlashInfer — 4-bit inference

FP8 BMM optimization with cluster shapes for low-precision GEMM

Gaps

No published research papers — engineer, not architecture researcher

…click to see all

Xing Li

low hireability

Researcher@Huawei

KV Cache Optimization88

Weight Compression42

Inference-Aware Architecture28

Weight Streaming Efficiency5

Strengths

KVTuner ICML 2025: layer-wise mixed-precision KV cache quantization

SVDq: 1.25-bit, 410x KV cache compression — extreme compression research

Gaps

No evidence of hardware-aware architecture co-design (chip constraints, power budgets)

…click to see all

Xiuyu Li

low hireability

PhD candidate@Berkeley AI Research (BAIR) at UC Berkeley

Previously: Research Consultant @ Together AI

San Francisco, US

Weight Compression88

Inference-Aware Architecture48

Weight Streaming Efficiency42

KV Cache Optimization28

Strengths

SqueezeLLM (321 cit): dense-sparse LLM quantization, core identity

SVDQuant: 4-bit models via low-rank SVD + outlier absorption

Gaps

No direct KV cache architecture work (MLA, eviction strategies)

…click to see all

Xiyou Zhou

low hireability

Inference-Aware Architecture82

Weight Compression72

KV Cache Optimization65

Weight Streaming Efficiency60

Strengths

Apple Intelligence 3B model: KV-cache sharing + 2-bit QAT for Apple silicon

Parallel Track Transformers: 16x sync reduction, 15-30% TTFT gain (Feb 2026)

Gaps

No deep MLA, vector-quantized KV cache, or KV eviction work found

…click to see all

Yanan Cao

low hireability

Weight Compression60

Inference-Aware Architecture45

Weight Streaming Efficiency25

KV Cache Optimization5

Strengths

pytorch/ao quantization & sparsity — 7 merged PRs

fp8 scaled_mm kernel with per-platform configs (H100/B200)

Gaps

No KV cache optimization work (MLA, eviction) found

…click to see all

Yejing Lai

low hireability

Weight Compression80

Inference-Aware Architecture70

Weight Streaming Efficiency55

KV Cache Optimization35

Strengths

MXFP4 block quant kernel on Intel BMG GPU (vllm-xpu-kernels #194, 2026)

MXFP8/fp8 block quant kernel on BMG — microscaling directly relevant

Gaps

No published research; implementation engineer, not architect

…click to see all

Yikang Shen

low hireability

Member of Technical Staff@xAI

Previously: Staff Research Scientist @ IBM

San Francisco, US

KV Cache Optimization82

Inference-Aware Architecture80

Weight Streaming Efficiency48

Weight Compression15

Strengths

GLA (2024, 262 cit): replaces KV cache with fixed-size recurrent state — KV-free inference

FlashFormer (2025): whole-model kernel fusion for memory-bandwidth-limited inference

Gaps

No weight compression work (microscaling, topological regularization, quantization research)

…click to see all

Yineng Zhang

low hireability

Principal AI Researcher@Together AI

Previously: Lead Software Engineer @ Baseten

San Francisco, US

KV Cache Optimization78

Inference-Aware Architecture68

Weight Compression55

Weight Streaming Efficiency20

Strengths

Mooncake: KVCache-centric disaggregated serving (ACM ToS 2025)

FlashInfer 34 commits — top KV cache attention engine

Gaps

Software/serving-layer focus, not hardware chip co-design

…click to see all

Ying Sheng

low hireability

Co-Founder (CEO)@RadixArk

Previously: Member of Technical Staff @ xAI

San Francisco, US

KV Cache Optimization92

Inference-Aware Architecture83

Weight Streaming Efficiency78

Weight Compression32

Strengths

H2O: KV eviction oracle — published NeurIPS-level KV cache eviction research

Double Sparsity: sparse attention cutting KV cache at post-training

Gaps

No dedicated weight compression work (MicroScaling, topological regularization)

…click to see all

Yingyan Celine Lin

low hireability

Visiting Professor@NVIDIA

Previously: Assistant Professor @ Rice University

Atlanta, US

Inference-Aware Architecture92

KV Cache Optimization80

Weight Compression78

Weight Streaming Efficiency62

Strengths

LaCache (2025): direct KV eviction research for long-context LLMs

Hymba (2025): hybrid-head arch cutting KV cache via SSM heads (NVIDIA)

Gaps

Tenured Associate Professor at Georgia Tech — low probability of full-time departure

…click to see all

Yuandong Tian

low hireability

Co-Founder@Stealth AI Startup

Previously: Research Director @ Meta

San Francisco, US

KV Cache Optimization95

Weight Compression82

Inference-Aware Architecture75

Weight Streaming Efficiency65

Strengths

H2O (658 citations) — co-invented KV eviction for generative inference

StreamingLLM (982 citations) — attention sinks for unbounded KV streaming

Gaps

Currently co-founding stealth startup — low availability for hire

…click to see all

Yuhong Li

low hireability

Engineer@xAI

Previously: Foundation Models Team @ Apple

New York, US

KV Cache Optimization92

Inference-Aware Architecture65

Weight Compression35

Weight Streaming Efficiency18

Strengths

SnapKV: core author (18 commits, 314 citations) — canonical KV eviction paper

Medusa: inference acceleration via multiple decoding heads (435 citations)

Gaps

No direct weight compression or microscaling/topological regularization work

…click to see all

Yuhui Xu

low hireability

Research Scientist@Google

Previously: Research Scientist @ Salesforce

Weight Compression82

KV Cache Optimization72

Inference-Aware Architecture42

Weight Streaming Efficiency22

Strengths

ThinK: query-driven KV cache pruning (2024, 38 citations)

QA-LoRA: quantization-aware LoRA for LLMs (2023, 245 citations)

Gaps

No MLA, KV eviction, or vector-quantized KV cache work found

…click to see all

Yujun Lin

low hireability

AI Research Scientist@NVIDIA

Previously: Research Assistant @ Massachusetts Institute of Technology

Boston, US

Weight Compression95

KV Cache Optimization90

Inference-Aware Architecture87

Weight Streaming Efficiency65

Strengths

QServe W4A8KV4 co-author — 4-bit KV cache quantization for LLM inference

LServe: sparse attention serving reduces active KV (KV eviction-like)

Gaps

No explicit weight-streaming-bandwidth work — addressed implicitly via W4 quantization

…click to see all

Yunhe Wang

low hireability

head of the Huawei Applied AI lab and also a senior researcher@Huawei

Previously: PhD student @ Peking University

Weight Compression95

Inference-Aware Architecture80

Weight Streaming Efficiency50

KV Cache Optimization15

Strengths

Pangu Ultra (2025): LLM architecture designed for Ascend NPU inference constraints

GhostNet CVPR2020: cheap-op weight reuse reduces memory bandwidth significantly

Gaps

No KV cache / MLA / KV-eviction papers found

…click to see all

Zekun Wang

low hireability

Researcher@Alibaba

Previously: PhD student @ Harbin Institute of Technology

United States, US

Weight Compression65

Inference-Aware Architecture40

KV Cache Optimization30

Weight Streaming Efficiency15

Strengths

NeurIPS 2025 Best Paper (Gated Attention) — sparsity, attention-sink-free design

CFSP: activation-aware structured pruning for LLMs (weight compression)

Gaps

No direct MLA, vector-quantized KV, or KV eviction work found

…click to see all

Ziheng Jiang

low hireability

AI Researcher@Meta

Previously: Principal Research Scientist @ ByteDance

Seattle, US

Inference-Aware Architecture88

Weight Streaming Efficiency45

Weight Compression20

KV Cache Optimization10

Strengths

TVM co-author (2511 citations) — gold standard hardware-aware ML compilation

VTA + hardware-SW blueprint: explicit inference chip co-design work

Gaps

No KV cache work found (MLA, vector-quantized KV, eviction strategies)

…click to see all

Runs

#1completed0 qualified / 0 foundMay 7, 1:24 PM