Neuralace · Post-Training Researcher (LLM)

completed370 qualified1 runMay 7, 1:30 PMcompany-name-neuralace-sabi-locations-usa-europe-china-india-1778160600

ParsedNeuralace · 5 topics · Researcher · USA, Europe, China, India

Generating seed nodes

0 proposed

Explored 0 queries

0/0 done

Expanding nodes

queued

Qualifying candidates

queued

Qualified Candidates (357)

Aakriti Agrawal

high hireability

Research Assistant@University of Maryland

Previously: Research Internship @ Capital One

Hyattsville, US

Evals & Reward Models50

RLHF / RLVR35

Synthetic Data & Self-Play20

Personality & Non-Verifiable Rewards10

LLM Creativity5

Strengths

EnsemW2S: weak-to-strong generalization via LLM ensembles

Easy2Hard-Bench: LLM eval difficulty labeling (NeurIPS 2024)

Gaps

No direct RLHF/RLVR training runs (PPO, DPO, GRPO) — only adjacent alignment work

…click to see all

Ameya Prabhu

high hireability

Postdoctoral Researcher@University of Tuebingen

Previously: Machine Learning Intern @ Intel

Tübingen, DE

Evals & Reward Models78

RLHF / RLVR68

Synthetic Data & Self-Play28

Personality & Non-Verifiable Rewards10

LLM Creativity5

Strengths

verl-tool: RLVR framework fork for diverse tool use (pinned repo)

LinkedIn headline calls out 'RL Post-training' as primary focus

Gaps

No published work on personality, tone, or non-verifiable reward modeling

…click to see all

Andrew Lee

high hireability

Postdoc@Harvard University

Previously: Research Scientist Intern (FAIR) @ Meta

Ann Arbor, US

RLHF / RLVR72

Personality & Non-Verifiable Rewards62

LLM Creativity35

Evals & Reward Models32

Synthetic Data & Self-Play22

Strengths

Pairwise Cringe Loss (2023, 119 cit.) — preference optimization for LLMs

ICML 2024 Oral: mechanistic understanding of DPO alignment

Gaps

No tool-call, agentic, or RLVR work

…click to see all

Boshi Wang

high hireability

PhD Student@The Ohio State University

Previously: Research Intern @ Microsoft

Evals & Reward Models55

Synthetic Data & Self-Play25

RLHF / RLVR15

LLM Creativity5

Personality & Non-Verifiable Rewards5

Strengths

Mind2Web (692 cit.) — pioneer web-agent eval benchmark

Tool Learning via Simulated Trial and Error (ACL-24) — tool-call training angle

Gaps

No RLHF, DPO, PPO, or RLVR post-training work

…click to see all

Carel van Niekerk

high hireability

Postdoctoral Researcher@Heinrich-Heine University

Previously: Doctoral Candidate and Research Scientist @ Heinrich-Heine University

Düsseldorf, DE

Synthetic Data & Self-Play75

Personality & Non-Verifiable Rewards72

RLHF / RLVR68

Evals & Reward Models50

LLM Creativity20

Strengths

"Post-training LLMs via RL from Self-Feedback" (2025) — direct RLVR post-training paper

RLSF (2024) — RL from self-feedback for reasoning

Gaps

No published work on tool use, code gen, or agentic tasks

…click to see all

Chi Han

high hireability

Graduate Student@University of Illinois Urbana-Champaign

Urbana, US

Personality & Non-Verifiable Rewards60

RLHF / RLVR20

Evals & Reward Models20

Synthetic Data & Self-Play20

LLM Creativity15

Strengths

LM-Steer (ACL 2024 Outstanding) — embedding-based LLM behavior/personality steering

Tool Learning with Foundation Models — 438 citations, core tool call work

Gaps

No direct RLHF/PPO/DPO/GRPO post-training pipeline work

…click to see all

Hang Yan

high hireability

Postdoc@Chinese University of Hong Kong

Previously: PhD student @ Fudan University

Hong Kong, HK

RLHF / RLVR90

Synthetic Data & Self-Play82

Evals & Reward Models80

Personality & Non-Verifiable Rewards30

LLM Creativity10

Strengths

"Secrets of RLHF" Parts I & II — direct PPO + reward modeling work

SynthRL (ICLR 2026) — RLVR + verifiable data synthesis

Gaps

No explicit personality/creativity/non-verifiable reward work

…click to see all

Hanlin Zhu

high hireability

Ph.D. candidate@University of California, Berkeley

San Francisco, US

RLHF / RLVR80

Personality & Non-Verifiable Rewards72

Evals & Reward Models68

LLM Creativity35

Synthetic Data & Self-Play35

Strengths

Starling-7B: RLAIF for helpfulness/harmlessness (COLM 2024 Oral, 176 citations)

Personalized alignment eval for open-ended text generation (EMNLP 2024)

Gaps

Limited synthetic data / self-play pipeline work — no evidence of large-scale SFT data gen

…click to see all

Hantao Lou

high hireability

Member of Technical Staff, Manager@Anthropic

Previously: Research Fellow @ Machine Intelligence Research Institute

San Francisco, US

RLHF / RLVR55

Evals & Reward Models35

Personality & Non-Verifiable Rewards20

Synthetic Data & Self-Play10

LLM Creativity5

Strengths

align-anything: full PPO/DPO RLHF framework for multimodal LLMs

Aligner (2024, 126 citations) — post-training alignment via correction

Gaps

Core identity is interpretability/theoretical alignment, not applied post-training

…click to see all

Han Zhou

high hireability

AI Scientist Intern@Mistral

Previously: Student Researcher @ Google

London, GB

Evals & Reward Models75

Personality & Non-Verifiable Rewards72

RLHF / RLVR62

Synthetic Data & Self-Play55

LLM Creativity20

Strengths

ZEPO (EMNLP 2024): preference elicitation for human-aligned LLM judgments

PairS (COLM 2024): pairwise preference LLM evaluator — direct reward model work

Gaps

No explicit PPO/DPO/GRPO at scale — policy optimization work is agent-focused

…click to see all

Hao Zhu

high hireability

Postdoctoral Scholar@Stanford University

Previously: PhD Student @ Carnegie Mellon University

Personality & Non-Verifiable Rewards88

Synthetic Data & Self-Play80

Evals & Reward Models78

RLHF / RLVR62

LLM Creativity22

Strengths

SOTOPIA-RL: reward design for social/personality behaviors — core non-verifiable reward work

SOTOPIA-S4: large-scale persona-conditioned synthetic conversation data generation

Gaps

No demonstrated work on standard RLHF/PPO/DPO at LLM scale (consumer post-training)

…click to see all

Hongru WANG

high hireability

Research Associate@University of Edinburgh

Previously: Research Intern @ ByteDance

Edinburgh, GB

RLHF / RLVR80

Evals & Reward Models75

Synthetic Data & Self-Play35

Personality & Non-Verifiable Rewards30

LLM Creativity5

Strengths

ToolRL: RLVR for tool learning via GRPO (78 citations, 2025)

RM-R1: reward modeling as reasoning (ICLR 2026, 49 citations)

Gaps

No personality/tone/creative reward modeling evidence

…click to see all

Huayu Chen

high hireability

PhD Candidate@Tsinghua University

Previously: Research Intern @ Nvidia

Beijing, CN

RLHF / RLVR88

Evals & Reward Models78

Synthetic Data & Self-Play30

Personality & Non-Verifiable Rewards8

LLM Creativity5

Strengths

NCA (NeurIPS 2024): explicit reward modeling for LLM alignment

PRIME paper (159 citations): process reward model for reasoning RL

Gaps

No work on personality, tone, humor, or non-verifiable reward modeling

…click to see all

Jin Peng Zhou

high hireability

Chief of Staff at Cornell University Student Assembly@Cornell University

Ithaca, US

RLHF / RLVR90

Evals & Reward Models82

Personality & Non-Verifiable Rewards45

LLM Creativity10

Synthetic Data & Self-Play10

Strengths

Q# (NeurIPS 2025): distributional RL theory applied to LM post-training

RLHF for personalization — consumer post-training alignment work

Gaps

No synthetic data generation or self-play pipeline experience found

…click to see all

Junyang Lin

high hireability

Research Scientist@Qwen

Previously: Staff Engineer @ Alibaba

Beijing, CN

RLHF / RLVR95

Evals & Reward Models88

Synthetic Data & Self-Play80

Personality & Non-Verifiable Rewards72

LLM Creativity15

Strengths

Core Qwen team: co-led Qwen2.5-Math, Qwen3 post-training

Process Reward Models paper (2025) — leads RLVR reward modeling research

Gaps

Limited explicit work on creative writing or personality/humor modeling

…click to see all

Shengding Hu

high hireability

Intern@DeepSeek

Previously: Intern @ ByteDance

Evals & Reward Models72

RLHF / RLVR65

Synthetic Data & Self-Play65

Personality & Non-Verifiable Rewards20

LLM Creativity15

Strengths

DeepSeek intern — current focus on RL scaling (O1 paradigm, GRPO environment)

MiniCPM: 35 commits, co-author — small LLM post-training scalable strategies

Gaps

No explicit published RLHF/DPO/PPO/reward modeling papers yet

…click to see all

Siyu Yuan

high hireability

Research Intern@Moonshot AI

Previously: Research Intern @ ByteDance

Personality & Non-Verifiable Rewards88

RLHF / RLVR72

LLM Creativity72

Evals & Reward Models65

Synthetic Data & Self-Play55

Strengths

InCharacter (164 citations) — personality fidelity eval for role-playing agents

Moonshot AI RL Team intern — contributed to Seed1.5-Thinking (RL post-training)

Gaps

No explicit PPO/DPO/GRPO paper — RL work is through contributions, not first-author RL methods

…click to see all

Souradip Chakraborty

high hireability

PhD student research intern@Google

Previously: AI Research Intern @ Chase

Seattle, US

RLHF / RLVR90

Evals & Reward Models75

Personality & Non-Verifiable Rewards65

Synthetic Data & Self-Play50

LLM Creativity12

Strengths

MaxMin-RLHF (133 cit) — alignment with diverse human preferences

PARL: unified RLHF framework — NeurIPS-level RLHF theory

Gaps

No direct work on personality/humor/creativity-specific rewards

…click to see all

Xiangru Tang

high hireability

Ph.D. Candidate@Yale University

Previously: Assistant Professor @ Yale University

New Haven, US

Evals & Reward Models62

Synthetic Data & Self-Play38

RLHF / RLVR20

LLM Creativity5

Personality & Non-Verifiable Rewards5

Strengths

ToolLLM: 950 cites, LLMs mastering 16000+ APIs

OpenHands: generalist coding agent platform, 292 cites

Gaps

No RLHF/RLVR/DPO training methodology publications

…click to see all

Yuanhao Yue

high hireability

Synthetic Data & Self-Play65

Evals & Reward Models62

RLHF / RLVR45

Personality & Non-Verifiable Rewards8

LLM Creativity5

Strengths

Post-training for LLMs is stated research focus (KD, data synthesis, evals)

17 commits on QwenLM/Qwen3 — direct contributor to target model family

Gaps

No published work on RLHF/RLVR or preference optimization (PPO/DPO/GRPO)

…click to see all

Zhenhailong Wang

high hireability

Research Assistant@BLENDER Lab

Previously: Applied Scientist Intern @ Amazon

Champaign, US

RLHF / RLVR60

Personality & Non-Verifiable Rewards55

Evals & Reward Models45

Synthetic Data & Self-Play30

LLM Creativity10

Strengths

PAPO: policy optimization for multimodal reasoning (ICLR 2026)

Multimodal Policy Internalization for Conversational Agents (ICLR 2026)

Gaps

No dedicated DPO/GRPO or text-only LLM preference optimization work

…click to see all

Aakanksha Chowdhery

medium hireability

Member of Technical Staff@ReflectionAI

Previously: Senior Staff Research Scientist @ Meta

San Francisco, US

RLHF / RLVR65

Evals & Reward Models58

Synthetic Data & Self-Play35

Personality & Non-Verifiable Rewards15

LLM Creativity10

Strengths

Scaling instruction-finetuned LMs (FLAN) — 5380 citations, post-training at scale

RL for agentic LLMs at ReflectionAI — directly relevant to RLVR / knowledge post-training

Gaps

No published work on personality alignment, non-verifiable rewards, or creative writing

…click to see all

Abhinav Chinta

medium hireability

Research Assistant@Stanford University

Previously: Researcher @ University of Illinois Urbana-Champaign

San Francisco, US

Personality & Non-Verifiable Rewards68

Evals & Reward Models42

RLHF / RLVR35

LLM Creativity28

Synthetic Data & Self-Play22

Strengths

Unsupervised Human Preference Learning (EMNLP 2024) — on-target for personality/non-VR rewards

Preference agent approach: small model steers large LLM toward individual prefs

Gaps

No direct RL work — no PPO, DPO, GRPO, or reward model training

…click to see all

Abhinav Rastogi

medium hireability

Research Scientist@Mistral AI

Previously: Staff Research Scientist & Tech Lead Manager @ DeepMind

San Francisco, US

RLHF / RLVR92

Synthetic Data & Self-Play74

Evals & Reward Models72

Personality & Non-Verifiable Rewards45

LLM Creativity10

Strengths

RLAIF vs RLHF (ICML 2024) — 922-citation landmark RLHF/RLAIF paper

Robust Multi-Objective Online DPO alignment (AAAI 2025)

Gaps

No work on personality adherence, tone, humor, or creative writing

…click to see all

Achal Dave

medium hireability

Member of the Technical Staff@Anthropic

Previously: Research Scientist @ Toyota Research Institute

San Francisco, US

RLHF / RLVR75

Synthetic Data & Self-Play70

Evals & Reward Models45

Personality & Non-Verifiable Rewards40

LLM Creativity15

Strengths

RLAIF patent 'Scaling RL With AI Feedback' — direct RLHF/RLAIF evidence

Post-training geometry paper (arXiv 2025) — Anthropic post-training research

Gaps

No published work on personality adherence, humor, or creative NLG

…click to see all

Aditi Chaudhary

medium hireability

Research Scientist@DeepMind

Previously: Graduate Research Assistant @ Carnegie Mellon University

San Francisco, US

Evals & Reward Models50

Synthetic Data & Self-Play30

RLHF / RLVR25

Personality & Non-Verifiable Rewards25

LLM Creativity10

Strengths

Gemini 2.5 contributor — DeepMind post-training/eval team experience

DB-confirmed expertise: LLM post-training, instruction fine-tuning

Gaps

No published RLHF/DPO/GRPO/reward-modeling papers

…click to see all

Aishwarya Padmakumar

medium hireability

Senior Dialogue Scientist@NVIDIA

Previously: Senior Applied Scientist @ Amazon

San Francisco, US

RLHF / RLVR75

Evals & Reward Models62

Personality & Non-Verifiable Rewards60

Synthetic Data & Self-Play50

LLM Creativity30

Strengths

NVIDIA role explicitly on RLHF for LLMs (raw data signal)

Data-Efficient Alignment with RLHF (2023) — direct RLHF alignment work

Gaps

No RLVR or verifiable-reward RL work (tool use, code, agentic tasks)

…click to see all

Akbir Khan

medium hireability

Member of Technical Staff@Anthropic

Previously: Research Analyst @ Cooperative AI Foundation

San Francisco, US

RLHF / RLVR75

Evals & Reward Models62

Personality & Non-Verifiable Rewards62

Synthetic Data & Self-Play45

LLM Creativity15

Strengths

'Language Models Learn to Mislead Humans via RLHF' — direct RLHF post-training work

Best Paper ICML 2024 for LLM debate / scalable oversight research

Gaps

No RLVR / tool-use or agentic post-training work found

…click to see all

Albert Q. Jiang

medium hireability

Research Scientist@Mistral AI

Previously: Intern @ Meta

London, GB

RLHF / RLVR62

Evals & Reward Models60

Synthetic Data & Self-Play35

LLM Creativity5

Personality & Non-Verifiable Rewards5

Strengths

Devstral: fine-tuning LMs for coding agent apps (Mistral, 2025)

Magistral: RLVR-based reasoning model at Mistral AI (2025)

Gaps

No consumer post-training work — personality, humor, creativity absent

…click to see all

Albert Webson

medium hireability

senior research scientist@DeepMind

Previously: Research Scientist @ DeepMind

RLHF / RLVR90

Synthetic Data & Self-Play65

Evals & Reward Models60

Personality & Non-Verifiable Rewards45

LLM Creativity20

Strengths

Primary RL lead for Gemini RL — production RLHF at massive scale

Flan-T5 co-author (5.3K citations) — foundational post-training work

Gaps

No published work on personality/non-verifiable reward modeling or creativity

…click to see all

Alekh Agarwal

medium hireability

Staff Research Scientist@Google

Previously: Principal Research Manager @ Microsoft

Seattle, US

RLHF / RLVR90

Evals & Reward Models88

Personality & Non-Verifiable Rewards45

Synthetic Data & Self-Play10

LLM Creativity5

Strengths

"Minimaximalist Approach to RLHF" (NeurIPS 2024) — core RLHF algorithm design

"Rewarding Progress: Scaling Process Verifiers" (2025) — RLVR for reasoning

Gaps

No work on personality, humor, sarcasm, or creative writing

…click to see all

Alexander M Rush

medium hireability

Research Scientist@Cursor

Previously: Researcher @ Hugging Face

New York, US

RLHF / RLVR85

Evals & Reward Models60

Synthetic Data & Self-Play45

Personality & Non-Verifiable Rewards30

LLM Creativity15

Strengths

Multi-Turn Code Gen (ICML 2025 spotlight): RLVR for multi-step tool/code tasks

Zephyr (773 citations): DPO-based LM alignment, widely adopted recipe

Gaps

Primarily code/tool use focused — limited consumer personality or creativity work

…click to see all

Alexander Spangher

medium hireability

Post-doctoral Researcher@Stanford University

Previously: Data Scientist and Journalist @ New York Times

San Francisco, US

LLM Creativity72

Evals & Reward Models32

Personality & Non-Verifiable Rewards22

RLHF / RLVR8

Synthetic Data & Self-Play5

Strengths

EMNLP 2024 Outstanding Paper: human-level narrative generation evaluation

ICML 2024 Spotlight: classifier-free guidance for topic-controlled LLM output

Gaps

No RLHF/RLVR/DPO/PPO work — no post-training methodology experience

…click to see all

Aliaksei Severyn

medium hireability

Research Scientist@Google

RLHF / RLVR90

Evals & Reward Models80

Synthetic Data & Self-Play75

Personality & Non-Verifiable Rewards15

LLM Creativity10

Strengths

BOND (ICLR 2025) — Best-of-N distillation, direct LLM alignment

West-of-N — synthetic preference generation for reward modeling

Gaps

No published work on personality, humor, creativity, or non-VR reward shaping

…click to see all

Alon Benhaim

medium hireability

Senior Applied Scientist@Microsoft

Previously: Applied Scientist 2 @ Microsoft

Seattle, US

RLHF / RLVR80

Evals & Reward Models50

Personality & Non-Verifiable Rewards45

Synthetic Data & Self-Play20

LLM Creativity10

Strengths

'*PO' paper: empirical DPO/RLHF analysis + LN-DPO — core preference optimization

POROver (2025): preference optimization for safety/overrefusal — non-VR reward alignment

Gaps

No work on personality/creativity/humor or subjective reward modeling

…click to see all

Aman Madaan

medium hireability

AI Researcher and Engineer@xAI

Previously: Graduate Research Assistant @ Carnegie Mellon University

San Francisco, US

Synthetic Data & Self-Play65

RLHF / RLVR60

Evals & Reward Models55

Personality & Non-Verifiable Rewards35

LLM Creativity30

Strengths

Self-Refine (NeurIPS 2023) — iterative feedback loop, h=21 foundational paper

AutoMix — model routing by size, exact match to JD's 'call bigger model' goal

Gaps

Core PPO/DPO/GRPO post-training work not directly evidenced — mainly inference-time

…click to see all

Angela Fan

medium hireability

An Yan

medium hireability

Research Scientist@Salesforce

Previously: Research Intern @ Microsoft

San Diego, US

Synthetic Data & Self-Play68

Evals & Reward Models48

RLHF / RLVR25

LLM Creativity10

Personality & Non-Verifiable Rewards8

Strengths

List Items One by One (COLM 2024) — synthetic data recipe for multimodal post-training

MTA-Agent — synthetic RL data pipeline for multimodal search agents

Gaps

No PPO/DPO/GRPO or explicit reward modeling work

…click to see all

An Yang

medium hireability

Researcher@Alibaba

Previously: MS student @ Peking University

RLHF / RLVR95

Evals & Reward Models80

Synthetic Data & Self-Play72

Personality & Non-Verifiable Rewards68

LLM Creativity30

Strengths

QwQ-32B — co-authored core RLVR reasoning paper (409 citations, 2025)

WorldPM — reward/preference modeling at scale (2025)

Gaps

Limited creative writing / personality adherence work — focused on RLVR

…click to see all

Aohan Zeng

medium hireability

Evals & Reward Models82

RLHF / RLVR78

Synthetic Data & Self-Play52

Personality & Non-Verifiable Rewards35

LLM Creativity20

Strengths

ChatGLM-RLHF paper — applied RLHF on production LLM at ZhipuAI

"Does RLHF Scale?" (2025) — explicit RLHF scaling research

Gaps

No explicit personality/tone/humor reward modeling work

…click to see all

Aohan Zeng

medium hireability

PhD student@Tsinghua University

Beijing, CN

RLHF / RLVR88

Evals & Reward Models82

Synthetic Data & Self-Play38

Personality & Non-Verifiable Rewards30

LLM Creativity10

Strengths

ChatGLM-RLHF (2024) — implemented RLHF for production LLM alignment

Does RLHF Scale? (2025) — systematic RLHF scaling experiments

Gaps

No focused work on personality, humor, or creative writing (non-verifiable rewards)

…click to see all

Baolin Peng

medium hireability

Principal Researcher@Microsoft

Previously: Senior Researcher @ Tencent

Seattle, US

RLHF / RLVR88

Evals & Reward Models78

Synthetic Data & Self-Play68

Personality & Non-Verifiable Rewards55

LLM Creativity20

Strengths

"Advantage Modeling for RLHF" (2025) — direct RLHF post-training work

Nash Policy Optimization (2025) — general preference alignment at LLM scale

Gaps

LLM creativity (roleplay, humor, cultural zeitgeist) — essentially no work in this area

…click to see all

Baosong Yang

medium hireability

算法专家@Alibaba

Previously: Postgraduate Intern @ Tencent

Hangzhou, CN

Evals & Reward Models55

Personality & Non-Verifiable Rewards50

RLHF / RLVR40

Synthetic Data & Self-Play35

LLM Creativity20

Strengths

Qwen3 co-author — first-hand knowledge of target model training pipeline

Qwen2 + Qwen2.5-Omni contributor — spans text and multimodal post-training

Gaps

No standalone RLHF/RLVR/DPO papers — post-training role within Qwen not individually isolated

…click to see all

Barna Pásztor

medium hireability

Doctoral Fellow@ETH AI Center

Previously: core contributor and lead of the post-training team @ Swiss AI Initiative

Zürich, CH

RLHF / RLVR75

Evals & Reward Models45

Personality & Non-Verifiable Rewards32

Synthetic Data & Self-Play20

LLM Creativity10

Strengths

Led post-training team at Apertus-70B — direct open-source LLM RLHF experience

Stackelberg RLHF paper — preference optimization as sequential game (EWRL 2025)

Gaps

Primarily theoretical (game-theoretic) vs applied RLHF engineering

…click to see all

Barun Patra

medium hireability

Member of Technical Staff@Microsoft

Previously: Senior Applied Scientist @ Microsoft

Seattle, US

RLHF / RLVR75

Evals & Reward Models50

Synthetic Data & Self-Play45

Personality & Non-Verifiable Rewards30

LLM Creativity5

Strengths

'A Practical Analysis of Human Alignment with *PO' — LN-DPO proposal, NAACL 2025

Phi-3 co-author — small-model post-training directly matches Qwen post-training context

Gaps

No work on personality, tone, creativity, or humor alignment

…click to see all

Behnam Neyshabur

medium hireability

Member of Technical Staff@Anthropic

Previously: Senior Staff Research Scientist & Team Lead @ DeepMind

San Francisco, US

Synthetic Data & Self-Play85

RLHF / RLVR72

Evals & Reward Models70

Personality & Non-Verifiable Rewards15

LLM Creativity5

Strengths

"Beyond Human Data" (TMLR 2024) — self-training/self-play data pipeline for LLMs

Co-led DeepMind Blueshift team → Gemini post-training in production

Gaps

No explicit reward modeling or RLHF-specific papers published

…click to see all

Bertie Vidgen

medium hireability

AI Research@Mercor

Previously: Data + Evaluation @ Contextual AI

Evals & Reward Models82

Personality & Non-Verifiable Rewards80

RLHF / RLVR35

Synthetic Data & Self-Play32

LLM Creativity10

Strengths

PRISM dataset: individualized human feedback for subjective LLM alignment (186 citations)

"Socioaffective alignment" 2025 — emotions, personality, AI-human relationship design

Gaps

No technical RLHF/PPO/DPO/GRPO training implementation evidence

…click to see all

Bhrij Patel

medium hireability

Incoming Research Intern@AG2

Previously: Machine Learning Research Intern @ Qualcomm

San Francisco, US

Evals & Reward Models55

RLHF / RLVR20

Personality & Non-Verifiable Rewards15

LLM Creativity5

Synthetic Data & Self-Play5

Strengths

ACL 2026: lightweight function calling — direct tool-call alignment

EMNLP 2025: API learning from demos for tool-based agents

Gaps

No direct RLHF/RLVR post-training experience; RL work is theoretical (avg-reward)

…click to see all

Bill Yuchen Lin

medium hireability

Member of Technical Staff@xAI

Previously: Research Scientist @ Allen Institute for AI

San Francisco, US

Evals & Reward Models92

Synthetic Data & Self-Play88

RLHF / RLVR80

Personality & Non-Verifiable Rewards50

LLM Creativity20

Strengths

RewardBench (417 citations) — defining paper for reward model evaluation

Magpie (2025) — alignment data synthesis from scratch, SFT pipeline

Gaps

No published work on personality, humor, sarcasm, or creative writing post-training

…click to see all

Bin Liang

medium hireability

Postdoctoral Fellow@The Chinese University of Hong Kong

Previously: PhD student @ Harbin Institute of Technology

Evals & Reward Models55

Personality & Non-Verifiable Rewards40

RLHF / RLVR15

Synthetic Data & Self-Play15

LLM Creativity8

Strengths

CoreEval (ACL 2025): builds contamination-resilient LLM eval datasets

Multi-persona Framework (ACL 2025): persona-conditioned quality scoring

Gaps

No direct RLHF/PPO/DPO/GRPO post-training work

…click to see all

Bin Wang

medium hireability

Principal Researcher@Xiaomi

Previously: Full Professor @ Institute of Information Engineering, Chinese Academy of Sciences

Synthetic Data & Self-Play78

RLHF / RLVR65

Evals & Reward Models62

Personality & Non-Verifiable Rewards20

LLM Creativity10

Strengths

TaP (2025): taxonomy-guided automated preference data generation framework

MobileIPL (2025): iterative DPO preference learning for agentic thinking

Gaps

No evidence of personality, humor, or creative writing post-training work

…click to see all

Binyuan Hui

medium hireability

Senior Staff Algorithm Engineer@Alibaba

Previously: Staff Algorithm Engineer @ Alibaba

Beijing, CN

RLHF / RLVR85

Evals & Reward Models80

Synthetic Data & Self-Play70

Personality & Non-Verifiable Rewards65

LLM Creativity20

Strengths

Core Qwen team: co-authored Qwen, Qwen2.5, Qwen2.5-Coder, Qwen3

WorldPM (2025): scaling human preference modeling for reward models

Gaps

No creative writing / personality / humor-specific published work

…click to see all

Bobak Shahriari

medium hireability

Researcher@DeepMind

Previously: Research Scientist @ DeepMind

RLHF / RLVR85

Personality & Non-Verifiable Rewards80

Evals & Reward Models70

Synthetic Data & Self-Play15

LLM Creativity10

Strengths

BOND (ICLR 2025) — LLM alignment via Best-of-N distillation

'Capturing individual human preferences with reward features' (2025) — personalized reward modeling

Gaps

No evident synthetic data generation or self-play pipeline work

…click to see all

Bofei Gao

medium hireability

MS student@Peking University

RLHF / RLVR80

Evals & Reward Models75

Synthetic Data & Self-Play45

Personality & Non-Verifiable Rewards35

LLM Creativity5

Strengths

Preference learning survey — comprehensive DPO/PPO/GRPO coverage

MATH-Minos: NL-feedback math verifier (reward model for reasoning)

Gaps

Work focused on verifiable rewards (math/code) — limited personality/conversation post-training

…click to see all

Bowen Li

medium hireability

Shanghai AI Lab

Previously: Researcher @ Shanghai AI Lab

Synthetic Data & Self-Play65

Evals & Reward Models50

RLHF / RLVR45

LLM Creativity5

Personality & Non-Verifiable Rewards5

Strengths

EvoSyn: evolutionary synthetic data gen framework for RLVR (Oct 2025)

TESSY: teacher-student SFT data synthesis, +11% code gen gains (2026)

Gaps

No explicit reward model or RLHF/PPO/DPO methodology papers

…click to see all

Bowen Tan

medium hireability

AI Research Scientist@Meta

Previously: Machine Learning Researcher @ Apple

Synthetic Data & Self-Play65

RLHF / RLVR55

Evals & Reward Models50

LLM Creativity30

Personality & Non-Verifiable Rewards25

Strengths

Efficient Soft Q-Learning for Text Generation — RL for generation (70 citations, 2022)

Learning Data Manipulation for Augmentation — NeurIPS 2019, 148 citations

Gaps

No direct PPO/DPO/GRPO post-training work on modern instruction-tuned LLMs

…click to see all

Bowen Yu

medium hireability

Algorithm Expert@Alibaba

Previously: PhD student @ Chinese Academy of Sciences

Beijing, CN

RLHF / RLVR92

Evals & Reward Models88

Synthetic Data & Self-Play80

Personality & Non-Verifiable Rewards78

LLM Creativity65

Strengths

Leads Qwen-Instruct post-training — exact match for search query

'Preference Ranking Optimization for Human Alignment' (AAAI 2024, 326 cit.)

Gaps

No explicit consumer/emotional intelligence (sarcasm, humor) papers found

…click to see all

Can Xu

medium hireability

Software Engineer@Microsoft

Previously: Software Engineer @ JPMorgan Chase & Co.

New York, US

Synthetic Data & Self-Play97

RLHF / RLVR75

Evals & Reward Models68

Personality & Non-Verifiable Rewards50

LLM Creativity15

Strengths

Arena Learning: self-play chatbot arena as data flywheel (NeurIPS 2024)

Evol-Instruct creator — canonical synthetic instruction data pipeline

Gaps

No direct work on personality, humor, or tone-based non-verifiable rewards

…click to see all

casinca

medium hireability

RLHF / RLVR83

Evals & Reward Models20

Synthetic Data & Self-Play15

Personality & Non-Verifiable Rewards8

LLM Creativity5

Strengths

15 merged TRL PRs: GRPO variants (VESPO, SAPO, OPSM), DPO norm, async rollout

VESPO implementation in grpo_trainer.py — paper-to-code contribution

Gaps

No evidence of personality/creative RLHF or non-verifiable reward work

…click to see all

Chang Gao

medium hireability

Researcher@Alibaba

Previously: Research Intern @ Z.ai

Beijing, CN

RLHF / RLVR92

Evals & Reward Models55

Synthetic Data & Self-Play30

Personality & Non-Verifiable Rewards30

LLM Creativity5

Strengths

Qwen3 co-author — direct experience fine-tuning the exact model

GSPO paper: novel GRPO/RLVR algorithm (2025, 94 citations)

Gaps

No work on personality, creativity, or non-verifiable reward modeling

…click to see all

Chenghao Deng

medium hireability

Research Intern@TikTok

Previously: Undergraduate Intern @ Penn State University

San Francisco, US

Evals & Reward Models40

RLHF / RLVR20

Synthetic Data & Self-Play10

LLM Creativity5

Personality & Non-Verifiable Rewards5

Strengths

EnsemW2S: token-level ensemble for weak-to-strong LLM alignment (NeurIPS 2024)

Easy2Hard-Bench: difficulty-graded eval benchmark for LLMs (NeurIPS 2024)

Gaps

No direct RLHF/RLVR or reward modeling pipeline work

…click to see all

Chenglong Wang

medium hireability

PhD candidate@Northeastern University (Shenyang, China)

Shenyang, CN

Evals & Reward Models85

RLHF / RLVR75

Synthetic Data & Self-Play40

Personality & Non-Verifiable Rewards20

LLM Creativity5

Strengths

GRAM (ICML 2025): generative foundation reward model — core RM contribution

MRMBench (AAAI 2026): multi-dimensional reward model eval framework

Gaps

No personality, tone, humor, or non-verifiable reward work

…click to see all

Cheng Qian

medium hireability

Professor@University of Illinois Urbana-Champaign

Previously: Associate Professor @ University of Illinois Urbana-Champaign

Urbana-Champaign, US

RLHF / RLVR85

Evals & Reward Models80

LLM Creativity60

Personality & Non-Verifiable Rewards50

Synthetic Data & Self-Play25

Strengths

ToolRL (NeurIPS 2025): RLVR reward shaping for tool use — exact JD priority match

RM-R1 (2025): reward modeling as reasoning — eval/reward model expertise

Gaps

2nd-year PhD — not in typical graduation/industry transition window yet

…click to see all

Chengqi Lyu

medium hireability

Researcher@Shanghai AI Laboratory

Previously: Researcher @ SenseTime

RLHF / RLVR82

Evals & Reward Models78

Synthetic Data & Self-Play60

Personality & Non-Verifiable Rewards22

LLM Creativity8

Strengths

OREAL (2025): outcome reward RL, 7B → 94% MATH-500 — core RLVR work

CompassVerifier: verifier for LLM eval + outcome reward signals (2025)

Gaps

No work on personality, tone, humor, or non-verifiable reward shaping

…click to see all

Chengsong Huang

medium hireability

Ph.D. Student in Computer Science@Washington University in St. Louis

Previously: Research Intern @ Tencent

St. Louis, US

RLHF / RLVR82

Synthetic Data & Self-Play78

Evals & Reward Models72

Personality & Non-Verifiable Rewards12

LLM Creativity8

Strengths

'Taming Overconfidence in LLMs': reward calibration in RLHF (ICLR 2025)

R-Zero: self-evolving LLM from zero data — direct self-play approach (ICLR 2026)

Gaps

No work on personality, tone, or non-verifiable creative rewards

…click to see all

Chengyu Wang

medium hireability

Algorithm Expert@Alibaba

Previously: PhD student @ East China Normal University

Hangzhou, CN

Synthetic Data & Self-Play75

RLHF / RLVR50

Evals & Reward Models30

Personality & Non-Verifiable Rewards10

LLM Creativity5

Strengths

AgenticQwen (ACL 2026): industrial tool-use training for small Qwen — direct hit

Mock Worlds, Real Skills (ACL 2026): rubric-based rewards + synthetic task environments

Gaps

Primary focus is knowledge distillation, not RLHF/DPO/PPO/GRPO post-training

…click to see all

Chen Zhu

medium hireability

Research Scientist@Meta

Previously: Member of Technical Staff @ xAI

San Francisco, US

RLHF / RLVR95

Evals & Reward Models85

Personality & Non-Verifiable Rewards30

Synthetic Data & Self-Play25

LLM Creativity15

Strengths

ODIN (ICML 2024): disentangled reward model, prevents reward hacking

Perfect Blend/CGPO (2025): multi-objective RLHF; outperforms PPO/DPO on chat+code+math

Gaps

No explicit personality, tone, or creativity reward modeling work

…click to see all

Chi Chen

medium hireability

Researcher@Tsinghua University

Previously: PhD student @ Tsinghua University

Evals & Reward Models60

RLHF / RLVR55

Synthetic Data & Self-Play30

LLM Creativity5

Personality & Non-Verifiable Rewards5

Strengths

AgentCPM-GUI: GRPO-based RL fine-tuning for GUI agents (SOTA 5 benchmarks)

MiniCPM-V co-author: efficient small MLLM matching GPT-4V on edge devices

Gaps

No consumer post-training work (personality, tone, humor, non-verifiable rewards)

…click to see all

Chirag Nagpal

medium hireability

AI Research Scientist@Meta

Previously: Research Scientist @ Google

San Francisco, US

RLHF / RLVR88

Evals & Reward Models85

Personality & Non-Verifiable Rewards28

Synthetic Data & Self-Play12

LLM Creativity5

Strengths

"Helping or Herding?" — reward model ensemble robustness, reward hacking (118 cites)

"Rewarding Progress" — process verifiers for LLM reasoning, RLVR-adjacent (155 cites)

Gaps

No work on personality, humor, sarcasm, or non-verifiable subjective reward design

…click to see all

Chong Zhang

medium hireability

PhD student@MiroMind AI; Fudan University

RLHF / RLVR62

Evals & Reward Models28

Synthetic Data & Self-Play12

LLM Creativity5

Personality & Non-Verifiable Rewards5

Strengths

Co-first author '100 days after DeepSeek-R1' — RLVR/SFT survey (2025)

MiroMind-M1 co-author: RLVR with multi-stage policy optimization on Qwen

Gaps

No consumer post-training work (personality, humor, creativity, non-verifiable rewards)

…click to see all

Christoforos Nalmpantis

medium hireability

AI Researcher@Prima Mente

Previously: Postdoctoral Researcher @ Meta

London, GB

RLHF / RLVR82

Evals & Reward Models35

Personality & Non-Verifiable Rewards25

Synthetic Data & Self-Play12

LLM Creativity10

Strengths

'Teaching LLMs to Reason with RL' (2024, 150 citations) — core post-training RL

'Understanding RLHF on LLM Generalisation' (2023, 272 citations) — RLHF depth

Gaps

No evidence of personality or non-verifiable reward modeling work

…click to see all

Christoph Dann

medium hireability

Research Scientist@Google

Previously: Research Intern @ Google

Pittsburgh, US

RLHF / RLVR92

Evals & Reward Models62

Personality & Non-Verifiable Rewards30

Synthetic Data & Self-Play8

LLM Creativity5

Strengths

Minimaximalist Approach to RLHF (2024) — 124 citations, theoretical RLHF foundations

P3O: pessimistic preference policy optimization — robust alignment (2024)

Gaps

No work on personality, creativity, or non-verifiable reward design for LLMs

…click to see all

Chujie Zheng

medium hireability

Researcher@Alibaba Group

Previously: Research Intern @ 01.AI

Beijing, CN

RLHF / RLVR92

Evals & Reward Models88

Personality & Non-Verifiable Rewards55

Synthetic Data & Self-Play35

LLM Creativity28

Strengths

GSPO (Group Sequence Policy Optimization) — GRPO variant, core RL post-training work

Qwen3 co-author — direct experience on the exact model Neuralace will post-train

Gaps

Limited explicit work on personality post-training or non-verifiable reward optimization

…click to see all

Clara Pohland

medium hireability

RLHF / RLVR68

Evals & Reward Models22

Personality & Non-Verifiable Rewards15

LLM Creativity5

Synthetic Data & Self-Play5

Strengths

BCOTrainer: created standalone trainer in huggingface/trl

10 merged TRL PRs — BCO, KTO, MoE load balancing

Gaps

No synthetic data generation or self-play data pipeline work

…click to see all

Cunxiang Wang

medium hireability

tech leader@ZhipuAI

Previously: Research Intern @ Amazon

Hangzhou, CN

Evals & Reward Models90

Synthetic Data & Self-Play82

RLHF / RLVR78

Personality & Non-Verifiable Rewards20

LLM Creativity10

Strengths

SPaR (2025): self-play + tree-search refinement for instruction-following data gen

RLAR (2026): agentic multi-task RL reward system — direct RLVR hit

Gaps

No personality/creativity/humor-focused work — consumer post-training axis underserved

…click to see all

Dacheng Li

medium hireability

Research Assistant@Sailing Lab

Previously: Research Assistant @ Machine learning, perception, and Cognition Lab

San Francisco, US

RLHF / RLVR82

Evals & Reward Models78

Synthetic Data & Self-Play38

Personality & Non-Verifiable Rewards25

LLM Creativity8

Strengths

Sky-T1: RLVR reasoning model at O1-level within $450 academic budget

SkyRL: full-stack modular RL library for LLM post-training

Gaps

Consumer post-training largely absent — no personality/creativity/roleplay work

…click to see all

Damai Dai

medium hireability

Researcher@DeepSeek AI

Previously: PhD student @ Peking University

RLHF / RLVR95

Evals & Reward Models75

Synthetic Data & Self-Play50

LLM Creativity5

Personality & Non-Verifiable Rewards5

Strengths

DeepSeek-R1 co-author — defining RLVR paper for LLM reasoning (5,561 cites)

Math-Shepherd — process reward model for step-level RL verification

Gaps

No personality, tone, or non-verifiable reward work

…click to see all

Daniele Calandriello

medium hireability

Researcher@DeepMind

Previously: Postdoc @ Università degli Studi di Genova, Istituto Italiano di Tecnologia

RLHF / RLVR95

Evals & Reward Models65

Synthetic Data & Self-Play50

Personality & Non-Verifiable Rewards35

LLM Creativity5

Strengths

Nash RLHF (186 citations) — game-theoretic alternative to standard RLHF

General paradigm for learning from human preferences (740 citations, seminal)

Gaps

No work on non-verifiable rewards (personality, humor, creativity post-training)

…click to see all

Danqing Wang

medium hireability

PhD student@Meta AI (FAIR)

Previously: Research Scientist Intern @ Meta

Pittsburgh, US

Personality & Non-Verifiable Rewards75

Evals & Reward Models72

RLHF / RLVR68

LLM Creativity52

Synthetic Data & Self-Play20

Strengths

"Learning Personalized Alignment" (EMNLP 2024) — reward modeling for open-ended text

Meta AI internship on personalized LLM alignment under Yuandong Tian

Gaps

No published DPO/PPO/GRPO work — alignment exposure more eval-focused than RL optimizer

…click to see all

Daoyuan Chen

medium hireability

Senior Algorithm Engineer@Alibaba

Previously: Senior Algorithm Engineer on Computer Vision @ Huawei

Beijing, CN

RLHF / RLVR80

Synthetic Data & Self-Play78

Evals & Reward Models60

Personality & Non-Verifiable Rewards20

LLM Creativity10

Strengths

Trinity-RFT: general-purpose RLFT framework for LLMs — core maintainer

Data-Juicer 2.0 (NeurIPS 2025 Spotlight) — cloud-scale SFT data pipeline

Gaps

No evidence of personality/tone/humor reward modeling (non-verifiable rewards)

…click to see all

David Wadden

medium hireability

Research Scientist@DeepMind

Previously: Research Scientist @ Allen Institute for AI

Seattle, US

RLHF / RLVR65

Evals & Reward Models45

Synthetic Data & Self-Play30

Personality & Non-Verifiable Rewards20

LLM Creativity10

Strengths

Tulu 2: DPO + RLHF instruction tuning at scale (292 citations)

Gemini post-training RS at DeepMind — production-scale RL

Gaps

No evidence of personality, humor, or creativity-focused reward modeling

…click to see all

Daya Guo

medium hireability

Associate Professor@Sun Yat-sen University

Previously: Postdoctoral Fellow @ Clemson University

Zhuhai, CN

RLHF / RLVR95

Evals & Reward Models80

Synthetic Data & Self-Play70

Personality & Non-Verifiable Rewards25

LLM Creativity10

Strengths

DeepSeek-R1 co-author — defines the RLVR paradigm (4,805 citations)

DeepSeekMath co-author — RL for verifiable reasoning at scale

Gaps

No work on personality, tone, humor, or non-verifiable reward modeling

…click to see all

Da Yan

medium hireability

Member Of Technical Staff@Anthropic

Previously: Independent Contractor @ OpenAI

New York, US

Personality & Non-Verifiable Rewards55

RLHF / RLVR45

Evals & Reward Models25

LLM Creativity5

Synthetic Data & Self-Play5

Strengths

Sycophancy in LLMs (582 citations, 2024) — core RLHF behavior research

Embedded in Anthropic post-training team with Askell, Perez, Korbak

Gaps

Core expertise is GPU compute/compilers, not post-training

…click to see all

Dayiheng Liu

medium hireability

Researcher@Alibaba

Previously: Intern @ Microsoft

Hangzhou, CN

RLHF / RLVR85

Evals & Reward Models85

Personality & Non-Verifiable Rewards70

Synthetic Data & Self-Play65

LLM Creativity55

Strengths

WorldPM: Scaling Human Preference Modeling (2025) — direct preference RM work

Core Qwen team (Qwen–Qwen3, QwQ-32B) — production post-training at scale

Gaps

Hangzhou, China — relocation barrier for US/Europe positions

…click to see all

Deng Cai

medium hireability

Research Scientist@ByteDance

Previously: Senior Researcher @ Tencent

Personality & Non-Verifiable Rewards68

Synthetic Data & Self-Play65

LLM Creativity45

Evals & Reward Models38

RLHF / RLVR20

Strengths

Harry Potter alignment paper — character personality via SFT (EMNLP 2023)

'Let LLMs Find Data to Train Themselves' — self-curation synthetic data (2025)

Gaps

No explicit RLHF/PPO/DPO/GRPO reward modeling papers found

…click to see all

Devamanyu Hazarika

medium hireability

Research Scientist@Meta

Previously: Senior Applied Scientist @ Amazon

San Francisco, US

Personality & Non-Verifiable Rewards85

RLHF / RLVR70

Evals & Reward Models50

Synthetic Data & Self-Play35

LLM Creativity20

Strengths

"Do LLMs Recognize Your Preferences?" ICLR 2025 Oral — LLM personalization

Co-led Amazon AGI model alignment team; core dev Amazon Nova

Gaps

No RLVR / tool-call or agentic post-training evidence found

…click to see all

Dian Yu

medium hireability

Senior Researcher@Tencent

Previously: Research intern @ Bosch

Seattle, US

Synthetic Data & Self-Play88

RLHF / RLVR82

Evals & Reward Models70

Personality & Non-Verifiable Rewards58

LLM Creativity35

Strengths

'1B Personas' — persona-conditioned synthetic data at massive scale (187 citations)

'Crossing the Reward Bridge' — RLVR across verifiable domains (ACL 2026)

Gaps

No direct work on personality/humor/tone reward modeling — creativity axis is weak

…click to see all

Donghai Hong

medium hireability

MS student@Peking University

RLHF / RLVR80

Evals & Reward Models60

Synthetic Data & Self-Play55

Personality & Non-Verifiable Rewards25

LLM Creativity5

Strengths

Align Anything (2024): multimodal RLHF framework, 20 citations

Safe RLHF-V (2025): safety-aligned RLHF for vision-language models

Gaps

No personality/humor/creativity work — consumer post-training axis weak

…click to see all

Dongrui Liu

medium hireability

Research Scientist@Shanghai AI Lab

Previously: IC Design Internship @ MediaTek

Shanghai, CN

RLHF / RLVR82

Personality & Non-Verifiable Rewards55

Evals & Reward Models42

Synthetic Data & Self-Play15

LLM Creativity5

Strengths

ExGRPO (ICLR 2026) — GRPO variant for LLM RL post-training, core topic

Entropy regularization + conditional advantage in RLVR (2 papers, 2025)

Gaps

No synthetic data generation or self-play pipeline work found

…click to see all

Dongzhan Zhou

medium hireability

Researcher@Shanghai Artificial Intelligence Laboratory

Previously: PhD student @ The University of Sydney

Shanghai, CN

RLHF / RLVR65

Evals & Reward Models62

Synthetic Data & Self-Play20

Personality & Non-Verifiable Rewards8

LLM Creativity5

Strengths

SophiaVL-R1: RLVR for MLLMs with thinking reward (2025)

LLaMA-Berry (NAACL 2025): Pairwise Preference Reward Model + MCTS

Gaps

Primary focus is AI for Science — not general-purpose LLM post-training

…click to see all

Duyu Tang

medium hireability

Researcher@Huawei

Previously: Principal Researcher @ Tencent

Beijing, CN

RLHF / RLVR72

Evals & Reward Models65

Synthetic Data & Self-Play38

Personality & Non-Verifiable Rewards12

LLM Creativity8

Strengths

ToolACE (2025): state-of-the-art LLM function calling, 73 citations

"Is PRM Necessary?" (2025): RLVR directly induces reward-model capability

Gaps

No consumer post-training work: personality, tone, humor, or creative writing

…click to see all

Duy Van Phung

medium hireability

Researcher@Intelligent Internet

Previously: Researcher @ SynthLabs

RLHF / RLVR90

Evals & Reward Models75

Synthetic Data & Self-Play65

Personality & Non-Verifiable Rewards28

LLM Creativity10

Strengths

trlX: RLHF distributed training framework, lead contributor (4.7K stars)

Generative Reward Models (2025) — trained reward models from scratch

Gaps

No personality/tone/creativity-specific post-training work found

…click to see all

Edward Emanuel Beeching

medium hireability

Research Scientist@Hugging Face

Previously: Research And Development Intern: Deep Reinforcement Learning @ Ubisoft LaForge

Lyon, FR

RLHF / RLVR92

Evals & Reward Models85

Synthetic Data & Self-Play72

Personality & Non-Verifiable Rewards38

LLM Creativity20

Strengths

huggingface/trl — core contributor to the canonical LLM RL training library

Zephyr + Alignment Handbook — DPO/SFT alignment recipes widely adopted

Gaps

No specific work on personality, humor, tone, or non-verifiable reward shaping

…click to see all

Enyu Zhou

medium hireability

PhD student@Fudan University

Previously: Research Intern @ China Qijizhifeng Ltd.Co

Shanghai, CN

Evals & Reward Models90

RLHF / RLVR88

Personality & Non-Verifiable Rewards38

Synthetic Data & Self-Play20

LLM Creativity5

Strengths

"Secrets of RLHF Part II: Reward Modeling" — 136 cites, core RLHF reference

RMB (ICLR 2025) — comprehensive reward model benchmarking

Gaps

No personality, humor, or creativity-specific reward modeling work

…click to see all

Eric Hambro

medium hireability

Member of Technical Staff@Anthropic

Previously: Research Engineer @ Meta

London, GB

RLHF / RLVR88

Evals & Reward Models65

Synthetic Data & Self-Play65

Personality & Non-Verifiable Rewards40

LLM Creativity20

Strengths

"Teaching LLMs to Reason with RL" (Anthropic 2024) — core RLHF/RLVR paper

"Understanding RLHF Effects on LLM Generalisation" (2023) — reward model analysis

Gaps

No explicit personality, tone, or humor reward modeling papers

…click to see all

Eric Tang

medium hireability

Software Engineer - LLMs@Anyscale

Previously: Research Engineering Intern @ DeepMind

San Francisco, US

RLHF / RLVR82

Evals & Reward Models50

Synthetic Data & Self-Play38

Personality & Non-Verifiable Rewards8

LLM Creativity5

Strengths

180 PRs to NovaSky-AI/SkyRL — core RLVR framework contributor

DAPO on Qwen3.5-35B-A3B — exact model class Neuralace is post-training

Gaps

No personality/RLAIF/non-verifiable reward work — pure RLVR focus

…click to see all

Ermo Hua

medium hireability

PhD student@Tsinghua University

RLHF / RLVR82

Evals & Reward Models65

Synthetic Data & Self-Play55

Personality & Non-Verifiable Rewards22

LLM Creativity10

Strengths

TTRL: test-time RL on unlabeled data; 211% Qwen-2.5-Math improvement on AIME 2024

CoGenesis (ACL 2024): small+large LLM routing — exact match for tool-call routing vision

Gaps

No published work on personality, humor, or non-verifiable reward design

…click to see all

Esin DURMUS

medium hireability

Research Scientist@Anthropic

Previously: Postdoctoral Scholar @ Stanford University

San Francisco, US

Personality & Non-Verifiable Rewards90

Evals & Reward Models88

RLHF / RLVR55

Synthetic Data & Self-Play12

LLM Creativity10

Strengths

"Sycophancy in LLMs" (476 cites) — non-verifiable reward signal research

"Collective Constitutional AI" — RLAIF for public values alignment

Gaps

Focus is values/safety evals, not RL training mechanics (PPO/DPO/GRPO)

…click to see all

Evan Frick

medium hireability

member of technical staff@LMArena

Previously: Research Engineer @ Nexusflow

Evals & Reward Models92

RLHF / RLVR88

Synthetic Data & Self-Play72

Personality & Non-Verifiable Rewards65

LLM Creativity10

Strengths

Starling-7B RLAIF post-training — helpfulness/harmlessness, 140 citations

Nectar: large-scale AI-feedback preference dataset for reward modeling

Gaps

No evidence of persona-conditioned synthetic data or self-play pipelines

…click to see all

Faeze Brahman

medium hireability

Research Scientist@Allen Institute for AI

Previously: Research Intern @ Microsoft

Seattle, US

LLM Creativity88

Evals & Reward Models82

Personality & Non-Verifiable Rewards80

RLHF / RLVR78

Synthetic Data & Self-Play58

Strengths

Tulu 3 co-author — full SFT+DPO+RLVR post-training pipeline

Trust or Escalate: ICLR 2025 oral, LLM judges for eval

Gaps

No explicit persona-conditioned self-play or rollout ranking work

…click to see all

Fan Zhou

medium hireability

1st year PhD student@Shanghai Jiao Tong University

Previously: MS student @ Shanghai Jiao Tong University

Shanghai, CN

Synthetic Data & Self-Play82

RLHF / RLVR78

Evals & Reward Models60

Personality & Non-Verifiable Rewards28

LLM Creativity12

Strengths

Qwen3-Coder + Qwen3.5 contributor — direct target-model family experience

OctoThinker (ICML 2025) — RL scaling via mid-training, rigorous ablations

Gaps

RLVR work focused on verifiable domains (math, code) — consumer personality axis thin

…click to see all

Fei Mi

medium hireability

Principle Research Scientist@Huawei

Previously: Sr Director, Sys Arch @ Huawei

Shenzhen, CN

Synthetic Data & Self-Play72

Evals & Reward Models70

RLHF / RLVR52

Personality & Non-Verifiable Rewards50

LLM Creativity32

Strengths

'A Synthetic Data Generation Framework for Grounded Dialogues' — ACL 2023, direct match

'One Cannot Stand for Everyone!' — user simulator training, persona-conditioned data gen

Gaps

No explicit PPO/DPO/GRPO work; alignment approach is SFT-based (mistake analysis), not RL-based

…click to see all

Felipe Vieira Frujeri

medium hireability

AI Researcher@NVIDIA

Previously: Staff AI Researcher @ Vatic Labs

Seattle, US

RLHF / RLVR80

Evals & Reward Models62

Personality & Non-Verifiable Rewards50

Synthetic Data & Self-Play20

LLM Creativity10

Strengths

APA paper: Advantage-Induced Policy Alignment (2024, 48 cit.) — RLHF post-training

RLHF/RLAIF alignment on OpenAI core models at Microsoft Azure AI

Gaps

No synthetic data / self-play pipeline work evident

…click to see all

Frank F. Xu

medium hireability

Member of Technical Staff@Microsoft

Previously: Graduate Research Assistant @ Carnegie Mellon University

San Francisco, US

Evals & Reward Models65

Synthetic Data & Self-Play35

RLHF / RLVR10

LLM Creativity5

Personality & Non-Verifiable Rewards5

Strengths

WebArena (ICLR 2024) — verifiable web-agent eval benchmark

TheAgentCompany — real-world agentic task benchmark for LLMs

Gaps

No RLHF/PPO/DPO/GRPO post-training work published

…click to see all

Gang Li

medium hireability

Research Scientist@Orby AI

Previously: Senior Software Engineer @ DeepMind

San Francisco, US

RLHF / RLVR82

Evals & Reward Models40

Synthetic Data & Self-Play15

LLM Creativity5

Personality & Non-Verifiable Rewards5

Strengths

DRPO (ICLR 2026): GRPO variant, decoupled reward policy optimization for LLMs

DisCO (NeurIPS 2025): verifiable-reward RL for LLM reasoning

Gaps

No evidence of personality, tone, or non-verifiable reward modeling work

…click to see all

Ganqu Cui

medium hireability

research scientist@Shanghai AI Laboratory

Previously: PhD student @ Tsinghua University

Shanghai, CN

RLHF / RLVR95

Evals & Reward Models88

Synthetic Data & Self-Play65

Personality & Non-Verifiable Rewards40

LLM Creativity10

Strengths

ULTRAFEEDBACK (664 cites) — foundational AI preference data generation

OpenRLHF contributor — production RLHF training framework

Gaps

No explicit work on personality, humor, or creative writing tasks

…click to see all

Geoffrey Cideron

medium hireability

Research Engineer@Google

Previously: Research intern @ Meta

Paris, FR

RLHF / RLVR88

Evals & Reward Models82

Personality & Non-Verifiable Rewards75

LLM Creativity35

Synthetic Data & Self-Play18

Strengths

WARM (ICML 2024, 103 cites): reward model weight averaging, anti-reward-hacking

BOND (2025): Best-of-N distillation for LLM alignment — RLHF at Google scale

Gaps

No published work on synthetic data pipelines or self-play for LLMs

…click to see all

Ge Zhang

medium hireability

Principal Product Manager - AI@ByteDance

Previously: Sr. Product Manager - Core AI @ eBay

San Francisco, US

Evals & Reward Models85

RLHF / RLVR82

Synthetic Data & Self-Play58

LLM Creativity52

Personality & Non-Verifiable Rewards50

Strengths

ReTool (147 citations): RL for when/how LLMs call tools — core JD skill

VideoScore: reward model for non-verifiable human feedback on video generation

Gaps

Limited direct work on personality/humor/tone alignment (COIG-P is adjacent)

…click to see all

Guanhua Huang

medium hireability

algorithm engineer@Tencent

Previously: Research Intern @ Tencent Hunyuan

Beijing, CN

RLHF / RLVR75

Evals & Reward Models40

Synthetic Data & Self-Play20

Personality & Non-Verifiable Rewards10

LLM Creativity5

Strengths

"Low-probability Tokens in RLVR" — core RLVR exploration paper (2025)

"RL on Pre-Training Data" — RL-guided training data selection (2025)

Gaps

No work on personality, humor, or non-verifiable reward modeling

…click to see all

Haipeng Luo

medium hireability

Intern@Tencent

Previously: Intern @ Microsoft

null

Synthetic Data & Self-Play80

RLHF / RLVR75

Evals & Reward Models70

Personality & Non-Verifiable Rewards20

LLM Creativity5

Strengths

WizardMath first author — RLVR for math reasoning, 565 citations

Arena Learning — post-training data flywheel via simulated self-play arena

Gaps

No work on personality, tone, creativity, or consumer LLM directions

…click to see all

Haitao Mi

medium hireability

Head of Language Intelligence Research Group at Tencent AI Lab@Tencent

Previously: Staff Engineer @ Ant Financial

RLHF / RLVR90

Synthetic Data & Self-Play88

Evals & Reward Models72

Personality & Non-Verifiable Rewards42

LLM Creativity18

Strengths

'Crossing the Reward Bridge' — verifiable-reward RL across domains (2025)

'Scaling Synthetic Data with 1B Personas' — persona SFT pipelines at scale

Gaps

No direct work on personality, humor, tone, or non-verifiable reward shaping

…click to see all

Haitham Bou Ammar

medium hireability

Senior Principal Scientist - ML Tech Leader@Noah's Ark Lab

Previously: Technology Expert & Advisor @ Sanome

Cambridge, GB

RLHF / RLVR82

Evals & Reward Models75

Personality & Non-Verifiable Rewards35

Synthetic Data & Self-Play25

LLM Creativity10

Strengths

Group Robust Preference Optimization (reward-free RLHF, 2024) — direct axis hit

Bayesian Reward Models for LLM Alignment — reward model architecture expertise

Gaps

No evidence of personality/tone/creativity reward modeling work

…click to see all

Hamish Ivison

medium hireability

PhD student@University of Washington

Previously: Predoctoral Young Investigator @ Allen Institute for AI

Seattle, US

RLHF / RLVR95

Evals & Reward Models78

Synthetic Data & Self-Play70

Personality & Non-Verifiable Rewards55

LLM Creativity25

Strengths

328 commits to allenai/open-instruct — core Tulu RLVR+DPO pipeline builder

Tulu 3 (496 cites) — flagship open post-training paper, RLVR + SFT

Gaps

No explicit work on personality adherence, humor, or creative writing

…click to see all

Hang Zhang

medium hireability

Researcher@Alibaba

Previously: PhD student @ Sichuan University

Evals & Reward Models25

Synthetic Data & Self-Play20

RLHF / RLVR10

LLM Creativity5

Personality & Non-Verifiable Rewards5

Strengths

Qwen2.5-VL co-author — PDF/doc/PPT understanding (knowledge post-training use case)

VideoLLaMA: instruction-tuned multimodal LLM — applied SFT post-training at scale

Gaps

No RLHF, DPO, PPO, or GRPO post-training papers

…click to see all

Hanjun Dai

medium hireability

Researcher@Precur AI

Previously: Research Manager @ Google

San Francisco, US

RLHF / RLVR85

Evals & Reward Models60

Synthetic Data & Self-Play30

Personality & Non-Verifiable Rewards20

LLM Creativity10

Strengths

'Value-Incentivized Preference Optimization' — unified online/offline RLHF (68 citations)

Matryoshka Pilot — small LLM orchestrating large LLM (directly mirrors JD tool-call vision)

Gaps

No consumer/personality post-training work — focus on verifiable/structured rewards

…click to see all

Hao Cheng

medium hireability

Data & Applied Scientist II@Microsoft

Previously: Senior Data Scientist @ Johnson & Johnson

New York, US

RLHF / RLVR82

Evals & Reward Models70

Synthetic Data & Self-Play65

Personality & Non-Verifiable Rewards55

LLM Creativity30

Strengths

RL for reasoning in LLMs with one training example — NeurIPS 2025, direct RLVR

CollabLLM ICML 2025 oral — agentic post-training, active collaborator design

Gaps

No explicit DPO/PPO/GRPO preference optimization papers found

…click to see all

Hao Yu

medium hireability

PhD student@THU

Beijing, CN

Evals & Reward Models75

RLHF / RLVR55

Synthetic Data & Self-Play28

Personality & Non-Verifiable Rewards5

LLM Creativity3

Strengths

UI-TARS-2: multi-turn RLVR applied to GUI computer-use agents

AgentBench (ICLR 2024, 795 citations) — top-tier agent eval benchmark

Gaps

No consumer post-training work — personality, humor, creativity entirely absent

…click to see all

Hongyi Guo

medium hireability

Research Scientist@ByteDance

Previously: Research Intern @ ByteDance

San Francisco, US

RLHF / RLVR92

Synthetic Data & Self-Play72

Evals & Reward Models65

Personality & Non-Verifiable Rewards20

LLM Creativity8

Strengths

"Provably Mitigating Overoptimization in RLHF" — NeurIPS 2024, 86 citations

BRiTE (ICML 2025): RLVR for bootstrapped thinking/reasoning

Gaps

No personality/creativity/non-verifiable reward work — consumer post-training gap

…click to see all

Hsien-chin Lin

medium hireability

Postdoctoral Researcher@Heinrich Heine University Düsseldorf

Previously: PhD student @ Heinrich Heine University Düsseldorf

Düsseldorf, DE

Synthetic Data & Self-Play70

Personality & Non-Verifiable Rewards65

RLHF / RLVR50

Evals & Reward Models35

LLM Creativity5

Strengths

2025 paper on post-training LLMs via RL self-feedback — core JD topic

RLSF (2024): RL from self-feedback applied to LLM reasoning

Gaps

No PPO/DPO/GRPO work — RL is dialogue-policy scale, not LLM post-training scale yet

…click to see all

Ian Osband

medium hireability

Research Scientist@Google

Previously: Member of Technical Staff @ OpenAI

London, GB

RLHF / RLVR65

Evals & Reward Models60

Personality & Non-Verifiable Rewards20

Synthetic Data & Self-Play15

LLM Creativity5

Strengths

GPT-4o + o1 system cards — direct OpenAI post-training involvement

ChatGPT data flywheel — applied RLHF/post-training pipeline at OpenAI

Gaps

Primary published research is exploration/uncertainty, not LLM post-training specifically

…click to see all

Ilia Kulikov

medium hireability

Research Scientist@Meta

Previously: Research Assistant @ Courant Institute of Mathematical Sciences

New York, US

Synthetic Data & Self-Play88

RLHF / RLVR85

Evals & Reward Models85

Personality & Non-Verifiable Rewards30

LLM Creativity22

Strengths

Self-Taught Evaluators (2025): reward models from unlabeled data, outperforms GPT-4

Diverse Preference Optimization (2025): novel DPO variant for RLHF

Gaps

No direct personality, humor, or creativity reward modeling work

…click to see all

Jacob Hilton

medium hireability

executive director@Alignment Research Center

Previously: Researcher @ OpenAI

null

RLHF / RLVR97

Evals & Reward Models90

Personality & Non-Verifiable Rewards68

Synthetic Data & Self-Play32

LLM Creativity22

Strengths

InstructGPT co-author (18K citations) — foundational RLHF pioneer

Scaling Laws for Reward Model Overoptimization — core reward model research

Gaps

No explicit personality/creativity post-training work

…click to see all

Jeffrey Wu

medium hireability

PhD Student@Anthropic AI, OpenAI

Previously: Undergraduate Researcher @ Berkeley Artificial Intelligence Research

New York, US

RLHF / RLVR97

Evals & Reward Models82

Personality & Non-Verifiable Rewards78

Synthetic Data & Self-Play25

LLM Creativity22

Strengths

InstructGPT (2022) co-author — defined modern PPO-RLHF paradigm

'Learning to summarize with human feedback' (2020) — foundational RLHF paper

Gaps

No published work on synthetic data generation or self-play pipelines

…click to see all

Jiacheng Chen

medium hireability

PhD student@The Chinese University of Hong Kong

Previously: Visiting Researcher @ Caltech

Hong Kong, HK

RLHF / RLVR75

Evals & Reward Models42

Synthetic Data & Self-Play18

LLM Creativity5

Personality & Non-Verifiable Rewards5

Strengths

'Entropy Mechanism of RL for Reasoning LMs' — core RLVR theory paper

P1: Physics Olympiad RL — complex verifiable-reward training

Gaps

No work on personality, non-verifiable rewards, or creative LLM outputs

…click to see all

Jiale Cheng

medium hireability

PhD Student@University of Michigan, Ann Arbor

Ann Arbor, US

Evals & Reward Models90

Synthetic Data & Self-Play85

RLHF / RLVR82

Personality & Non-Verifiable Rewards68

LLM Creativity42

Strengths

SPaR: self-play + tree-search refinement for instruction following (2024)

VisionReward: multi-dim preference learning — direct reward modeling work

Gaps

No direct work on tool-call / agentic / computer-use post-training

…click to see all

Jiaming Ji

medium hireability

PhD Student@Peking University

Beijing, CN

RLHF / RLVR90

Evals & Reward Models78

Synthetic Data & Self-Play40

Personality & Non-Verifiable Rewards30

LLM Creativity8

Strengths

Safe RLHF paper (521 citations) — ICLR, reward-constrained value alignment

BeaverTails (662 citations) — large-scale human-preference dataset for RLHF

Gaps

No personality, humor, or creativity post-training work

…click to see all

Jianxin Yang

medium hireability

Principal Researcher@Alibaba

Previously: NLP Algorithm Engineer @ Tencent

Hangzhou, CN

RLHF / RLVR80

Synthetic Data & Self-Play55

Evals & Reward Models40

LLM Creativity25

Personality & Non-Verifiable Rewards25

Strengths

Firefly: full SFT+DPO training framework for Qwen2.5 and 30+ LLMs

NeurIPS 2025 RLVR paper on high-entropy token selection strategy

Gaps

No explicit work on non-verifiable rewards, personality modeling, or creative writing

…click to see all

Jiayi Zhou

medium hireability

PhD student@Peking University

Previously: Researcher @ Peking University

RLHF / RLVR85

Evals & Reward Models82

Personality & Non-Verifiable Rewards38

Synthetic Data & Self-Play22

LLM Creativity10

Strengths

Seq2Seq Reward Modeling (AAAI 2025 Oral) — language feedback reward models

56 commits to PKU-Alignment/align-anything — RLHF training infrastructure

Gaps

No consumer post-training work — no personality, humor, or creativity modeling

…click to see all

Jiazhan Feng

medium hireability

Research Scientist@ByteDance

Previously: Research Intern @ Microsoft

Oxford, GB

RLHF / RLVR80

LLM Creativity65

Synthetic Data & Self-Play65

Evals & Reward Models60

Personality & Non-Verifiable Rewards50

Strengths

ReTool: RLVR for tool use on Qwen2.5-32B — exact JD fit

UI-TARS-2: multi-turn RL for GUI/computer use

Gaps

Primary tool-use RL is math/code reasoning — consumer personality RL less evidenced

…click to see all

Jie Fu

medium hireability

Research Scientist@Shanghai AI Lab

Previously: Visiting Scholar @ The Hong Kong University of Science and Technology

Shanghai, CN

Evals & Reward Models80

Personality & Non-Verifiable Rewards75

LLM Creativity60

RLHF / RLVR50

Synthetic Data & Self-Play45

Strengths

RoleLLM (365 citations) — benchmarks & elicits role-playing, personality adherence

ChatEval (783 citations) — multi-agent LLM evaluator framework (evals axis)

Gaps

No dedicated PPO/DPO/GRPO post-training paper — RLVR coverage indirect via survey

…click to see all

Jincheng Mei

medium hireability

senior research scientist@DeepMind

Previously: research scientist @ Google

London, GB

RLHF / RLVR78

Evals & Reward Models30

Personality & Non-Verifiable Rewards10

Synthetic Data & Self-Play8

LLM Creativity5

Strengths

VPO (ICLR 2025): unified online+offline RLHF via value-incentivized preference opt

Faster WIND (AISTATS 2025): accelerated iterative BoN distillation for alignment

Gaps

No work on personality, humor, or subjective/non-verifiable reward modeling

…click to see all

Jing Shao

medium hireability

Young Research Scientist, Group Leader@Shanghai AI Laboratory

Previously: Research Director @ SenseTime

RLHF / RLVR85

Evals & Reward Models80

Personality & Non-Verifiable Rewards65

Synthetic Data & Self-Play45

LLM Creativity10

Strengths

HarmRLVR (2025) — RLVR with verifiable rewards for LLM safety alignment

Multi-Objective DPO (2024, 70 citations) — multi-preference optimization

Gaps

Safety/harmlessness focus — limited work on consumer personality or creativity

…click to see all

Johan Ferret

medium hireability

Research Scientist@DeepMind

Previously: PhD Candidate @ Inria

Paris, FR

RLHF / RLVR92

Evals & Reward Models78

Synthetic Data & Self-Play60

Personality & Non-Verifiable Rewards58

LLM Creativity25

Strengths

RLAIF (958 citations) — scaling RLHF with AI feedback, first key contributor

WARM: reward model averaging reduces reward hacking (115 citations)

Gaps

No explicit personality / humor / creativity-focused post-training work

…click to see all

Johannes Heidecke

medium hireability

Research Engineer@OpenAI

Previously: AI Safety Analyst @ OpenAI

Barcelona, ES

RLHF / RLVR85

Evals & Reward Models78

Personality & Non-Verifiable Rewards70

Synthetic Data & Self-Play30

LLM Creativity22

Strengths

Rule-Based Rewards for LM Safety (2024, 71 cites) — reward modeling for post-training alignment

Diverse Red Teaming with Auto-generated Rewards + Multi-step RL — RL reward engineering

Gaps

No explicit synthetic conversation dataset or self-play pipeline work found

…click to see all

Jon Ander Campos

medium hireability

Staff Member of Technical Staff@Cohere

Previously: Senior Member of Technical Staff @ Cohere

San Francisco, US

RLHF / RLVR82

Evals & Reward Models75

Personality & Non-Verifiable Rewards72

Synthetic Data & Self-Play48

LLM Creativity28

Strengths

Post-Training Lead at Cohere — production post-training role

"Learning from Natural Language Feedback" (214 cit) — core RLHF work

Gaps

No explicit self-play or persona-conditioned data generation work

…click to see all

Jonathan Tow

medium hireability

RLHF / RLVR85

Synthetic Data & Self-Play75

Personality & Non-Verifiable Rewards65

Evals & Reward Models45

LLM Creativity35

Strengths

51 commits to CarperAI/trlx — 2nd contributor, PPO/ILQL RLHF practitioner

2nd author on StableLM 2 1.6B tech report — core Stability AI researcher

Gaps

Location unconfirmed — no geography signal on GitHub or website

…click to see all

Jules Gagnon-Marchand

medium hireability

RLHF / RLVR72

Synthetic Data & Self-Play60

Evals & Reward Models58

Personality & Non-Verifiable Rewards8

LLM Creativity5

Strengths

Marg-Li-CoT: active RLVR + rejection sampling research repo (2024–2025)

Multi-GPU RLVR training (ray train, 8 GPUs, Slurm) — production infra

Gaps

No published papers on RLVR/CoT work — research unpublished as of qualification date

…click to see all

Julian Michael

medium hireability

Researcher on AI safety, evaluation, and alignment@Meta

Previously: Head of the Safety, Evaluations, and Alignment Lab (SEAL) @ Scale AI

Evals & Reward Models72

Personality & Non-Verifiable Rewards15

RLHF / RLVR12

LLM Creativity5

Synthetic Data & Self-Play5

Strengths

GPQA benchmark (1.5K citations) — defines grad-level eval standard

SuperGLUE co-creator (3K citations) — foundational NLU evals

Gaps

No RLHF/RLVR training work — purely evaluator, not trainer

…click to see all

Juntao Dai

medium hireability

Researcher@Peking University

Previously: PhD student @ Zhejiang University

Hangzhou, CN

RLHF / RLVR85

Evals & Reward Models74

Synthetic Data & Self-Play58

Personality & Non-Verifiable Rewards35

LLM Creativity5

Strengths

Safe RLHF paper (538 cites) — constrained RLHF framework co-author

BeaverTails (665 cites) — human-preference dataset at scale

Gaps

Safety-focused alignment — personality/creativity/tone work minimal

…click to see all

Kaile Wang

medium hireability

Undergrad student@Peking University

Beijing, CN

RLHF / RLVR80

Evals & Reward Models60

Synthetic Data & Self-Play25

Personality & Non-Verifiable Rewards20

LLM Creativity5

Strengths

Merged PR: RLOO/REINFORCE/GROUP_NORM in PPO for align-anything (Mar 2025)

lmm-r1: OpenRLHF extension for multimodal DeepSeek-R1 — direct RLVR work

Gaps

No evidence of personality/humor/non-verifiable reward work (consumer post-training axis)

…click to see all

Kaitao Song

medium hireability

Senior Researcher@Microsoft

Previously: Researcher @ Microsoft

Shanghai, CN

RLHF / RLVR65

Evals & Reward Models65

Personality & Non-Verifiable Rewards50

LLM Creativity35

Synthetic Data & Self-Play20

Strengths

HuggingGPT (1,623 cites) — orchestrating model calls from smaller front-end LLM

Conditional Reward Modeling for LLM Reasoning (2025) — RLVR post-training

Gaps

No large-scale SFT / self-play data pipeline evidence

…click to see all

Kamal Ndousse

medium hireability

Member Of Technical Staff@Anthropic

Previously: Member of Technical Staff @ Stealth Co

San Francisco, US

RLHF / RLVR95

Personality & Non-Verifiable Rewards85

Evals & Reward Models75

Synthetic Data & Self-Play35

LLM Creativity30

Strengths

HH-RLHF co-author (2022, 3K citations) — defining RLHF post-training paper

Constitutional AI co-author — RLAIF for harmlessness, non-verifiable reward shaping

Gaps

Limited synthetic data / self-play pipeline experience in public record

…click to see all

Kourosh Hakhamaneshi

medium hireability

Team Lead (AI)@Anyscale

Previously: Software Engineer (RL and ML) @ Anyscale

San Francisco, US

RLHF / RLVR68

Evals & Reward Models32

Synthetic Data & Self-Play25

Personality & Non-Verifiable Rewards10

LLM Creativity5

Strengths

SkyRL-v0: RLVR for long-horizon LLM agents at Anyscale (2025)

"LLMs Learn to Reason from Demonstrations" — 58 citations, 2025

Gaps

No work on personality, tone, humor, or non-verifiable reward modeling

…click to see all

Kunlun Zhu

medium hireability

Graduate Student@University of Illinois Urbana-Champaign

Previously: Research Assistant @ Tsinghua University

RLHF / RLVR68

Evals & Reward Models65

Synthetic Data & Self-Play30

LLM Creativity5

Personality & Non-Verifiable Rewards5

Strengths

ToolLLM (ICLR 2024, 984 cit.) — core tool-use post-training benchmark

OpenManus-RL — live RL tuning project for LLM agents (pinned repo)

Gaps

No work on personality/creativity or non-verifiable reward modeling

…click to see all

Lei Shu

medium hireability

Staff Research Scientist@DeepMind

Previously: Senior Research Scientist @ DeepMind

Seattle, US

RLHF / RLVR65

Evals & Reward Models65

Personality & Non-Verifiable Rewards20

LLM Creativity15

Synthetic Data & Self-Play10

Strengths

'Automated Process Supervision' (ICLR 2025) — verifiable reward / process RM

'Critique Ability of LLMs' (ICLR 2024) — LLM eval and reward modeling

Gaps

No work on personality, humor, or non-verifiable reward shaping

…click to see all

Leo Gao

medium hireability

Researcher@OpenAI

Previously: Researcher @ EleutherAI

Evals & Reward Models90

RLHF / RLVR80

Synthetic Data & Self-Play40

LLM Creativity35

Personality & Non-Verifiable Rewards35

Strengths

Creator of lm-evaluation-harness — the dominant LLM eval framework

Scaling Laws for Reward Model Overoptimization — core RM scaling research

Gaps

No direct RLHF fine-tuning or PPO/DPO training pipeline work visible

…click to see all

Leon Ericsson

medium hireability

RLHF / RLVR78

Evals & Reward Models48

Synthetic Data & Self-Play42

Personality & Non-Verifiable Rewards22

LLM Creativity10

Strengths

29 commits to huggingface/trl — active RL post-training contributor

Feb 2026 blog: geometric view of OPSM/PPO off-policy masking — original technical work

Gaps

No personality/creative training work — purely technical RL focus

…click to see all

Lewei Lu

medium hireability

Senior Research Director@SenseTime

Previously: Senior Researcher @ SenseTime

Beijing, CN

RLHF / RLVR82

Evals & Reward Models78

Synthetic Data & Self-Play52

Personality & Non-Verifiable Rewards22

LLM Creativity15

Strengths

VisualPRM (2025): process reward model for multimodal reasoning — core RLVR signal

Mixed Preference Optimization (2024, 139 cites): DPO-style post-training for MLLMs

Gaps

No personality/creativity reward work — focus is on verifiable reasoning rewards

…click to see all

Lewis Tunstall

medium hireability

Machine Learning Engineer@Hugging Face

Previously: Senior Data Scientist @ Swisscom

Bern, CH

RLHF / RLVR97

Evals & Reward Models72

Synthetic Data & Self-Play55

Personality & Non-Verifiable Rewards40

LLM Creativity22

Strengths

TRL library creator (480 citations) — de facto RLHF/DPO/GRPO toolchain

Open-R1: led HF's DeepSeek-R1 RLVR reproduction with GRPO

Gaps

Limited work on consumer post-training (personality, humor, creativity, non-VR rewards)

…click to see all

Linxin Song

medium hireability

Ph.D. Student@University of Southern California

Previously: Research Intern @ Salesforce Research

Los Angeles, US

RLHF / RLVR72

Evals & Reward Models70

Synthetic Data & Self-Play68

Personality & Non-Verifiable Rewards35

LLM Creativity18

Strengths

ExeVRM: reward modeling for computer-use agents — exact RLVR fit

Efficient RL Finetuning via Adaptive Curriculum (32 citations, Apr 2025)

Gaps

No creative writing / roleplay / humor work — weak on consumer personality axis

…click to see all

Lisa Dunlap

medium hireability

PhD student@UC Berkeley

Previously: core contributor @ Chatbot Arena

Evals & Reward Models70

Personality & Non-Verifiable Rewards50

LLM Creativity10

RLHF / RLVR5

Synthetic Data & Self-Play5

Strengths

VibeCheck (ICLR 2025): measures qualitative LLM traits like tone, humor, personality

VisionArena: 230K VLM conversations with human preference labels

Gaps

No post-training / RLHF / DPO experience — eval-only researcher

…click to see all

Longze Chen

medium hireability

PhD student@Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences

Previously: Undergrad student @ Shandong University

Shenzhen, CN

RLHF / RLVR85

Evals & Reward Models75

Synthetic Data & Self-Play72

LLM Creativity30

Personality & Non-Verifiable Rewards20

Strengths

"Implicit Actor Critic Coupling for RLVR" — PACS framework, +8-9% on math benchmarks

"Learning Ordinal Probabilistic Reward from Preferences" (2026) — novel reward modeling

Gaps

Personality/non-verifiable reward work absent — persona work is for math, not conversation

…click to see all

Loubna Ben Allal

medium hireability

Research Engineer@Hugging Face

Previously: Member of the core team @ BigCode

Paris, FR

Evals & Reward Models55

Synthetic Data & Self-Play50

RLHF / RLVR20

LLM Creativity5

Personality & Non-Verifiable Rewards5

Strengths

SmolLM2 COLM 2025 spotlight — data-centric SFT training pipeline end-to-end

Cosmopedia: persona-conditioned synthetic data at scale (SFT data relevance)

Gaps

Pre-training focused, not a post-training RLHF/RLVR specialist

…click to see all

Marzieh Fadaee

medium hireability

Head of Cohere Labs@Cohere

Previously: Staff Research Scientist @ Cohere

Amsterdam, NL

RLHF / RLVR92

Evals & Reward Models88

Synthetic Data & Self-Play88

Personality & Non-Verifiable Rewards72

LLM Creativity30

Strengths

"Back to Basics" REINFORCE/RLHF paper — 431 citations, direct RLHF methodology

Leads Cohere Labs — runs post-training research across instruction tuning and alignment

Gaps

LLM creativity/roleplay/humor — no published work in this direction

…click to see all

Matvei Novikov

medium hireability

Senior Deep Learning Software Engineer@NVIDIA

Previously: Senior Deep Learning Software Engineer @ NVIDIA

San Francisco, US

RLHF / RLVR70

Synthetic Data & Self-Play70

Evals & Reward Models55

Personality & Non-Verifiable Rewards15

LLM Creativity10

Strengths

Nemotron-CrossThink: RL + synthetic/real data self-learning across diverse domains

Llama-Nemotron: SFT + large-scale RL post-training for efficient reasoning models

Gaps

No evidence of personality, tone, humor, or non-verifiable reward modeling

…click to see all

M⍼

max ⍼

medium hireability

RLHF / RLVR90

Personality & Non-Verifiable Rewards70

Evals & Reward Models65

Synthetic Data & Self-Play35

LLM Creativity25

Strengths

#1 CarperAI/trlx contributor — 71 merged PRs on RLHF library

Built RFT trainer (ReST/RFT paper) in trlx — post-training SFT pipeline

Gaps

No location confirmed — timezone suggests Europe but not verified

…click to see all

Maximilian Mozes

medium hireability

Team Lead, Post-Training@Cohere

Previously: Senior Research Scientist @ Cohere

London, GB

RLHF / RLVR82

Evals & Reward Models62

Personality & Non-Verifiable Rewards50

Synthetic Data & Self-Play35

LLM Creativity25

Strengths

"Reverse Engineering Human Preferences with RL" — NeurIPS 2025 Spotlight

Cohere Post-Training Team Lead — production post-training at scale

Gaps

No specific papers on personality modeling or non-verifiable reward design

…click to see all

Meg Tong

medium hireability

Member of Technical Staff@LLM company

Previously: Researcher @ Various research organisations

San Francisco, US

Personality & Non-Verifiable Rewards82

Evals & Reward Models72

RLHF / RLVR70

Synthetic Data & Self-Play18

LLM Creativity15

Strengths

Sycophancy paper (ICLR 2024) — RLHF reward hacking, core non-verifiable rewards research

Constitutional Classifiers (2025) — reward model for subjective safety/personality criteria

Gaps

No synthetic data generation or self-play pipeline experience

…click to see all

Mingdao Liu

medium hireability

PhD student@Tsinghua University

Beijing, CN

RLHF / RLVR65

Synthetic Data & Self-Play65

Evals & Reward Models50

Personality & Non-Verifiable Rewards20

LLM Creativity10

Strengths

GLM-4.5 Agentic (2025) — core author on post-trained tool-use model

ChatGLM SFT + RLHF pipeline: multi-stage post-training at production scale

Gaps

No dedicated work on personality, tone, or non-verifiable reward modeling

…click to see all

Minzheng Wang

medium hireability

Ph.D student@Institute of Automation, Chinese Academy of Sciences

Previously: Research Intern @ Alibaba

Beijing, CN

RLHF / RLVR72

Synthetic Data & Self-Play65

Evals & Reward Models60

Personality & Non-Verifiable Rewards45

LLM Creativity30

Strengths

AMPO (ICLR 2026): RL policy optimization for social agents, scores 8/8/8/6

BuPO (co-first): bottom-up policy optimization in LLMs

Gaps

No direct RLHF/preference optimization (PPO, DPO, GRPO) work published yet

…click to see all

Nicholas Schiefer

medium hireability

Member of Technical Staff@Anthropic

Previously: Resident Member of Technical Staff @ Anthropic

San Francisco, US

Personality & Non-Verifiable Rewards90

RLHF / RLVR78

Evals & Reward Models75

Synthetic Data & Self-Play40

LLM Creativity15

Strengths

Constitutional AI (RLAIF) co-author — 2,629 citations, defines non-verifiable reward training

Sycophancy paper (582 citations) — personality alignment via preference optimization

Gaps

No evidence of tool-use / computer-use RLVR or agentic post-training work

…click to see all

Niklas Muennighoff

medium hireability

AI Research@Meta

Previously: AI Research @ Ai2

Evals & Reward Models90

RLHF / RLVR78

Synthetic Data & Self-Play55

Personality & Non-Verifiable Rewards40

LLM Creativity20

Strengths

MODPO (823 citations) — direct preference optimization / reward modeling work

OctoPack: code instruction tuning for LLMs (348 citations)

Gaps

No direct work on personality adherence, humor, or creative roleplay

…click to see all

Nouha Dziri

medium hireability

Research Scientist@Allen Institute for AI

Previously: Postdoc @ Allen Institute for AI

Evals & Reward Models95

RLHF / RLVR90

Personality & Non-Verifiable Rewards65

Synthetic Data & Self-Play50

LLM Creativity35

Strengths

RewardBench (417 cites) — definitive reward model evaluation framework

Tulu 3 (320 cites) — allenai's flagship RLHF/post-training pipeline co-author

Gaps

No specific RLVR/verifiable-reward work (code/tool-use RL)

…click to see all

Olivier Pietquin

medium hireability

Chief Scientist@Earth Species Project

Previously: Director, Reinforcement Learning and Interaction Research @ Cohere

Lille, FR

RLHF / RLVR95

Evals & Reward Models75

Personality & Non-Verifiable Rewards70

LLM Creativity45

Synthetic Data & Self-Play30

Strengths

'Back to Basics' (2024, 481 citations) — redefining REINFORCE-style RLHF for LLMs

ShiQ (2025) + Self-Improving RPO — cutting-edge preference optimization

Gaps

Current day-job at Earth Species Project is bioacoustics, not LLM post-training

…click to see all

Pengcheng Wen

medium hireability

RLHF / RLVR45

Evals & Reward Models45

LLM Creativity30

Synthetic Data & Self-Play20

Personality & Non-Verifiable Rewards20

Strengths

GRPO remote RM in align-anything — direct RLVR post-training implementation

eval-anything benchmarks (MathVision, OlympiadBench) — evaluation infrastructure

Gaps

Junior MPhil student — engineering contributor, not lead researcher

…click to see all

Pengcheng Yin

medium hireability

Researcher@DeepMind

Previously: Researcher Intern @ Microsoft

San Francisco, US

Evals & Reward Models60

Synthetic Data & Self-Play58

RLHF / RLVR15

LLM Creativity3

Personality & Non-Verifiable Rewards3

Strengths

14 LLM code gen papers; h-index 35 — recognized expert in NL-to-code

Learn-by-interact (2025): synthetic agent trajectory data without human annotation

Gaps

No RLHF, RLVR, DPO, or reward modeling work in published record

…click to see all

Pradeep Dasigi

medium hireability

Senior Research Scientist@Allen Institute for AI

Previously: Research Scientist @ Allen Institute for AI

Seattle, US

RLHF / RLVR85

Evals & Reward Models75

Synthetic Data & Self-Play50

Personality & Non-Verifiable Rewards40

LLM Creativity5

Strengths

Tulu 3 contributor — AllenAI's flagship open post-training pipeline

'Generalizing Verifiable Instruction Following' — direct RLVR evidence

Gaps

No evident work on personality, humor, or creative writing post-training

…click to see all

Pu Zhao

medium hireability

Principal Researcher@Microsoft

Previously: Researcher @ Microsoft

Beijing, CN

Synthetic Data & Self-Play85

RLHF / RLVR80

Evals & Reward Models75

Personality & Non-Verifiable Rewards40

LLM Creativity30

Strengths

WizardMath: Reinforced Evol-Instruct (PPO-based RLVR for math reasoning)

Self-Evolved Reward Learning for LLMs: reward model innovation

Gaps

No specific work on personality, humor, sarcasm, or non-verifiable subjective rewards

…click to see all

Qingfeng Sun

medium hireability

Partner Engineering Manager@Microsoft

Previously: Principal Dev Manager @ Microsoft

Seattle, US

Synthetic Data & Self-Play90

RLHF / RLVR85

Evals & Reward Models65

Personality & Non-Verifiable Rewards40

LLM Creativity20

Strengths

Evol-Instruct inventor — scalable synthetic instruction data generation

RLEIF (WizardMath) — RLVR for reasoning, directly query-relevant

Gaps

No published work on personality/humor/non-verifiable reward modeling

…click to see all

Qingwei Lin

medium hireability

Partner Researcher/Partner Research Manager@Microsoft

Previously: Principal Researcher/Principal Research Manager @ Microsoft

Beijing, CN

RLHF / RLVR82

Synthetic Data & Self-Play80

Evals & Reward Models68

Personality & Non-Verifiable Rewards22

LLM Creativity12

Strengths

WizardMath: Reinforced Evol-Instruct — 656 citations, RLVR for math post-training

Arena Learning: self-play chatbot arena as post-training data flywheel

Gaps

No evidence of work on personality, humor, or subjective non-verifiable rewards

…click to see all

Qipeng Guo

medium hireability

Young Research Scientist@Shanghai AI Laboratory

Previously: Investment Manager @ Cowin Venture Capital

Shanghai, CN

RLHF / RLVR85

Evals & Reward Models75

Synthetic Data & Self-Play75

Personality & Non-Verifiable Rewards20

LLM Creativity10

Strengths

InternLM2 co-author: large-scale LLM post-training (RLHF, SFT, DPO)

IFDECORATOR: instruction-following RLVR with verifiable rewards (2025)

Gaps

Minimal consumer-facing personality work (humor, sarcasm, roleplay not evident)

…click to see all

Quentin Gallouédec

medium hireability

Researcher@Hugging Face

Previously: PhD student @ Ecole Centrale de Lyon

RLHF / RLVR97

Evals & Reward Models72

Synthetic Data & Self-Play68

Personality & Non-Verifiable Rewards38

LLM Creativity18

Strengths

#1 TRL contributor (720 commits) — owns DPO, GRPO, PPO, SFT trainers

Open R1 co-author — end-to-end RLVR pipeline with GRPO at scale

Gaps

No public work on personality, tone, humor, or non-verifiable reward modeling

…click to see all

Rafael Rafailov

medium hireability

Analyst@Goldman Sachs

Previously: Equity Research Officer/Risk Manager @ Berkeley Investment Group

New York, US

RLHF / RLVR100

Evals & Reward Models90

Personality & Non-Verifiable Rewards75

LLM Creativity60

Synthetic Data & Self-Play55

Strengths

DPO inventor — NeurIPS 2023 Outstanding Paper, 5,769 citations, single most-cited RLHF paper

Generative Reward Models (2025) — novel reward model for post-training pipelines

Gaps

No explicit small-model (≤30B) deployment or efficiency-constrained post-training work

…click to see all

Rajkumar Ramamurthy

medium hireability

Director of Engineering- Transmission Controls@Bosch

Previously: Director-Simultaneous Engineering @ Automotive Steering Column LLC

Auburn Hills, US

RLHF / RLVR88

Evals & Reward Models72

Synthetic Data & Self-Play65

Personality & Non-Verifiable Rewards45

LLM Creativity15

Strengths

allenai/RL4LMs: 58 commits — primary author of RLHF-for-LMs library

ICLR 2023 RL4LMs paper — RL policy optimization benchmarks for NLP

Gaps

No explicit work on personality adherence, humor, or creative writing rewards

…click to see all

Rémi Munos

medium hireability

Researcher@Meta

Previously: Research scientist @ DeepMind

Villeneuve d'Ascq, FR

RLHF / RLVR97

Personality & Non-Verifiable Rewards82

Evals & Reward Models72

Synthetic Data & Self-Play25

LLM Creativity15

Strengths

"Beyond Verifiable Rewards" (2025) — RL on non-verifiable LLM data

Nash RLHF (2023, 186 cites) — pioneered game-theoretic preference optimization

Gaps

No synthetic data / self-play data generation work found

…click to see all

Rui Lu

medium hireability

RLHF / RLVR70

Evals & Reward Models55

Synthetic Data & Self-Play35

LLM Creativity5

Personality & Non-Verifiable Rewards5

Strengths

DeepDive (arXiv:2509.10446): first author, multi-turn RLVR for search agents

SLIME contributor — merged PR to THUDM's RL post-training framework (GLM-4.5/5.x)

Gaps

No evidence of personality, tone, or non-verifiable reward work

…click to see all

Rui Pan

medium hireability

RLHF / RLVR55

Evals & Reward Models40

Synthetic Data & Self-Play5

Personality & Non-Verifiable Rewards5

LLM Creativity0

Strengths

22 merged PRs to PKU-Alignment/align-anything — core contributor

Safe RLHF-V: Reward Model-V + Cost Model-V implementation (#203)

Gaps

Undergraduate student — no production-scale post-training experience

…click to see all

Run Luo

medium hireability

MS student@University of Chinese Academy of Sciences

Synthetic Data & Self-Play88

RLHF / RLVR82

Evals & Reward Models75

Personality & Non-Verifiable Rewards55

LLM Creativity48

Strengths

GUI-R1: GRPO-based RLVR for GUI agents, first author (2025, 84 citations)

MMEvol: Evol-Instruct synthetic instruction data for MLLMs, first author

Gaps

MS student — junior; limited production-scale deployment experience

…click to see all

Runxin Xu

medium hireability

researcher@DeepSeek

Previously: Quant researcher @ Metabit Trading

Barcelona, ES

RLHF / RLVR97

Evals & Reward Models85

Synthetic Data & Self-Play65

Personality & Non-Verifiable Rewards20

LLM Creativity15

Strengths

DeepSeek-R1 (GRPO) — 5348 citations, defining RLVR paper in field

DeepSeekMath (2677 cites) — RLVR post-training for verifiable math reasoning

Gaps

No consumer post-training work (personality, creativity, humor, non-verifiable rewards)

…click to see all

Sagnik Mukherjee

medium hireability

Graduate Research Assistant@University of Illinois Urbana-Champaign

Previously: Research Intern @ Microsoft

Champaign, US

RLHF / RLVR65

Evals & Reward Models50

Personality & Non-Verifiable Rewards30

LLM Creativity5

Synthetic Data & Self-Play5

Strengths

NeurIPS 2025: RL sparsity in LLMs — mechanistic RL post-training analysis

ICML 2025 PARC: CoT chain verification and error identification

Gaps

No synthetic data generation or self-play pipeline experience

…click to see all

Sahil Chaudhary

medium hireability

Synthetic Data & Self-Play85

RLHF / RLVR30

Evals & Reward Models18

Personality & Non-Verifiable Rewards8

LLM Creativity5

Strengths

glaiveai/reasoning-v1-20m: 22.2M sample reasoning SFT dataset

glaiveai/l4: 1.28M function-calling training samples — direct tool-call alignment

Gaps

No evidence of RLHF/PPO/DPO/GRPO applied to preference optimization

…click to see all

Sainbayar Sukhbaatar

medium hireability

Research Scientist@Meta

Previously: Research Intern @ DeepMind

RLHF / RLVR92

Evals & Reward Models80

Synthetic Data & Self-Play55

Personality & Non-Verifiable Rewards28

LLM Creativity12

Strengths

Self-Rewarding LMs (548 citations) — RLHF via self-judging reward loop

Meta-Rewarding (2025) — meta-judge reward model for self-improving alignment

Gaps

No work on personality/tone/humor or non-verifiable reward shaping for conversation

…click to see all

Samuel R. Bowman

medium hireability

Member of Technical Staff@Anthropic

Previously: Technical Advisor @ ASAPP

San Francisco, US

Evals & Reward Models90

RLHF / RLVR85

Personality & Non-Verifiable Rewards82

Synthetic Data & Self-Play25

LLM Creativity12

Strengths

Constitutional AI (2022) — foundational RLAIF for non-verifiable reward training

Pretraining LMs with Human Preferences — RLHF from scratch research

Gaps

No synthetic data generation or self-play / persona-conditioned data pipeline work

…click to see all

Sergio Paniego Blanco

medium hireability

RLHF / RLVR75

Synthetic Data & Self-Play50

Evals & Reward Models40

Personality & Non-Verifiable Rewards15

LLM Creativity5

Strengths

110+ merged PRs on huggingface/trl — RLHF post-training library

AsyncGRPO + GRPO examples on Qwen3 (same base model Neuralace uses)

Gaps

No personality/non-verifiable reward work — GRPO is RLVR only

…click to see all

Shafiq Joty

medium hireability

Senior Research Director@Salesforce

Previously: Research Director @ Salesforce

San Francisco, US

Evals & Reward Models82

RLHF / RLVR78

Synthetic Data & Self-Play55

Personality & Non-Verifiable Rewards30

LLM Creativity20

Strengths

Diffusion Model Alignment via DPO — 408 citations, strong preference optimization pedigree

Direct Judgement Preference Optimization (2025) — current DPO research

Gaps

No personality, humor, or creative writing post-training work found

…click to see all

Shaoguang Mao

medium hireability

Technical Staff@Moonshot AI

Previously: Senior Research SDE @ Microsoft

Beijing, CN

Evals & Reward Models55

LLM Creativity48

Personality & Non-Verifiable Rewards42

Synthetic Data & Self-Play25

RLHF / RLVR20

Strengths

Kimi K2 authorship (2025) — active Moonshot AI post-training team

TaskMatrix.AI (229 citations) — tool-call & API orchestration expertise

Gaps

No published RLHF/RLVR/DPO/PPO papers — reward modeling depth unclear

…click to see all

Shauna M Kravec

medium hireability

Member Of Technical Staff@Anthropic

Previously: Machine Learning Engineer @ Clostra

RLHF / RLVR95

Personality & Non-Verifiable Rewards95

Evals & Reward Models85

Synthetic Data & Self-Play50

LLM Creativity20

Strengths

Constitutional AI co-author — defining work on AI feedback and personality shaping

RLHF paper (3,057 citations) — foundational post-training with human preferences

Gaps

No explicit RLVR/verifiable-reward work (code gen, tool-use, agentic RL)

…click to see all

Sheng Shen

medium hireability

Member of Technical Staff@xAI

Previously: Research Scientist @ Meta

San Francisco, US

RLHF / RLVR72

Synthetic Data & Self-Play68

Evals & Reward Models65

Personality & Non-Verifiable Rewards12

LLM Creativity8

Strengths

LLaVA-RLHF — factually augmented RLHF, pinned repo with code

"Learning to Solve and Verify" (2025) — self-play for code/test gen

Gaps

No work on personality, humor, or consumer-side subjective RLHF

…click to see all

Shiyi Cao

medium hireability

Ph.D. student@UC Berkeley EECS

Previously: Researcher @ CMU

San Francisco, US

RLHF / RLVR62

Synthetic Data & Self-Play35

Evals & Reward Models28

LLM Creativity5

Personality & Non-Verifiable Rewards5

Strengths

Sky-T1: RLVR reasoning model (Qwen3-7B) trained for $450, EMNLP 2025

SkyRL: modular full-stack RL library for long-horizon LLM agent training

Gaps

No work on consumer post-training: personality, humor, or non-verifiable rewards

…click to see all

Shu Liu

medium hireability

PhD Student@University of California, Berkeley

Previously: Research Intern @ Max Planck Institute for Software Systems

Berkeley, US

RLHF / RLVR80

Evals & Reward Models45

Synthetic Data & Self-Play40

LLM Creativity5

Personality & Non-Verifiable Rewards5

Strengths

Sky-T1 (163 citations) — GRPO RLVR o1-style reasoning training at $450

NovaSky-AI/SkyRL — modular RL library built for LLM agentic workloads

Gaps

No consumer post-training work — personality, humor, or non-verifiable rewards

…click to see all

Sourab Mangrulkar

medium hireability

RLHF / RLVR30

Synthetic Data & Self-Play20

Evals & Reward Models8

LLM Creativity5

Personality & Non-Verifiable Rewards5

Strengths

huggingface/peft creator — 119+ PRs, LoRA/QLoRA/adapters widely used in post-training

DPO trainer fix + FSDP+QLoRA enablement in TRL

Gaps

No reward modeling or RLHF algorithm research (PPO, GRPO, DPO methods — not just fixes)

…click to see all

Sumanth R Hegde

medium hireability

RLHF / RLVR82

Evals & Reward Models48

Synthetic Data & Self-Play25

Personality & Non-Verifiable Rewards8

LLM Creativity5

Strengths

Core SkyRL contributor — GRPO, SFT trainer, Megatron backend

Chunked logprobs for Qwen 3.5 248k vocab — Qwen RL infra

Gaps

No non-verifiable reward work — personality, humor, creativity absent

…click to see all

Suraj Subramanian

medium hireability

RLHF / RLVR52

Evals & Reward Models20

Synthetic Data & Self-Play18

LLM Creativity5

Personality & Non-Verifiable Rewards5

Strengths

OpenEnv: agentic RL training framework, integrates with TRL/SkyRL/Unsloth (Dec 2025)

LoRA vs FFT skills-transferability experiments — fine-tuning methodology research

Gaps

No evidence of reward modeling, DPO, or RLHF pipeline work

…click to see all

Szymon Tworkowski

medium hireability

Member of Technical Staff@xAI

Previously: Student Researcher @ DeepMind

San Francisco, US

RLHF / RLVR92

Evals & Reward Models52

Synthetic Data & Self-Play38

LLM Creativity5

Personality & Non-Verifiable Rewards5

Strengths

Led Grok 4.20 reasoning RL training algorithm and scaling at xAI

100x RL scaling: foundational contributor to Grok 3 Reasoning stability

Gaps

No evidence of consumer post-training (personality, tone, non-verifiable rewards)

…click to see all

Tianhao Wu

medium hireability

PhD student@UC Berkeley

Previously: Algo Developer @ Hudson River Trading

Berkeley, US

RLHF / RLVR92

Evals & Reward Models85

Personality & Non-Verifiable Rewards72

Synthetic Data & Self-Play52

LLM Creativity28

Strengths

RouteLLM: LLM routing by query complexity — matches JD's small-model-as-frontend concept

Starling-7B (RLAIF, 176 citations) — RLHF post-training for helpfulness/harmlessness

Gaps

No published work on personality adherence, humor, or creative writing specifically

…click to see all

Tianyi Tang

medium hireability

Member of the Qwen Team@Alibaba

Previously: AI Engineer @ Unisound AI

Hangzhou, CN

Personality & Non-Verifiable Rewards85

RLHF / RLVR70

Evals & Reward Models62

LLM Creativity58

Synthetic Data & Self-Play50

Strengths

Co-author Qwen2.5 & Qwen3 — direct post-training production experience

ICLR 2025: Neuron-based Personality Trait Induction — on-point for consumer alignment

Gaps

No standalone RLVR / verifiable-reward RL paper (PPO, DPO, GRPO)

…click to see all

Tomek Korbak

medium hireability

Member of Technical Staff@OpenAI

Previously: Senior Research Scientist @ AI Security Institute

San Francisco, US

RLHF / RLVR98

Personality & Non-Verifiable Rewards80

Evals & Reward Models72

LLM Creativity32

Synthetic Data & Self-Play30

Strengths

'Open problems... RLHF' — 836 cites, co-authored survey on RLHF limitations

PhD at Sussex focused entirely on RL from human feedback

Gaps

Only ~6 months at OpenAI — relatively fresh hire, may not be actively looking

…click to see all

Tyler Griggs

medium hireability

RLHF / RLVR78

Synthetic Data & Self-Play30

Evals & Reward Models28

Personality & Non-Verifiable Rewards15

LLM Creativity5

Strengths

SkyRL-Agent paper: multi-turn agent RL training (2025)

GRPO off-policy per-token masking commit to NovaSky-AI/SkyRL

Gaps

No work on personality, creativity, or non-verifiable reward modeling

…click to see all

Valentina Pyatkin

medium hireability

Researcher@ETH AI Center

Previously: Research Intern @ Allen Institute for AI

Zurich, CH

Evals & Reward Models95

RLHF / RLVR90

Personality & Non-Verifiable Rewards60

Synthetic Data & Self-Play55

LLM Creativity15

Strengths

TÜLU 3: end-to-end open LLM post-training (PPO/DPO/RLVR) — 489 citations

RewardBench (432 citations): standard benchmark for reward model eval

Gaps

No direct work on personality modeling, humor, or creativity generation

…click to see all

Vardhan Dongre

medium hireability

Research Scientist Intern@Adobe

Previously: AI/ML Research Software Engineer Intern @ Brunswick Corporation

San Francisco, US

Personality & Non-Verifiable Rewards38

Evals & Reward Models35

Synthetic Data & Self-Play28

RLHF / RLVR10

LLM Creativity5

Strengths

Advised by Dilek Hakkani-Tür — leading conversational AI researcher at UIUC

Drift No More? — multi-turn LLM context quality, key consumer post-training concern

Gaps

No evidence of RLHF, DPO, reward modeling, or post-training methodology

…click to see all

Wangchunshu Zhou

medium hireability

Director of OPPO Personal AI Lab@OPPO

Previously: Co-Founder & Chief Technology Officer @ AIWaves

Hangzhou, CN

LLM Creativity88

Personality & Non-Verifiable Rewards85

Evals & Reward Models62

Synthetic Data & Self-Play62

RLHF / RLVR35

Strengths

RoleLLM (344 citations) — eliciting + benchmarking role-play in LLMs

Weaver: foundation model specifically for creative writing (NeurIPS 2024)

Gaps

No explicit RLHF/PPO/DPO/GRPO training papers — mostly SFT-based

…click to see all

Wanjun Zhong

medium hireability

Senior Research Scientist@ByteDance

Previously: Research Scientist @ Huawei

RLHF / RLVR82

Evals & Reward Models78

Personality & Non-Verifiable Rewards35

Synthetic Data & Self-Play25

LLM Creativity12

Strengths

RetoTool (2025): RL for strategic tool use — direct RLVR match

OTC (2025): optimal tool calls via RL — second tool-RL paper

Gaps

No evidence of personality/non-verifiable reward post-training work

…click to see all

Wenwei Zhang

medium hireability

Young Research Scientist@Shanghai Artificial Intelligence Laboratory

Previously: PhD student @ Nanyang Technological University

Singapore, SG

Evals & Reward Models92

RLHF / RLVR82

Synthetic Data & Self-Play30

Personality & Non-Verifiable Rewards22

LLM Creativity10

Strengths

"Exploring Limit of Outcome Reward" — RLVR for math reasoning (2025)

InternLM-XComposer2.5-Reward — multimodal reward model (30 citations, 2025)

Gaps

No direct work on personality, humor, or non-verifiable reward shaping

…click to see all

Wenxiang Hu

medium hireability

Senior Machine Learning Engineer@Microsoft

Previously: Senior Research Software Engineer @ Microsoft

Seattle, US

Synthetic Data & Self-Play80

RLHF / RLVR30

Evals & Reward Models20

LLM Creativity15

Personality & Non-Verifiable Rewards15

Strengths

WizardCoder (846 cites): Evol-Instruct synthetic data pipeline for code SFT

EpiCoder (ICML 2025): feature tree-based synthesis — controllable complexity & diversity

Gaps

No explicit RLHF/DPO/PPO/GRPO post-training work — focus is SFT data generation

…click to see all

Xiangxin Zhou

medium hireability

RedStar Intern@Xiaohongshu Hi Lab

Previously: Associate Member @ Sea AI Lab

RLHF / RLVR68

Personality & Non-Verifiable Rewards55

Evals & Reward Models15

Synthetic Data & Self-Play12

LLM Creativity5

Strengths

VeriFree (ICLR 2026): RL for reasoning without verifiers — non-VR rewards

Variational Reasoning for LMs (ICLR 2026): second LLM RL paper same cycle

Gaps

No explicit work on personality, humor, sarcasm, or creative writing

…click to see all

Xin Cong

medium hireability

RLHF / RLVR72

Evals & Reward Models72

Synthetic Data & Self-Play38

Personality & Non-Verifiable Rewards8

LLM Creativity5

Strengths

AgentCPM-Explore: RL post-training for 4B model with reward signal denoising (Feb 2026)

AgentRM: reward modeling (explicit/implicit/LLM-as-judge) for agent policy guidance (Feb 2025)

Gaps

No consumer post-training work — personality, humor, sarcasm, creativity all absent

…click to see all

Xinting Huang

medium hireability

Senior Researcher@Tencent

Previously: Research Engineer Intern @ ByteDance

Shenzhen, CN

Synthetic Data & Self-Play72

RLHF / RLVR45

Evals & Reward Models30

Personality & Non-Verifiable Rewards8

LLM Creativity5

Strengths

Explore-Instruct (EMNLP 2023): domain-specific instruction data via active exploration

TeG-Instruct (2024): text-grounded premium instruction-tuning data pipeline

Gaps

No evidence on personality, humor, creativity, or roleplay post-training

…click to see all

Xuechen Li

medium hireability

Member of Technical Staff@xAI

Previously: Member of Technical Staff @ xAI

San Francisco, US

RLHF / RLVR85

Evals & Reward Models80

Synthetic Data & Self-Play50

Personality & Non-Verifiable Rewards40

LLM Creativity10

Strengths

AlpacaFarm (NeurIPS 2023): RLHF simulation — seminal preference-learning work

19-commit lead on stanford_alpaca — core instruction-tuning contributor

Gaps

No direct work on personality / non-verifiable reward modeling

…click to see all

Xuyao Wang

medium hireability

RLHF / RLVR55

Evals & Reward Models55

Personality & Non-Verifiable Rewards25

Synthetic Data & Self-Play10

LLM Creativity5

Strengths

PPO, DPO, SFT pipelines implemented across multiple model families in align-anything

Qwen3/Qwen3MoE post-training support — directly relevant to Neuralace's Qwen work

Gaps

No published research papers — engineering contributor only, not research lead

…click to see all

Yang An (An Yang)

medium hireability

RLHF / RLVR95

Evals & Reward Models85

Personality & Non-Verifiable Rewards80

Synthetic Data & Self-Play72

LLM Creativity62

Strengths

Group Sequence Policy Optimization — novel RL algorithm for LLM post-training (2025)

WorldPM: Scaling Human Preference Modeling — reward model for subjective preferences

Gaps

Deeply embedded at Alibaba Qwen team — competitive to recruit

…click to see all

Yann Dubois

medium hireability

Member of Technical Staff@OpenAI

Previously: Research Assistant @ Vector Institute

San Francisco, US

Evals & Reward Models92

RLHF / RLVR88

Synthetic Data & Self-Play78

Personality & Non-Verifiable Rewards72

LLM Creativity30

Strengths

AlpacaFarm (686 cit.) — RLHF simulation with LLM-as-judge for non-verifiable rewards

AlpacaEval (824 cit.) — industry-standard instruction-following eval framework

Gaps

No direct work on creative writing, roleplay, or personality adherence training

…click to see all

Yan Wang

medium hireability

principal researcher@Tencent

Previously: research scientist @ miHoYo

Personality & Non-Verifiable Rewards72

LLM Creativity55

Evals & Reward Models38

Synthetic Data & Self-Play22

RLHF / RLVR18

Strengths

'Harry Potter' character alignment — LLMs aligned to personality (EMNLP 2023)

'Generate, Delete and Rewrite' — persona consistency in dialogue (ACL 2020)

Gaps

No RLHF/PPO/DPO/GRPO post-training pipeline evidence

…click to see all

Yaowei Zheng

medium hireability

RLHF / RLVR95

Evals & Reward Models75

Synthetic Data & Self-Play70

Personality & Non-Verifiable Rewards30

LLM Creativity15

Strengths

LLaMA-Factory (ACL 2024, 1,205 citations) — PPO/DPO/GRPO, 2,186 commits

EasyR1: GRPO/DAPO/RLOO RL training framework, 4.9K stars, multi-modal

Gaps

No dedicated research on personality fine-tuning or non-verifiable reward modeling

…click to see all

Yaxi Lu

medium hireability

PhD student@Eng.D. student, Tsinghua University

Beijing, CN

Evals & Reward Models75

RLHF / RLVR72

Synthetic Data & Self-Play10

LLM Creativity5

Personality & Non-Verifiable Rewards5

Strengths

AgentRM (ACL 2025): reward model for agent generalization

Reflective Reinforcement Tool Learning (2026): RLVR for tool use

Gaps

No consumer post-training work (personality, creativity, humor, roleplay)

…click to see all

Yining Ye

medium hireability

master student@THUNLP lab, Tsinghua University

Previously: Topseed Intern @ Bytedance

Beijing, CN

RLHF / RLVR65

Evals & Reward Models62

Synthetic Data & Self-Play38

LLM Creativity5

Personality & Non-Verifiable Rewards5

Strengths

ToolLLM (ICLR 2024 Spotlight, 1005 citations) — tool calling training platform

UI-TARS-2 multi-turn RL for GUI/computer-use agents

Gaps

No consumer post-training (personality, humor, emotional understanding)

…click to see all

Yixuan Su

medium hireability

Modelling Lead, Agentic Reasoning@Cohere

Previously: Research Scientist @ Cohere

London, GB

Evals & Reward Models65

LLM Creativity60

RLHF / RLVR45

Personality & Non-Verifiable Rewards30

Synthetic Data & Self-Play25

Strengths

Modelling Lead on Command-A-Reasoning — hands-on post-training for reasoning

"Replacing Judges with Juries" (2024) — multi-model eval framework, 164 cites

Gaps

No published RLHF/RLVR papers — post-training expertise inferred from role only

…click to see all

Yongbin Li

medium hireability

Principal Research Scientist@Alibaba

RLHF / RLVR85

Personality & Non-Verifiable Rewards82

Evals & Reward Models80

Synthetic Data & Self-Play70

LLM Creativity68

Strengths

'Preference Ranking Optimization for Human Alignment' — 327 citations, core RLHF work

'CPO: Reward Ambiguity in Role-playing Dialogue' (2025) — non-verifiable reward for roleplay

Gaps

RLVR on code/math (Knowledge Work axis) weaker than consumer post-training axis

…click to see all

Younes Belkada

medium hireability

MS student@ENS Paris Saclay

Previously: Researcher @ Technology Innovation Institute

Paris, FR

RLHF / RLVR92

Evals & Reward Models55

Personality & Non-Verifiable Rewards38

Synthetic Data & Self-Play35

LLM Creativity12

Strengths

241 commits to huggingface/trl — PPO/DPO/GRPO RLHF library core contributor

Zephyr paper (779 cites): direct LM alignment distillation via DPO

Gaps

No specific work on personality, humor, or creativity-focused reward modeling

…click to see all

Yujia Qin

medium hireability

Seed@ByteDance

Previously: Founder @ SeqAI Inc.

RLHF / RLVR78

Evals & Reward Models72

Synthetic Data & Self-Play60

Personality & Non-Verifiable Rewards28

LLM Creativity5

Strengths

ToolLLM — 956 citations, foundational LLM tool-use training (ICLR 2024 spotlight)

ReTool (2025) — RL for strategic tool use, direct RLVR post-training evidence

Gaps

No work on personality/creativity/humor post-training (Consumer direction weak)

…click to see all

Yujia Qin

medium hireability

Evals & Reward Models72

Synthetic Data & Self-Play55

RLHF / RLVR40

LLM Creativity5

Personality & Non-Verifiable Rewards5

Strengths

ToolLLM (ICLR 2024 Spotlight) — 16k API tool-use training at scale

ToolBench — open eval platform for tool learning, directly maps to Evals axis

Gaps

No work on consumer post-training: personality, humor, sarcasm, or non-VR rewards

…click to see all

Yunshui Li

medium hireability

Researcher@ByteDance

Previously: MS student @ University of the Chinese Academy of Sciences

Synthetic Data & Self-Play80

RLHF / RLVR70

Evals & Reward Models40

LLM Creativity30

Personality & Non-Verifiable Rewards25

Strengths

Seed1.5-Thinking co-author — ByteDance RLVR reasoning post-training (2025)

NUGGETS: instruction data prospector for SFT quality filtering (ACL 2024)

Gaps

No direct RLAIF/constitutional AI or personality reward modeling work

…click to see all

Yuqing Du

medium hireability

Research Scientist@DeepMind

Previously: Visiting Researcher @ Meta

San Francisco, US

RLHF / RLVR88

Evals & Reward Models72

Personality & Non-Verifiable Rewards65

Synthetic Data & Self-Play30

LLM Creativity22

Strengths

DPOK (NeurIPS 2023, 419 citations): RLHF/RLVR applied to generative model fine-tuning

Aligning T2I with Human Feedback (410 citations): preference reward modeling

Gaps

No explicit work on LLM personality, humor, or creativity alignment

…click to see all

Yuxuan Zhang

medium hireability

PhD student@University of Liverpool

Previously: Undergrad student @ Xi'an Jiaotong-Liverpool University

Liverpool, GB

RLHF / RLVR65

Evals & Reward Models25

Personality & Non-Verifiable Rewards25

Synthetic Data & Self-Play20

LLM Creativity15

Strengths

GLM-4.1V-Thinking: RLCS (RL with curriculum sampling) for multimodal reasoning

GLM-4.5 contributor — agentic, tool use, coding post-training at scale

Gaps

Contributor role on large teams — ownership of RL/post-training components unclear

…click to see all

Zhenyu Li

medium hireability

PhD student@Tsinghua University

Beijing, CN

RLHF / RLVR78

Synthetic Data & Self-Play78

Evals & Reward Models65

Personality & Non-Verifiable Rewards32

LLM Creativity14

Strengths

Doubao Super Mode: led RL pipeline for agentic tool/search/code capabilities

Agent-World: self-evolving training arena, 14B beats DeepSeek-V3-685B on BFCL-V4

Gaps

No explicit work on personality, humor, or creative writing post-training

…click to see all

Zhiheng Xi

medium hireability

Senior Staff Machine Learning Engineer@Apple

Previously: Staff Software Engineer (ASE) @ Apple

Seattle, US

RLHF / RLVR95

Evals & Reward Models90

Synthetic Data & Self-Play65

Personality & Non-Verifiable Rewards35

LLM Creativity20

Strengths

"Delve into PPO" (278 cit.) — seminal RLHF PPO implementation paper

"Secrets of RLHF Part II: Reward Modeling" (192 cit.) — reward model depth

Gaps

No work on consumer personality, humor, or creative writing

…click to see all

Zhihong Shao

medium hireability

Member of Technical Staff@DeepSeek

Previously: Research Intern @ Microsoft

Beijing, CN

RLHF / RLVR97

Evals & Reward Models85

Synthetic Data & Self-Play78

LLM Creativity18

Personality & Non-Verifiable Rewards12

Strengths

Invented GRPO — the dominant RLVR algorithm for LLM post-training

DeepSeek-R1 key author — RL scaling for complex reasoning

Gaps

No public work on personality adherence, humor, or non-verifiable reward modeling

…click to see all

ZHU QIHAO

medium hireability

RLHF / RLVR95

Evals & Reward Models60

Synthetic Data & Self-Play50

LLM Creativity5

Personality & Non-Verifiable Rewards5

Strengths

DeepSeek-R1 co-author — GRPO/RLVR for reasoning, landmark 2025 paper

DeepSeek-Prover-V1.5 & V2 — RLPAF + MCTS self-play for theorem proving

Gaps

No evidence of consumer post-training (personality, humor, non-VR rewards)

…click to see all

Zihan Wang

medium hireability

MS student@Tsinghua University

RLHF / RLVR72

Synthetic Data & Self-Play70

Evals & Reward Models45

Personality & Non-Verifiable Rewards12

LLM Creativity8

Strengths

RLOJF: direct RLVR with online judge verifiable reward

SciInstruct: self-reflective annotation pipeline (NeurIPS 2024)

Gaps

No work on personality, emotions, humor, or non-verifiable reward modeling

…click to see all

阿丹

阿丹(adan)

medium hireability

Synthetic Data & Self-Play60

Personality & Non-Verifiable Rewards50

RLHF / RLVR45

LLM Creativity25

Evals & Reward Models15

Strengths

AutoPlan toolcall finetuning — loss masking on Observation tokens

dpo_trainer_new — DPO with SFT cross-entropy to prevent catastrophic forgetting

Gaps

No formal RL research or reward modeling publications

…click to see all

Abbas Abdolmaleki

low hireability

Research Scientist@Google

Portugal

RLHF / RLVR82

Evals & Reward Models52

Personality & Non-Verifiable Rewards30

Synthetic Data & Self-Play25

LLM Creativity10

Strengths

'Preference optimization as probabilistic inference' — direct RLHF theory contribution

MPO/V-MPO inventor — KL-constrained RL foundation for preference optimization

Gaps

Core background is robotic control RL, not LLM personality or creativity post-training

…click to see all

Adam Roberts

low hireability

Director of Research@DeepMind

Previously: Senior Staff Software Engineer @ DeepMind

San Francisco, US

Synthetic Data & Self-Play80

LLM Creativity72

Evals & Reward Models55

RLHF / RLVR30

Personality & Non-Verifiable Rewards20

Strengths

FLAN-v2: instruction fine-tuning at scale — 5.3K citations

Flan Collection: data design methods for SFT — directly maps to synthetic data pipeline

Gaps

Instruction tuning is SFT-centric — limited RLHF/PPO/DPO/reward modeling work

…click to see all

Adam Santoro

low hireability

DeepMind

Evals & Reward Models80

RLHF / RLVR60

LLM Creativity20

Synthetic Data & Self-Play15

Personality & Non-Verifiable Rewards10

Strengths

Representation geometry paper (2025) directly compares SFT, DPO, RLVR post-training dynamics

BIG-bench framework — major LLM evaluation benchmark (2016 citations)

Gaps

No evidence of reward modeling or RLHF pipeline implementation

…click to see all

Adam Tauman Kalai

low hireability

Research Scientist@OpenAI

Previously: Senior Principal Researcher @ Microsoft

Evals & Reward Models65

Synthetic Data & Self-Play55

Personality & Non-Verifiable Rewards45

LLM Creativity25

RLHF / RLVR20

Strengths

'Using LLMs to Simulate Multiple Humans' (778 citations) — persona data gen

OpenAI o1 system card contributor — RL-based training exposure

Gaps

No direct RLHF/PPO/DPO/GRPO training work published

…click to see all

Adam X. Yang

low hireability

PhD student@Mistral AI

Bristol, GB

RLHF / RLVR88

Evals & Reward Models80

Personality & Non-Verifiable Rewards30

Synthetic Data & Self-Play18

LLM Creativity10

Strengths

Bayesian Reward Models for LLM Alignment — reward overoptimization & RLHF (29 citations)

SparsePO (ICLR 2025) — preference alignment via sparse token weighting

Gaps

No evidence of personality/tone/creativity reward modeling (consumer post-training gap)

…click to see all

Addie Foote

low hireability

Research Scholar@ML Alignment & Theory Scholars

Previously: Research Scholar @ ML Alignment & Theory Scholars

San Francisco, US

RLHF / RLVR30

Synthetic Data & Self-Play20

Evals & Reward Models15

Personality & Non-Verifiable Rewards15

LLM Creativity5

Strengths

Trellis: 50x faster LoRA fine-tuning on 1T Kimi K2 Thinking MoE (March 2026)

Expert parallelism + INT4 dequant on 8xH200 — hands-on distributed post-training stack

Gaps

Very early career — h-index 2, UT Austin 2024 undergrad

…click to see all

Afra Amini

low hireability

Research Scientist@DeepMind

Previously: Research Intern @ Ai2

RLHF / RLVR90

Evals & Reward Models60

Synthetic Data & Self-Play30

Personality & Non-Verifiable Rewards20

LLM Creativity10

Strengths

ODPO (ACL 2024, 102 citations) — direct preference optimization innovation

NeurIPS 2025: KL divergence for RLHF — reward model quality signal

Gaps

No work on personality adherence, humor, or non-verifiable reward shaping

…click to see all

Ahmed Hassan Awadallah

low hireability

Partner Research Manager@Microsoft

RLHF / RLVR90

Synthetic Data & Self-Play88

Evals & Reward Models80

Personality & Non-Verifiable Rewards25

LLM Creativity10

Strengths

Hybrid LLM (2024): small-to-large model routing -- exact Neuralace vision

Orca/Orca-2: GPT-4 trace synthetic data pipeline for SLM post-training

Gaps

No published work on personality, tone, humor, or non-verifiable subjective rewards

…click to see all

AÜ

Ahmet Üstün

low hireability

Code Agents Lead@Cohere

Previously: Senior Research Scientist @ Cohere

Groningen, NL

RLHF / RLVR85

Evals & Reward Models60

Synthetic Data & Self-Play55

Personality & Non-Verifiable Rewards20

LLM Creativity10

Strengths

"Back to Basics: REINFORCE for LHF" (2024, 481 citations) — core RLHF post-training work

"RLHF Can Speak Many Languages" — multilingual preference optimization

Gaps

No evidence of personality/creativity reward modeling or non-verifiable reward design

…click to see all

Akshay Krishnamurthy

low hireability

Senior Principal Research Manager@Microsoft

Previously: Principal Researcher @ Microsoft

New York, US

RLHF / RLVR92

Evals & Reward Models55

Synthetic Data & Self-Play32

Personality & Non-Verifiable Rewards10

LLM Creativity5

Strengths

XPO (ICLR 2025): provably efficient exploration in RLHF

Chi-Squared Preference Opt (ICLR 2025 spotlight): direct alignment sans overoptimization

Gaps

No work on personality, tone, humor, or subjective reward modeling

…click to see all

Albert Villanova del Moral

low hireability

RLHF / RLVR75

Evals & Reward Models55

Personality & Non-Verifiable Rewards15

Synthetic Data & Self-Play10

LLM Creativity5

Strengths

387 merged PRs on huggingface/trl — core DPO/KTO/Reward trainer maintainer

Commits landing today (May 7 2026): KTO/DPO alignment, RewardTrainer fixes

Gaps

No published papers on RLHF, preference learning, or post-training

…click to see all

Alec Koppel

low hireability

Senior Professional Staff@Johns Hopkins Applied Physics Laboratory

Previously: Research Lead/Vice President @ JPMorgan Chase

Laurel, US

RLHF / RLVR90

Evals & Reward Models65

Synthetic Data & Self-Play35

Personality & Non-Verifiable Rewards35

LLM Creativity5

Strengths

MaxMin-RLHF (2024, 88 cites) — diverse-preference alignment directly on-target

PARL (2024, 37 cites) — unified RLHF policy alignment framework

Gaps

No work on personality, humor, or non-verifiable creative reward modeling

…click to see all

Alexander Havrilla

low hireability

research scientist@DeepMind

Previously: PhD student @ Georgia Institute of Technology

London, GB

RLHF / RLVR92

LLM Creativity76

Synthetic Data & Self-Play76

Personality & Non-Verifiable Rewards74

Evals & Reward Models72

Strengths

trlX: large-scale RLHF framework, EMNLP 2023, CarperAI co-founder

"Teaching LLMs to Reason with RL" — 130-citation RLVR paper

Gaps

No public tool-call or agentic LLM work found

…click to see all

Alexander Wettig

low hireability

Research Scientist@Cursor

Previously: Software Engineering Intern @ Google

San Francisco, US

Evals & Reward Models78

Synthetic Data & Self-Play72

RLHF / RLVR15

LLM Creativity0

Personality & Non-Verifiable Rewards0

Strengths

SWE-bench (ICLR 2024 Oral, 1195 cites) — top eval for code agents

SWE-smith (2025): scales SFT data generation for agentic code tasks

Gaps

No RLHF/RLVR or preference optimization work

…click to see all

Alexandre Ramé

low hireability

Research Scientist@DeepMind

Previously: Research Scientist Intern @ Meta

Paris, FR

RLHF / RLVR90

Evals & Reward Models87

Personality & Non-Verifiable Rewards52

Synthetic Data & Self-Play18

LLM Creativity12

Strengths

WARM (103 citations) — reward model robustness via weight averaging

Rewarded Soups (207 citations) — Pareto-optimal multi-reward alignment

Gaps

No synthetic data or self-play pipeline work found

…click to see all

Alex Beutel

low hireability

Member of Technical Staff, Research Scientist@OpenAI

Previously: Senior Staff Research Scientist @ Google

New York, US

RLHF / RLVR82

Evals & Reward Models78

Personality & Non-Verifiable Rewards72

Synthetic Data & Self-Play22

LLM Creativity18

Strengths

OpenAI o1 post-training author — reasoning model safety training

'Instruction Hierarchy' trains LLMs on privileged instruction priority (ICLR 2025)

Gaps

No explicit work on synthetic conversation data generation or self-play pipelines

…click to see all

Alexey Bukhtiyarov

low hireability

Personality & Non-Verifiable Rewards65

RLHF / RLVR35

LLM Creativity30

Evals & Reward Models15

Synthetic Data & Self-Play5

Strengths

NLP Team Lead at Ex-Human — consumer personality/character AI company

Slingshot AI (Ash) — foundational LLM for therapy, emotion & intent understanding

Gaps

No published research — applied practitioner, not researcher

…click to see all

Alex Tamkin

low hireability

Member of Technical Staff@Anthropic

Previously: PhD @ Stanford University

San Francisco, US

Personality & Non-Verifiable Rewards85

RLHF / RLVR65

Evals & Reward Models65

Synthetic Data & Self-Play30

LLM Creativity20

Strengths

Collective Constitutional AI — RLAIF with broad human preference input (124 citations)

Eliciting Human Preferences with LLMs — preference reward modeling at NeurIPS 2024

Gaps

No clear RLVR / tool-use or code-gen work — knowledge-work axis weak

…click to see all

Alon Albalak

low hireability

Research Scientist, Open-Endedness@Lila Sciences

Previously: Data Team Lead, Member of Technical Staff @ SynthLabs

San Francisco, US

Synthetic Data & Self-Play88

RLHF / RLVR72

Evals & Reward Models68

Personality & Non-Verifiable Rewards18

LLM Creativity12

Strengths

Generative Reward Models (2025, 92 cit) — direct reward modeling work

Big-math RL dataset @ SynthLabsAI — RLVR data pipeline at scale

Gaps

Minimal consumer post-training work — no personality, creativity, or RLAIF evidence

…click to see all

Amanda Askell

low hireability

Member Of Technical Staff@Anthropic

Previously: Research Scientist (Policy) @ OpenAI

San Francisco, US

Personality & Non-Verifiable Rewards100

RLHF / RLVR98

Evals & Reward Models85

LLM Creativity40

Synthetic Data & Self-Play35

Strengths

Constitutional AI paper — canonical RLAIF / non-VR reward design (2.6K citations)

InstructGPT co-author — PPO-based RLHF at scale (21K citations)

Gaps

No direct evidence of synthetic data / self-play pipelines

…click to see all

Amjad Almahairi

low hireability

Staff Research Scientist@Google

Previously: Research Scientist @ Anyscale

San Francisco, US

RLHF / RLVR80

Evals & Reward Models45

Personality & Non-Verifiable Rewards35

Synthetic Data & Self-Play20

LLM Creativity10

Strengths

RouteLLM (ICLR 2025) — preference-data LLM routing, exact match to JD tool-call vision

LLaMA 2 RLHF post-training — core contributor at Meta AI

Gaps

No dedicated personality/creativity/non-verifiable reward work evidenced

…click to see all

Andrea Madotto

low hireability

Research Scientist@Meta

Previously: PhD @ The Hong Kong University of Science and Technology

San Francisco, US

Personality & Non-Verifiable Rewards80

LLM Creativity55

Evals & Reward Models45

RLHF / RLVR30

Synthetic Data & Self-Play20

Strengths

PPLM (1257 citations) — foundational controllable personality/style generation work

MoEL (310 citations) — empathy and emotion modeling in dialogue

Gaps

No evidence of modern RLHF at scale (PPO, DPO, GRPO on large LLMs)

…click to see all

Angelica Chen

low hireability

Senior Research Scientist@DeepMind

Previously: Doctoral Student @ New York University

New York, US

RLHF / RLVR85

Evals & Reward Models75

Personality & Non-Verifiable Rewards60

Synthetic Data & Self-Play15

LLM Creativity10

Strengths

"Pretraining Language Models with Human Preferences" (ICML 2023, 263 citations)

"Preference Learning Algorithms Do Not Learn Preference Rankings" (NeurIPS 2024)

Gaps

No work on personality, humor, or creative writing post-training

…click to see all

Anirudh Goyal

low hireability

Researcher@DeepMind

Previously: PhD student @ University of Montreal

London, GB

LLM Creativity85

Evals & Reward Models75

RLHF / RLVR70

Synthetic Data & Self-Play45

Personality & Non-Verifiable Rewards45

Strengths

HypoSpace (2025): LLM creativity eval as set-valued hypothesis generators

MCTS + iterative preference learning (2024, 170 cit.) — DPO-style preference optimization

Gaps

No evidence of tool-call or agentic post-training work

…click to see all

Asli Celikyilmaz

low hireability

Research Manager@Meta

Previously: Senior Principal Researcher @ Microsoft

Seattle, US

Evals & Reward Models84

RLHF / RLVR80

Personality & Non-Verifiable Rewards75

Synthetic Data & Self-Play73

LLM Creativity72

Strengths

RLCD (2024): RL from contrastive distillation for LM alignment, 60 cit

PrefPalette (2025): personalized preference modeling with latent attributes

Gaps

No explicit GRPO or RLVR on verifiable tasks (code, tool execution confirmed)

…click to see all

Aston Zhang

low hireability

Member of Technical Staff@OpenAI

Previously: Research Scientist @ Meta

San Francisco, US

RLHF / RLVR75

Evals & Reward Models70

Synthetic Data & Self-Play45

Personality & Non-Verifiable Rewards30

LLM Creativity15

Strengths

Self-generated critiques boost reward modeling (2025) -- direct reward model paper

Systematic Examination of Preference Learning (2025) -- RLHF/DPO methodology

Gaps

No explicit published work on personality/creativity/non-verifiable reward design

…click to see all

Banghua Zhu

low hireability

Principal Researcher@NVIDIA

Evals & Reward Models95

RLHF / RLVR90

Personality & Non-Verifiable Rewards82

Synthetic Data & Self-Play40

LLM Creativity25

Strengths

Chatbot Arena (1063 citations) — human preference LLM evaluation pioneer

Starling-7B RLAIF — direct experience training with AI feedback for helpfulness/harmlessness

Gaps

Primarily evaluation/reward-model focused; limited consumer personality or creative writing training work

…click to see all

Barret Zoph

low hireability

Something New@OpenAI

Previously: CTO, Co-Founder @ Thinking Machines

San Francisco, US

RLHF / RLVR95

Evals & Reward Models90

Personality & Non-Verifiable Rewards75

Synthetic Data & Self-Play72

LLM Creativity40

Strengths

VP Research Post-Training at OpenAI — ran ChatGPT, GPT-4, o1 post-training

FLAN (scaling instruction-finetuned models) — 4,896 citations, canonical IT work

Gaps

Just restarted at OpenAI as IC ~4 months ago — low near-term hireability

…click to see all

Behnam Hedayatnia

low hireability

Senior Machine Learning Engineer@Apple

Previously: Senior Research Scientist @ Amazon

San Francisco, US

Personality & Non-Verifiable Rewards75

Evals & Reward Models65

LLM Creativity30

Synthetic Data & Self-Play30

RLHF / RLVR20

Strengths

DialGuide (2023): behavior alignment via natural-language guidelines — personality adherence

7 years at Amazon Alexa Prize — conversation quality, emotion, engagement research

Gaps

No direct RLHF/PPO/DPO/GRPO post-training papers

…click to see all

Bei Chen

low hireability

Senior Researcher@Microsoft

Previously: Intern @ Alibaba

Beijing, CN

Evals & Reward Models55

RLHF / RLVR40

Personality & Non-Verifiable Rewards25

Synthetic Data & Self-Play15

LLM Creativity5

Strengths

CodeT (506 cites): test execution as verifiable reward for code gen

Step-Aware Verifier (358 cites): process reward model for LLM reasoning

Gaps

No direct PPO/DPO/GRPO post-training work on LLMs

…click to see all

Bilal Piot

low hireability

Research scientist@DeepMind

Previously: ATER @ Université Lille3

London, GB

RLHF / RLVR97

Evals & Reward Models88

Personality & Non-Verifiable Rewards35

Synthetic Data & Self-Play28

LLM Creativity5

Strengths

Nash RLHF (2024, 216 cit) — novel game-theoretic RLHF framework

General paradigm for learning from preferences (2024, 835 cit) — foundational theory

Gaps

No personality/creativity/humor-focused reward modeling work

…click to see all

Binxing Jiao

low hireability

VP@StepFun

Previously: Principal Software Engineering Manager @ Microsoft

Beijing, CN

RLHF / RLVR88

Evals & Reward Models72

Synthetic Data & Self-Play60

Personality & Non-Verifiable Rewards30

LLM Creativity12

Strengths

ROVER (2025): novel RLVR algorithm, +8.2 pass@1 over existing methods

Step 3.5 Flash: scalable RL combining verifiable + preference signals at scale

Gaps

No evidence of personality, humor, or creativity-focused reward modeling

…click to see all

Bowen Baker

low hireability

Research Scientist@OpenAI

Previously: Research Scientist Intern @ OpenAI

Nevada City, US

Evals & Reward Models82

RLHF / RLVR72

Synthetic Data & Self-Play42

LLM Creativity5

Personality & Non-Verifiable Rewards5

Strengths

"Let's Verify Step by Step" — foundational PRM paper (1898 citations, 2023)

OpenAI o1 System Card contributor — RLVR reasoning model training

Gaps

Zero work on consumer post-training — personality, tone, creativity, non-verifiable rewards

…click to see all

Bowen Jin

low hireability

Member of Technical Staff@OpenAI

Previously: Research Intern @ Apple

San Francisco, US

RLHF / RLVR88

Evals & Reward Models68

Synthetic Data & Self-Play20

LLM Creativity5

Personality & Non-Verifiable Rewards5

Strengths

Search-R1: scalable RLVR framework for reasoning + search tool calling

Rm-r1: reward modeling framed as reasoning — novel reward model approach

Gaps

No work on personality, tone, humor, or non-verifiable reward modeling

…click to see all

Boyuan Chen

low hireability

Undergraduate Student in Artificial Intelligence@Peking University

Previously: Research Intern @ Peking University

Beijing, CN

RLHF / RLVR90

Evals & Reward Models70

Personality & Non-Verifiable Rewards50

Synthetic Data & Self-Play25

LLM Creativity10

Strengths

BeaverTails (665 citations) — built RLHF preference dataset from scratch

PKU-SafeRLHF: ACL 2025 Best Paper on multi-level safety alignment

Gaps

No explicit work on personality/tone/humor or consumer conversation modeling

…click to see all

Boyuan Zheng

low hireability

Member of Technical Staff@xAI

Previously: Research Intern @ Allen Institute for AI

San Francisco, US

RLHF / RLVR65

Synthetic Data & Self-Play60

Evals & Reward Models55

Personality & Non-Verifiable Rewards15

LLM Creativity5

Strengths

AI2 RLVR intern — open-instruct/Tulu RLVR post-training pipeline

MTS at xAI (Dec 2025) — post-training/LLM team

Gaps

No published work on personality/non-verifiable reward modeling

…click to see all

Caiming Xiong

low hireability

SVP, AI Research & Applied Research@Salesforce

Previously: VP of AI Research & Applied AI @ Salesforce

San Francisco, US

Evals & Reward Models85

Synthetic Data & Self-Play85

RLHF / RLVR75

Personality & Non-Verifiable Rewards35

LLM Creativity10

Strengths

Authored LLM post-training survey (2025) — field authority

APIGen (91 citations) — verifiable function-calling dataset pipeline

Gaps

No work on personality/emotion/humor/sarcasm post-training

…click to see all

Carlos E. Jimenez

low hireability

Researcher@Anthropic

Previously: Teaching Assistant @ University of Utah

San Francisco, US

Evals & Reward Models85

Synthetic Data & Self-Play70

Personality & Non-Verifiable Rewards35

RLHF / RLVR30

LLM Creativity5

Strengths

SWE-bench (540 cit) — gold-standard agentic code eval framework

SWE-smith — synthetic data scaling for agent training (37 cit)

Gaps

No RLHF/PPO/DPO or reward modeling work published

…click to see all

Carlos Miguel Patiño

low hireability

Research Engineer@Hugging Face

Previously: Staff Machine Learning Engineer @ Factored

RLHF / RLVR60

Synthetic Data & Self-Play40

Evals & Reward Models20

LLM Creativity5

Personality & Non-Verifiable Rewards5

Strengths

DistillationTrainer (TRL PR #5407) — on-policy KD with external teacher server

GOLD trainer buffered rollouts — GRPO-style on-policy generation pipeline

Gaps

No personality, humor, or non-verifiable reward work

…click to see all

cat-state

low hireability

RLHF / RLVR75

Evals & Reward Models45

Synthetic Data & Self-Play25

Personality & Non-Verifiable Rewards15

LLM Creativity5

Strengths

NeMo PPO trainer (trlx) — 1,500+ lines, scales to 20B NeMo Megatron

PrimeIntellect/verifiers 2025 — merged PRs on RLVR rollout infra

Gaps

No work on personality, humor, or non-verifiable subjective rewards

…click to see all

Chao Jia

low hireability

Senior Staff Researcher, GenAI Unit@DeepMind

Previously: Principal Research Scientist @ AIML @ Apple

San Francisco, US

RLHF / RLVR70

Evals & Reward Models50

Personality & Non-Verifiable Rewards40

LLM Creativity35

Synthetic Data & Self-Play20

Strengths

Role title 'Gemini Multimodal Post-Training' — exact query match

Gemini 2.5 co-author (2025) — frontier post-training at scale

Gaps

Only 7 months at DeepMind — very low likelihood of near-term departure

…click to see all

Chenghao Yang

low hireability

Graduate Research Assistant@University of Chicago

Previously: Student Researcher @ Google

Chicago, US

RLHF / RLVR82

Evals & Reward Models62

Personality & Non-Verifiable Rewards55

LLM Creativity30

Synthetic Data & Self-Play15

Strengths

f-DPO (ICLR 2024 Spotlight, 126 cites): DPO generalization for diversity + alignment

EAD-RLVR (2025): verifiable RL via exploratory annealed decoding

Gaps

Synthetic data / self-play pipelines: no clear evidence in publications

…click to see all

Chenguang Zhu

low hireability

Senior Research Scientist@Meta

Previously: Teaching Assistant @ The University of Texas at Austin

San Francisco, US

RLHF / RLVR90

Evals & Reward Models85

Synthetic Data & Self-Play60

Personality & Non-Verifiable Rewards45

LLM Creativity20

Strengths

WPO (EMNLP 2024): direct weighted preference optimization for RLHF

Self-Generated Critiques: reward model improvement via self-critique (NAACL 2025)

Gaps

No published work on personality, humor, or creativity alignment

…click to see all

Chen Xing

low hireability

Research Scientist@Meta

Previously: Senior Research Scientist, Strategic Partnership Lead @ Scale AI

San Francisco, US

Evals & Reward Models80

Synthetic Data & Self-Play75

RLHF / RLVR45

Personality & Non-Verifiable Rewards25

LLM Creativity20

Strengths

ReGenesis (ICLR 2025 Oral): LLM self-improvement, self-play for reasoning data

MultiChallenge (ACL 2025): multi-turn conversation eval benchmark

Gaps

No explicit RLHF/DPO/PPO/GRPO papers — post-training via self-improvement, not RL alignment

…click to see all

Ching-An Cheng

low hireability

Senior Research Scientist@Google

Previously: Principal Researcher @ Microsoft

Redmond, US

RLHF / RLVR75

Evals & Reward Models55

Personality & Non-Verifiable Rewards25

Synthetic Data & Self-Play20

LLM Creativity10

Strengths

Direct Nash Optimization (2024) — preference optimization for LLMs, 162 citations

LLF-Bench (ICLR 2025) — benchmark for interactive language feedback evaluation

Gaps

No work on personality, tone, humor, or non-verifiable reward modeling

…click to see all

Chong Ruan

low hireability

Researcher@DeepSeek

Previously: MS student @ Peking University

RLHF / RLVR90

Evals & Reward Models80

Synthetic Data & Self-Play60

Personality & Non-Verifiable Rewards10

LLM Creativity5

Strengths

DeepSeek-R1 co-author — pioneered GRPO/RLVR for reasoning (5230+ citations)

Inference-Time Scaling for Generalist Reward Modeling — direct reward model research

Gaps

No evidence of consumer/personality post-training or non-verifiable rewards

…click to see all

Chunting Zhou

low hireability

Researcher@Stealth

Previously: Research Scientist @ Meta

San Francisco, US

Synthetic Data & Self-Play82

Personality & Non-Verifiable Rewards60

Evals & Reward Models55

RLHF / RLVR40

LLM Creativity20

Strengths

LIMA (NeurIPS 2023, 1674 citations) — seminal low-data alignment paper

Self-Alignment via Instruction Backtranslation — synthetic data for SFT

Gaps

Alignment is SFT-first — limited PPO/DPO/GRPO RL post-training evidence

…click to see all

Conghui He

low hireability

Young Leading Scientist@Shanghai AI Lab

Previously: Researcher @ Sensetime

Shanghai, CN

Evals & Reward Models88

Synthetic Data & Self-Play75

Personality & Non-Verifiable Rewards60

RLHF / RLVR50

LLM Creativity15

Strengths

MMBench (1520 cites) — leading LLM eval benchmark; 41 eval papers

MinerU: open-source PDF/doc extraction tool, directly relevant to docs wrangling

Gaps

No explicit PPO/GRPO/RL rollout work — more data-quality focused than RL-optimization focused

…click to see all

Corentin Tallec

low hireability

Research Scientist@DeepMind

Previously: PhD Student @ Laboratoire de Recherche en Informatique

Paris, FR

RLHF / RLVR35

LLM Creativity20

Evals & Reward Models8

Personality & Non-Verifiable Rewards8

Synthetic Data & Self-Play5

Strengths

Gemini 2.5 co-author (2025) — on frontier LLM team at Google DeepMind

2025 code-gen patent — practical LLM knowledge post-training alignment

Gaps

No direct RLHF, DPO, or PPO for LLM post-training work documented

…click to see all

Daxin Jiang

low hireability

Co-Founder & CEO@StepFun

Previously: Vice President @ Microsoft

Beijing, CN

RLHF / RLVR90

Synthetic Data & Self-Play85

Evals & Reward Models75

Personality & Non-Verifiable Rewards25

LLM Creativity20

Strengths

Open-Reasoner-Zero (2025): open-source base model RL — RLVR at scale

WizardLM (1057 citations): Evol-Instruct synthetic instruction data pipeline

Gaps

No published work on personality, humor, or non-VR subjective reward modeling

…click to see all

Deep Ganguli

low hireability

Member Of Technical Staff@Anthropic

Previously: Research Director @ Stanford University

RLHF / RLVR95

Personality & Non-Verifiable Rewards90

Evals & Reward Models88

Synthetic Data & Self-Play55

LLM Creativity25

Strengths

RLHF paper co-author (Anthropic, 2022) — 3059 citations, defining work

Constitutional AI (RLAIF) co-author — canonical non-verifiable reward modeling

Gaps

No evidence of synthetic self-play or persona-conditioned data generation

…click to see all

Dejian Yang

low hireability

Researcher@DeepSeek AI

Previously: Researcher @ Microsoft

RLHF / RLVR95

Evals & Reward Models60

Synthetic Data & Self-Play35

LLM Creativity5

Personality & Non-Verifiable Rewards5

Strengths

DeepSeek-R1: pure RL on reasoning — foundational RLVR expertise

DeepSeek-Prover-V2: RL for subgoal decomposition (formal math RLVR)

Gaps

No work on personality, tone, or non-verifiable reward modeling

…click to see all

Dinghuai Zhang

low hireability

Senior Researcher@Microsoft

Previously: Intern @ Apple

Beijing, CN

RLHF / RLVR82

Evals & Reward Models38

Synthetic Data & Self-Play22

Personality & Non-Verifiable Rewards12

LLM Creativity8

Strengths

FlowRL (2025): reward distribution matching for LLM reasoning RL

Rollout-Training Mismatch paper — RL stability/efficiency for LLMs

Gaps

No consumer post-training work: personality, creativity, humor, non-verifiable rewards

…click to see all

DJ Strouse

low hireability

Member of Technical Staff@ReflectionAI

Previously: PhD Student @ University of Michigan

New York, US

RLHF / RLVR88

Evals & Reward Models68

Synthetic Data & Self-Play45

Personality & Non-Verifiable Rewards40

LLM Creativity10

Strengths

Constrained RLHF (ICLR 2024 Spotlight) — reward overoptimization direct contribution

Direct Nash Optimization (2024) — LLM self-improvement via general preferences

Gaps

No visible work on personality, tone, creativity, or non-verifiable rewards

…click to see all

Edward Grefenstette

low hireability

Director of Research, Frontier AI Board Member, and Assistants Program Area Lead@DeepMind

Previously: Head of Machine Learning @ Cohere

London, GB

RLHF / RLVR80

Personality & Non-Verifiable Rewards65

Evals & Reward Models50

LLM Creativity30

Synthetic Data & Self-Play25

Strengths

Assistants Program Area Lead at DeepMind — leads LLM post-training org

'Understanding the Effects of RLHF on LLM Generalisation and Diversity' (2023)

Gaps

No explicit synthetic data or self-play pipeline work visible

…click to see all

Eric Hartford

low hireability

Personality & Non-Verifiable Rewards85

Synthetic Data & Self-Play80

LLM Creativity75

Evals & Reward Models60

RLHF / RLVR55

Strengths

Dolphin series: production SFT post-training on Qwen2/Llama3/Gemma2, millions of downloads

Samantha model — personality-conditioned AI companion for consumer LLM use cases

Gaps

Founder/CEO of Quixi AI — actively building own company, availability low

…click to see all

Eric Michael Smith

low hireability

Research Scientist, Generative AI@Meta

Previously: Research Engineer @ Meta

New York, US

Personality & Non-Verifiable Rewards82

Evals & Reward Models65

RLHF / RLVR62

LLM Creativity58

Synthetic Data & Self-Play38

Strengths

Llama 2 & 3 co-author — chat fine-tuning with RLHF at billion-parameter scale

Empathetic conversation (1.4K cites) — emotional and implicit understanding in dialogue

Gaps

Specific RLHF sub-role in Llama 2/3 unclear — likely on responsible AI/safety side vs. RL training

…click to see all

Eric Mitchell

low hireability

Member of Technical Staff@OpenAI

Previously: Machine Learning Research Engineer @ Samsung

San Francisco, US

RLHF / RLVR97

Personality & Non-Verifiable Rewards85

Evals & Reward Models82

Synthetic Data & Self-Play60

LLM Creativity22

Strengths

Co-leads Post-training Frontiers at OpenAI (o1, o3, GPT-5-Thinking)

DPO author — foundational preference optimization paper (2023)

Gaps

No published work specifically on creative writing or roleplay post-training

…click to see all

Eric Wallace

low hireability

Member of Technical Staff@OpenAI

Previously: Doctoral Student @ University of California, Berkeley

San Francisco, US

RLHF / RLVR72

Evals & Reward Models60

Personality & Non-Verifiable Rewards25

Synthetic Data & Self-Play20

LLM Creativity5

Strengths

Co-leads OpenAI Alignment Training — direct RLHF/post-training leadership

The Instruction Hierarchy (2024): post-training for instruction following

Gaps

No published work on RLVR, tool use, or agentic tasks

…click to see all

Ethan Perez

low hireability

Research Scientist@Anthropic

Previously: Research Advisor @ New York University

San Francisco, US

RLHF / RLVR95

Personality & Non-Verifiable Rewards85

Evals & Reward Models80

Synthetic Data & Self-Play65

LLM Creativity25

Strengths

'Pretraining LMs with Human Preferences' — RLHF post-training research

'Towards Understanding Sycophancy' (527 cit.) — non-VR reward design

Gaps

No direct work on persona-conditioned SFT data pipelines or self-play data gen

…click to see all

Fandong Meng

low hireability

Senior Researcher@Tencent

Previously: PHD Student @ Institute of Computing Technology, Chinese Academy of Sciences

Beijing, CN

Evals & Reward Models82

RLHF / RLVR75

Personality & Non-Verifiable Rewards55

Synthetic Data & Self-Play45

LLM Creativity20

Strengths

RewardAnything (2025): principle-following reward models — direct RM hit

GRAM-R (2025): self-training foundation reward model for reasoning

Gaps

No RLVR/verifiable-reward RL (code, tool) work found — mostly preference/RM

…click to see all

Fei Huang

low hireability

Chief Scientist and Senior Director of Language Technologies Lab@DAMO Academy

Previously: VP of Security Strategy @ SUSE

San Francisco, US

Personality & Non-Verifiable Rewards90

RLHF / RLVR88

Evals & Reward Models85

LLM Creativity75

Synthetic Data & Self-Play72

Strengths

VP Alibaba Cloud — heads the Qwen Language Tech Lab (post-training team)

"Editing Personality for LLMs" (2024) — direct personality post-training research

Gaps

Very senior VP/executive — unlikely to join as IC researcher

…click to see all

Fenia Christopoulou

low hireability

Member of Engineering (Applied Research)@poolside

Previously: NLP Research Scientist @ Huawei

Paris, FR

RLHF / RLVR72

Evals & Reward Models35

Personality & Non-Verifiable Rewards28

Synthetic Data & Self-Play15

LLM Creativity10

Strengths

SparsePO (EMNLP 2025): sparse token-level preference optimization for LLMs

RL for reasoning at Poolside AI — active applied post-training work

Gaps

Only ~6 months at Poolside — recent hire, low hireability

…click to see all

Florian Strub

low hireability

Head of RLVR and Post-training engineering@Cohere

Previously: Co-head of Command A and Command R7B Post-training @ Cohere

Paris, FR

RLHF / RLVR96

Evals & Reward Models78

Synthetic Data & Self-Play38

Personality & Non-Verifiable Rewards30

LLM Creativity12

Strengths

Head of RLVR and Post-training engineering at Cohere — exact JD match

Co-led Command A and R7B post-training — production-scale RLHF experience

Gaps

Limited work on personality adherence or non-verifiable reward modeling

…click to see all

Furu Wei

low hireability

Chief Scientist@Microsoft

Previously: Partner Research Manager @ Microsoft

Beijing, CN

RLHF / RLVR88

Evals & Reward Models85

Synthetic Data & Self-Play82

Personality & Non-Verifiable Rewards30

LLM Creativity15

Strengths

Reward Reasoning Model (NeurIPS 2025) — reward modeling for post-training

Preference Optimization with Pseudo Feedback (2025) — RLVR/DPO at scale

Gaps

No direct work on personality design, humor, or non-verifiable consumer rewards

…click to see all

Gabriel Synnaeve

low hireability

Research Scientist@Meta

Previously: Postdoctoral Fellow @ Meta

Paris, FR

RLHF / RLVR88

Synthetic Data & Self-Play85

Evals & Reward Models68

LLM Creativity5

Personality & Non-Verifiable Rewards5

Strengths

SWE-RL (Feb 2025): verifiable-reward RL for SE; 41% SWE-bench Verified

Self-play SWE-RL (Dec 2025): self-play synthetic data gen without human labels

Gaps

Zero work on personality, humor, tone, or non-verifiable reward modeling

…click to see all

Gerasimos Lampouras

low hireability

Principal Research Scientist@Huawei

Previously: Research Associate @ University of Cambridge

London, GB

RLHF / RLVR75

Evals & Reward Models65

Synthetic Data & Self-Play55

Personality & Non-Verifiable Rewards35

LLM Creativity10

Strengths

SparsePO (ICLR 2025) — token-level preference optimization, DPO variant

Code-Optimise (2024) — self-generated preference data for code SFT

Gaps

No published work on personality design or non-verifiable reward modeling

…click to see all

Hang Li

low hireability

Head of Research@ByteDance

Previously: Director of AI Lab @ ByteDance

Beijing, CN

RLHF / RLVR72

Evals & Reward Models50

Synthetic Data & Self-Play35

Personality & Non-Verifiable Rewards30

LLM Creativity10

Strengths

ReFT (2024, 188 citations) — RL-based reasoning fine-tuning, core RLVR

AGILE (2024) — RL framework for LLM tool-use agents

Gaps

No direct work on personality/humor/creativity non-verifiable rewards

…click to see all

Haodong Duan

low hireability

Postdoctoral Researcher, Young Scientist@Shanghai AI Laboratory

Previously: Applied Scientist Intern @ Amazon

Hong Kong, HK

Evals & Reward Models93

RLHF / RLVR82

Synthetic Data & Self-Play62

Personality & Non-Verifiable Rewards28

LLM Creativity8

Strengths

InternLM-XComposer2.5-Reward — multi-modal reward model (ACL Findings 2025)

Visual-RFT (239 citations) — RLVR applied to vision-language tasks

Gaps

No evidence of personality, humor, or subjective-quality reward modeling

…click to see all

Haoxiang Wang

low hireability

Research Scientist@Luma AI

Previously: Research Scientist @ NVIDIA

San Francisco, US

RLHF / RLVR88

Evals & Reward Models82

Synthetic Data & Self-Play48

Personality & Non-Verifiable Rewards28

LLM Creativity5

Strengths

RLHF Workflow (TMLR 2024) — end-to-end online RLHF recipe paper

ArmoRM: multi-objective MoE reward model integrated into RewardBench

Gaps

No personality, creativity, or non-verifiable reward modeling work found

…click to see all

Harrison Edwards

low hireability

Research Scientist@DeepMind

Previously: Research Scientist @ OpenAI

London, GB

Evals & Reward Models90

RLHF / RLVR88

Synthetic Data & Self-Play48

Personality & Non-Verifiable Rewards35

LLM Creativity10

Strengths

"Let's Verify Step by Step" (ICLR 2024) — authored PRM800K process reward dataset

"Prover-Verifier Games" (2025) — adversarial RL training for verifiable LLM outputs

Gaps

Only ~7 months into DeepMind role — very low hireability

…click to see all

Harshit Sikchi

low hireability

Researcher@OpenAI

Previously: Graduate Research Assistant @ The University of Texas at Austin

San Francisco, US

RLHF / RLVR85

Evals & Reward Models78

Personality & Non-Verifiable Rewards30

Synthetic Data & Self-Play20

LLM Creativity5

Strengths

CPL: preference learning without RL (ICLR 2024, 119 citations)

Scaling Laws for Reward Model Overoptimization (NeurIPS 2024, 85 citations)

Gaps

No personality, humor, or non-verifiable reward modeling work

…click to see all

Hongcheng Gao

low hireability

Incoming PhD student@College of AI at Tsinghua University

Previously: Intern @ Tsinghua University

Beijing, CN

RLHF / RLVR70

Evals & Reward Models38

Synthetic Data & Self-Play10

LLM Creativity5

Personality & Non-Verifiable Rewards5

Strengths

Kimi k1.5 (741 citations): RL scaling for LLMs, direct RLVR evidence

Kimi k2/k2.5: agentic intelligence, tool use alignment at scale

Gaps

No consumer post-training work: personality, creativity, humor absent

…click to see all

Hongkun Yu

low hireability

Principal Engineer@DeepMind

Previously: Senior Staff Software Engineer @ Google

San Francisco, US

RLHF / RLVR72

Personality & Non-Verifiable Rewards68

Evals & Reward Models65

Synthetic Data & Self-Play65

LLM Creativity5

Strengths

Conditional Language Policy (2024): steerable multi-objective finetuning for LLMs

TIR-Judge (ICLR 2026): RLVR + tool-integrated RL for LLM evaluation

Gaps

No evidence of creative writing, roleplay, or humor/personality-focused post-training

…click to see all

HUANG Fei

low hireability

Evals & Reward Models70

RLHF / RLVR40

Personality & Non-Verifiable Rewards20

Synthetic Data & Self-Play15

LLM Creativity10

Strengths

Qwen3Guard (arXiv:2510.14276): safety reward model for Qwen3 LLMs

Qwen3-4B-SafeRL: RL fine-tuning using guard model as reward signal

Gaps

Reward modeling is safety/content moderation, not personality or creativity

…click to see all

Hyung Won Chung

low hireability

AI Research Scientist@Meta

Previously: Research Scientist @ OpenAI

San Francisco, US

RLHF / RLVR92

Evals & Reward Models82

Synthetic Data & Self-Play72

Personality & Non-Verifiable Rewards65

LLM Creativity25

Strengths

o1 System Card co-author — frontier RLVR / reasoning RL work at OpenAI

Deliberative Alignment — scalable reward modeling for non-verifiable policy adherence

Gaps

No direct work on personality adherence, humor, or creative roleplay data

…click to see all

Jack Hessel

low hireability

Member of Technical Staff@Anthropic

Previously: Founding Researcher @ Samaya AI

Seattle, US

LLM Creativity88

Personality & Non-Verifiable Rewards82

Synthetic Data & Self-Play80

Evals & Reward Models78

RLHF / RLVR75

Strengths

RL4LMs (2022): RLHF benchmarks + baselines for NLP post-training

SODA: 1M-scale social dialogue distillation — production synthetic data

Gaps

~8 months at Anthropic — low near-term hireability

…click to see all

Jane Yu

low hireability

Member of Technical Staff@OpenAI

Previously: Research Scientist @ Meta

San Francisco, US

RLHF / RLVR75

Evals & Reward Models72

Synthetic Data & Self-Play70

Personality & Non-Verifiable Rewards55

LLM Creativity25

Strengths

Toolformer (2550 cit): LLMs teaching themselves to use tools

Teaching LLMs to Reason with RL — RLVR / PPO for reasoning (2024)

Gaps

Likely recent hire at Meta AI — low near-term hireability

…click to see all

Jan Hendrik Kirchner

low hireability

Researcher@Anthropic

Previously: Researcher @ OpenAI

RLHF / RLVR68

Evals & Reward Models52

Synthetic Data & Self-Play20

Personality & Non-Verifiable Rewards12

LLM Creativity5

Strengths

Weak-to-Strong Generalization (404 citations) — RLHF scalable oversight

Prover-Verifier Games — game-theoretic verifier/reward training

Gaps

No work on personality, humor, sarcasm, or non-verifiable creative rewards

…click to see all

Jan Leike

low hireability

Lead of the Alignment Science team@Anthropic

Previously: Co-lead of the Superalignment Team @ OpenAI

RLHF / RLVR95

Evals & Reward Models88

Personality & Non-Verifiable Rewards55

Synthetic Data & Self-Play30

LLM Creativity15

Strengths

InstructGPT RLHF paper co-author — 20K+ citations

Deep RL from Human Preferences (2017) — reward learning pioneer

Gaps

No published work on personality/humor/creativity reward modeling

…click to see all

Jared Kaplan

low hireability

Anthropic

RLHF / RLVR90

Evals & Reward Models80

Personality & Non-Verifiable Rewards80

Synthetic Data & Self-Play20

LLM Creativity10

Strengths

'Training helpful & harmless assistant with RLHF' (2022, 2,977 citations) — foundational RLHF

Constitutional AI / RLAIF (2022, 2,230 citations) — non-verifiable reward modeling

Gaps

Co-founder of Anthropic — extremely unlikely to leave his own company

…click to see all

Jason E Weston

low hireability

Research Scientist@Meta

Previously: Researcher @ Meta

New York, US

RLHF / RLVR95

Evals & Reward Models92

Synthetic Data & Self-Play88

Personality & Non-Verifiable Rewards85

LLM Creativity45

Strengths

Self-Rewarding LMs (2024, 548 cit) — foundational RLHF reward modeling

Meta-Rewarding LMs (2025) — LLM-as-meta-judge, self-improving alignment

Gaps

12 years at Meta FAIR — entrenched senior researcher, low mobility signals

…click to see all

Jason Wei

low hireability

Research Scientist@Meta

Previously: Research Scientist @ OpenAI

San Francisco, US

RLHF / RLVR85

Evals & Reward Models75

Synthetic Data & Self-Play50

Personality & Non-Verifiable Rewards20

LLM Creativity15

Strengths

FLAN (Finetuned LMs are Zero-Shot Learners) — instruction tuning at scale, 4.8K citations

Chain-of-thought prompting — core post-training reasoning technique, 22K citations

Gaps

No direct published work on personality, humor, or non-verifiable reward modeling

…click to see all

Jianfeng Gao

low hireability

Distinguished Scientist & Vice President@Microsoft

Previously: Partner Research Manager in Business AI @ Microsoft

Woodinville, US

RLHF / RLVR80

Synthetic Data & Self-Play75

Evals & Reward Models70

Personality & Non-Verifiable Rewards45

LLM Creativity20

Strengths

'RL for Reasoning in LLMs' (2025, 80 citations) — RLVR core work

'FlowRL' reward distribution matching for LLM reasoning (ICLR 2026)

Gaps

No specific work on personality, humor, or sarcasm detection

…click to see all

John Schulman

low hireability

cofounder and chief scientist@Thinking Machines

Previously: researcher on the Alignment Science team @ Anthropic

RLHF / RLVR100

Evals & Reward Models92

Synthetic Data & Self-Play65

Personality & Non-Verifiable Rewards45

LLM Creativity15

Strengths

PPO inventor (31K citations) — RLHF backbone algorithm

InstructGPT co-author — pioneered RLHF for LLMs

Gaps

Co-founder of Thinking Machines — very low recruiting likelihood

…click to see all

Joykirat Singh

low hireability

Research Assistant@University of North Carolina at Chapel Hill

Previously: Research Fellow @ Microsoft

Chapel Hill, US

RLHF / RLVR75

Synthetic Data & Self-Play60

Evals & Reward Models40

Personality & Non-Verifiable Rewards10

LLM Creativity5

Strengths

Agentic RL tool-use paper (2505.01441) — RL for LLM tool calls, 34 citations

Self-evolved DPO — self-play preference optimization for small models

Gaps

No work on personality, tone, humor, or non-verifiable reward modeling

…click to see all

Junlong Li

low hireability

Ph.D. Student@HKUST

Previously: Lecturer @ Shanghai Jiao Tong University

Hong Kong, HK

Evals & Reward Models90

RLHF / RLVR85

Synthetic Data & Self-Play65

Personality & Non-Verifiable Rewards60

LLM Creativity15

Strengths

DeepSeek-R1 co-author — RLVR at massive scale (4530 citations)

Generative Judge (ICLR 2024) — reward model for alignment evaluation

Gaps

Just started PhD at HKUST (Sep 2025) — low near-term hireability

…click to see all

Junting Pan

low hireability

Research Scientist@Apple

Previously: Research Scientist Intern @ Meta

San Francisco, US

RLHF / RLVR65

Evals & Reward Models42

Synthetic Data & Self-Play15

LLM Creativity5

Personality & Non-Verifiable Rewards5

Strengths

Step-Controlled DPO (ICLR 2025) — direct preference optimization for reasoning

SpiritSight Agent — GUI/computer use; agentic Knowledge post-training

Gaps

No personality or consumer post-training work (humor, creativity, sarcasm)

…click to see all

Kai Chen

low hireability

Research Scientist & Head of Large Model Center@Shanghai AI Laboratory

Previously: Director @ SenseTime

Shanghai, CN

Evals & Reward Models92

RLHF / RLVR80

Synthetic Data & Self-Play55

Personality & Non-Verifiable Rewards50

LLM Creativity25

Strengths

Leads InternLM post-training team — direct match to Neuralace's use case

InternLM2 tech report (497 cites) — RLHF + SFT pipeline at scale

Gaps

No work on consumer personality, humor, or creative writing post-training

…click to see all

Kaipeng Zhang

low hireability

Principal Researcher@Shanda AI Research

Previously: Researcher @ Shanghai AI Lab

Shanghai, CN

Evals & Reward Models78

RLHF / RLVR65

Synthetic Data & Self-Play20

Personality & Non-Verifiable Rewards20

LLM Creativity5

Strengths

MM-Eureka (2025, 178 cit): rule-based RLVR — direct post-training RL work

ProJudge: MLLM process judge dataset — reward model / verifiable eval signal

Gaps

No direct consumer post-training work (personality, humor, tone, roleplay, creativity)

…click to see all

Karl Cobbe

low hireability

Research Scientist@OpenAI

San Francisco, US

RLHF / RLVR90

Evals & Reward Models88

Synthetic Data & Self-Play20

Personality & Non-Verifiable Rewards20

LLM Creativity5

Strengths

Let's Verify Step by Step -- seminal process reward model paper

GSM8K benchmark (5786 citations) -- RLVR eval standard

Gaps

No published work on personality, humor, or non-verifiable creative rewards

…click to see all

Leandro Von Werra

low hireability

Head of Research@Hugging Face

Previously: Machine Learning Engineer @ Hugging Face

Bern, CH

RLHF / RLVR97

Evals & Reward Models70

Synthetic Data & Self-Play65

Personality & Non-Verifiable Rewards40

LLM Creativity10

Strengths

TRL library creator — industry-standard RLHF/DPO/GRPO training framework

'N Implementation Details of RLHF with PPO' — seminal PPO post-training paper

Gaps

No evidence of personality/creativity/humor reward modeling work

…click to see all

Le Hou

low hireability

Senior Staff Software Engineer@DeepMind

Previously: Staff Software Engineer @ DeepMind

San Francisco, US

Synthetic Data & Self-Play68

RLHF / RLVR65

Evals & Reward Models42

Personality & Non-Verifiable Rewards38

LLM Creativity15

Strengths

FLAN Collection: landmark instruction-tuning data/methods work at Google Brain scale

Conditional Language Policy: multi-objective steerable finetuning framework (2024)

Gaps

No explicit PPO/DPO/GRPO reward-modeling publications — primarily SFT/instruction-tuning focused

…click to see all

Lilian Weng

low hireability

Research Scientist@OpenAI

RLHF / RLVR85

Evals & Reward Models80

Personality & Non-Verifiable Rewards50

Synthetic Data & Self-Play15

LLM Creativity10

Strengths

Rule-Based Rewards for LM Safety — reward model design with verifiable rules

Multi-step RL + auto-generated rewards, red teaming (NeurIPS 2024)

Gaps

No published work on personality, humor, or consumer NVR reward modeling

…click to see all

Louis Castricato

low hireability

RLHF / RLVR95

LLM Creativity85

Evals & Reward Models85

Personality & Non-Verifiable Rewards85

Synthetic Data & Self-Play70

Strengths

trlX: built the canonical RLHF training framework (2023, 76 cites)

Generative Reward Models (2025, 92 cites) — reward modeling for post-training

Gaps

CEO of Overworld AI — active startup (PRs March 2026), very low availability

…click to see all

Luca Soldaini

low hireability

Lead Research Scientist@Ai2

Previously: Senior Research Scientist @ Ai2

Seattle, US

RLHF / RLVR75

Synthetic Data & Self-Play65

Evals & Reward Models60

Personality & Non-Verifiable Rewards20

LLM Creativity10

Strengths

Tulu 3 co-author — RLVR+DPO post-training pipeline, 465 citations

Led Ai2 OLMo post-training team (2022–early 2026)

Gaps

No personality/tone/humor reward modeling in published work

…click to see all

Michal Valko

low hireability

Chief Models Officer, Member of the Founding Team, Member of Technical Staff@Stealth AI Startup

Previously: Principal Llama Engineer @ Meta

Paris, FR

RLHF / RLVR95

Evals & Reward Models72

Personality & Non-Verifiable Rewards30

Synthetic Data & Self-Play25

LLM Creativity15

Strengths

Built online RL stack for Llama 3 — hands-on RLHF at scale

Nash Learning from Human Feedback — ICML 2023 Best Paper

Gaps

Co-founded Isara Labs 2025 — very low hireability as active founder

…click to see all

Mrinank Sharma

low hireability

Intern@University of Oxford

Previously: Research Internship @ Indian Institute of Technology, Delhi

Oxford, GB

Personality & Non-Verifiable Rewards88

RLHF / RLVR65

Evals & Reward Models65

Synthetic Data & Self-Play30

LLM Creativity8

Strengths

Sycophancy paper (542 cit): key work on non-verifiable rewards shaping LLM personality

Constitutional AI co-author — foundational RLHF alignment methodology

Gaps

No visible work on synthetic conversation data gen or self-play pipelines

…click to see all

Nan Du

low hireability

Member of Technical Staff@OpenAI

Previously: Principal Researcher @ Apple

San Francisco, US

RLHF / RLVR72

Synthetic Data & Self-Play68

Personality & Non-Verifiable Rewards65

Evals & Reward Models50

LLM Creativity20

Strengths

FLAN co-author — foundational instruction finetuning (4821 citations)

ReAct co-author — seminal tool use / agentic reasoning (5045 citations)

Gaps

No explicit RLVR / verifiable-reward RL work (code evals, tool verification)

…click to see all

Nathan Lambert

low hireability

Senior Research Scientist@Allen Institute for AI

Previously: Research Scientist & RLHF Team Lead @ Hugging Face

Seattle, US

RLHF / RLVR97

Evals & Reward Models95

Synthetic Data & Self-Play80

Personality & Non-Verifiable Rewards65

LLM Creativity20

Strengths

Led Tülu 3 RLVR pipeline — open LLM post-training SOTA at AllenAI

RewardBench author — standard benchmark for reward model evaluation

Gaps

Founding SAIL Media — likely not seeking employment

…click to see all

Olivier Bachem

low hireability

Senior Director, Research Scientist@DeepMind

Previously: Director, Research Scientist @ DeepMind

Zurich, CH

RLHF / RLVR93

Evals & Reward Models88

Personality & Non-Verifiable Rewards72

Synthetic Data & Self-Play58

LLM Creativity52

Strengths

BOND (2025): Best-of-N distillation — directly LLM alignment via reward

WARM + WARP: reward model weight averaging; production-grade RLHF

Gaps

No visible work on tool-call or agentic post-training

…click to see all

Oyvind Tafjord

low hireability

Staff Research Scientist@DeepMind

Previously: Principal Research Scientist @ Allen Institute for AI

Seattle, US

RLHF / RLVR75

Evals & Reward Models75

Synthetic Data & Self-Play40

Personality & Non-Verifiable Rewards30

LLM Creativity15

Strengths

Tulu 3 co-author: introduces RLVR for verifiable-reward post-training

OLMES (NAACL 2025): standardized eval framework for LMs

Gaps

No explicit work on personality reward modeling or subjective quality RLHF

…click to see all

Phil Blunsom

low hireability

Chief Technology Officer@Cohere

Previously: Chief Scientist @ Cohere

London, GB

RLHF / RLVR88

Evals & Reward Models85

Synthetic Data & Self-Play55

Personality & Non-Verifiable Rewards42

LLM Creativity22

Strengths

"Improving reward models with synthetic critiques" (2025) — RM + synthetic data

"Uncertainty-Aware Step-wise Verification with Generative Reward Models" (2025)

Gaps

No direct evidence of personality-design or creative-writing RL work

…click to see all

Pramodith Ballapuram

low hireability

RLHF / RLVR82

Evals & Reward Models52

Personality & Non-Verifiable Rewards30

Synthetic Data & Self-Play20

LLM Creativity15

Strengths

33 TRL commits — GRPOTrainer async tool calls, SAPO/CISPO losses (Nov–Jan 2026)

Async reward functions in GRPO — unlocks non-verifiable LLM-judge rewards

Gaps

No evidence of personality / creative post-training or non-verifiable reward design

…click to see all

Qian Liu

low hireability

Member of Technical Staff@xAI

Previously: Researcher @ TikTok

Singapore, SG

RLHF / RLVR85

Evals & Reward Models72

Synthetic Data & Self-Play55

Personality & Non-Verifiable Rewards30

LLM Creativity15

Strengths

SimpleRL-Zoo (2025): RLVR/zero-RL — investigates RL for open base model reasoning

SimpleTIR (2025): multi-turn RL for tool-integrated reasoning — aligns with tool-call direction

Gaps

Singapore-based — outside stated target locations (USA/Europe/China/India)

…click to see all

Quoc V Le

low hireability

Research Scientist@Google

Previously: Research Visitor @ Max Planck Institute for Biological Cybernetics

San Francisco, US

RLHF / RLVR82

Synthetic Data & Self-Play75

Evals & Reward Models62

Personality & Non-Verifiable Rewards55

LLM Creativity20

Strengths

FLAN series co-author — instruction fine-tuning at scale

"SFT Memorizes, RL Generalizes" (2025) — direct RL post-training comparison

Gaps

No specific work on personality/humor/creative writing for LLMs

…click to see all

Rohan Taori

low hireability

Member of Technical Staff@Anthropic

Previously: OSV Fellow @ O'Shaughnessy Ventures

San Francisco, US

Synthetic Data & Self-Play90

Evals & Reward Models85

RLHF / RLVR82

Personality & Non-Verifiable Rewards52

LLM Creativity15

Strengths

AlpacaFarm: RLHF simulation framework, reward modeling without human data

Stanford Alpaca: canonical self-instruct SFT data pipeline (52K examples)

Gaps

No published work on personality, tone, or non-verifiable reward modeling

…click to see all

Scott R Johnston

low hireability

Anthropic

Previously: Senior Displays Engineer @ Apple

San Francisco, US

RLHF / RLVR92

Personality & Non-Verifiable Rewards82

Evals & Reward Models70

Synthetic Data & Self-Play12

LLM Creativity10

Strengths

Anthropic RLHF paper (2022) — 2,977 citations, foundational post-training work

'Towards Understanding Sycophancy' — personality adherence and non-VR evals

Gaps

No synthetic data / self-play published work

…click to see all

Sharan Narang

low hireability

Director, AI Research@Meta

Previously: Tech Lead @ Google

San Francisco, US

Evals & Reward Models80

RLHF / RLVR72

Synthetic Data & Self-Play40

Personality & Non-Verifiable Rewards35

LLM Creativity20

Strengths

FLAN (arXiv:2210.11416) — instruction finetuning at scale, core post-training

Llama 2 co-author — RLHF chat model post-training at Meta

Gaps

No standalone RLHF/PPO/DPO/GRPO paper — involvement implicit via Llama 2

…click to see all

Sharath Chandra Raparthy

low hireability

Research Engineer@DeepMind

Previously: Member of Technical Staff @ Reka AI

London, GB

RLHF / RLVR75

Synthetic Data & Self-Play62

Evals & Reward Models55

Personality & Non-Verifiable Rewards20

LLM Creativity10

Strengths

Llama 3 tool-use + math reasoning post-training — shipped at scale

Rainbow Teaming: adversarial synthetic LLM data generation (NeurIPS 2024)

Gaps

No personality, creativity, or non-verifiable reward modeling work

…click to see all

Shengyi Huang

low hireability

Researcher@Allen Institute for Artificial Intelligence

Previously: Researcher @ Hugging Face

RLHF / RLVR97

Evals & Reward Models72

Synthetic Data & Self-Play45

Personality & Non-Verifiable Rewards20

LLM Creativity10

Strengths

Tulu 3 co-author: AllenAI RLVR + DPO post-training pipeline

CleanRL author — canonical PPO implementations, 10K+ GitHub stars

Gaps

No published work on personality, tone, humor, or non-verifiable reward shaping

…click to see all

Shunyu Yao

low hireability

Research Scientist@OpenAI

Previously: Research Intern @ Sierra

San Francisco, US

Evals & Reward Models82

RLHF / RLVR50

Synthetic Data & Self-Play40

Personality & Non-Verifiable Rewards10

LLM Creativity8

Strengths

tau-bench: SOTA Tool-Agent-User eval benchmark (2024, 204 cites)

Reflexion: verbal RL / self-improvement for agents (3360 cites)

Gaps

No work on personality, humor, or non-verifiable reward modeling

…click to see all

Songyang Zhang

low hireability

Young Scientist@Shanghai AI Laboratory

Previously: postdoctoral researcher @ ShanghaiTech University

Shanghai, CN

Evals & Reward Models95

RLHF / RLVR78

Synthetic Data & Self-Play58

Personality & Non-Verifiable Rewards50

LLM Creativity28

Strengths

OpenCompass creator — leading LLM eval platform (372 citations)

CompassJudger-1/2 — generalist judge/reward models, verifiable + subjective

Gaps

Just started at Tencent Hunyuan (~2 months) — very low hireability window

…click to see all

Teknium

low hireability

Synthetic Data & Self-Play90

Personality & Non-Verifiable Rewards80

LLM Creativity78

RLHF / RLVR55

Evals & Reward Models28

Strengths

OpenHermes-2.5: ~1M synthetic SFT samples, 1290+ models trained on it

GPTeacher: roleplay + toolformer datasets, dual consumer + knowledge signal

Gaps

Co-founder at NousResearch — low hireability

…click to see all

Tianle Li

low hireability

Member of Technical Staff@xAI

Previously: full-time team member @ Nexusflow

Evals & Reward Models90

RLHF / RLVR88

Synthetic Data & Self-Play62

Personality & Non-Verifiable Rewards42

LLM Creativity18

Strengths

Led RL post-training + recipe studies for Grok 4.1 and Grok 4.2 at xAI

Grok 4: synthetic datasets, tool-use training, evals — directly query-aligned

Gaps

No personality adherence or creative writing work (Consumer Post-Training gap)

…click to see all

Tianyu Liu

low hireability

Researcher@Alibaba

Previously: Senior Researcher @ Tencent

Beijing, CN

RLHF / RLVR83

Evals & Reward Models82

Synthetic Data & Self-Play48

Personality & Non-Verifiable Rewards42

LLM Creativity10

Strengths

Alibaba Qwen team staff researcher — direct Qwen post-training experience

ACL 2025: scalable long-CoT RL — RLVR post-training at production scale

Gaps

Consumer post-training gap — no personality/humor/roleplay/creativity work

…click to see all

Timo Schick

low hireability

Member of Technical Staff@Microsoft

Previously: Member of Technical Staff @ Microsoft

San Francisco, US

Synthetic Data & Self-Play80

Evals & Reward Models45

RLHF / RLVR35

LLM Creativity35

Personality & Non-Verifiable Rewards35

Strengths

Toolformer (2023, 2599 cit.) — LLMs teaching themselves tool/API use

DINO: Generating Datasets with PLMs (264 cit.) — direct synthetic data pipeline work

Gaps

No direct RLHF/PPO/DPO/GRPO work — tool-calling trained via SFT not RL

…click to see all

Wei Xiong

low hireability

Senior Research Scientist@NVIDIA

Previously: Research Scientist @ Adobe

San Francisco, US

RLHF / RLVR95

Evals & Reward Models85

Synthetic Data & Self-Play65

Personality & Non-Verifiable Rewards45

LLM Creativity10

Strengths

RAFT paper (583 citations) — reward-ranked SFT data selection, core post-training

RLHF Workflow paper (248 citations) — end-to-end online RLHF recipe

Gaps

No work on personality, humor, or subjective non-verifiable reward modeling

…click to see all

Wing Lian

low hireability

RLHF / RLVR80

Evals & Reward Models55

Personality & Non-Verifiable Rewards35

Synthetic Data & Self-Play25

LLM Creativity15

Strengths

Axolotl: 1,418 commits — DPO, GRPO, KTO, ORPO, reward modeling all supported

13 commits to huggingface/trl — core RLHF library

Gaps

No academic research output — engineer/builder, not researcher

…click to see all

Xinyun Chen

low hireability

AI research scientist@Meta

Previously: staff research scientist @ DeepMind

Evals & Reward Models75

RLHF / RLVR65

Synthetic Data & Self-Play35

LLM Creativity5

Personality & Non-Verifiable Rewards5

Strengths

AlphaCode (1774 citations) — competition-level code gen with test-based RLVR filtering

Teaching LLMs to Self-Debug — code verification and post-hoc correction

Gaps

No personality, tone, or non-verifiable reward work

…click to see all

Xizhou Zhu

low hireability

Researcher@Shanghai AI Laboratory

Previously: Researcher @ SenseTime

Evals & Reward Models65

RLHF / RLVR55

Synthetic Data & Self-Play40

Personality & Non-Verifiable Rewards8

LLM Creativity5

Strengths

ZeroGUI: online RL for GUI agents at zero human annotation cost

VisualPRM: process reward model + VisualPRM400K eval dataset

Gaps

All post-training work is multimodal (vision-language), not text-only LLM

…click to see all

Xuancheng Ren

low hireability

Researcher@Alibaba

Previously: PhD student @ Peking University

RLHF / RLVR78

Evals & Reward Models68

Synthetic Data & Self-Play62

Personality & Non-Verifiable Rewards38

LLM Creativity28

Strengths

#1 contributor to QwenLM/Qwen3 (108 commits) — core team

Qwen2.5 post-training: multistage RL + 1M+ SFT samples

Gaps

No direct evidence of personality/creativity post-training work

…click to see all

Xuehai Pan

low hireability

Code Engineer of Agent/RL Infra@DeepSeek AI

Previously: Technical Staff @ Moonshot AI

Beijing, CN

RLHF / RLVR90

Evals & Reward Models72

Synthetic Data & Self-Play65

Personality & Non-Verifiable Rewards45

LLM Creativity5

Strengths

88 commits to safe-rlhf — primary RLHF framework implementer

BeaverTails: human-preference dataset for reward model training

Gaps

No evidence of personality/humor/creativity reward modeling

…click to see all

Yunlin Mao

low hireability

Evals & Reward Models82

RLHF / RLVR45

Synthetic Data & Self-Play18

LLM Creativity5

Personality & Non-Verifiable Rewards5

Strengths

400+ merged PRs to modelscope/evalscope — core maintainer

TIR-Bench & SWE-Smith evals — tool-calling and code agent evaluation

Gaps

No evidence of personality, creativity, or non-verifiable reward work

…click to see all

Yu Qiao

low hireability

Principal Researcher@Shanghai AI Laboratory

Previously: Professor @ Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences

Shenzhen, CN

Evals & Reward Models85

RLHF / RLVR75

Synthetic Data & Self-Play50

Personality & Non-Verifiable Rewards30

LLM Creativity20

Strengths

VisualPRM: process reward model for multimodal reasoning (73 cites, 2025)

VideoChat-R1: RL fine-tuning / RLVR applied to video MLLMs (93 cites)

Gaps

No explicit work on personality, humor, or consumer-facing non-verifiable reward modeling

…click to see all

Yu Wu

low hireability

Head of LLM Alignment Team@DeepSeek AI

Previously: Senior Researcher @ Microsoft

RLHF / RLVR95

Evals & Reward Models88

Synthetic Data & Self-Play40

Personality & Non-Verifiable Rewards15

LLM Creativity8

Strengths

DeepSeek-R1: pioneered GRPO algorithm for RLVR reasoning (5230 citations)

Math-Shepherd: process reward model without human annotations (593 citations)

Gaps

No published work on personality, tone, or non-verifiable reward modeling

…click to see all

Zac Hatfield-Dodds

low hireability

member of technical staff@Anthropic

Previously: unknown @ Autonomy, Agency, and Assurance Institute

San Francisco, US

RLHF / RLVR80

Evals & Reward Models78

Personality & Non-Verifiable Rewards72

Synthetic Data & Self-Play18

LLM Creativity8

Strengths

"Training HH Assistant with RLHF" — co-author, 3,057 citations, foundational post-training paper

Sycophancy paper (ICLR 2024): analyzes how RLHF produces non-truthful preference alignment

Gaps

No evidence of RLVR, verifiable-reward RL, or tool-call training

…click to see all

Zhengxiao Du

low hireability

Tech Lead@ZhipuAI

Previously: Research Intern @ Beijing Academy of Artificial Intelligence

Beijing, CN

RLHF / RLVR88

Evals & Reward Models75

Synthetic Data & Self-Play72

Personality & Non-Verifiable Rewards42

LLM Creativity22

Strengths

ChatGLM-RLHF (2024): production PPO alignment pipeline for 30B+ model

Does RLHF Scale? (2025): empirical RLHF scaling across data/model/method

Gaps

Hireability low: ~9 months into new senior role at ZhipuAI

…click to see all

Runs

#1completed0 qualified / 0 foundMay 7, 1:30 PM