Back to dashboard

Neuralace · Post-Training Researcher (LLM)

completed370 qualified1 runMay 7, 1:30 PMcompany-name-neuralace-sabi-locations-usa-europe-china-india-1778160600
ParsedNeuralace · 5 topics · Researcher · USA, Europe, China, India
Generating seed nodes
0 proposed
Explored 0 queries
0/0 done
    3
    Expanding nodes
    queued
    4
    Qualifying candidates
    queued

    Qualified Candidates (357)

    AA

    Aakriti Agrawal

    high hireability

    Research Assistant@University of Maryland

    Previously: Research Internship @ Capital One

    Hyattsville, US

    24
    Evals & Reward Models50
    RLHF / RLVR35
    Synthetic Data & Self-Play20
    Personality & Non-Verifiable Rewards10
    LLM Creativity5
    Strengths
    EnsemW2S: weak-to-strong generalization via LLM ensembles
    Easy2Hard-Bench: LLM eval difficulty labeling (NeurIPS 2024)
    Gaps
    No direct RLHF/RLVR training runs (PPO, DPO, GRPO) — only adjacent alignment work
    …click to see all
    AP

    Ameya Prabhu

    high hireability

    Postdoctoral Researcher@University of Tuebingen

    Previously: Machine Learning Intern @ Intel

    Tübingen, DE

    38
    Evals & Reward Models78
    RLHF / RLVR68
    Synthetic Data & Self-Play28
    Personality & Non-Verifiable Rewards10
    LLM Creativity5
    Strengths
    verl-tool: RLVR framework fork for diverse tool use (pinned repo)
    LinkedIn headline calls out 'RL Post-training' as primary focus
    Gaps
    No published work on personality, tone, or non-verifiable reward modeling
    …click to see all
    AL

    Andrew Lee

    high hireability

    Postdoc@Harvard University

    Previously: Research Scientist Intern (FAIR) @ Meta

    Ann Arbor, US

    45
    RLHF / RLVR72
    Personality & Non-Verifiable Rewards62
    LLM Creativity35
    Evals & Reward Models32
    Synthetic Data & Self-Play22
    Strengths
    Pairwise Cringe Loss (2023, 119 cit.) — preference optimization for LLMs
    ICML 2024 Oral: mechanistic understanding of DPO alignment
    Gaps
    No tool-call, agentic, or RLVR work
    …click to see all
    BW

    Boshi Wang

    high hireability

    PhD Student@The Ohio State University

    Previously: Research Intern @ Microsoft

    21
    Evals & Reward Models55
    Synthetic Data & Self-Play25
    RLHF / RLVR15
    LLM Creativity5
    Personality & Non-Verifiable Rewards5
    Strengths
    Mind2Web (692 cit.) — pioneer web-agent eval benchmark
    Tool Learning via Simulated Trial and Error (ACL-24) — tool-call training angle
    Gaps
    No RLHF, DPO, PPO, or RLVR post-training work
    …click to see all
    CN

    Carel van Niekerk

    high hireability

    Postdoctoral Researcher@Heinrich-Heine University

    Previously: Doctoral Candidate and Research Scientist @ Heinrich-Heine University

    Düsseldorf, DE

    57
    Synthetic Data & Self-Play75
    Personality & Non-Verifiable Rewards72
    RLHF / RLVR68
    Evals & Reward Models50
    LLM Creativity20
    Strengths
    "Post-training LLMs via RL from Self-Feedback" (2025) — direct RLVR post-training paper
    RLSF (2024) — RL from self-feedback for reasoning
    Gaps
    No published work on tool use, code gen, or agentic tasks
    …click to see all
    CH

    Chi Han

    high hireability

    Graduate Student@University of Illinois Urbana-Champaign

    Urbana, US

    27
    Personality & Non-Verifiable Rewards60
    RLHF / RLVR20
    Evals & Reward Models20
    Synthetic Data & Self-Play20
    LLM Creativity15
    Strengths
    LM-Steer (ACL 2024 Outstanding) — embedding-based LLM behavior/personality steering
    Tool Learning with Foundation Models — 438 citations, core tool call work
    Gaps
    No direct RLHF/PPO/DPO/GRPO post-training pipeline work
    …click to see all
    HY

    Hang Yan

    high hireability

    Postdoc@Chinese University of Hong Kong

    Previously: PhD student @ Fudan University

    Hong Kong, HK

    58
    RLHF / RLVR90
    Synthetic Data & Self-Play82
    Evals & Reward Models80
    Personality & Non-Verifiable Rewards30
    LLM Creativity10
    Strengths
    "Secrets of RLHF" Parts I & II — direct PPO + reward modeling work
    SynthRL (ICLR 2026) — RLVR + verifiable data synthesis
    Gaps
    No explicit personality/creativity/non-verifiable reward work
    …click to see all
    HZ

    Hanlin Zhu

    high hireability

    Ph.D. candidate@University of California, Berkeley

    San Francisco, US

    58
    RLHF / RLVR80
    Personality & Non-Verifiable Rewards72
    Evals & Reward Models68
    LLM Creativity35
    Synthetic Data & Self-Play35
    Strengths
    Starling-7B: RLAIF for helpfulness/harmlessness (COLM 2024 Oral, 176 citations)
    Personalized alignment eval for open-ended text generation (EMNLP 2024)
    Gaps
    Limited synthetic data / self-play pipeline work — no evidence of large-scale SFT data gen
    …click to see all
    HL

    Hantao Lou

    high hireability

    Member of Technical Staff, Manager@Anthropic

    Previously: Research Fellow @ Machine Intelligence Research Institute

    San Francisco, US

    25
    RLHF / RLVR55
    Evals & Reward Models35
    Personality & Non-Verifiable Rewards20
    Synthetic Data & Self-Play10
    LLM Creativity5
    Strengths
    align-anything: full PPO/DPO RLHF framework for multimodal LLMs
    Aligner (2024, 126 citations) — post-training alignment via correction
    Gaps
    Core identity is interpretability/theoretical alignment, not applied post-training
    …click to see all
    HZ

    Han Zhou

    high hireability

    AI Scientist Intern@Mistral

    Previously: Student Researcher @ Google

    London, GB

    57
    Evals & Reward Models75
    Personality & Non-Verifiable Rewards72
    RLHF / RLVR62
    Synthetic Data & Self-Play55
    LLM Creativity20
    Strengths
    ZEPO (EMNLP 2024): preference elicitation for human-aligned LLM judgments
    PairS (COLM 2024): pairwise preference LLM evaluator — direct reward model work
    Gaps
    No explicit PPO/DPO/GRPO at scale — policy optimization work is agent-focused
    …click to see all
    HZ

    Hao Zhu

    high hireability

    Postdoctoral Scholar@Stanford University

    Previously: PhD Student @ Carnegie Mellon University

    US

    66
    Personality & Non-Verifiable Rewards88
    Synthetic Data & Self-Play80
    Evals & Reward Models78
    RLHF / RLVR62
    LLM Creativity22
    Strengths
    SOTOPIA-RL: reward design for social/personality behaviors — core non-verifiable reward work
    SOTOPIA-S4: large-scale persona-conditioned synthetic conversation data generation
    Gaps
    No demonstrated work on standard RLHF/PPO/DPO at LLM scale (consumer post-training)
    …click to see all
    HW

    Hongru WANG

    high hireability

    Research Associate@University of Edinburgh

    Previously: Research Intern @ ByteDance

    Edinburgh, GB

    45
    RLHF / RLVR80
    Evals & Reward Models75
    Synthetic Data & Self-Play35
    Personality & Non-Verifiable Rewards30
    LLM Creativity5
    Strengths
    ToolRL: RLVR for tool learning via GRPO (78 citations, 2025)
    RM-R1: reward modeling as reasoning (ICLR 2026, 49 citations)
    Gaps
    No personality/tone/creative reward modeling evidence
    …click to see all
    HC

    Huayu Chen

    high hireability

    PhD Candidate@Tsinghua University

    Previously: Research Intern @ Nvidia

    Beijing, CN

    42
    RLHF / RLVR88
    Evals & Reward Models78
    Synthetic Data & Self-Play30
    Personality & Non-Verifiable Rewards8
    LLM Creativity5
    Strengths
    NCA (NeurIPS 2024): explicit reward modeling for LLM alignment
    PRIME paper (159 citations): process reward model for reasoning RL
    Gaps
    No work on personality, tone, humor, or non-verifiable reward modeling
    …click to see all
    JZ

    Jin Peng Zhou

    high hireability

    Chief of Staff at Cornell University Student Assembly@Cornell University

    Ithaca, US

    47
    RLHF / RLVR90
    Evals & Reward Models82
    Personality & Non-Verifiable Rewards45
    LLM Creativity10
    Synthetic Data & Self-Play10
    Strengths
    Q# (NeurIPS 2025): distributional RL theory applied to LM post-training
    RLHF for personalization — consumer post-training alignment work
    Gaps
    No synthetic data generation or self-play pipeline experience found
    …click to see all
    JL

    Junyang Lin

    high hireability

    Research Scientist@Qwen

    Previously: Staff Engineer @ Alibaba

    Beijing, CN

    70
    RLHF / RLVR95
    Evals & Reward Models88
    Synthetic Data & Self-Play80
    Personality & Non-Verifiable Rewards72
    LLM Creativity15
    Strengths
    Core Qwen team: co-led Qwen2.5-Math, Qwen3 post-training
    Process Reward Models paper (2025) — leads RLVR reward modeling research
    Gaps
    Limited explicit work on creative writing or personality/humor modeling
    …click to see all
    SH

    Shengding Hu

    high hireability

    Intern@DeepSeek

    Previously: Intern @ ByteDance

    47
    Evals & Reward Models72
    RLHF / RLVR65
    Synthetic Data & Self-Play65
    Personality & Non-Verifiable Rewards20
    LLM Creativity15
    Strengths
    DeepSeek intern — current focus on RL scaling (O1 paradigm, GRPO environment)
    MiniCPM: 35 commits, co-author — small LLM post-training scalable strategies
    Gaps
    No explicit published RLHF/DPO/PPO/reward modeling papers yet
    …click to see all
    SY

    Siyu Yuan

    high hireability

    Research Intern@Moonshot AI

    Previously: Research Intern @ ByteDance

    70
    Personality & Non-Verifiable Rewards88
    RLHF / RLVR72
    LLM Creativity72
    Evals & Reward Models65
    Synthetic Data & Self-Play55
    Strengths
    InCharacter (164 citations) — personality fidelity eval for role-playing agents
    Moonshot AI RL Team intern — contributed to Seed1.5-Thinking (RL post-training)
    Gaps
    No explicit PPO/DPO/GRPO paper — RL work is through contributions, not first-author RL methods
    …click to see all
    SC

    Souradip Chakraborty

    high hireability

    PhD student research intern@Google

    Previously: AI Research Intern @ Chase

    Seattle, US

    58
    RLHF / RLVR90
    Evals & Reward Models75
    Personality & Non-Verifiable Rewards65
    Synthetic Data & Self-Play50
    LLM Creativity12
    Strengths
    MaxMin-RLHF (133 cit) — alignment with diverse human preferences
    PARL: unified RLHF framework — NeurIPS-level RLHF theory
    Gaps
    No direct work on personality/humor/creativity-specific rewards
    …click to see all
    XT

    Xiangru Tang

    high hireability

    Ph.D. Candidate@Yale University

    Previously: Assistant Professor @ Yale University

    New Haven, US

    26
    Evals & Reward Models62
    Synthetic Data & Self-Play38
    RLHF / RLVR20
    LLM Creativity5
    Personality & Non-Verifiable Rewards5
    Strengths
    ToolLLM: 950 cites, LLMs mastering 16000+ APIs
    OpenHands: generalist coding agent platform, 292 cites
    Gaps
    No RLHF/RLVR/DPO training methodology publications
    …click to see all
    YY

    Yuanhao Yue

    high hireability
    37
    Synthetic Data & Self-Play65
    Evals & Reward Models62
    RLHF / RLVR45
    Personality & Non-Verifiable Rewards8
    LLM Creativity5
    Strengths
    Post-training for LLMs is stated research focus (KD, data synthesis, evals)
    17 commits on QwenLM/Qwen3 — direct contributor to target model family
    Gaps
    No published work on RLHF/RLVR or preference optimization (PPO/DPO/GRPO)
    …click to see all
    ZW

    Zhenhailong Wang

    high hireability

    Research Assistant@BLENDER Lab

    Previously: Applied Scientist Intern @ Amazon

    Champaign, US

    40
    RLHF / RLVR60
    Personality & Non-Verifiable Rewards55
    Evals & Reward Models45
    Synthetic Data & Self-Play30
    LLM Creativity10
    Strengths
    PAPO: policy optimization for multimodal reasoning (ICLR 2026)
    Multimodal Policy Internalization for Conversational Agents (ICLR 2026)
    Gaps
    No dedicated DPO/GRPO or text-only LLM preference optimization work
    …click to see all
    AC

    Aakanksha Chowdhery

    medium hireability

    Member of Technical Staff@ReflectionAI

    Previously: Senior Staff Research Scientist @ Meta

    San Francisco, US

    37
    RLHF / RLVR65
    Evals & Reward Models58
    Synthetic Data & Self-Play35
    Personality & Non-Verifiable Rewards15
    LLM Creativity10
    Strengths
    Scaling instruction-finetuned LMs (FLAN) — 5380 citations, post-training at scale
    RL for agentic LLMs at ReflectionAI — directly relevant to RLVR / knowledge post-training
    Gaps
    No published work on personality alignment, non-verifiable rewards, or creative writing
    …click to see all
    AC

    Abhinav Chinta

    medium hireability

    Research Assistant@Stanford University

    Previously: Researcher @ University of Illinois Urbana-Champaign

    San Francisco, US

    39
    Personality & Non-Verifiable Rewards68
    Evals & Reward Models42
    RLHF / RLVR35
    LLM Creativity28
    Synthetic Data & Self-Play22
    Strengths
    Unsupervised Human Preference Learning (EMNLP 2024) — on-target for personality/non-VR rewards
    Preference agent approach: small model steers large LLM toward individual prefs
    Gaps
    No direct RL work — no PPO, DPO, GRPO, or reward model training
    …click to see all
    AR

    Abhinav Rastogi

    medium hireability

    Research Scientist@Mistral AI

    Previously: Staff Research Scientist & Tech Lead Manager @ DeepMind

    San Francisco, US

    59
    RLHF / RLVR92
    Synthetic Data & Self-Play74
    Evals & Reward Models72
    Personality & Non-Verifiable Rewards45
    LLM Creativity10
    Strengths
    RLAIF vs RLHF (ICML 2024) — 922-citation landmark RLHF/RLAIF paper
    Robust Multi-Objective Online DPO alignment (AAAI 2025)
    Gaps
    No work on personality adherence, tone, humor, or creative writing
    …click to see all
    AD

    Achal Dave

    medium hireability

    Member of the Technical Staff@Anthropic

    Previously: Research Scientist @ Toyota Research Institute

    San Francisco, US

    49
    RLHF / RLVR75
    Synthetic Data & Self-Play70
    Evals & Reward Models45
    Personality & Non-Verifiable Rewards40
    LLM Creativity15
    Strengths
    RLAIF patent 'Scaling RL With AI Feedback' — direct RLHF/RLAIF evidence
    Post-training geometry paper (arXiv 2025) — Anthropic post-training research
    Gaps
    No published work on personality adherence, humor, or creative NLG
    …click to see all
    AC

    Aditi Chaudhary

    medium hireability

    Research Scientist@DeepMind

    Previously: Graduate Research Assistant @ Carnegie Mellon University

    San Francisco, US

    28
    Evals & Reward Models50
    Synthetic Data & Self-Play30
    RLHF / RLVR25
    Personality & Non-Verifiable Rewards25
    LLM Creativity10
    Strengths
    Gemini 2.5 contributor — DeepMind post-training/eval team experience
    DB-confirmed expertise: LLM post-training, instruction fine-tuning
    Gaps
    No published RLHF/DPO/GRPO/reward-modeling papers
    …click to see all
    AP

    Aishwarya Padmakumar

    medium hireability

    Senior Dialogue Scientist@NVIDIA

    Previously: Senior Applied Scientist @ Amazon

    San Francisco, US

    55
    RLHF / RLVR75
    Evals & Reward Models62
    Personality & Non-Verifiable Rewards60
    Synthetic Data & Self-Play50
    LLM Creativity30
    Strengths
    NVIDIA role explicitly on RLHF for LLMs (raw data signal)
    Data-Efficient Alignment with RLHF (2023) — direct RLHF alignment work
    Gaps
    No RLVR or verifiable-reward RL work (tool use, code, agentic tasks)
    …click to see all
    AK

    Akbir Khan

    medium hireability

    Member of Technical Staff@Anthropic

    Previously: Research Analyst @ Cooperative AI Foundation

    San Francisco, US

    52
    RLHF / RLVR75
    Evals & Reward Models62
    Personality & Non-Verifiable Rewards62
    Synthetic Data & Self-Play45
    LLM Creativity15
    Strengths
    'Language Models Learn to Mislead Humans via RLHF' — direct RLHF post-training work
    Best Paper ICML 2024 for LLM debate / scalable oversight research
    Gaps
    No RLVR / tool-use or agentic post-training work found
    …click to see all
    AJ

    Albert Q. Jiang

    medium hireability

    Research Scientist@Mistral AI

    Previously: Intern @ Meta

    London, GB

    33
    RLHF / RLVR62
    Evals & Reward Models60
    Synthetic Data & Self-Play35
    LLM Creativity5
    Personality & Non-Verifiable Rewards5
    Strengths
    Devstral: fine-tuning LMs for coding agent apps (Mistral, 2025)
    Magistral: RLVR-based reasoning model at Mistral AI (2025)
    Gaps
    No consumer post-training work — personality, humor, creativity absent
    …click to see all
    AW

    Albert Webson

    medium hireability

    senior research scientist@DeepMind

    Previously: Research Scientist @ DeepMind

    56
    RLHF / RLVR90
    Synthetic Data & Self-Play65
    Evals & Reward Models60
    Personality & Non-Verifiable Rewards45
    LLM Creativity20
    Strengths
    Primary RL lead for Gemini RL — production RLHF at massive scale
    Flan-T5 co-author (5.3K citations) — foundational post-training work
    Gaps
    No published work on personality/non-verifiable reward modeling or creativity
    …click to see all
    AA

    Alekh Agarwal

    medium hireability

    Staff Research Scientist@Google

    Previously: Principal Research Manager @ Microsoft

    Seattle, US

    48
    RLHF / RLVR90
    Evals & Reward Models88
    Personality & Non-Verifiable Rewards45
    Synthetic Data & Self-Play10
    LLM Creativity5
    Strengths
    "Minimaximalist Approach to RLHF" (NeurIPS 2024) — core RLHF algorithm design
    "Rewarding Progress: Scaling Process Verifiers" (2025) — RLVR for reasoning
    Gaps
    No work on personality, humor, sarcasm, or creative writing
    …click to see all
    AR

    Alexander M Rush

    medium hireability

    Research Scientist@Cursor

    Previously: Researcher @ Hugging Face

    New York, US

    47
    RLHF / RLVR85
    Evals & Reward Models60
    Synthetic Data & Self-Play45
    Personality & Non-Verifiable Rewards30
    LLM Creativity15
    Strengths
    Multi-Turn Code Gen (ICML 2025 spotlight): RLVR for multi-step tool/code tasks
    Zephyr (773 citations): DPO-based LM alignment, widely adopted recipe
    Gaps
    Primarily code/tool use focused — limited consumer personality or creativity work
    …click to see all
    AS

    Alexander Spangher

    medium hireability

    Post-doctoral Researcher@Stanford University

    Previously: Data Scientist and Journalist @ New York Times

    San Francisco, US

    28
    LLM Creativity72
    Evals & Reward Models32
    Personality & Non-Verifiable Rewards22
    RLHF / RLVR8
    Synthetic Data & Self-Play5
    Strengths
    EMNLP 2024 Outstanding Paper: human-level narrative generation evaluation
    ICML 2024 Spotlight: classifier-free guidance for topic-controlled LLM output
    Gaps
    No RLHF/RLVR/DPO/PPO work — no post-training methodology experience
    …click to see all
    AS

    Aliaksei Severyn

    medium hireability

    Research Scientist@Google

    54
    RLHF / RLVR90
    Evals & Reward Models80
    Synthetic Data & Self-Play75
    Personality & Non-Verifiable Rewards15
    LLM Creativity10
    Strengths
    BOND (ICLR 2025) — Best-of-N distillation, direct LLM alignment
    West-of-N — synthetic preference generation for reward modeling
    Gaps
    No published work on personality, humor, creativity, or non-VR reward shaping
    …click to see all
    AB

    Alon Benhaim

    medium hireability

    Senior Applied Scientist@Microsoft

    Previously: Applied Scientist 2 @ Microsoft

    Seattle, US

    41
    RLHF / RLVR80
    Evals & Reward Models50
    Personality & Non-Verifiable Rewards45
    Synthetic Data & Self-Play20
    LLM Creativity10
    Strengths
    '*PO' paper: empirical DPO/RLHF analysis + LN-DPO — core preference optimization
    POROver (2025): preference optimization for safety/overrefusal — non-VR reward alignment
    Gaps
    No work on personality/creativity/humor or subjective reward modeling
    …click to see all
    AM

    Aman Madaan

    medium hireability

    AI Researcher and Engineer@xAI

    Previously: Graduate Research Assistant @ Carnegie Mellon University

    San Francisco, US

    49
    Synthetic Data & Self-Play65
    RLHF / RLVR60
    Evals & Reward Models55
    Personality & Non-Verifiable Rewards35
    LLM Creativity30
    Strengths
    Self-Refine (NeurIPS 2023) — iterative feedback loop, h=21 foundational paper
    AutoMix — model routing by size, exact match to JD's 'call bigger model' goal
    Gaps
    Core PPO/DPO/GRPO post-training work not directly evidenced — mainly inference-time
    …click to see all
    AF

    Angela Fan

    medium hireability

    Meta

    Previously: Research Scientist @ Meta

    52
    LLM Creativity85
    Personality & Non-Verifiable Rewards70
    RLHF / RLVR55
    Evals & Reward Models30
    Synthetic Data & Self-Play20
    Strengths
    LLaMA 2 co-author — direct RLHF post-training for chat alignment
    Hierarchical Neural Story Generation (ACL 2018) — seminal creative writing work
    Gaps
    No standalone reward modeling or preference optimization (DPO/PPO) papers
    …click to see all
    AY

    An Yan

    medium hireability

    Research Scientist@Salesforce

    Previously: Research Intern @ Microsoft

    San Diego, US

    32
    Synthetic Data & Self-Play68
    Evals & Reward Models48
    RLHF / RLVR25
    LLM Creativity10
    Personality & Non-Verifiable Rewards8
    Strengths
    List Items One by One (COLM 2024) — synthetic data recipe for multimodal post-training
    MTA-Agent — synthetic RL data pipeline for multimodal search agents
    Gaps
    No PPO/DPO/GRPO or explicit reward modeling work
    …click to see all
    AY

    An Yang

    medium hireability

    Researcher@Alibaba

    Previously: MS student @ Peking University

    69
    RLHF / RLVR95
    Evals & Reward Models80
    Synthetic Data & Self-Play72
    Personality & Non-Verifiable Rewards68
    LLM Creativity30
    Strengths
    QwQ-32B — co-authored core RLVR reasoning paper (409 citations, 2025)
    WorldPM — reward/preference modeling at scale (2025)
    Gaps
    Limited creative writing / personality adherence work — focused on RLVR
    …click to see all
    AZ

    Aohan Zeng

    medium hireability
    53
    Evals & Reward Models82
    RLHF / RLVR78
    Synthetic Data & Self-Play52
    Personality & Non-Verifiable Rewards35
    LLM Creativity20
    Strengths
    ChatGLM-RLHF paper — applied RLHF on production LLM at ZhipuAI
    "Does RLHF Scale?" (2025) — explicit RLHF scaling research
    Gaps
    No explicit personality/tone/humor reward modeling work
    …click to see all
    AZ

    Aohan Zeng

    medium hireability

    PhD student@Tsinghua University

    Beijing, CN

    50
    RLHF / RLVR88
    Evals & Reward Models82
    Synthetic Data & Self-Play38
    Personality & Non-Verifiable Rewards30
    LLM Creativity10
    Strengths
    ChatGLM-RLHF (2024) — implemented RLHF for production LLM alignment
    Does RLHF Scale? (2025) — systematic RLHF scaling experiments
    Gaps
    No focused work on personality, humor, or creative writing (non-verifiable rewards)
    …click to see all
    BP

    Baolin Peng

    medium hireability

    Principal Researcher@Microsoft

    Previously: Senior Researcher @ Tencent

    Seattle, US

    62
    RLHF / RLVR88
    Evals & Reward Models78
    Synthetic Data & Self-Play68
    Personality & Non-Verifiable Rewards55
    LLM Creativity20
    Strengths
    "Advantage Modeling for RLHF" (2025) — direct RLHF post-training work
    Nash Policy Optimization (2025) — general preference alignment at LLM scale
    Gaps
    LLM creativity (roleplay, humor, cultural zeitgeist) — essentially no work in this area
    …click to see all
    BY

    Baosong Yang

    medium hireability

    算法专家@Alibaba

    Previously: Postgraduate Intern @ Tencent

    Hangzhou, CN

    40
    Evals & Reward Models55
    Personality & Non-Verifiable Rewards50
    RLHF / RLVR40
    Synthetic Data & Self-Play35
    LLM Creativity20
    Strengths
    Qwen3 co-author — first-hand knowledge of target model training pipeline
    Qwen2 + Qwen2.5-Omni contributor — spans text and multimodal post-training
    Gaps
    No standalone RLHF/RLVR/DPO papers — post-training role within Qwen not individually isolated
    …click to see all
    BP

    Barna Pásztor

    medium hireability

    Doctoral Fellow@ETH AI Center

    Previously: core contributor and lead of the post-training team @ Swiss AI Initiative

    Zürich, CH

    36
    RLHF / RLVR75
    Evals & Reward Models45
    Personality & Non-Verifiable Rewards32
    Synthetic Data & Self-Play20
    LLM Creativity10
    Strengths
    Led post-training team at Apertus-70B — direct open-source LLM RLHF experience
    Stackelberg RLHF paper — preference optimization as sequential game (EWRL 2025)
    Gaps
    Primarily theoretical (game-theoretic) vs applied RLHF engineering
    …click to see all
    BP

    Barun Patra

    medium hireability

    Member of Technical Staff@Microsoft

    Previously: Senior Applied Scientist @ Microsoft

    Seattle, US

    41
    RLHF / RLVR75
    Evals & Reward Models50
    Synthetic Data & Self-Play45
    Personality & Non-Verifiable Rewards30
    LLM Creativity5
    Strengths
    'A Practical Analysis of Human Alignment with *PO' — LN-DPO proposal, NAACL 2025
    Phi-3 co-author — small-model post-training directly matches Qwen post-training context
    Gaps
    No work on personality, tone, creativity, or humor alignment
    …click to see all
    BN

    Behnam Neyshabur

    medium hireability

    Member of Technical Staff@Anthropic

    Previously: Senior Staff Research Scientist & Team Lead @ DeepMind

    San Francisco, US

    49
    Synthetic Data & Self-Play85
    RLHF / RLVR72
    Evals & Reward Models70
    Personality & Non-Verifiable Rewards15
    LLM Creativity5
    Strengths
    "Beyond Human Data" (TMLR 2024) — self-training/self-play data pipeline for LLMs
    Co-led DeepMind Blueshift team → Gemini post-training in production
    Gaps
    No explicit reward modeling or RLHF-specific papers published
    …click to see all
    BV

    Bertie Vidgen

    medium hireability

    AI Research@Mercor

    Previously: Data + Evaluation @ Contextual AI

    US

    48
    Evals & Reward Models82
    Personality & Non-Verifiable Rewards80
    RLHF / RLVR35
    Synthetic Data & Self-Play32
    LLM Creativity10
    Strengths
    PRISM dataset: individualized human feedback for subjective LLM alignment (186 citations)
    "Socioaffective alignment" 2025 — emotions, personality, AI-human relationship design
    Gaps
    No technical RLHF/PPO/DPO/GRPO training implementation evidence
    …click to see all
    BP

    Bhrij Patel

    medium hireability

    Incoming Research Intern@AG2

    Previously: Machine Learning Research Intern @ Qualcomm

    San Francisco, US

    20
    Evals & Reward Models55
    RLHF / RLVR20
    Personality & Non-Verifiable Rewards15
    LLM Creativity5
    Synthetic Data & Self-Play5
    Strengths
    ACL 2026: lightweight function calling — direct tool-call alignment
    EMNLP 2025: API learning from demos for tool-based agents
    Gaps
    No direct RLHF/RLVR post-training experience; RL work is theoretical (avg-reward)
    …click to see all
    BL

    Bill Yuchen Lin

    medium hireability

    Member of Technical Staff@xAI

    Previously: Research Scientist @ Allen Institute for AI

    San Francisco, US

    66
    Evals & Reward Models92
    Synthetic Data & Self-Play88
    RLHF / RLVR80
    Personality & Non-Verifiable Rewards50
    LLM Creativity20
    Strengths
    RewardBench (417 citations) — defining paper for reward model evaluation
    Magpie (2025) — alignment data synthesis from scratch, SFT pipeline
    Gaps
    No published work on personality, humor, sarcasm, or creative writing post-training
    …click to see all
    BL

    Bin Liang

    medium hireability

    Postdoctoral Fellow@The Chinese University of Hong Kong

    Previously: PhD student @ Harbin Institute of Technology

    CN

    27
    Evals & Reward Models55
    Personality & Non-Verifiable Rewards40
    RLHF / RLVR15
    Synthetic Data & Self-Play15
    LLM Creativity8
    Strengths
    CoreEval (ACL 2025): builds contamination-resilient LLM eval datasets
    Multi-persona Framework (ACL 2025): persona-conditioned quality scoring
    Gaps
    No direct RLHF/PPO/DPO/GRPO post-training work
    …click to see all
    BW

    Bin Wang

    medium hireability

    Principal Researcher@Xiaomi

    Previously: Full Professor @ Institute of Information Engineering, Chinese Academy of Sciences

    47
    Synthetic Data & Self-Play78
    RLHF / RLVR65
    Evals & Reward Models62
    Personality & Non-Verifiable Rewards20
    LLM Creativity10
    Strengths
    TaP (2025): taxonomy-guided automated preference data generation framework
    MobileIPL (2025): iterative DPO preference learning for agentic thinking
    Gaps
    No evidence of personality, humor, or creative writing post-training work
    …click to see all
    BH

    Binyuan Hui

    medium hireability

    Senior Staff Algorithm Engineer@Alibaba

    Previously: Staff Algorithm Engineer @ Alibaba

    Beijing, CN

    64
    RLHF / RLVR85
    Evals & Reward Models80
    Synthetic Data & Self-Play70
    Personality & Non-Verifiable Rewards65
    LLM Creativity20
    Strengths
    Core Qwen team: co-authored Qwen, Qwen2.5, Qwen2.5-Coder, Qwen3
    WorldPM (2025): scaling human preference modeling for reward models
    Gaps
    No creative writing / personality / humor-specific published work
    …click to see all
    BS

    Bobak Shahriari

    medium hireability

    Researcher@DeepMind

    Previously: Research Scientist @ DeepMind

    52
    RLHF / RLVR85
    Personality & Non-Verifiable Rewards80
    Evals & Reward Models70
    Synthetic Data & Self-Play15
    LLM Creativity10
    Strengths
    BOND (ICLR 2025) — LLM alignment via Best-of-N distillation
    'Capturing individual human preferences with reward features' (2025) — personalized reward modeling
    Gaps
    No evident synthetic data generation or self-play pipeline work
    …click to see all
    BG

    Bofei Gao

    medium hireability

    MS student@Peking University

    CN

    48
    RLHF / RLVR80
    Evals & Reward Models75
    Synthetic Data & Self-Play45
    Personality & Non-Verifiable Rewards35
    LLM Creativity5
    Strengths
    Preference learning survey — comprehensive DPO/PPO/GRPO coverage
    MATH-Minos: NL-feedback math verifier (reward model for reasoning)
    Gaps
    Work focused on verifiable rewards (math/code) — limited personality/conversation post-training
    …click to see all
    BL

    Bowen Li

    medium hireability

    Shanghai AI Lab

    Previously: Researcher @ Shanghai AI Lab

    34
    Synthetic Data & Self-Play65
    Evals & Reward Models50
    RLHF / RLVR45
    LLM Creativity5
    Personality & Non-Verifiable Rewards5
    Strengths
    EvoSyn: evolutionary synthetic data gen framework for RLVR (Oct 2025)
    TESSY: teacher-student SFT data synthesis, +11% code gen gains (2026)
    Gaps
    No explicit reward model or RLHF/PPO/DPO methodology papers
    …click to see all
    BT

    Bowen Tan

    medium hireability

    AI Research Scientist@Meta

    Previously: Machine Learning Researcher @ Apple

    US

    45
    Synthetic Data & Self-Play65
    RLHF / RLVR55
    Evals & Reward Models50
    LLM Creativity30
    Personality & Non-Verifiable Rewards25
    Strengths
    Efficient Soft Q-Learning for Text Generation — RL for generation (70 citations, 2022)
    Learning Data Manipulation for Augmentation — NeurIPS 2019, 148 citations
    Gaps
    No direct PPO/DPO/GRPO post-training work on modern instruction-tuned LLMs
    …click to see all
    BY

    Bowen Yu

    medium hireability

    Algorithm Expert@Alibaba

    Previously: PhD student @ Chinese Academy of Sciences

    Beijing, CN

    81
    RLHF / RLVR92
    Evals & Reward Models88
    Synthetic Data & Self-Play80
    Personality & Non-Verifiable Rewards78
    LLM Creativity65
    Strengths
    Leads Qwen-Instruct post-training — exact match for search query
    'Preference Ranking Optimization for Human Alignment' (AAAI 2024, 326 cit.)
    Gaps
    No explicit consumer/emotional intelligence (sarcasm, humor) papers found
    …click to see all
    CX

    Can Xu

    medium hireability

    Software Engineer@Microsoft

    Previously: Software Engineer @ JPMorgan Chase & Co.

    New York, US

    61
    Synthetic Data & Self-Play97
    RLHF / RLVR75
    Evals & Reward Models68
    Personality & Non-Verifiable Rewards50
    LLM Creativity15
    Strengths
    Arena Learning: self-play chatbot arena as data flywheel (NeurIPS 2024)
    Evol-Instruct creator — canonical synthetic instruction data pipeline
    Gaps
    No direct work on personality, humor, or tone-based non-verifiable rewards
    …click to see all
    CA

    casinca

    medium hireability
    26
    RLHF / RLVR83
    Evals & Reward Models20
    Synthetic Data & Self-Play15
    Personality & Non-Verifiable Rewards8
    LLM Creativity5
    Strengths
    15 merged TRL PRs: GRPO variants (VESPO, SAPO, OPSM), DPO norm, async rollout
    VESPO implementation in grpo_trainer.py — paper-to-code contribution
    Gaps
    No evidence of personality/creative RLHF or non-verifiable reward work
    …click to see all
    CG

    Chang Gao

    medium hireability

    Researcher@Alibaba

    Previously: Research Intern @ Z.ai

    Beijing, CN

    42
    RLHF / RLVR92
    Evals & Reward Models55
    Synthetic Data & Self-Play30
    Personality & Non-Verifiable Rewards30
    LLM Creativity5
    Strengths
    Qwen3 co-author — direct experience fine-tuning the exact model
    GSPO paper: novel GRPO/RLVR algorithm (2025, 94 citations)
    Gaps
    No work on personality, creativity, or non-verifiable reward modeling
    …click to see all
    CD

    Chenghao Deng

    medium hireability

    Research Intern@TikTok

    Previously: Undergraduate Intern @ Penn State University

    San Francisco, US

    16
    Evals & Reward Models40
    RLHF / RLVR20
    Synthetic Data & Self-Play10
    LLM Creativity5
    Personality & Non-Verifiable Rewards5
    Strengths
    EnsemW2S: token-level ensemble for weak-to-strong LLM alignment (NeurIPS 2024)
    Easy2Hard-Bench: difficulty-graded eval benchmark for LLMs (NeurIPS 2024)
    Gaps
    No direct RLHF/RLVR or reward modeling pipeline work
    …click to see all
    CW

    Chenglong Wang

    medium hireability

    PhD candidate@Northeastern University (Shenyang, China)

    Shenyang, CN

    45
    Evals & Reward Models85
    RLHF / RLVR75
    Synthetic Data & Self-Play40
    Personality & Non-Verifiable Rewards20
    LLM Creativity5
    Strengths
    GRAM (ICML 2025): generative foundation reward model — core RM contribution
    MRMBench (AAAI 2026): multi-dimensional reward model eval framework
    Gaps
    No personality, tone, humor, or non-verifiable reward work
    …click to see all
    CQ

    Cheng Qian

    medium hireability

    Professor@University of Illinois Urbana-Champaign

    Previously: Associate Professor @ University of Illinois Urbana-Champaign

    Urbana-Champaign, US

    60
    RLHF / RLVR85
    Evals & Reward Models80
    LLM Creativity60
    Personality & Non-Verifiable Rewards50
    Synthetic Data & Self-Play25
    Strengths
    ToolRL (NeurIPS 2025): RLVR reward shaping for tool use — exact JD priority match
    RM-R1 (2025): reward modeling as reasoning — eval/reward model expertise
    Gaps
    2nd-year PhD — not in typical graduation/industry transition window yet
    …click to see all
    CL

    Chengqi Lyu

    medium hireability

    Researcher@Shanghai AI Laboratory

    Previously: Researcher @ SenseTime

    50
    RLHF / RLVR82
    Evals & Reward Models78
    Synthetic Data & Self-Play60
    Personality & Non-Verifiable Rewards22
    LLM Creativity8
    Strengths
    OREAL (2025): outcome reward RL, 7B → 94% MATH-500 — core RLVR work
    CompassVerifier: verifier for LLM eval + outcome reward signals (2025)
    Gaps
    No work on personality, tone, humor, or non-verifiable reward shaping
    …click to see all
    CH

    Chengsong Huang

    medium hireability

    Ph.D. Student in Computer Science@Washington University in St. Louis

    Previously: Research Intern @ Tencent

    St. Louis, US

    50
    RLHF / RLVR82
    Synthetic Data & Self-Play78
    Evals & Reward Models72
    Personality & Non-Verifiable Rewards12
    LLM Creativity8
    Strengths
    'Taming Overconfidence in LLMs': reward calibration in RLHF (ICLR 2025)
    R-Zero: self-evolving LLM from zero data — direct self-play approach (ICLR 2026)
    Gaps
    No work on personality, tone, or non-verifiable creative rewards
    …click to see all
    CW

    Chengyu Wang

    medium hireability

    Algorithm Expert@Alibaba

    Previously: PhD student @ East China Normal University

    Hangzhou, CN

    34
    Synthetic Data & Self-Play75
    RLHF / RLVR50
    Evals & Reward Models30
    Personality & Non-Verifiable Rewards10
    LLM Creativity5
    Strengths
    AgenticQwen (ACL 2026): industrial tool-use training for small Qwen — direct hit
    Mock Worlds, Real Skills (ACL 2026): rubric-based rewards + synthetic task environments
    Gaps
    Primary focus is knowledge distillation, not RLHF/DPO/PPO/GRPO post-training
    …click to see all
    CZ

    Chen Zhu

    medium hireability

    Research Scientist@Meta

    Previously: Member of Technical Staff @ xAI

    San Francisco, US

    50
    RLHF / RLVR95
    Evals & Reward Models85
    Personality & Non-Verifiable Rewards30
    Synthetic Data & Self-Play25
    LLM Creativity15
    Strengths
    ODIN (ICML 2024): disentangled reward model, prevents reward hacking
    Perfect Blend/CGPO (2025): multi-objective RLHF; outperforms PPO/DPO on chat+code+math
    Gaps
    No explicit personality, tone, or creativity reward modeling work
    …click to see all
    CC

    Chi Chen

    medium hireability

    Researcher@Tsinghua University

    Previously: PhD student @ Tsinghua University

    CN

    31
    Evals & Reward Models60
    RLHF / RLVR55
    Synthetic Data & Self-Play30
    LLM Creativity5
    Personality & Non-Verifiable Rewards5
    Strengths
    AgentCPM-GUI: GRPO-based RL fine-tuning for GUI agents (SOTA 5 benchmarks)
    MiniCPM-V co-author: efficient small MLLM matching GPT-4V on edge devices
    Gaps
    No consumer post-training work (personality, tone, humor, non-verifiable rewards)
    …click to see all
    CN

    Chirag Nagpal

    medium hireability

    AI Research Scientist@Meta

    Previously: Research Scientist @ Google

    San Francisco, US

    44
    RLHF / RLVR88
    Evals & Reward Models85
    Personality & Non-Verifiable Rewards28
    Synthetic Data & Self-Play12
    LLM Creativity5
    Strengths
    "Helping or Herding?" — reward model ensemble robustness, reward hacking (118 cites)
    "Rewarding Progress" — process verifiers for LLM reasoning, RLVR-adjacent (155 cites)
    Gaps
    No work on personality, humor, sarcasm, or non-verifiable subjective reward design
    …click to see all
    CZ

    Chong Zhang

    medium hireability

    PhD student@MiroMind AI; Fudan University

    22
    RLHF / RLVR62
    Evals & Reward Models28
    Synthetic Data & Self-Play12
    LLM Creativity5
    Personality & Non-Verifiable Rewards5
    Strengths
    Co-first author '100 days after DeepSeek-R1' — RLVR/SFT survey (2025)
    MiroMind-M1 co-author: RLVR with multi-stage policy optimization on Qwen
    Gaps
    No consumer post-training work (personality, humor, creativity, non-verifiable rewards)
    …click to see all
    CN

    Christoforos Nalmpantis

    medium hireability

    AI Researcher@Prima Mente

    Previously: Postdoctoral Researcher @ Meta

    London, GB

    33
    RLHF / RLVR82
    Evals & Reward Models35
    Personality & Non-Verifiable Rewards25
    Synthetic Data & Self-Play12
    LLM Creativity10
    Strengths
    'Teaching LLMs to Reason with RL' (2024, 150 citations) — core post-training RL
    'Understanding RLHF on LLM Generalisation' (2023, 272 citations) — RLHF depth
    Gaps
    No evidence of personality or non-verifiable reward modeling work
    …click to see all
    CD

    Christoph Dann

    medium hireability

    Research Scientist@Google

    Previously: Research Intern @ Google

    Pittsburgh, US

    39
    RLHF / RLVR92
    Evals & Reward Models62
    Personality & Non-Verifiable Rewards30
    Synthetic Data & Self-Play8
    LLM Creativity5
    Strengths
    Minimaximalist Approach to RLHF (2024) — 124 citations, theoretical RLHF foundations
    P3O: pessimistic preference policy optimization — robust alignment (2024)
    Gaps
    No work on personality, creativity, or non-verifiable reward design for LLMs
    …click to see all
    CZ

    Chujie Zheng

    medium hireability

    Researcher@Alibaba Group

    Previously: Research Intern @ 01.AI

    Beijing, CN

    60
    RLHF / RLVR92
    Evals & Reward Models88
    Personality & Non-Verifiable Rewards55
    Synthetic Data & Self-Play35
    LLM Creativity28
    Strengths
    GSPO (Group Sequence Policy Optimization) — GRPO variant, core RL post-training work
    Qwen3 co-author — direct experience on the exact model Neuralace will post-train
    Gaps
    Limited explicit work on personality post-training or non-verifiable reward optimization
    …click to see all
    CP

    Clara Pohland

    medium hireability
    23
    RLHF / RLVR68
    Evals & Reward Models22
    Personality & Non-Verifiable Rewards15
    LLM Creativity5
    Synthetic Data & Self-Play5
    Strengths
    BCOTrainer: created standalone trainer in huggingface/trl
    10 merged TRL PRs — BCO, KTO, MoE load balancing
    Gaps
    No synthetic data generation or self-play data pipeline work
    …click to see all
    CW

    Cunxiang Wang

    medium hireability

    tech leader@ZhipuAI

    Previously: Research Intern @ Amazon

    Hangzhou, CN

    56
    Evals & Reward Models90
    Synthetic Data & Self-Play82
    RLHF / RLVR78
    Personality & Non-Verifiable Rewards20
    LLM Creativity10
    Strengths
    SPaR (2025): self-play + tree-search refinement for instruction-following data gen
    RLAR (2026): agentic multi-task RL reward system — direct RLVR hit
    Gaps
    No personality/creativity/humor-focused work — consumer post-training axis underserved
    …click to see all
    DL

    Dacheng Li

    medium hireability

    Research Assistant@Sailing Lab

    Previously: Research Assistant @ Machine learning, perception, and Cognition Lab

    San Francisco, US

    46
    RLHF / RLVR82
    Evals & Reward Models78
    Synthetic Data & Self-Play38
    Personality & Non-Verifiable Rewards25
    LLM Creativity8
    Strengths
    Sky-T1: RLVR reasoning model at O1-level within $450 academic budget
    SkyRL: full-stack modular RL library for LLM post-training
    Gaps
    Consumer post-training largely absent — no personality/creativity/roleplay work
    …click to see all
    DD

    Damai Dai

    medium hireability

    Researcher@DeepSeek AI

    Previously: PhD student @ Peking University

    46
    RLHF / RLVR95
    Evals & Reward Models75
    Synthetic Data & Self-Play50
    LLM Creativity5
    Personality & Non-Verifiable Rewards5
    Strengths
    DeepSeek-R1 co-author — defining RLVR paper for LLM reasoning (5,561 cites)
    Math-Shepherd — process reward model for step-level RL verification
    Gaps
    No personality, tone, or non-verifiable reward work
    …click to see all
    DC

    Daniele Calandriello

    medium hireability

    Researcher@DeepMind

    Previously: Postdoc @ Università degli Studi di Genova, Istituto Italiano di Tecnologia

    50
    RLHF / RLVR95
    Evals & Reward Models65
    Synthetic Data & Self-Play50
    Personality & Non-Verifiable Rewards35
    LLM Creativity5
    Strengths
    Nash RLHF (186 citations) — game-theoretic alternative to standard RLHF
    General paradigm for learning from human preferences (740 citations, seminal)
    Gaps
    No work on non-verifiable rewards (personality, humor, creativity post-training)
    …click to see all
    DW

    Danqing Wang

    medium hireability

    PhD student@Meta AI (FAIR)

    Previously: Research Scientist Intern @ Meta

    Pittsburgh, US

    57
    Personality & Non-Verifiable Rewards75
    Evals & Reward Models72
    RLHF / RLVR68
    LLM Creativity52
    Synthetic Data & Self-Play20
    Strengths
    "Learning Personalized Alignment" (EMNLP 2024) — reward modeling for open-ended text
    Meta AI internship on personalized LLM alignment under Yuandong Tian
    Gaps
    No published DPO/PPO/GRPO work — alignment exposure more eval-focused than RL optimizer
    …click to see all
    DC

    Daoyuan Chen

    medium hireability

    Senior Algorithm Engineer@Alibaba

    Previously: Senior Algorithm Engineer on Computer Vision @ Huawei

    Beijing, CN

    50
    RLHF / RLVR80
    Synthetic Data & Self-Play78
    Evals & Reward Models60
    Personality & Non-Verifiable Rewards20
    LLM Creativity10
    Strengths
    Trinity-RFT: general-purpose RLFT framework for LLMs — core maintainer
    Data-Juicer 2.0 (NeurIPS 2025 Spotlight) — cloud-scale SFT data pipeline
    Gaps
    No evidence of personality/tone/humor reward modeling (non-verifiable rewards)
    …click to see all
    DW

    David Wadden

    medium hireability

    Research Scientist@DeepMind

    Previously: Research Scientist @ Allen Institute for AI

    Seattle, US

    34
    RLHF / RLVR65
    Evals & Reward Models45
    Synthetic Data & Self-Play30
    Personality & Non-Verifiable Rewards20
    LLM Creativity10
    Strengths
    Tulu 2: DPO + RLHF instruction tuning at scale (292 citations)
    Gemini post-training RS at DeepMind — production-scale RL
    Gaps
    No evidence of personality, humor, or creativity-focused reward modeling
    …click to see all
    DG

    Daya Guo

    medium hireability

    Associate Professor@Sun Yat-sen University

    Previously: Postdoctoral Fellow @ Clemson University

    Zhuhai, CN

    56
    RLHF / RLVR95
    Evals & Reward Models80
    Synthetic Data & Self-Play70
    Personality & Non-Verifiable Rewards25
    LLM Creativity10
    Strengths
    DeepSeek-R1 co-author — defines the RLVR paradigm (4,805 citations)
    DeepSeekMath co-author — RL for verifiable reasoning at scale
    Gaps
    No work on personality, tone, humor, or non-verifiable reward modeling
    …click to see all
    DY

    Da Yan

    medium hireability

    Member Of Technical Staff@Anthropic

    Previously: Independent Contractor @ OpenAI

    New York, US

    27
    Personality & Non-Verifiable Rewards55
    RLHF / RLVR45
    Evals & Reward Models25
    LLM Creativity5
    Synthetic Data & Self-Play5
    Strengths
    Sycophancy in LLMs (582 citations, 2024) — core RLHF behavior research
    Embedded in Anthropic post-training team with Askell, Perez, Korbak
    Gaps
    Core expertise is GPU compute/compilers, not post-training
    …click to see all
    DL

    Dayiheng Liu

    medium hireability

    Researcher@Alibaba

    Previously: Intern @ Microsoft

    Hangzhou, CN

    72
    RLHF / RLVR85
    Evals & Reward Models85
    Personality & Non-Verifiable Rewards70
    Synthetic Data & Self-Play65
    LLM Creativity55
    Strengths
    WorldPM: Scaling Human Preference Modeling (2025) — direct preference RM work
    Core Qwen team (Qwen–Qwen3, QwQ-32B) — production post-training at scale
    Gaps
    Hangzhou, China — relocation barrier for US/Europe positions
    …click to see all
    DC

    Deng Cai

    medium hireability

    Research Scientist@ByteDance

    Previously: Senior Researcher @ Tencent

    CN

    47
    Personality & Non-Verifiable Rewards68
    Synthetic Data & Self-Play65
    LLM Creativity45
    Evals & Reward Models38
    RLHF / RLVR20
    Strengths
    Harry Potter alignment paper — character personality via SFT (EMNLP 2023)
    'Let LLMs Find Data to Train Themselves' — self-curation synthetic data (2025)
    Gaps
    No explicit RLHF/PPO/DPO/GRPO reward modeling papers found
    …click to see all
    DH

    Devamanyu Hazarika

    medium hireability

    Research Scientist@Meta

    Previously: Senior Applied Scientist @ Amazon

    San Francisco, US

    52
    Personality & Non-Verifiable Rewards85
    RLHF / RLVR70
    Evals & Reward Models50
    Synthetic Data & Self-Play35
    LLM Creativity20
    Strengths
    "Do LLMs Recognize Your Preferences?" ICLR 2025 Oral — LLM personalization
    Co-led Amazon AGI model alignment team; core dev Amazon Nova
    Gaps
    No RLVR / tool-call or agentic post-training evidence found
    …click to see all
    DY

    Dian Yu

    medium hireability

    Senior Researcher@Tencent

    Previously: Research intern @ Bosch

    Seattle, US

    67
    Synthetic Data & Self-Play88
    RLHF / RLVR82
    Evals & Reward Models70
    Personality & Non-Verifiable Rewards58
    LLM Creativity35
    Strengths
    '1B Personas' — persona-conditioned synthetic data at massive scale (187 citations)
    'Crossing the Reward Bridge' — RLVR across verifiable domains (ACL 2026)
    Gaps
    No direct work on personality/humor/tone reward modeling — creativity axis is weak
    …click to see all
    DH

    Donghai Hong

    medium hireability

    MS student@Peking University

    45
    RLHF / RLVR80
    Evals & Reward Models60
    Synthetic Data & Self-Play55
    Personality & Non-Verifiable Rewards25
    LLM Creativity5
    Strengths
    Align Anything (2024): multimodal RLHF framework, 20 citations
    Safe RLHF-V (2025): safety-aligned RLHF for vision-language models
    Gaps
    No personality/humor/creativity work — consumer post-training axis weak
    …click to see all
    DL

    Dongrui Liu

    medium hireability

    Research Scientist@Shanghai AI Lab

    Previously: IC Design Internship @ MediaTek

    Shanghai, CN

    40
    RLHF / RLVR82
    Personality & Non-Verifiable Rewards55
    Evals & Reward Models42
    Synthetic Data & Self-Play15
    LLM Creativity5
    Strengths
    ExGRPO (ICLR 2026) — GRPO variant for LLM RL post-training, core topic
    Entropy regularization + conditional advantage in RLVR (2 papers, 2025)
    Gaps
    No synthetic data generation or self-play pipeline work found
    …click to see all
    DZ

    Dongzhan Zhou

    medium hireability

    Researcher@Shanghai Artificial Intelligence Laboratory

    Previously: PhD student @ The University of Sydney

    Shanghai, CN

    32
    RLHF / RLVR65
    Evals & Reward Models62
    Synthetic Data & Self-Play20
    Personality & Non-Verifiable Rewards8
    LLM Creativity5
    Strengths
    SophiaVL-R1: RLVR for MLLMs with thinking reward (2025)
    LLaMA-Berry (NAACL 2025): Pairwise Preference Reward Model + MCTS
    Gaps
    Primary focus is AI for Science — not general-purpose LLM post-training
    …click to see all
    DT

    Duyu Tang

    medium hireability

    Researcher@Huawei

    Previously: Principal Researcher @ Tencent

    Beijing, CN

    39
    RLHF / RLVR72
    Evals & Reward Models65
    Synthetic Data & Self-Play38
    Personality & Non-Verifiable Rewards12
    LLM Creativity8
    Strengths
    ToolACE (2025): state-of-the-art LLM function calling, 73 citations
    "Is PRM Necessary?" (2025): RLVR directly induces reward-model capability
    Gaps
    No consumer post-training work: personality, tone, humor, or creative writing
    …click to see all
    DP

    Duy Van Phung

    medium hireability

    Researcher@Intelligent Internet

    Previously: Researcher @ SynthLabs

    54
    RLHF / RLVR90
    Evals & Reward Models75
    Synthetic Data & Self-Play65
    Personality & Non-Verifiable Rewards28
    LLM Creativity10
    Strengths
    trlX: RLHF distributed training framework, lead contributor (4.7K stars)
    Generative Reward Models (2025) — trained reward models from scratch
    Gaps
    No personality/tone/creativity-specific post-training work found
    …click to see all
    EB

    Edward Emanuel Beeching

    medium hireability

    Research Scientist@Hugging Face

    Previously: Research And Development Intern: Deep Reinforcement Learning @ Ubisoft LaForge

    Lyon, FR

    61
    RLHF / RLVR92
    Evals & Reward Models85
    Synthetic Data & Self-Play72
    Personality & Non-Verifiable Rewards38
    LLM Creativity20
    Strengths
    huggingface/trl — core contributor to the canonical LLM RL training library
    Zephyr + Alignment Handbook — DPO/SFT alignment recipes widely adopted
    Gaps
    No specific work on personality, humor, tone, or non-verifiable reward shaping
    …click to see all
    EZ

    Enyu Zhou

    medium hireability

    PhD student@Fudan University

    Previously: Research Intern @ China Qijizhifeng Ltd.Co

    Shanghai, CN

    48
    Evals & Reward Models90
    RLHF / RLVR88
    Personality & Non-Verifiable Rewards38
    Synthetic Data & Self-Play20
    LLM Creativity5
    Strengths
    "Secrets of RLHF Part II: Reward Modeling" — 136 cites, core RLHF reference
    RMB (ICLR 2025) — comprehensive reward model benchmarking
    Gaps
    No personality, humor, or creativity-specific reward modeling work
    …click to see all
    EH

    Eric Hambro

    medium hireability

    Member of Technical Staff@Anthropic

    Previously: Research Engineer @ Meta

    London, GB

    56
    RLHF / RLVR88
    Evals & Reward Models65
    Synthetic Data & Self-Play65
    Personality & Non-Verifiable Rewards40
    LLM Creativity20
    Strengths
    "Teaching LLMs to Reason with RL" (Anthropic 2024) — core RLHF/RLVR paper
    "Understanding RLHF Effects on LLM Generalisation" (2023) — reward model analysis
    Gaps
    No explicit personality, tone, or humor reward modeling papers
    …click to see all
    ET

    Eric Tang

    medium hireability

    Software Engineer - LLMs@Anyscale

    Previously: Research Engineering Intern @ DeepMind

    San Francisco, US

    37
    RLHF / RLVR82
    Evals & Reward Models50
    Synthetic Data & Self-Play38
    Personality & Non-Verifiable Rewards8
    LLM Creativity5
    Strengths
    180 PRs to NovaSky-AI/SkyRL — core RLVR framework contributor
    DAPO on Qwen3.5-35B-A3B — exact model class Neuralace is post-training
    Gaps
    No personality/RLAIF/non-verifiable reward work — pure RLVR focus
    …click to see all
    EH

    Ermo Hua

    medium hireability

    PhD student@Tsinghua University

    47
    RLHF / RLVR82
    Evals & Reward Models65
    Synthetic Data & Self-Play55
    Personality & Non-Verifiable Rewards22
    LLM Creativity10
    Strengths
    TTRL: test-time RL on unlabeled data; 211% Qwen-2.5-Math improvement on AIME 2024
    CoGenesis (ACL 2024): small+large LLM routing — exact match for tool-call routing vision
    Gaps
    No published work on personality, humor, or non-verifiable reward design
    …click to see all
    ED

    Esin DURMUS

    medium hireability

    Research Scientist@Anthropic

    Previously: Postdoctoral Scholar @ Stanford University

    San Francisco, US

    51
    Personality & Non-Verifiable Rewards90
    Evals & Reward Models88
    RLHF / RLVR55
    Synthetic Data & Self-Play12
    LLM Creativity10
    Strengths
    "Sycophancy in LLMs" (476 cites) — non-verifiable reward signal research
    "Collective Constitutional AI" — RLAIF for public values alignment
    Gaps
    Focus is values/safety evals, not RL training mechanics (PPO/DPO/GRPO)
    …click to see all
    EF

    Evan Frick

    medium hireability

    member of technical staff@LMArena

    Previously: Research Engineer @ Nexusflow

    65
    Evals & Reward Models92
    RLHF / RLVR88
    Synthetic Data & Self-Play72
    Personality & Non-Verifiable Rewards65
    LLM Creativity10
    Strengths
    Starling-7B RLAIF post-training — helpfulness/harmlessness, 140 citations
    Nectar: large-scale AI-feedback preference dataset for reward modeling
    Gaps
    No evidence of persona-conditioned synthetic data or self-play pipelines
    …click to see all
    FB

    Faeze Brahman

    medium hireability

    Research Scientist@Allen Institute for AI

    Previously: Research Intern @ Microsoft

    Seattle, US

    77
    LLM Creativity88
    Evals & Reward Models82
    Personality & Non-Verifiable Rewards80
    RLHF / RLVR78
    Synthetic Data & Self-Play58
    Strengths
    Tulu 3 co-author — full SFT+DPO+RLVR post-training pipeline
    Trust or Escalate: ICLR 2025 oral, LLM judges for eval
    Gaps
    No explicit persona-conditioned self-play or rollout ranking work
    …click to see all
    FZ

    Fan Zhou

    medium hireability

    1st year PhD student@Shanghai Jiao Tong University

    Previously: MS student @ Shanghai Jiao Tong University

    Shanghai, CN

    52
    Synthetic Data & Self-Play82
    RLHF / RLVR78
    Evals & Reward Models60
    Personality & Non-Verifiable Rewards28
    LLM Creativity12
    Strengths
    Qwen3-Coder + Qwen3.5 contributor — direct target-model family experience
    OctoThinker (ICML 2025) — RL scaling via mid-training, rigorous ablations
    Gaps
    RLVR work focused on verifiable domains (math, code) — consumer personality axis thin
    …click to see all
    FM

    Fei Mi

    medium hireability

    Principle Research Scientist@Huawei

    Previously: Sr Director, Sys Arch @ Huawei

    Shenzhen, CN

    55
    Synthetic Data & Self-Play72
    Evals & Reward Models70
    RLHF / RLVR52
    Personality & Non-Verifiable Rewards50
    LLM Creativity32
    Strengths
    'A Synthetic Data Generation Framework for Grounded Dialogues' — ACL 2023, direct match
    'One Cannot Stand for Everyone!' — user simulator training, persona-conditioned data gen
    Gaps
    No explicit PPO/DPO/GRPO work; alignment approach is SFT-based (mistake analysis), not RL-based
    …click to see all
    FF

    Felipe Vieira Frujeri

    medium hireability

    AI Researcher@NVIDIA

    Previously: Staff AI Researcher @ Vatic Labs

    Seattle, US

    44
    RLHF / RLVR80
    Evals & Reward Models62
    Personality & Non-Verifiable Rewards50
    Synthetic Data & Self-Play20
    LLM Creativity10
    Strengths
    APA paper: Advantage-Induced Policy Alignment (2024, 48 cit.) — RLHF post-training
    RLHF/RLAIF alignment on OpenAI core models at Microsoft Azure AI
    Gaps
    No synthetic data / self-play pipeline work evident
    …click to see all
    FX

    Frank F. Xu

    medium hireability

    Member of Technical Staff@Microsoft

    Previously: Graduate Research Assistant @ Carnegie Mellon University

    San Francisco, US

    24
    Evals & Reward Models65
    Synthetic Data & Self-Play35
    RLHF / RLVR10
    LLM Creativity5
    Personality & Non-Verifiable Rewards5
    Strengths
    WebArena (ICLR 2024) — verifiable web-agent eval benchmark
    TheAgentCompany — real-world agentic task benchmark for LLMs
    Gaps
    No RLHF/PPO/DPO/GRPO post-training work published
    …click to see all
    GL

    Gang Li

    medium hireability

    Research Scientist@Orby AI

    Previously: Senior Software Engineer @ DeepMind

    San Francisco, US

    29
    RLHF / RLVR82
    Evals & Reward Models40
    Synthetic Data & Self-Play15
    LLM Creativity5
    Personality & Non-Verifiable Rewards5
    Strengths
    DRPO (ICLR 2026): GRPO variant, decoupled reward policy optimization for LLMs
    DisCO (NeurIPS 2025): verifiable-reward RL for LLM reasoning
    Gaps
    No evidence of personality, tone, or non-verifiable reward modeling work
    …click to see all
    GC

    Ganqu Cui

    medium hireability

    research scientist@Shanghai AI Laboratory

    Previously: PhD student @ Tsinghua University

    Shanghai, CN

    60
    RLHF / RLVR95
    Evals & Reward Models88
    Synthetic Data & Self-Play65
    Personality & Non-Verifiable Rewards40
    LLM Creativity10
    Strengths
    ULTRAFEEDBACK (664 cites) — foundational AI preference data generation
    OpenRLHF contributor — production RLHF training framework
    Gaps
    No explicit work on personality, humor, or creative writing tasks
    …click to see all
    GC

    Geoffrey Cideron

    medium hireability

    Research Engineer@Google

    Previously: Research intern @ Meta

    Paris, FR

    60
    RLHF / RLVR88
    Evals & Reward Models82
    Personality & Non-Verifiable Rewards75
    LLM Creativity35
    Synthetic Data & Self-Play18
    Strengths
    WARM (ICML 2024, 103 cites): reward model weight averaging, anti-reward-hacking
    BOND (2025): Best-of-N distillation for LLM alignment — RLHF at Google scale
    Gaps
    No published work on synthetic data pipelines or self-play for LLMs
    …click to see all
    GZ

    Ge Zhang

    medium hireability

    Principal Product Manager - AI@ByteDance

    Previously: Sr. Product Manager - Core AI @ eBay

    San Francisco, US

    65
    Evals & Reward Models85
    RLHF / RLVR82
    Synthetic Data & Self-Play58
    LLM Creativity52
    Personality & Non-Verifiable Rewards50
    Strengths
    ReTool (147 citations): RL for when/how LLMs call tools — core JD skill
    VideoScore: reward model for non-verifiable human feedback on video generation
    Gaps
    Limited direct work on personality/humor/tone alignment (COIG-P is adjacent)
    …click to see all
    GH

    Guanhua Huang

    medium hireability

    algorithm engineer@Tencent

    Previously: Research Intern @ Tencent Hunyuan

    Beijing, CN

    30
    RLHF / RLVR75
    Evals & Reward Models40
    Synthetic Data & Self-Play20
    Personality & Non-Verifiable Rewards10
    LLM Creativity5
    Strengths
    "Low-probability Tokens in RLVR" — core RLVR exploration paper (2025)
    "RL on Pre-Training Data" — RL-guided training data selection (2025)
    Gaps
    No work on personality, humor, or non-verifiable reward modeling
    …click to see all
    HL

    Haipeng Luo

    medium hireability

    Intern@Tencent

    Previously: Intern @ Microsoft

    null

    50
    Synthetic Data & Self-Play80
    RLHF / RLVR75
    Evals & Reward Models70
    Personality & Non-Verifiable Rewards20
    LLM Creativity5
    Strengths
    WizardMath first author — RLVR for math reasoning, 565 citations
    Arena Learning — post-training data flywheel via simulated self-play arena
    Gaps
    No work on personality, tone, creativity, or consumer LLM directions
    …click to see all
    HM

    Haitao Mi

    medium hireability

    Head of Language Intelligence Research Group at Tencent AI Lab@Tencent

    Previously: Staff Engineer @ Ant Financial

    US

    62
    RLHF / RLVR90
    Synthetic Data & Self-Play88
    Evals & Reward Models72
    Personality & Non-Verifiable Rewards42
    LLM Creativity18
    Strengths
    'Crossing the Reward Bridge' — verifiable-reward RL across domains (2025)
    'Scaling Synthetic Data with 1B Personas' — persona SFT pipelines at scale
    Gaps
    No direct work on personality, humor, tone, or non-verifiable reward shaping
    …click to see all
    HA

    Haitham Bou Ammar

    medium hireability

    Senior Principal Scientist - ML Tech Leader@Noah's Ark Lab

    Previously: Technology Expert & Advisor @ Sanome

    Cambridge, GB

    45
    RLHF / RLVR82
    Evals & Reward Models75
    Personality & Non-Verifiable Rewards35
    Synthetic Data & Self-Play25
    LLM Creativity10
    Strengths
    Group Robust Preference Optimization (reward-free RLHF, 2024) — direct axis hit
    Bayesian Reward Models for LLM Alignment — reward model architecture expertise
    Gaps
    No evidence of personality/tone/creativity reward modeling work
    …click to see all
    HI

    Hamish Ivison

    medium hireability

    PhD student@University of Washington

    Previously: Predoctoral Young Investigator @ Allen Institute for AI

    Seattle, US

    65
    RLHF / RLVR95
    Evals & Reward Models78
    Synthetic Data & Self-Play70
    Personality & Non-Verifiable Rewards55
    LLM Creativity25
    Strengths
    328 commits to allenai/open-instruct — core Tulu RLVR+DPO pipeline builder
    Tulu 3 (496 cites) — flagship open post-training paper, RLVR + SFT
    Gaps
    No explicit work on personality adherence, humor, or creative writing
    …click to see all
    HZ

    Hang Zhang

    medium hireability

    Researcher@Alibaba

    Previously: PhD student @ Sichuan University

    CN

    13
    Evals & Reward Models25
    Synthetic Data & Self-Play20
    RLHF / RLVR10
    LLM Creativity5
    Personality & Non-Verifiable Rewards5
    Strengths
    Qwen2.5-VL co-author — PDF/doc/PPT understanding (knowledge post-training use case)
    VideoLLaMA: instruction-tuned multimodal LLM — applied SFT post-training at scale
    Gaps
    No RLHF, DPO, PPO, or GRPO post-training papers
    …click to see all
    HD

    Hanjun Dai

    medium hireability

    Researcher@Precur AI

    Previously: Research Manager @ Google

    San Francisco, US

    41
    RLHF / RLVR85
    Evals & Reward Models60
    Synthetic Data & Self-Play30
    Personality & Non-Verifiable Rewards20
    LLM Creativity10
    Strengths
    'Value-Incentivized Preference Optimization' — unified online/offline RLHF (68 citations)
    Matryoshka Pilot — small LLM orchestrating large LLM (directly mirrors JD tool-call vision)
    Gaps
    No consumer/personality post-training work — focus on verifiable/structured rewards
    …click to see all
    HC

    Hao Cheng

    medium hireability

    Data & Applied Scientist II@Microsoft

    Previously: Senior Data Scientist @ Johnson & Johnson

    New York, US

    60
    RLHF / RLVR82
    Evals & Reward Models70
    Synthetic Data & Self-Play65
    Personality & Non-Verifiable Rewards55
    LLM Creativity30
    Strengths
    RL for reasoning in LLMs with one training example — NeurIPS 2025, direct RLVR
    CollabLLM ICML 2025 oral — agentic post-training, active collaborator design
    Gaps
    No explicit DPO/PPO/GRPO preference optimization papers found
    …click to see all
    HY

    Hao Yu

    medium hireability

    PhD student@THU

    Beijing, CN

    33
    Evals & Reward Models75
    RLHF / RLVR55
    Synthetic Data & Self-Play28
    Personality & Non-Verifiable Rewards5
    LLM Creativity3
    Strengths
    UI-TARS-2: multi-turn RLVR applied to GUI computer-use agents
    AgentBench (ICLR 2024, 795 citations) — top-tier agent eval benchmark
    Gaps
    No consumer post-training work — personality, humor, creativity entirely absent
    …click to see all
    HG

    Hongyi Guo

    medium hireability

    Research Scientist@ByteDance

    Previously: Research Intern @ ByteDance

    San Francisco, US

    51
    RLHF / RLVR92
    Synthetic Data & Self-Play72
    Evals & Reward Models65
    Personality & Non-Verifiable Rewards20
    LLM Creativity8
    Strengths
    "Provably Mitigating Overoptimization in RLHF" — NeurIPS 2024, 86 citations
    BRiTE (ICML 2025): RLVR for bootstrapped thinking/reasoning
    Gaps
    No personality/creativity/non-verifiable reward work — consumer post-training gap
    …click to see all
    HL

    Hsien-chin Lin

    medium hireability

    Postdoctoral Researcher@Heinrich Heine University Düsseldorf

    Previously: PhD student @ Heinrich Heine University Düsseldorf

    Düsseldorf, DE

    45
    Synthetic Data & Self-Play70
    Personality & Non-Verifiable Rewards65
    RLHF / RLVR50
    Evals & Reward Models35
    LLM Creativity5
    Strengths
    2025 paper on post-training LLMs via RL self-feedback — core JD topic
    RLSF (2024): RL from self-feedback applied to LLM reasoning
    Gaps
    No PPO/DPO/GRPO work — RL is dialogue-policy scale, not LLM post-training scale yet
    …click to see all
    IO

    Ian Osband

    medium hireability

    Research Scientist@Google

    Previously: Member of Technical Staff @ OpenAI

    London, GB

    33
    RLHF / RLVR65
    Evals & Reward Models60
    Personality & Non-Verifiable Rewards20
    Synthetic Data & Self-Play15
    LLM Creativity5
    Strengths
    GPT-4o + o1 system cards — direct OpenAI post-training involvement
    ChatGPT data flywheel — applied RLHF/post-training pipeline at OpenAI
    Gaps
    Primary published research is exploration/uncertainty, not LLM post-training specifically
    …click to see all
    IK

    Ilia Kulikov

    medium hireability

    Research Scientist@Meta

    Previously: Research Assistant @ Courant Institute of Mathematical Sciences

    New York, US

    62
    Synthetic Data & Self-Play88
    RLHF / RLVR85
    Evals & Reward Models85
    Personality & Non-Verifiable Rewards30
    LLM Creativity22
    Strengths
    Self-Taught Evaluators (2025): reward models from unlabeled data, outperforms GPT-4
    Diverse Preference Optimization (2025): novel DPO variant for RLHF
    Gaps
    No direct personality, humor, or creativity reward modeling work
    …click to see all
    JH

    Jacob Hilton

    medium hireability

    executive director@Alignment Research Center

    Previously: Researcher @ OpenAI

    null

    62
    RLHF / RLVR97
    Evals & Reward Models90
    Personality & Non-Verifiable Rewards68
    Synthetic Data & Self-Play32
    LLM Creativity22
    Strengths
    InstructGPT co-author (18K citations) — foundational RLHF pioneer
    Scaling Laws for Reward Model Overoptimization — core reward model research
    Gaps
    No explicit personality/creativity post-training work
    …click to see all
    JW

    Jeffrey Wu

    medium hireability

    PhD Student@Anthropic AI, OpenAI

    Previously: Undergraduate Researcher @ Berkeley Artificial Intelligence Research

    New York, US

    61
    RLHF / RLVR97
    Evals & Reward Models82
    Personality & Non-Verifiable Rewards78
    Synthetic Data & Self-Play25
    LLM Creativity22
    Strengths
    InstructGPT (2022) co-author — defined modern PPO-RLHF paradigm
    'Learning to summarize with human feedback' (2020) — foundational RLHF paper
    Gaps
    No published work on synthetic data generation or self-play pipelines
    …click to see all
    JC

    Jiacheng Chen

    medium hireability

    PhD student@The Chinese University of Hong Kong

    Previously: Visiting Researcher @ Caltech

    Hong Kong, HK

    29
    RLHF / RLVR75
    Evals & Reward Models42
    Synthetic Data & Self-Play18
    LLM Creativity5
    Personality & Non-Verifiable Rewards5
    Strengths
    'Entropy Mechanism of RL for Reasoning LMs' — core RLVR theory paper
    P1: Physics Olympiad RL — complex verifiable-reward training
    Gaps
    No work on personality, non-verifiable rewards, or creative LLM outputs
    …click to see all
    JC

    Jiale Cheng

    medium hireability

    PhD Student@University of Michigan, Ann Arbor

    Ann Arbor, US

    73
    Evals & Reward Models90
    Synthetic Data & Self-Play85
    RLHF / RLVR82
    Personality & Non-Verifiable Rewards68
    LLM Creativity42
    Strengths
    SPaR: self-play + tree-search refinement for instruction following (2024)
    VisionReward: multi-dim preference learning — direct reward modeling work
    Gaps
    No direct work on tool-call / agentic / computer-use post-training
    …click to see all
    JJ

    Jiaming Ji

    medium hireability

    PhD Student@Peking University

    Beijing, CN

    49
    RLHF / RLVR90
    Evals & Reward Models78
    Synthetic Data & Self-Play40
    Personality & Non-Verifiable Rewards30
    LLM Creativity8
    Strengths
    Safe RLHF paper (521 citations) — ICLR, reward-constrained value alignment
    BeaverTails (662 citations) — large-scale human-preference dataset for RLHF
    Gaps
    No personality, humor, or creativity post-training work
    …click to see all
    JY

    Jianxin Yang

    medium hireability

    Principal Researcher@Alibaba

    Previously: NLP Algorithm Engineer @ Tencent

    Hangzhou, CN

    45
    RLHF / RLVR80
    Synthetic Data & Self-Play55
    Evals & Reward Models40
    LLM Creativity25
    Personality & Non-Verifiable Rewards25
    Strengths
    Firefly: full SFT+DPO training framework for Qwen2.5 and 30+ LLMs
    NeurIPS 2025 RLVR paper on high-entropy token selection strategy
    Gaps
    No explicit work on non-verifiable rewards, personality modeling, or creative writing
    …click to see all
    JZ

    Jiayi Zhou

    medium hireability

    PhD student@Peking University

    Previously: Researcher @ Peking University

    47
    RLHF / RLVR85
    Evals & Reward Models82
    Personality & Non-Verifiable Rewards38
    Synthetic Data & Self-Play22
    LLM Creativity10
    Strengths
    Seq2Seq Reward Modeling (AAAI 2025 Oral) — language feedback reward models
    56 commits to PKU-Alignment/align-anything — RLHF training infrastructure
    Gaps
    No consumer post-training work — no personality, humor, or creativity modeling
    …click to see all
    JF

    Jiazhan Feng

    medium hireability

    Research Scientist@ByteDance

    Previously: Research Intern @ Microsoft

    Oxford, GB

    64
    RLHF / RLVR80
    LLM Creativity65
    Synthetic Data & Self-Play65
    Evals & Reward Models60
    Personality & Non-Verifiable Rewards50
    Strengths
    ReTool: RLVR for tool use on Qwen2.5-32B — exact JD fit
    UI-TARS-2: multi-turn RL for GUI/computer use
    Gaps
    Primary tool-use RL is math/code reasoning — consumer personality RL less evidenced
    …click to see all
    JF

    Jie Fu

    medium hireability

    Research Scientist@Shanghai AI Lab

    Previously: Visiting Scholar @ The Hong Kong University of Science and Technology

    Shanghai, CN

    62
    Evals & Reward Models80
    Personality & Non-Verifiable Rewards75
    LLM Creativity60
    RLHF / RLVR50
    Synthetic Data & Self-Play45
    Strengths
    RoleLLM (365 citations) — benchmarks & elicits role-playing, personality adherence
    ChatEval (783 citations) — multi-agent LLM evaluator framework (evals axis)
    Gaps
    No dedicated PPO/DPO/GRPO post-training paper — RLVR coverage indirect via survey
    …click to see all
    JM

    Jincheng Mei

    medium hireability

    senior research scientist@DeepMind

    Previously: research scientist @ Google

    London, GB

    26
    RLHF / RLVR78
    Evals & Reward Models30
    Personality & Non-Verifiable Rewards10
    Synthetic Data & Self-Play8
    LLM Creativity5
    Strengths
    VPO (ICLR 2025): unified online+offline RLHF via value-incentivized preference opt
    Faster WIND (AISTATS 2025): accelerated iterative BoN distillation for alignment
    Gaps
    No work on personality, humor, or subjective/non-verifiable reward modeling
    …click to see all
    JS

    Jing Shao

    medium hireability

    Young Research Scientist, Group Leader@Shanghai AI Laboratory

    Previously: Research Director @ SenseTime

    CN

    57
    RLHF / RLVR85
    Evals & Reward Models80
    Personality & Non-Verifiable Rewards65
    Synthetic Data & Self-Play45
    LLM Creativity10
    Strengths
    HarmRLVR (2025) — RLVR with verifiable rewards for LLM safety alignment
    Multi-Objective DPO (2024, 70 citations) — multi-preference optimization
    Gaps
    Safety/harmlessness focus — limited work on consumer personality or creativity
    …click to see all
    JF

    Johan Ferret

    medium hireability

    Research Scientist@DeepMind

    Previously: PhD Candidate @ Inria

    Paris, FR

    63
    RLHF / RLVR92
    Evals & Reward Models78
    Synthetic Data & Self-Play60
    Personality & Non-Verifiable Rewards58
    LLM Creativity25
    Strengths
    RLAIF (958 citations) — scaling RLHF with AI feedback, first key contributor
    WARM: reward model averaging reduces reward hacking (115 citations)
    Gaps
    No explicit personality / humor / creativity-focused post-training work
    …click to see all
    JH

    Johannes Heidecke

    medium hireability

    Research Engineer@OpenAI

    Previously: AI Safety Analyst @ OpenAI

    Barcelona, ES

    57
    RLHF / RLVR85
    Evals & Reward Models78
    Personality & Non-Verifiable Rewards70
    Synthetic Data & Self-Play30
    LLM Creativity22
    Strengths
    Rule-Based Rewards for LM Safety (2024, 71 cites) — reward modeling for post-training alignment
    Diverse Red Teaming with Auto-generated Rewards + Multi-step RL — RL reward engineering
    Gaps
    No explicit synthetic conversation dataset or self-play pipeline work found
    …click to see all
    JC

    Jon Ander Campos

    medium hireability

    Staff Member of Technical Staff@Cohere

    Previously: Senior Member of Technical Staff @ Cohere

    San Francisco, US

    61
    RLHF / RLVR82
    Evals & Reward Models75
    Personality & Non-Verifiable Rewards72
    Synthetic Data & Self-Play48
    LLM Creativity28
    Strengths
    Post-Training Lead at Cohere — production post-training role
    "Learning from Natural Language Feedback" (214 cit) — core RLHF work
    Gaps
    No explicit self-play or persona-conditioned data generation work
    …click to see all
    JT

    Jonathan Tow

    medium hireability
    61
    RLHF / RLVR85
    Synthetic Data & Self-Play75
    Personality & Non-Verifiable Rewards65
    Evals & Reward Models45
    LLM Creativity35
    Strengths
    51 commits to CarperAI/trlx — 2nd contributor, PPO/ILQL RLHF practitioner
    2nd author on StableLM 2 1.6B tech report — core Stability AI researcher
    Gaps
    Location unconfirmed — no geography signal on GitHub or website
    …click to see all
    JG

    Jules Gagnon-Marchand

    medium hireability
    41
    RLHF / RLVR72
    Synthetic Data & Self-Play60
    Evals & Reward Models58
    Personality & Non-Verifiable Rewards8
    LLM Creativity5
    Strengths
    Marg-Li-CoT: active RLVR + rejection sampling research repo (2024–2025)
    Multi-GPU RLVR training (ray train, 8 GPUs, Slurm) — production infra
    Gaps
    No published papers on RLVR/CoT work — research unpublished as of qualification date
    …click to see all
    JM

    Julian Michael

    medium hireability

    Researcher on AI safety, evaluation, and alignment@Meta

    Previously: Head of the Safety, Evaluations, and Alignment Lab (SEAL) @ Scale AI

    22
    Evals & Reward Models72
    Personality & Non-Verifiable Rewards15
    RLHF / RLVR12
    LLM Creativity5
    Synthetic Data & Self-Play5
    Strengths
    GPQA benchmark (1.5K citations) — defines grad-level eval standard
    SuperGLUE co-creator (3K citations) — foundational NLU evals
    Gaps
    No RLHF/RLVR training work — purely evaluator, not trainer
    …click to see all
    JD

    Juntao Dai

    medium hireability

    Researcher@Peking University

    Previously: PhD student @ Zhejiang University

    Hangzhou, CN

    51
    RLHF / RLVR85
    Evals & Reward Models74
    Synthetic Data & Self-Play58
    Personality & Non-Verifiable Rewards35
    LLM Creativity5
    Strengths
    Safe RLHF paper (538 cites) — constrained RLHF framework co-author
    BeaverTails (665 cites) — human-preference dataset at scale
    Gaps
    Safety-focused alignment — personality/creativity/tone work minimal
    …click to see all
    KW

    Kaile Wang

    medium hireability

    Undergrad student@Peking University

    Beijing, CN

    38
    RLHF / RLVR80
    Evals & Reward Models60
    Synthetic Data & Self-Play25
    Personality & Non-Verifiable Rewards20
    LLM Creativity5
    Strengths
    Merged PR: RLOO/REINFORCE/GROUP_NORM in PPO for align-anything (Mar 2025)
    lmm-r1: OpenRLHF extension for multimodal DeepSeek-R1 — direct RLVR work
    Gaps
    No evidence of personality/humor/non-verifiable reward work (consumer post-training axis)
    …click to see all
    KS

    Kaitao Song

    medium hireability

    Senior Researcher@Microsoft

    Previously: Researcher @ Microsoft

    Shanghai, CN

    47
    RLHF / RLVR65
    Evals & Reward Models65
    Personality & Non-Verifiable Rewards50
    LLM Creativity35
    Synthetic Data & Self-Play20
    Strengths
    HuggingGPT (1,623 cites) — orchestrating model calls from smaller front-end LLM
    Conditional Reward Modeling for LLM Reasoning (2025) — RLVR post-training
    Gaps
    No large-scale SFT / self-play data pipeline evidence
    …click to see all
    KN

    Kamal Ndousse

    medium hireability

    Member Of Technical Staff@Anthropic

    Previously: Member of Technical Staff @ Stealth Co

    San Francisco, US

    64
    RLHF / RLVR95
    Personality & Non-Verifiable Rewards85
    Evals & Reward Models75
    Synthetic Data & Self-Play35
    LLM Creativity30
    Strengths
    HH-RLHF co-author (2022, 3K citations) — defining RLHF post-training paper
    Constitutional AI co-author — RLAIF for harmlessness, non-verifiable reward shaping
    Gaps
    Limited synthetic data / self-play pipeline experience in public record
    …click to see all
    KH

    Kourosh Hakhamaneshi

    medium hireability

    Team Lead (AI)@Anyscale

    Previously: Software Engineer (RL and ML) @ Anyscale

    San Francisco, US

    28
    RLHF / RLVR68
    Evals & Reward Models32
    Synthetic Data & Self-Play25
    Personality & Non-Verifiable Rewards10
    LLM Creativity5
    Strengths
    SkyRL-v0: RLVR for long-horizon LLM agents at Anyscale (2025)
    "LLMs Learn to Reason from Demonstrations" — 58 citations, 2025
    Gaps
    No work on personality, tone, humor, or non-verifiable reward modeling
    …click to see all
    KZ

    Kunlun Zhu

    medium hireability

    Graduate Student@University of Illinois Urbana-Champaign

    Previously: Research Assistant @ Tsinghua University

    US

    35
    RLHF / RLVR68
    Evals & Reward Models65
    Synthetic Data & Self-Play30
    LLM Creativity5
    Personality & Non-Verifiable Rewards5
    Strengths
    ToolLLM (ICLR 2024, 984 cit.) — core tool-use post-training benchmark
    OpenManus-RL — live RL tuning project for LLM agents (pinned repo)
    Gaps
    No work on personality/creativity or non-verifiable reward modeling
    …click to see all
    LS

    Lei Shu

    medium hireability

    Staff Research Scientist@DeepMind

    Previously: Senior Research Scientist @ DeepMind

    Seattle, US

    35
    RLHF / RLVR65
    Evals & Reward Models65
    Personality & Non-Verifiable Rewards20
    LLM Creativity15
    Synthetic Data & Self-Play10
    Strengths
    'Automated Process Supervision' (ICLR 2025) — verifiable reward / process RM
    'Critique Ability of LLMs' (ICLR 2024) — LLM eval and reward modeling
    Gaps
    No work on personality, humor, or non-verifiable reward shaping
    …click to see all
    LG

    Leo Gao

    medium hireability

    Researcher@OpenAI

    Previously: Researcher @ EleutherAI

    56
    Evals & Reward Models90
    RLHF / RLVR80
    Synthetic Data & Self-Play40
    LLM Creativity35
    Personality & Non-Verifiable Rewards35
    Strengths
    Creator of lm-evaluation-harness — the dominant LLM eval framework
    Scaling Laws for Reward Model Overoptimization — core RM scaling research
    Gaps
    No direct RLHF fine-tuning or PPO/DPO training pipeline work visible
    …click to see all
    LE

    Leon Ericsson

    medium hireability
    40
    RLHF / RLVR78
    Evals & Reward Models48
    Synthetic Data & Self-Play42
    Personality & Non-Verifiable Rewards22
    LLM Creativity10
    Strengths
    29 commits to huggingface/trl — active RL post-training contributor
    Feb 2026 blog: geometric view of OPSM/PPO off-policy masking — original technical work
    Gaps
    No personality/creative training work — purely technical RL focus
    …click to see all
    LL

    Lewei Lu

    medium hireability

    Senior Research Director@SenseTime

    Previously: Senior Researcher @ SenseTime

    Beijing, CN

    50
    RLHF / RLVR82
    Evals & Reward Models78
    Synthetic Data & Self-Play52
    Personality & Non-Verifiable Rewards22
    LLM Creativity15
    Strengths
    VisualPRM (2025): process reward model for multimodal reasoning — core RLVR signal
    Mixed Preference Optimization (2024, 139 cites): DPO-style post-training for MLLMs
    Gaps
    No personality/creativity reward work — focus is on verifiable reasoning rewards
    …click to see all
    LT

    Lewis Tunstall

    medium hireability

    Machine Learning Engineer@Hugging Face

    Previously: Senior Data Scientist @ Swisscom

    Bern, CH

    57
    RLHF / RLVR97
    Evals & Reward Models72
    Synthetic Data & Self-Play55
    Personality & Non-Verifiable Rewards40
    LLM Creativity22
    Strengths
    TRL library creator (480 citations) — de facto RLHF/DPO/GRPO toolchain
    Open-R1: led HF's DeepSeek-R1 RLVR reproduction with GRPO
    Gaps
    Limited work on consumer post-training (personality, humor, creativity, non-VR rewards)
    …click to see all
    LS

    Linxin Song

    medium hireability

    Ph.D. Student@University of Southern California

    Previously: Research Intern @ Salesforce Research

    Los Angeles, US

    53
    RLHF / RLVR72
    Evals & Reward Models70
    Synthetic Data & Self-Play68
    Personality & Non-Verifiable Rewards35
    LLM Creativity18
    Strengths
    ExeVRM: reward modeling for computer-use agents — exact RLVR fit
    Efficient RL Finetuning via Adaptive Curriculum (32 citations, Apr 2025)
    Gaps
    No creative writing / roleplay / humor work — weak on consumer personality axis
    …click to see all
    LD

    Lisa Dunlap

    medium hireability

    PhD student@UC Berkeley

    Previously: core contributor @ Chatbot Arena

    28
    Evals & Reward Models70
    Personality & Non-Verifiable Rewards50
    LLM Creativity10
    RLHF / RLVR5
    Synthetic Data & Self-Play5
    Strengths
    VibeCheck (ICLR 2025): measures qualitative LLM traits like tone, humor, personality
    VisionArena: 230K VLM conversations with human preference labels
    Gaps
    No post-training / RLHF / DPO experience — eval-only researcher
    …click to see all
    LC

    Longze Chen

    medium hireability

    PhD student@Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences

    Previously: Undergrad student @ Shandong University

    Shenzhen, CN

    56
    RLHF / RLVR85
    Evals & Reward Models75
    Synthetic Data & Self-Play72
    LLM Creativity30
    Personality & Non-Verifiable Rewards20
    Strengths
    "Implicit Actor Critic Coupling for RLVR" — PACS framework, +8-9% on math benchmarks
    "Learning Ordinal Probabilistic Reward from Preferences" (2026) — novel reward modeling
    Gaps
    Personality/non-verifiable reward work absent — persona work is for math, not conversation
    …click to see all
    LA

    Loubna Ben Allal

    medium hireability

    Research Engineer@Hugging Face

    Previously: Member of the core team @ BigCode

    Paris, FR

    27
    Evals & Reward Models55
    Synthetic Data & Self-Play50
    RLHF / RLVR20
    LLM Creativity5
    Personality & Non-Verifiable Rewards5
    Strengths
    SmolLM2 COLM 2025 spotlight — data-centric SFT training pipeline end-to-end
    Cosmopedia: persona-conditioned synthetic data at scale (SFT data relevance)
    Gaps
    Pre-training focused, not a post-training RLHF/RLVR specialist
    …click to see all
    MF

    Marzieh Fadaee

    medium hireability

    Head of Cohere Labs@Cohere

    Previously: Staff Research Scientist @ Cohere

    Amsterdam, NL

    74
    RLHF / RLVR92
    Evals & Reward Models88
    Synthetic Data & Self-Play88
    Personality & Non-Verifiable Rewards72
    LLM Creativity30
    Strengths
    "Back to Basics" REINFORCE/RLHF paper — 431 citations, direct RLHF methodology
    Leads Cohere Labs — runs post-training research across instruction tuning and alignment
    Gaps
    LLM creativity/roleplay/humor — no published work in this direction
    …click to see all
    MN

    Matvei Novikov

    medium hireability

    Senior Deep Learning Software Engineer@NVIDIA

    Previously: Senior Deep Learning Software Engineer @ NVIDIA

    San Francisco, US

    44
    RLHF / RLVR70
    Synthetic Data & Self-Play70
    Evals & Reward Models55
    Personality & Non-Verifiable Rewards15
    LLM Creativity10
    Strengths
    Nemotron-CrossThink: RL + synthetic/real data self-learning across diverse domains
    Llama-Nemotron: SFT + large-scale RL post-training for efficient reasoning models
    Gaps
    No evidence of personality, tone, humor, or non-verifiable reward modeling
    …click to see all
    M⍼

    max ⍼

    medium hireability
    57
    RLHF / RLVR90
    Personality & Non-Verifiable Rewards70
    Evals & Reward Models65
    Synthetic Data & Self-Play35
    LLM Creativity25
    Strengths
    #1 CarperAI/trlx contributor — 71 merged PRs on RLHF library
    Built RFT trainer (ReST/RFT paper) in trlx — post-training SFT pipeline
    Gaps
    No location confirmed — timezone suggests Europe but not verified
    …click to see all
    MM

    Maximilian Mozes

    medium hireability

    Team Lead, Post-Training@Cohere

    Previously: Senior Research Scientist @ Cohere

    London, GB

    51
    RLHF / RLVR82
    Evals & Reward Models62
    Personality & Non-Verifiable Rewards50
    Synthetic Data & Self-Play35
    LLM Creativity25
    Strengths
    "Reverse Engineering Human Preferences with RL" — NeurIPS 2025 Spotlight
    Cohere Post-Training Team Lead — production post-training at scale
    Gaps
    No specific papers on personality modeling or non-verifiable reward design
    …click to see all
    MT

    Meg Tong

    medium hireability

    Member of Technical Staff@LLM company

    Previously: Researcher @ Various research organisations

    San Francisco, US

    51
    Personality & Non-Verifiable Rewards82
    Evals & Reward Models72
    RLHF / RLVR70
    Synthetic Data & Self-Play18
    LLM Creativity15
    Strengths
    Sycophancy paper (ICLR 2024) — RLHF reward hacking, core non-verifiable rewards research
    Constitutional Classifiers (2025) — reward model for subjective safety/personality criteria
    Gaps
    No synthetic data generation or self-play pipeline experience
    …click to see all
    ML

    Mingdao Liu

    medium hireability

    PhD student@Tsinghua University

    Beijing, CN

    42
    RLHF / RLVR65
    Synthetic Data & Self-Play65
    Evals & Reward Models50
    Personality & Non-Verifiable Rewards20
    LLM Creativity10
    Strengths
    GLM-4.5 Agentic (2025) — core author on post-trained tool-use model
    ChatGLM SFT + RLHF pipeline: multi-stage post-training at production scale
    Gaps
    No dedicated work on personality, tone, or non-verifiable reward modeling
    …click to see all
    MW

    Minzheng Wang

    medium hireability

    Ph.D student@Institute of Automation, Chinese Academy of Sciences

    Previously: Research Intern @ Alibaba

    Beijing, CN

    54
    RLHF / RLVR72
    Synthetic Data & Self-Play65
    Evals & Reward Models60
    Personality & Non-Verifiable Rewards45
    LLM Creativity30
    Strengths
    AMPO (ICLR 2026): RL policy optimization for social agents, scores 8/8/8/6
    BuPO (co-first): bottom-up policy optimization in LLMs
    Gaps
    No direct RLHF/preference optimization (PPO, DPO, GRPO) work published yet
    …click to see all
    NS

    Nicholas Schiefer

    medium hireability

    Member of Technical Staff@Anthropic

    Previously: Resident Member of Technical Staff @ Anthropic

    San Francisco, US

    60
    Personality & Non-Verifiable Rewards90
    RLHF / RLVR78
    Evals & Reward Models75
    Synthetic Data & Self-Play40
    LLM Creativity15
    Strengths
    Constitutional AI (RLAIF) co-author — 2,629 citations, defines non-verifiable reward training
    Sycophancy paper (582 citations) — personality alignment via preference optimization
    Gaps
    No evidence of tool-use / computer-use RLVR or agentic post-training work
    …click to see all
    NM

    Niklas Muennighoff

    medium hireability

    AI Research@Meta

    Previously: AI Research @ Ai2

    US

    57
    Evals & Reward Models90
    RLHF / RLVR78
    Synthetic Data & Self-Play55
    Personality & Non-Verifiable Rewards40
    LLM Creativity20
    Strengths
    MODPO (823 citations) — direct preference optimization / reward modeling work
    OctoPack: code instruction tuning for LLMs (348 citations)
    Gaps
    No direct work on personality adherence, humor, or creative roleplay
    …click to see all
    ND

    Nouha Dziri

    medium hireability

    Research Scientist@Allen Institute for AI

    Previously: Postdoc @ Allen Institute for AI

    US

    67
    Evals & Reward Models95
    RLHF / RLVR90
    Personality & Non-Verifiable Rewards65
    Synthetic Data & Self-Play50
    LLM Creativity35
    Strengths
    RewardBench (417 cites) — definitive reward model evaluation framework
    Tulu 3 (320 cites) — allenai's flagship RLHF/post-training pipeline co-author
    Gaps
    No specific RLVR/verifiable-reward work (code/tool-use RL)
    …click to see all
    OP

    Olivier Pietquin

    medium hireability

    Chief Scientist@Earth Species Project

    Previously: Director, Reinforcement Learning and Interaction Research @ Cohere

    Lille, FR

    63
    RLHF / RLVR95
    Evals & Reward Models75
    Personality & Non-Verifiable Rewards70
    LLM Creativity45
    Synthetic Data & Self-Play30
    Strengths
    'Back to Basics' (2024, 481 citations) — redefining REINFORCE-style RLHF for LLMs
    ShiQ (2025) + Self-Improving RPO — cutting-edge preference optimization
    Gaps
    Current day-job at Earth Species Project is bioacoustics, not LLM post-training
    …click to see all
    PW

    Pengcheng Wen

    medium hireability
    32
    RLHF / RLVR45
    Evals & Reward Models45
    LLM Creativity30
    Synthetic Data & Self-Play20
    Personality & Non-Verifiable Rewards20
    Strengths
    GRPO remote RM in align-anything — direct RLVR post-training implementation
    eval-anything benchmarks (MathVision, OlympiadBench) — evaluation infrastructure
    Gaps
    Junior MPhil student — engineering contributor, not lead researcher
    …click to see all
    PY

    Pengcheng Yin

    medium hireability

    Researcher@DeepMind

    Previously: Researcher Intern @ Microsoft

    San Francisco, US

    28
    Evals & Reward Models60
    Synthetic Data & Self-Play58
    RLHF / RLVR15
    LLM Creativity3
    Personality & Non-Verifiable Rewards3
    Strengths
    14 LLM code gen papers; h-index 35 — recognized expert in NL-to-code
    Learn-by-interact (2025): synthetic agent trajectory data without human annotation
    Gaps
    No RLHF, RLVR, DPO, or reward modeling work in published record
    …click to see all
    PD

    Pradeep Dasigi

    medium hireability

    Senior Research Scientist@Allen Institute for AI

    Previously: Research Scientist @ Allen Institute for AI

    Seattle, US

    51
    RLHF / RLVR85
    Evals & Reward Models75
    Synthetic Data & Self-Play50
    Personality & Non-Verifiable Rewards40
    LLM Creativity5
    Strengths
    Tulu 3 contributor — AllenAI's flagship open post-training pipeline
    'Generalizing Verifiable Instruction Following' — direct RLVR evidence
    Gaps
    No evident work on personality, humor, or creative writing post-training
    …click to see all
    PZ

    Pu Zhao

    medium hireability

    Principal Researcher@Microsoft

    Previously: Researcher @ Microsoft

    Beijing, CN

    62
    Synthetic Data & Self-Play85
    RLHF / RLVR80
    Evals & Reward Models75
    Personality & Non-Verifiable Rewards40
    LLM Creativity30
    Strengths
    WizardMath: Reinforced Evol-Instruct (PPO-based RLVR for math reasoning)
    Self-Evolved Reward Learning for LLMs: reward model innovation
    Gaps
    No specific work on personality, humor, sarcasm, or non-verifiable subjective rewards
    …click to see all
    QS

    Qingfeng Sun

    medium hireability

    Partner Engineering Manager@Microsoft

    Previously: Principal Dev Manager @ Microsoft

    Seattle, US

    60
    Synthetic Data & Self-Play90
    RLHF / RLVR85
    Evals & Reward Models65
    Personality & Non-Verifiable Rewards40
    LLM Creativity20
    Strengths
    Evol-Instruct inventor — scalable synthetic instruction data generation
    RLEIF (WizardMath) — RLVR for reasoning, directly query-relevant
    Gaps
    No published work on personality/humor/non-verifiable reward modeling
    …click to see all
    QL

    Qingwei Lin

    medium hireability

    Partner Researcher/Partner Research Manager@Microsoft

    Previously: Principal Researcher/Principal Research Manager @ Microsoft

    Beijing, CN

    53
    RLHF / RLVR82
    Synthetic Data & Self-Play80
    Evals & Reward Models68
    Personality & Non-Verifiable Rewards22
    LLM Creativity12
    Strengths
    WizardMath: Reinforced Evol-Instruct — 656 citations, RLVR for math post-training
    Arena Learning: self-play chatbot arena as post-training data flywheel
    Gaps
    No evidence of work on personality, humor, or subjective non-verifiable rewards
    …click to see all
    QG

    Qipeng Guo

    medium hireability

    Young Research Scientist@Shanghai AI Laboratory

    Previously: Investment Manager @ Cowin Venture Capital

    Shanghai, CN

    53
    RLHF / RLVR85
    Evals & Reward Models75
    Synthetic Data & Self-Play75
    Personality & Non-Verifiable Rewards20
    LLM Creativity10
    Strengths
    InternLM2 co-author: large-scale LLM post-training (RLHF, SFT, DPO)
    IFDECORATOR: instruction-following RLVR with verifiable rewards (2025)
    Gaps
    Minimal consumer-facing personality work (humor, sarcasm, roleplay not evident)
    …click to see all
    QG

    Quentin Gallouédec

    medium hireability

    Researcher@Hugging Face

    Previously: PhD student @ Ecole Centrale de Lyon

    59
    RLHF / RLVR97
    Evals & Reward Models72
    Synthetic Data & Self-Play68
    Personality & Non-Verifiable Rewards38
    LLM Creativity18
    Strengths
    #1 TRL contributor (720 commits) — owns DPO, GRPO, PPO, SFT trainers
    Open R1 co-author — end-to-end RLVR pipeline with GRPO at scale
    Gaps
    No public work on personality, tone, humor, or non-verifiable reward modeling
    …click to see all
    RR

    Rafael Rafailov

    medium hireability

    Analyst@Goldman Sachs

    Previously: Equity Research Officer/Risk Manager @ Berkeley Investment Group

    New York, US

    76
    RLHF / RLVR100
    Evals & Reward Models90
    Personality & Non-Verifiable Rewards75
    LLM Creativity60
    Synthetic Data & Self-Play55
    Strengths
    DPO inventor — NeurIPS 2023 Outstanding Paper, 5,769 citations, single most-cited RLHF paper
    Generative Reward Models (2025) — novel reward model for post-training pipelines
    Gaps
    No explicit small-model (≤30B) deployment or efficiency-constrained post-training work
    …click to see all
    RR

    Rajkumar Ramamurthy

    medium hireability

    Director of Engineering- Transmission Controls@Bosch

    Previously: Director-Simultaneous Engineering @ Automotive Steering Column LLC

    Auburn Hills, US

    57
    RLHF / RLVR88
    Evals & Reward Models72
    Synthetic Data & Self-Play65
    Personality & Non-Verifiable Rewards45
    LLM Creativity15
    Strengths
    allenai/RL4LMs: 58 commits — primary author of RLHF-for-LMs library
    ICLR 2023 RL4LMs paper — RL policy optimization benchmarks for NLP
    Gaps
    No explicit work on personality adherence, humor, or creative writing rewards
    …click to see all
    RM

    Rémi Munos

    medium hireability

    Researcher@Meta

    Previously: Research scientist @ DeepMind

    Villeneuve d'Ascq, FR

    58
    RLHF / RLVR97
    Personality & Non-Verifiable Rewards82
    Evals & Reward Models72
    Synthetic Data & Self-Play25
    LLM Creativity15
    Strengths
    "Beyond Verifiable Rewards" (2025) — RL on non-verifiable LLM data
    Nash RLHF (2023, 186 cites) — pioneered game-theoretic preference optimization
    Gaps
    No synthetic data / self-play data generation work found
    …click to see all
    RL

    Rui Lu

    medium hireability
    34
    RLHF / RLVR70
    Evals & Reward Models55
    Synthetic Data & Self-Play35
    LLM Creativity5
    Personality & Non-Verifiable Rewards5
    Strengths
    DeepDive (arXiv:2509.10446): first author, multi-turn RLVR for search agents
    SLIME contributor — merged PR to THUDM's RL post-training framework (GLM-4.5/5.x)
    Gaps
    No evidence of personality, tone, or non-verifiable reward work
    …click to see all
    RP

    Rui Pan

    medium hireability
    21
    RLHF / RLVR55
    Evals & Reward Models40
    Synthetic Data & Self-Play5
    Personality & Non-Verifiable Rewards5
    LLM Creativity0
    Strengths
    22 merged PRs to PKU-Alignment/align-anything — core contributor
    Safe RLHF-V: Reward Model-V + Cost Model-V implementation (#203)
    Gaps
    Undergraduate student — no production-scale post-training experience
    …click to see all
    RL

    Run Luo

    medium hireability

    MS student@University of Chinese Academy of Sciences

    CN

    70
    Synthetic Data & Self-Play88
    RLHF / RLVR82
    Evals & Reward Models75
    Personality & Non-Verifiable Rewards55
    LLM Creativity48
    Strengths
    GUI-R1: GRPO-based RLVR for GUI agents, first author (2025, 84 citations)
    MMEvol: Evol-Instruct synthetic instruction data for MLLMs, first author
    Gaps
    MS student — junior; limited production-scale deployment experience
    …click to see all
    RX

    Runxin Xu

    medium hireability

    researcher@DeepSeek

    Previously: Quant researcher @ Metabit Trading

    Barcelona, ES

    56
    RLHF / RLVR97
    Evals & Reward Models85
    Synthetic Data & Self-Play65
    Personality & Non-Verifiable Rewards20
    LLM Creativity15
    Strengths
    DeepSeek-R1 (GRPO) — 5348 citations, defining RLVR paper in field
    DeepSeekMath (2677 cites) — RLVR post-training for verifiable math reasoning
    Gaps
    No consumer post-training work (personality, creativity, humor, non-verifiable rewards)
    …click to see all
    SM

    Sagnik Mukherjee

    medium hireability

    Graduate Research Assistant@University of Illinois Urbana-Champaign

    Previously: Research Intern @ Microsoft

    Champaign, US

    31
    RLHF / RLVR65
    Evals & Reward Models50
    Personality & Non-Verifiable Rewards30
    LLM Creativity5
    Synthetic Data & Self-Play5
    Strengths
    NeurIPS 2025: RL sparsity in LLMs — mechanistic RL post-training analysis
    ICML 2025 PARC: CoT chain verification and error identification
    Gaps
    No synthetic data generation or self-play pipeline experience
    …click to see all
    SC

    Sahil Chaudhary

    medium hireability
    29
    Synthetic Data & Self-Play85
    RLHF / RLVR30
    Evals & Reward Models18
    Personality & Non-Verifiable Rewards8
    LLM Creativity5
    Strengths
    glaiveai/reasoning-v1-20m: 22.2M sample reasoning SFT dataset
    glaiveai/l4: 1.28M function-calling training samples — direct tool-call alignment
    Gaps
    No evidence of RLHF/PPO/DPO/GRPO applied to preference optimization
    …click to see all
    SS

    Sainbayar Sukhbaatar

    medium hireability

    Research Scientist@Meta

    Previously: Research Intern @ DeepMind

    US

    53
    RLHF / RLVR92
    Evals & Reward Models80
    Synthetic Data & Self-Play55
    Personality & Non-Verifiable Rewards28
    LLM Creativity12
    Strengths
    Self-Rewarding LMs (548 citations) — RLHF via self-judging reward loop
    Meta-Rewarding (2025) — meta-judge reward model for self-improving alignment
    Gaps
    No work on personality/tone/humor or non-verifiable reward shaping for conversation
    …click to see all
    SB

    Samuel R. Bowman

    medium hireability

    Member of Technical Staff@Anthropic

    Previously: Technical Advisor @ ASAPP

    San Francisco, US

    59
    Evals & Reward Models90
    RLHF / RLVR85
    Personality & Non-Verifiable Rewards82
    Synthetic Data & Self-Play25
    LLM Creativity12
    Strengths
    Constitutional AI (2022) — foundational RLAIF for non-verifiable reward training
    Pretraining LMs with Human Preferences — RLHF from scratch research
    Gaps
    No synthetic data generation or self-play / persona-conditioned data pipeline work
    …click to see all
    SB

    Sergio Paniego Blanco

    medium hireability
    37
    RLHF / RLVR75
    Synthetic Data & Self-Play50
    Evals & Reward Models40
    Personality & Non-Verifiable Rewards15
    LLM Creativity5
    Strengths
    110+ merged PRs on huggingface/trl — RLHF post-training library
    AsyncGRPO + GRPO examples on Qwen3 (same base model Neuralace uses)
    Gaps
    No personality/non-verifiable reward work — GRPO is RLVR only
    …click to see all
    SJ

    Shafiq Joty

    medium hireability

    Senior Research Director@Salesforce

    Previously: Research Director @ Salesforce

    San Francisco, US

    53
    Evals & Reward Models82
    RLHF / RLVR78
    Synthetic Data & Self-Play55
    Personality & Non-Verifiable Rewards30
    LLM Creativity20
    Strengths
    Diffusion Model Alignment via DPO — 408 citations, strong preference optimization pedigree
    Direct Judgement Preference Optimization (2025) — current DPO research
    Gaps
    No personality, humor, or creative writing post-training work found
    …click to see all
    SM

    Shaoguang Mao

    medium hireability

    Technical Staff@Moonshot AI

    Previously: Senior Research SDE @ Microsoft

    Beijing, CN

    38
    Evals & Reward Models55
    LLM Creativity48
    Personality & Non-Verifiable Rewards42
    Synthetic Data & Self-Play25
    RLHF / RLVR20
    Strengths
    Kimi K2 authorship (2025) — active Moonshot AI post-training team
    TaskMatrix.AI (229 citations) — tool-call & API orchestration expertise
    Gaps
    No published RLHF/RLVR/DPO/PPO papers — reward modeling depth unclear
    …click to see all
    SK

    Shauna M Kravec

    medium hireability

    Member Of Technical Staff@Anthropic

    Previously: Machine Learning Engineer @ Clostra

    US

    69
    RLHF / RLVR95
    Personality & Non-Verifiable Rewards95
    Evals & Reward Models85
    Synthetic Data & Self-Play50
    LLM Creativity20
    Strengths
    Constitutional AI co-author — defining work on AI feedback and personality shaping
    RLHF paper (3,057 citations) — foundational post-training with human preferences
    Gaps
    No explicit RLVR/verifiable-reward work (code gen, tool-use, agentic RL)
    …click to see all
    SS

    Sheng Shen

    medium hireability

    Member of Technical Staff@xAI

    Previously: Research Scientist @ Meta

    San Francisco, US

    45
    RLHF / RLVR72
    Synthetic Data & Self-Play68
    Evals & Reward Models65
    Personality & Non-Verifiable Rewards12
    LLM Creativity8
    Strengths
    LLaVA-RLHF — factually augmented RLHF, pinned repo with code
    "Learning to Solve and Verify" (2025) — self-play for code/test gen
    Gaps
    No work on personality, humor, or consumer-side subjective RLHF
    …click to see all
    SC

    Shiyi Cao

    medium hireability

    Ph.D. student@UC Berkeley EECS

    Previously: Researcher @ CMU

    San Francisco, US

    27
    RLHF / RLVR62
    Synthetic Data & Self-Play35
    Evals & Reward Models28
    LLM Creativity5
    Personality & Non-Verifiable Rewards5
    Strengths
    Sky-T1: RLVR reasoning model (Qwen3-7B) trained for $450, EMNLP 2025
    SkyRL: modular full-stack RL library for long-horizon LLM agent training
    Gaps
    No work on consumer post-training: personality, humor, or non-verifiable rewards
    …click to see all
    SL

    Shu Liu

    medium hireability

    PhD Student@University of California, Berkeley

    Previously: Research Intern @ Max Planck Institute for Software Systems

    Berkeley, US

    35
    RLHF / RLVR80
    Evals & Reward Models45
    Synthetic Data & Self-Play40
    LLM Creativity5
    Personality & Non-Verifiable Rewards5
    Strengths
    Sky-T1 (163 citations) — GRPO RLVR o1-style reasoning training at $450
    NovaSky-AI/SkyRL — modular RL library built for LLM agentic workloads
    Gaps
    No consumer post-training work — personality, humor, or non-verifiable rewards
    …click to see all
    SM

    Sourab Mangrulkar

    medium hireability
    14
    RLHF / RLVR30
    Synthetic Data & Self-Play20
    Evals & Reward Models8
    LLM Creativity5
    Personality & Non-Verifiable Rewards5
    Strengths
    huggingface/peft creator — 119+ PRs, LoRA/QLoRA/adapters widely used in post-training
    DPO trainer fix + FSDP+QLoRA enablement in TRL
    Gaps
    No reward modeling or RLHF algorithm research (PPO, GRPO, DPO methods — not just fixes)
    …click to see all
    SH

    Sumanth R Hegde

    medium hireability
    34
    RLHF / RLVR82
    Evals & Reward Models48
    Synthetic Data & Self-Play25
    Personality & Non-Verifiable Rewards8
    LLM Creativity5
    Strengths
    Core SkyRL contributor — GRPO, SFT trainer, Megatron backend
    Chunked logprobs for Qwen 3.5 248k vocab — Qwen RL infra
    Gaps
    No non-verifiable reward work — personality, humor, creativity absent
    …click to see all
    SS

    Suraj Subramanian

    medium hireability
    20
    RLHF / RLVR52
    Evals & Reward Models20
    Synthetic Data & Self-Play18
    LLM Creativity5
    Personality & Non-Verifiable Rewards5
    Strengths
    OpenEnv: agentic RL training framework, integrates with TRL/SkyRL/Unsloth (Dec 2025)
    LoRA vs FFT skills-transferability experiments — fine-tuning methodology research
    Gaps
    No evidence of reward modeling, DPO, or RLHF pipeline work
    …click to see all
    ST

    Szymon Tworkowski

    medium hireability

    Member of Technical Staff@xAI

    Previously: Student Researcher @ DeepMind

    San Francisco, US

    38
    RLHF / RLVR92
    Evals & Reward Models52
    Synthetic Data & Self-Play38
    LLM Creativity5
    Personality & Non-Verifiable Rewards5
    Strengths
    Led Grok 4.20 reasoning RL training algorithm and scaling at xAI
    100x RL scaling: foundational contributor to Grok 3 Reasoning stability
    Gaps
    No evidence of consumer post-training (personality, tone, non-verifiable rewards)
    …click to see all
    TW

    Tianhao Wu

    medium hireability

    PhD student@UC Berkeley

    Previously: Algo Developer @ Hudson River Trading

    Berkeley, US

    66
    RLHF / RLVR92
    Evals & Reward Models85
    Personality & Non-Verifiable Rewards72
    Synthetic Data & Self-Play52
    LLM Creativity28
    Strengths
    RouteLLM: LLM routing by query complexity — matches JD's small-model-as-frontend concept
    Starling-7B (RLAIF, 176 citations) — RLHF post-training for helpfulness/harmlessness
    Gaps
    No published work on personality adherence, humor, or creative writing specifically
    …click to see all
    TT

    Tianyi Tang

    medium hireability

    Member of the Qwen Team@Alibaba

    Previously: AI Engineer @ Unisound AI

    Hangzhou, CN

    65
    Personality & Non-Verifiable Rewards85
    RLHF / RLVR70
    Evals & Reward Models62
    LLM Creativity58
    Synthetic Data & Self-Play50
    Strengths
    Co-author Qwen2.5 & Qwen3 — direct post-training production experience
    ICLR 2025: Neuron-based Personality Trait Induction — on-point for consumer alignment
    Gaps
    No standalone RLVR / verifiable-reward RL paper (PPO, DPO, GRPO)
    …click to see all
    TK

    Tomek Korbak

    medium hireability

    Member of Technical Staff@OpenAI

    Previously: Senior Research Scientist @ AI Security Institute

    San Francisco, US

    62
    RLHF / RLVR98
    Personality & Non-Verifiable Rewards80
    Evals & Reward Models72
    LLM Creativity32
    Synthetic Data & Self-Play30
    Strengths
    'Open problems... RLHF' — 836 cites, co-authored survey on RLHF limitations
    PhD at Sussex focused entirely on RL from human feedback
    Gaps
    Only ~6 months at OpenAI — relatively fresh hire, may not be actively looking
    …click to see all
    TG

    Tyler Griggs

    medium hireability
    31
    RLHF / RLVR78
    Synthetic Data & Self-Play30
    Evals & Reward Models28
    Personality & Non-Verifiable Rewards15
    LLM Creativity5
    Strengths
    SkyRL-Agent paper: multi-turn agent RL training (2025)
    GRPO off-policy per-token masking commit to NovaSky-AI/SkyRL
    Gaps
    No work on personality, creativity, or non-verifiable reward modeling
    …click to see all
    VP

    Valentina Pyatkin

    medium hireability

    Researcher@ETH AI Center

    Previously: Research Intern @ Allen Institute for AI

    Zurich, CH

    63
    Evals & Reward Models95
    RLHF / RLVR90
    Personality & Non-Verifiable Rewards60
    Synthetic Data & Self-Play55
    LLM Creativity15
    Strengths
    TÜLU 3: end-to-end open LLM post-training (PPO/DPO/RLVR) — 489 citations
    RewardBench (432 citations): standard benchmark for reward model eval
    Gaps
    No direct work on personality modeling, humor, or creativity generation
    …click to see all
    VD

    Vardhan Dongre

    medium hireability

    Research Scientist Intern@Adobe

    Previously: AI/ML Research Software Engineer Intern @ Brunswick Corporation

    San Francisco, US

    23
    Personality & Non-Verifiable Rewards38
    Evals & Reward Models35
    Synthetic Data & Self-Play28
    RLHF / RLVR10
    LLM Creativity5
    Strengths
    Advised by Dilek Hakkani-Tür — leading conversational AI researcher at UIUC
    Drift No More? — multi-turn LLM context quality, key consumer post-training concern
    Gaps
    No evidence of RLHF, DPO, reward modeling, or post-training methodology
    …click to see all
    WZ

    Wangchunshu Zhou

    medium hireability

    Director of OPPO Personal AI Lab@OPPO

    Previously: Co-Founder & Chief Technology Officer @ AIWaves

    Hangzhou, CN

    66
    LLM Creativity88
    Personality & Non-Verifiable Rewards85
    Evals & Reward Models62
    Synthetic Data & Self-Play62
    RLHF / RLVR35
    Strengths
    RoleLLM (344 citations) — eliciting + benchmarking role-play in LLMs
    Weaver: foundation model specifically for creative writing (NeurIPS 2024)
    Gaps
    No explicit RLHF/PPO/DPO/GRPO training papers — mostly SFT-based
    …click to see all
    WZ

    Wanjun Zhong

    medium hireability

    Senior Research Scientist@ByteDance

    Previously: Research Scientist @ Huawei

    CN

    46
    RLHF / RLVR82
    Evals & Reward Models78
    Personality & Non-Verifiable Rewards35
    Synthetic Data & Self-Play25
    LLM Creativity12
    Strengths
    RetoTool (2025): RL for strategic tool use — direct RLVR match
    OTC (2025): optimal tool calls via RL — second tool-RL paper
    Gaps
    No evidence of personality/non-verifiable reward post-training work
    …click to see all
    WZ

    Wenwei Zhang

    medium hireability

    Young Research Scientist@Shanghai Artificial Intelligence Laboratory

    Previously: PhD student @ Nanyang Technological University

    Singapore, SG

    47
    Evals & Reward Models92
    RLHF / RLVR82
    Synthetic Data & Self-Play30
    Personality & Non-Verifiable Rewards22
    LLM Creativity10
    Strengths
    "Exploring Limit of Outcome Reward" — RLVR for math reasoning (2025)
    InternLM-XComposer2.5-Reward — multimodal reward model (30 citations, 2025)
    Gaps
    No direct work on personality, humor, or non-verifiable reward shaping
    …click to see all
    WH

    Wenxiang Hu

    medium hireability

    Senior Machine Learning Engineer@Microsoft

    Previously: Senior Research Software Engineer @ Microsoft

    Seattle, US

    32
    Synthetic Data & Self-Play80
    RLHF / RLVR30
    Evals & Reward Models20
    LLM Creativity15
    Personality & Non-Verifiable Rewards15
    Strengths
    WizardCoder (846 cites): Evol-Instruct synthetic data pipeline for code SFT
    EpiCoder (ICML 2025): feature tree-based synthesis — controllable complexity & diversity
    Gaps
    No explicit RLHF/DPO/PPO/GRPO post-training work — focus is SFT data generation
    …click to see all
    XZ

    Xiangxin Zhou

    medium hireability

    RedStar Intern@Xiaohongshu Hi Lab

    Previously: Associate Member @ Sea AI Lab

    31
    RLHF / RLVR68
    Personality & Non-Verifiable Rewards55
    Evals & Reward Models15
    Synthetic Data & Self-Play12
    LLM Creativity5
    Strengths
    VeriFree (ICLR 2026): RL for reasoning without verifiers — non-VR rewards
    Variational Reasoning for LMs (ICLR 2026): second LLM RL paper same cycle
    Gaps
    No explicit work on personality, humor, sarcasm, or creative writing
    …click to see all
    XC

    Xin Cong

    medium hireability
    39
    RLHF / RLVR72
    Evals & Reward Models72
    Synthetic Data & Self-Play38
    Personality & Non-Verifiable Rewards8
    LLM Creativity5
    Strengths
    AgentCPM-Explore: RL post-training for 4B model with reward signal denoising (Feb 2026)
    AgentRM: reward modeling (explicit/implicit/LLM-as-judge) for agent policy guidance (Feb 2025)
    Gaps
    No consumer post-training work — personality, humor, sarcasm, creativity all absent
    …click to see all
    XH

    Xinting Huang

    medium hireability

    Senior Researcher@Tencent

    Previously: Research Engineer Intern @ ByteDance

    Shenzhen, CN

    32
    Synthetic Data & Self-Play72
    RLHF / RLVR45
    Evals & Reward Models30
    Personality & Non-Verifiable Rewards8
    LLM Creativity5
    Strengths
    Explore-Instruct (EMNLP 2023): domain-specific instruction data via active exploration
    TeG-Instruct (2024): text-grounded premium instruction-tuning data pipeline
    Gaps
    No evidence on personality, humor, creativity, or roleplay post-training
    …click to see all
    XL

    Xuechen Li

    medium hireability

    Member of Technical Staff@xAI

    Previously: Member of Technical Staff @ xAI

    San Francisco, US

    53
    RLHF / RLVR85
    Evals & Reward Models80
    Synthetic Data & Self-Play50
    Personality & Non-Verifiable Rewards40
    LLM Creativity10
    Strengths
    AlpacaFarm (NeurIPS 2023): RLHF simulation — seminal preference-learning work
    19-commit lead on stanford_alpaca — core instruction-tuning contributor
    Gaps
    No direct work on personality / non-verifiable reward modeling
    …click to see all
    XW

    Xuyao Wang

    medium hireability
    30
    RLHF / RLVR55
    Evals & Reward Models55
    Personality & Non-Verifiable Rewards25
    Synthetic Data & Self-Play10
    LLM Creativity5
    Strengths
    PPO, DPO, SFT pipelines implemented across multiple model families in align-anything
    Qwen3/Qwen3MoE post-training support — directly relevant to Neuralace's Qwen work
    Gaps
    No published research papers — engineering contributor only, not research lead
    …click to see all
    YY

    Yang An (An Yang)

    medium hireability
    79
    RLHF / RLVR95
    Evals & Reward Models85
    Personality & Non-Verifiable Rewards80
    Synthetic Data & Self-Play72
    LLM Creativity62
    Strengths
    Group Sequence Policy Optimization — novel RL algorithm for LLM post-training (2025)
    WorldPM: Scaling Human Preference Modeling — reward model for subjective preferences
    Gaps
    Deeply embedded at Alibaba Qwen team — competitive to recruit
    …click to see all
    YD

    Yann Dubois

    medium hireability

    Member of Technical Staff@OpenAI

    Previously: Research Assistant @ Vector Institute

    San Francisco, US

    72
    Evals & Reward Models92
    RLHF / RLVR88
    Synthetic Data & Self-Play78
    Personality & Non-Verifiable Rewards72
    LLM Creativity30
    Strengths
    AlpacaFarm (686 cit.) — RLHF simulation with LLM-as-judge for non-verifiable rewards
    AlpacaEval (824 cit.) — industry-standard instruction-following eval framework
    Gaps
    No direct work on creative writing, roleplay, or personality adherence training
    …click to see all
    YW

    Yan Wang

    medium hireability

    principal researcher@Tencent

    Previously: research scientist @ miHoYo

    41
    Personality & Non-Verifiable Rewards72
    LLM Creativity55
    Evals & Reward Models38
    Synthetic Data & Self-Play22
    RLHF / RLVR18
    Strengths
    'Harry Potter' character alignment — LLMs aligned to personality (EMNLP 2023)
    'Generate, Delete and Rewrite' — persona consistency in dialogue (ACL 2020)
    Gaps
    No RLHF/PPO/DPO/GRPO post-training pipeline evidence
    …click to see all
    YZ

    Yaowei Zheng

    medium hireability
    57
    RLHF / RLVR95
    Evals & Reward Models75
    Synthetic Data & Self-Play70
    Personality & Non-Verifiable Rewards30
    LLM Creativity15
    Strengths
    LLaMA-Factory (ACL 2024, 1,205 citations) — PPO/DPO/GRPO, 2,186 commits
    EasyR1: GRPO/DAPO/RLOO RL training framework, 4.9K stars, multi-modal
    Gaps
    No dedicated research on personality fine-tuning or non-verifiable reward modeling
    …click to see all
    YL

    Yaxi Lu

    medium hireability

    PhD student@Eng.D. student, Tsinghua University

    Beijing, CN

    33
    Evals & Reward Models75
    RLHF / RLVR72
    Synthetic Data & Self-Play10
    LLM Creativity5
    Personality & Non-Verifiable Rewards5
    Strengths
    AgentRM (ACL 2025): reward model for agent generalization
    Reflective Reinforcement Tool Learning (2026): RLVR for tool use
    Gaps
    No consumer post-training work (personality, creativity, humor, roleplay)
    …click to see all
    YY

    Yining Ye

    medium hireability

    master student@THUNLP lab, Tsinghua University

    Previously: Topseed Intern @ Bytedance

    Beijing, CN

    35
    RLHF / RLVR65
    Evals & Reward Models62
    Synthetic Data & Self-Play38
    LLM Creativity5
    Personality & Non-Verifiable Rewards5
    Strengths
    ToolLLM (ICLR 2024 Spotlight, 1005 citations) — tool calling training platform
    UI-TARS-2 multi-turn RL for GUI/computer-use agents
    Gaps
    No consumer post-training (personality, humor, emotional understanding)
    …click to see all
    YS

    Yixuan Su

    medium hireability

    Modelling Lead, Agentic Reasoning@Cohere

    Previously: Research Scientist @ Cohere

    London, GB

    45
    Evals & Reward Models65
    LLM Creativity60
    RLHF / RLVR45
    Personality & Non-Verifiable Rewards30
    Synthetic Data & Self-Play25
    Strengths
    Modelling Lead on Command-A-Reasoning — hands-on post-training for reasoning
    "Replacing Judges with Juries" (2024) — multi-model eval framework, 164 cites
    Gaps
    No published RLHF/RLVR papers — post-training expertise inferred from role only
    …click to see all
    YL

    Yongbin Li

    medium hireability

    Principal Research Scientist@Alibaba

    77
    RLHF / RLVR85
    Personality & Non-Verifiable Rewards82
    Evals & Reward Models80
    Synthetic Data & Self-Play70
    LLM Creativity68
    Strengths
    'Preference Ranking Optimization for Human Alignment' — 327 citations, core RLHF work
    'CPO: Reward Ambiguity in Role-playing Dialogue' (2025) — non-verifiable reward for roleplay
    Gaps
    RLVR on code/math (Knowledge Work axis) weaker than consumer post-training axis
    …click to see all
    YB

    Younes Belkada

    medium hireability

    MS student@ENS Paris Saclay

    Previously: Researcher @ Technology Innovation Institute

    Paris, FR

    46
    RLHF / RLVR92
    Evals & Reward Models55
    Personality & Non-Verifiable Rewards38
    Synthetic Data & Self-Play35
    LLM Creativity12
    Strengths
    241 commits to huggingface/trl — PPO/DPO/GRPO RLHF library core contributor
    Zephyr paper (779 cites): direct LM alignment distillation via DPO
    Gaps
    No specific work on personality, humor, or creativity-focused reward modeling
    …click to see all
    YQ

    Yujia Qin

    medium hireability

    Seed@ByteDance

    Previously: Founder @ SeqAI Inc.

    49
    RLHF / RLVR78
    Evals & Reward Models72
    Synthetic Data & Self-Play60
    Personality & Non-Verifiable Rewards28
    LLM Creativity5
    Strengths
    ToolLLM — 956 citations, foundational LLM tool-use training (ICLR 2024 spotlight)
    ReTool (2025) — RL for strategic tool use, direct RLVR post-training evidence
    Gaps
    No work on personality/creativity/humor post-training (Consumer direction weak)
    …click to see all
    YQ

    Yujia Qin

    medium hireability
    35
    Evals & Reward Models72
    Synthetic Data & Self-Play55
    RLHF / RLVR40
    LLM Creativity5
    Personality & Non-Verifiable Rewards5
    Strengths
    ToolLLM (ICLR 2024 Spotlight) — 16k API tool-use training at scale
    ToolBench — open eval platform for tool learning, directly maps to Evals axis
    Gaps
    No work on consumer post-training: personality, humor, sarcasm, or non-VR rewards
    …click to see all
    YL

    Yunshui Li

    medium hireability

    Researcher@ByteDance

    Previously: MS student @ University of the Chinese Academy of Sciences

    49
    Synthetic Data & Self-Play80
    RLHF / RLVR70
    Evals & Reward Models40
    LLM Creativity30
    Personality & Non-Verifiable Rewards25
    Strengths
    Seed1.5-Thinking co-author — ByteDance RLVR reasoning post-training (2025)
    NUGGETS: instruction data prospector for SFT quality filtering (ACL 2024)
    Gaps
    No direct RLAIF/constitutional AI or personality reward modeling work
    …click to see all
    YD

    Yuqing Du

    medium hireability

    Research Scientist@DeepMind

    Previously: Visiting Researcher @ Meta

    San Francisco, US

    55
    RLHF / RLVR88
    Evals & Reward Models72
    Personality & Non-Verifiable Rewards65
    Synthetic Data & Self-Play30
    LLM Creativity22
    Strengths
    DPOK (NeurIPS 2023, 419 citations): RLHF/RLVR applied to generative model fine-tuning
    Aligning T2I with Human Feedback (410 citations): preference reward modeling
    Gaps
    No explicit work on LLM personality, humor, or creativity alignment
    …click to see all
    YZ

    Yuxuan Zhang

    medium hireability

    PhD student@University of Liverpool

    Previously: Undergrad student @ Xi'an Jiaotong-Liverpool University

    Liverpool, GB

    30
    RLHF / RLVR65
    Evals & Reward Models25
    Personality & Non-Verifiable Rewards25
    Synthetic Data & Self-Play20
    LLM Creativity15
    Strengths
    GLM-4.1V-Thinking: RLCS (RL with curriculum sampling) for multimodal reasoning
    GLM-4.5 contributor — agentic, tool use, coding post-training at scale
    Gaps
    Contributor role on large teams — ownership of RL/post-training components unclear
    …click to see all
    ZL

    Zhenyu Li

    medium hireability

    PhD student@Tsinghua University

    Beijing, CN

    53
    RLHF / RLVR78
    Synthetic Data & Self-Play78
    Evals & Reward Models65
    Personality & Non-Verifiable Rewards32
    LLM Creativity14
    Strengths
    Doubao Super Mode: led RL pipeline for agentic tool/search/code capabilities
    Agent-World: self-evolving training arena, 14B beats DeepSeek-V3-685B on BFCL-V4
    Gaps
    No explicit work on personality, humor, or creative writing post-training
    …click to see all
    ZX

    Zhiheng Xi

    medium hireability

    Senior Staff Machine Learning Engineer@Apple

    Previously: Staff Software Engineer (ASE) @ Apple

    Seattle, US

    61
    RLHF / RLVR95
    Evals & Reward Models90
    Synthetic Data & Self-Play65
    Personality & Non-Verifiable Rewards35
    LLM Creativity20
    Strengths
    "Delve into PPO" (278 cit.) — seminal RLHF PPO implementation paper
    "Secrets of RLHF Part II: Reward Modeling" (192 cit.) — reward model depth
    Gaps
    No work on consumer personality, humor, or creative writing
    …click to see all
    ZS

    Zhihong Shao

    medium hireability

    Member of Technical Staff@DeepSeek

    Previously: Research Intern @ Microsoft

    Beijing, CN

    58
    RLHF / RLVR97
    Evals & Reward Models85
    Synthetic Data & Self-Play78
    LLM Creativity18
    Personality & Non-Verifiable Rewards12
    Strengths
    Invented GRPO — the dominant RLVR algorithm for LLM post-training
    DeepSeek-R1 key author — RL scaling for complex reasoning
    Gaps
    No public work on personality adherence, humor, or non-verifiable reward modeling
    …click to see all
    ZQ

    ZHU QIHAO

    medium hireability
    43
    RLHF / RLVR95
    Evals & Reward Models60
    Synthetic Data & Self-Play50
    LLM Creativity5
    Personality & Non-Verifiable Rewards5
    Strengths
    DeepSeek-R1 co-author — GRPO/RLVR for reasoning, landmark 2025 paper
    DeepSeek-Prover-V1.5 & V2 — RLPAF + MCTS self-play for theorem proving
    Gaps
    No evidence of consumer post-training (personality, humor, non-VR rewards)
    …click to see all
    ZW

    Zihan Wang

    medium hireability

    MS student@Tsinghua University

    CN

    41
    RLHF / RLVR72
    Synthetic Data & Self-Play70
    Evals & Reward Models45
    Personality & Non-Verifiable Rewards12
    LLM Creativity8
    Strengths
    RLOJF: direct RLVR with online judge verifiable reward
    SciInstruct: self-reflective annotation pipeline (NeurIPS 2024)
    Gaps
    No work on personality, emotions, humor, or non-verifiable reward modeling
    …click to see all
    阿丹

    阿丹(adan)

    medium hireability
    39
    Synthetic Data & Self-Play60
    Personality & Non-Verifiable Rewards50
    RLHF / RLVR45
    LLM Creativity25
    Evals & Reward Models15
    Strengths
    AutoPlan toolcall finetuning — loss masking on Observation tokens
    dpo_trainer_new — DPO with SFT cross-entropy to prevent catastrophic forgetting
    Gaps
    No formal RL research or reward modeling publications
    …click to see all
    AA

    Abbas Abdolmaleki

    low hireability

    Research Scientist@Google

    Portugal

    40
    RLHF / RLVR82
    Evals & Reward Models52
    Personality & Non-Verifiable Rewards30
    Synthetic Data & Self-Play25
    LLM Creativity10
    Strengths
    'Preference optimization as probabilistic inference' — direct RLHF theory contribution
    MPO/V-MPO inventor — KL-constrained RL foundation for preference optimization
    Gaps
    Core background is robotic control RL, not LLM personality or creativity post-training
    …click to see all
    AR

    Adam Roberts

    low hireability

    Director of Research@DeepMind

    Previously: Senior Staff Software Engineer @ DeepMind

    San Francisco, US

    51
    Synthetic Data & Self-Play80
    LLM Creativity72
    Evals & Reward Models55
    RLHF / RLVR30
    Personality & Non-Verifiable Rewards20
    Strengths
    FLAN-v2: instruction fine-tuning at scale — 5.3K citations
    Flan Collection: data design methods for SFT — directly maps to synthetic data pipeline
    Gaps
    Instruction tuning is SFT-centric — limited RLHF/PPO/DPO/reward modeling work
    …click to see all
    AS

    Adam Santoro

    low hireability

    DeepMind

    37
    Evals & Reward Models80
    RLHF / RLVR60
    LLM Creativity20
    Synthetic Data & Self-Play15
    Personality & Non-Verifiable Rewards10
    Strengths
    Representation geometry paper (2025) directly compares SFT, DPO, RLVR post-training dynamics
    BIG-bench framework — major LLM evaluation benchmark (2016 citations)
    Gaps
    No evidence of reward modeling or RLHF pipeline implementation
    …click to see all
    AK

    Adam Tauman Kalai

    low hireability

    Research Scientist@OpenAI

    Previously: Senior Principal Researcher @ Microsoft

    42
    Evals & Reward Models65
    Synthetic Data & Self-Play55
    Personality & Non-Verifiable Rewards45
    LLM Creativity25
    RLHF / RLVR20
    Strengths
    'Using LLMs to Simulate Multiple Humans' (778 citations) — persona data gen
    OpenAI o1 system card contributor — RL-based training exposure
    Gaps
    No direct RLHF/PPO/DPO/GRPO training work published
    …click to see all
    AY

    Adam X. Yang

    low hireability

    PhD student@Mistral AI

    Bristol, GB

    45
    RLHF / RLVR88
    Evals & Reward Models80
    Personality & Non-Verifiable Rewards30
    Synthetic Data & Self-Play18
    LLM Creativity10
    Strengths
    Bayesian Reward Models for LLM Alignment — reward overoptimization & RLHF (29 citations)
    SparsePO (ICLR 2025) — preference alignment via sparse token weighting
    Gaps
    No evidence of personality/tone/creativity reward modeling (consumer post-training gap)
    …click to see all
    AF

    Addie Foote

    low hireability

    Research Scholar@ML Alignment & Theory Scholars

    Previously: Research Scholar @ ML Alignment & Theory Scholars

    San Francisco, US

    17
    RLHF / RLVR30
    Synthetic Data & Self-Play20
    Evals & Reward Models15
    Personality & Non-Verifiable Rewards15
    LLM Creativity5
    Strengths
    Trellis: 50x faster LoRA fine-tuning on 1T Kimi K2 Thinking MoE (March 2026)
    Expert parallelism + INT4 dequant on 8xH200 — hands-on distributed post-training stack
    Gaps
    Very early career — h-index 2, UT Austin 2024 undergrad
    …click to see all
    AA

    Afra Amini

    low hireability

    Research Scientist@DeepMind

    Previously: Research Intern @ Ai2

    CH

    42
    RLHF / RLVR90
    Evals & Reward Models60
    Synthetic Data & Self-Play30
    Personality & Non-Verifiable Rewards20
    LLM Creativity10
    Strengths
    ODPO (ACL 2024, 102 citations) — direct preference optimization innovation
    NeurIPS 2025: KL divergence for RLHF — reward model quality signal
    Gaps
    No work on personality adherence, humor, or non-verifiable reward shaping
    …click to see all
    AA

    Ahmed Hassan Awadallah

    low hireability

    Partner Research Manager@Microsoft

    59
    RLHF / RLVR90
    Synthetic Data & Self-Play88
    Evals & Reward Models80
    Personality & Non-Verifiable Rewards25
    LLM Creativity10
    Strengths
    Hybrid LLM (2024): small-to-large model routing -- exact Neuralace vision
    Orca/Orca-2: GPT-4 trace synthetic data pipeline for SLM post-training
    Gaps
    No published work on personality, tone, humor, or non-verifiable subjective rewards
    …click to see all

    Ahmet Üstün

    low hireability

    Code Agents Lead@Cohere

    Previously: Senior Research Scientist @ Cohere

    Groningen, NL

    46
    RLHF / RLVR85
    Evals & Reward Models60
    Synthetic Data & Self-Play55
    Personality & Non-Verifiable Rewards20
    LLM Creativity10
    Strengths
    "Back to Basics: REINFORCE for LHF" (2024, 481 citations) — core RLHF post-training work
    "RLHF Can Speak Many Languages" — multilingual preference optimization
    Gaps
    No evidence of personality/creativity reward modeling or non-verifiable reward design
    …click to see all
    AK

    Akshay Krishnamurthy

    low hireability

    Senior Principal Research Manager@Microsoft

    Previously: Principal Researcher @ Microsoft

    New York, US

    39
    RLHF / RLVR92
    Evals & Reward Models55
    Synthetic Data & Self-Play32
    Personality & Non-Verifiable Rewards10
    LLM Creativity5
    Strengths
    XPO (ICLR 2025): provably efficient exploration in RLHF
    Chi-Squared Preference Opt (ICLR 2025 spotlight): direct alignment sans overoptimization
    Gaps
    No work on personality, tone, humor, or subjective reward modeling
    …click to see all
    AM

    Albert Villanova del Moral

    low hireability
    32
    RLHF / RLVR75
    Evals & Reward Models55
    Personality & Non-Verifiable Rewards15
    Synthetic Data & Self-Play10
    LLM Creativity5
    Strengths
    387 merged PRs on huggingface/trl — core DPO/KTO/Reward trainer maintainer
    Commits landing today (May 7 2026): KTO/DPO alignment, RewardTrainer fixes
    Gaps
    No published papers on RLHF, preference learning, or post-training
    …click to see all
    AK

    Alec Koppel

    low hireability

    Senior Professional Staff@Johns Hopkins Applied Physics Laboratory

    Previously: Research Lead/Vice President @ JPMorgan Chase

    Laurel, US

    46
    RLHF / RLVR90
    Evals & Reward Models65
    Synthetic Data & Self-Play35
    Personality & Non-Verifiable Rewards35
    LLM Creativity5
    Strengths
    MaxMin-RLHF (2024, 88 cites) — diverse-preference alignment directly on-target
    PARL (2024, 37 cites) — unified RLHF policy alignment framework
    Gaps
    No work on personality, humor, or non-verifiable creative reward modeling
    …click to see all
    AH

    Alexander Havrilla

    low hireability

    research scientist@DeepMind

    Previously: PhD student @ Georgia Institute of Technology

    London, GB

    78
    RLHF / RLVR92
    LLM Creativity76
    Synthetic Data & Self-Play76
    Personality & Non-Verifiable Rewards74
    Evals & Reward Models72
    Strengths
    trlX: large-scale RLHF framework, EMNLP 2023, CarperAI co-founder
    "Teaching LLMs to Reason with RL" — 130-citation RLVR paper
    Gaps
    No public tool-call or agentic LLM work found
    …click to see all
    AW

    Alexander Wettig

    low hireability

    Research Scientist@Cursor

    Previously: Software Engineering Intern @ Google

    San Francisco, US

    33
    Evals & Reward Models78
    Synthetic Data & Self-Play72
    RLHF / RLVR15
    LLM Creativity0
    Personality & Non-Verifiable Rewards0
    Strengths
    SWE-bench (ICLR 2024 Oral, 1195 cites) — top eval for code agents
    SWE-smith (2025): scales SFT data generation for agentic code tasks
    Gaps
    No RLHF/RLVR or preference optimization work
    …click to see all
    AR

    Alexandre Ramé

    low hireability

    Research Scientist@DeepMind

    Previously: Research Scientist Intern @ Meta

    Paris, FR

    52
    RLHF / RLVR90
    Evals & Reward Models87
    Personality & Non-Verifiable Rewards52
    Synthetic Data & Self-Play18
    LLM Creativity12
    Strengths
    WARM (103 citations) — reward model robustness via weight averaging
    Rewarded Soups (207 citations) — Pareto-optimal multi-reward alignment
    Gaps
    No synthetic data or self-play pipeline work found
    …click to see all
    AB

    Alex Beutel

    low hireability

    Member of Technical Staff, Research Scientist@OpenAI

    Previously: Senior Staff Research Scientist @ Google

    New York, US

    54
    RLHF / RLVR82
    Evals & Reward Models78
    Personality & Non-Verifiable Rewards72
    Synthetic Data & Self-Play22
    LLM Creativity18
    Strengths
    OpenAI o1 post-training author — reasoning model safety training
    'Instruction Hierarchy' trains LLMs on privileged instruction priority (ICLR 2025)
    Gaps
    No explicit work on synthetic conversation data generation or self-play pipelines
    …click to see all
    AB

    Alexey Bukhtiyarov

    low hireability
    30
    Personality & Non-Verifiable Rewards65
    RLHF / RLVR35
    LLM Creativity30
    Evals & Reward Models15
    Synthetic Data & Self-Play5
    Strengths
    NLP Team Lead at Ex-Human — consumer personality/character AI company
    Slingshot AI (Ash) — foundational LLM for therapy, emotion & intent understanding
    Gaps
    No published research — applied practitioner, not researcher
    …click to see all
    AT

    Alex Tamkin

    low hireability

    Member of Technical Staff@Anthropic

    Previously: PhD @ Stanford University

    San Francisco, US

    53
    Personality & Non-Verifiable Rewards85
    RLHF / RLVR65
    Evals & Reward Models65
    Synthetic Data & Self-Play30
    LLM Creativity20
    Strengths
    Collective Constitutional AI — RLAIF with broad human preference input (124 citations)
    Eliciting Human Preferences with LLMs — preference reward modeling at NeurIPS 2024
    Gaps
    No clear RLVR / tool-use or code-gen work — knowledge-work axis weak
    …click to see all
    AA

    Alon Albalak

    low hireability

    Research Scientist, Open-Endedness@Lila Sciences

    Previously: Data Team Lead, Member of Technical Staff @ SynthLabs

    San Francisco, US

    52
    Synthetic Data & Self-Play88
    RLHF / RLVR72
    Evals & Reward Models68
    Personality & Non-Verifiable Rewards18
    LLM Creativity12
    Strengths
    Generative Reward Models (2025, 92 cit) — direct reward modeling work
    Big-math RL dataset @ SynthLabsAI — RLVR data pipeline at scale
    Gaps
    Minimal consumer post-training work — no personality, creativity, or RLAIF evidence
    …click to see all
    AA

    Amanda Askell

    low hireability

    Member Of Technical Staff@Anthropic

    Previously: Research Scientist (Policy) @ OpenAI

    San Francisco, US

    72
    Personality & Non-Verifiable Rewards100
    RLHF / RLVR98
    Evals & Reward Models85
    LLM Creativity40
    Synthetic Data & Self-Play35
    Strengths
    Constitutional AI paper — canonical RLAIF / non-VR reward design (2.6K citations)
    InstructGPT co-author — PPO-based RLHF at scale (21K citations)
    Gaps
    No direct evidence of synthetic data / self-play pipelines
    …click to see all
    AA

    Amjad Almahairi

    low hireability

    Staff Research Scientist@Google

    Previously: Research Scientist @ Anyscale

    San Francisco, US

    38
    RLHF / RLVR80
    Evals & Reward Models45
    Personality & Non-Verifiable Rewards35
    Synthetic Data & Self-Play20
    LLM Creativity10
    Strengths
    RouteLLM (ICLR 2025) — preference-data LLM routing, exact match to JD tool-call vision
    LLaMA 2 RLHF post-training — core contributor at Meta AI
    Gaps
    No dedicated personality/creativity/non-verifiable reward work evidenced
    …click to see all
    AM

    Andrea Madotto

    low hireability

    Research Scientist@Meta

    Previously: PhD @ The Hong Kong University of Science and Technology

    San Francisco, US

    46
    Personality & Non-Verifiable Rewards80
    LLM Creativity55
    Evals & Reward Models45
    RLHF / RLVR30
    Synthetic Data & Self-Play20
    Strengths
    PPLM (1257 citations) — foundational controllable personality/style generation work
    MoEL (310 citations) — empathy and emotion modeling in dialogue
    Gaps
    No evidence of modern RLHF at scale (PPO, DPO, GRPO on large LLMs)
    …click to see all
    AC

    Angelica Chen

    low hireability

    Senior Research Scientist@DeepMind

    Previously: Doctoral Student @ New York University

    New York, US

    49
    RLHF / RLVR85
    Evals & Reward Models75
    Personality & Non-Verifiable Rewards60
    Synthetic Data & Self-Play15
    LLM Creativity10
    Strengths
    "Pretraining Language Models with Human Preferences" (ICML 2023, 263 citations)
    "Preference Learning Algorithms Do Not Learn Preference Rankings" (NeurIPS 2024)
    Gaps
    No work on personality, humor, or creative writing post-training
    …click to see all
    AG

    Anirudh Goyal

    low hireability

    Researcher@DeepMind

    Previously: PhD student @ University of Montreal

    London, GB

    64
    LLM Creativity85
    Evals & Reward Models75
    RLHF / RLVR70
    Synthetic Data & Self-Play45
    Personality & Non-Verifiable Rewards45
    Strengths
    HypoSpace (2025): LLM creativity eval as set-valued hypothesis generators
    MCTS + iterative preference learning (2024, 170 cit.) — DPO-style preference optimization
    Gaps
    No evidence of tool-call or agentic post-training work
    …click to see all
    AC

    Asli Celikyilmaz

    low hireability

    Research Manager@Meta

    Previously: Senior Principal Researcher @ Microsoft

    Seattle, US

    77
    Evals & Reward Models84
    RLHF / RLVR80
    Personality & Non-Verifiable Rewards75
    Synthetic Data & Self-Play73
    LLM Creativity72
    Strengths
    RLCD (2024): RL from contrastive distillation for LM alignment, 60 cit
    PrefPalette (2025): personalized preference modeling with latent attributes
    Gaps
    No explicit GRPO or RLVR on verifiable tasks (code, tool execution confirmed)
    …click to see all
    AZ

    Aston Zhang

    low hireability

    Member of Technical Staff@OpenAI

    Previously: Research Scientist @ Meta

    San Francisco, US

    47
    RLHF / RLVR75
    Evals & Reward Models70
    Synthetic Data & Self-Play45
    Personality & Non-Verifiable Rewards30
    LLM Creativity15
    Strengths
    Self-generated critiques boost reward modeling (2025) -- direct reward model paper
    Systematic Examination of Preference Learning (2025) -- RLHF/DPO methodology
    Gaps
    No explicit published work on personality/creativity/non-verifiable reward design
    …click to see all
    BZ

    Banghua Zhu

    low hireability

    Principal Researcher@NVIDIA

    66
    Evals & Reward Models95
    RLHF / RLVR90
    Personality & Non-Verifiable Rewards82
    Synthetic Data & Self-Play40
    LLM Creativity25
    Strengths
    Chatbot Arena (1063 citations) — human preference LLM evaluation pioneer
    Starling-7B RLAIF — direct experience training with AI feedback for helpfulness/harmlessness
    Gaps
    Primarily evaluation/reward-model focused; limited consumer personality or creative writing training work
    …click to see all
    BZ

    Barret Zoph

    low hireability

    Something New@OpenAI

    Previously: CTO, Co-Founder @ Thinking Machines

    San Francisco, US

    74
    RLHF / RLVR95
    Evals & Reward Models90
    Personality & Non-Verifiable Rewards75
    Synthetic Data & Self-Play72
    LLM Creativity40
    Strengths
    VP Research Post-Training at OpenAI — ran ChatGPT, GPT-4, o1 post-training
    FLAN (scaling instruction-finetuned models) — 4,896 citations, canonical IT work
    Gaps
    Just restarted at OpenAI as IC ~4 months ago — low near-term hireability
    …click to see all
    BH

    Behnam Hedayatnia

    low hireability

    Senior Machine Learning Engineer@Apple

    Previously: Senior Research Scientist @ Amazon

    San Francisco, US

    44
    Personality & Non-Verifiable Rewards75
    Evals & Reward Models65
    LLM Creativity30
    Synthetic Data & Self-Play30
    RLHF / RLVR20
    Strengths
    DialGuide (2023): behavior alignment via natural-language guidelines — personality adherence
    7 years at Amazon Alexa Prize — conversation quality, emotion, engagement research
    Gaps
    No direct RLHF/PPO/DPO/GRPO post-training papers
    …click to see all
    BC

    Bei Chen

    low hireability

    Senior Researcher@Microsoft

    Previously: Intern @ Alibaba

    Beijing, CN

    28
    Evals & Reward Models55
    RLHF / RLVR40
    Personality & Non-Verifiable Rewards25
    Synthetic Data & Self-Play15
    LLM Creativity5
    Strengths
    CodeT (506 cites): test execution as verifiable reward for code gen
    Step-Aware Verifier (358 cites): process reward model for LLM reasoning
    Gaps
    No direct PPO/DPO/GRPO post-training work on LLMs
    …click to see all
    BP

    Bilal Piot

    low hireability

    Research scientist@DeepMind

    Previously: ATER @ Université Lille3

    London, GB

    51
    RLHF / RLVR97
    Evals & Reward Models88
    Personality & Non-Verifiable Rewards35
    Synthetic Data & Self-Play28
    LLM Creativity5
    Strengths
    Nash RLHF (2024, 216 cit) — novel game-theoretic RLHF framework
    General paradigm for learning from preferences (2024, 835 cit) — foundational theory
    Gaps
    No personality/creativity/humor-focused reward modeling work
    …click to see all
    BJ

    Binxing Jiao

    low hireability

    VP@StepFun

    Previously: Principal Software Engineering Manager @ Microsoft

    Beijing, CN

    52
    RLHF / RLVR88
    Evals & Reward Models72
    Synthetic Data & Self-Play60
    Personality & Non-Verifiable Rewards30
    LLM Creativity12
    Strengths
    ROVER (2025): novel RLVR algorithm, +8.2 pass@1 over existing methods
    Step 3.5 Flash: scalable RL combining verifiable + preference signals at scale
    Gaps
    No evidence of personality, humor, or creativity-focused reward modeling
    …click to see all
    BB

    Bowen Baker

    low hireability

    Research Scientist@OpenAI

    Previously: Research Scientist Intern @ OpenAI

    Nevada City, US

    41
    Evals & Reward Models82
    RLHF / RLVR72
    Synthetic Data & Self-Play42
    LLM Creativity5
    Personality & Non-Verifiable Rewards5
    Strengths
    "Let's Verify Step by Step" — foundational PRM paper (1898 citations, 2023)
    OpenAI o1 System Card contributor — RLVR reasoning model training
    Gaps
    Zero work on consumer post-training — personality, tone, creativity, non-verifiable rewards
    …click to see all
    BJ

    Bowen Jin

    low hireability

    Member of Technical Staff@OpenAI

    Previously: Research Intern @ Apple

    San Francisco, US

    37
    RLHF / RLVR88
    Evals & Reward Models68
    Synthetic Data & Self-Play20
    LLM Creativity5
    Personality & Non-Verifiable Rewards5
    Strengths
    Search-R1: scalable RLVR framework for reasoning + search tool calling
    Rm-r1: reward modeling framed as reasoning — novel reward model approach
    Gaps
    No work on personality, tone, humor, or non-verifiable reward modeling
    …click to see all
    BC

    Boyuan Chen

    low hireability

    Undergraduate Student in Artificial Intelligence@Peking University

    Previously: Research Intern @ Peking University

    Beijing, CN

    49
    RLHF / RLVR90
    Evals & Reward Models70
    Personality & Non-Verifiable Rewards50
    Synthetic Data & Self-Play25
    LLM Creativity10
    Strengths
    BeaverTails (665 citations) — built RLHF preference dataset from scratch
    PKU-SafeRLHF: ACL 2025 Best Paper on multi-level safety alignment
    Gaps
    No explicit work on personality/tone/humor or consumer conversation modeling
    …click to see all
    BZ

    Boyuan Zheng

    low hireability

    Member of Technical Staff@xAI

    Previously: Research Intern @ Allen Institute for AI

    San Francisco, US

    40
    RLHF / RLVR65
    Synthetic Data & Self-Play60
    Evals & Reward Models55
    Personality & Non-Verifiable Rewards15
    LLM Creativity5
    Strengths
    AI2 RLVR intern — open-instruct/Tulu RLVR post-training pipeline
    MTS at xAI (Dec 2025) — post-training/LLM team
    Gaps
    No published work on personality/non-verifiable reward modeling
    …click to see all
    CX

    Caiming Xiong

    low hireability

    SVP, AI Research & Applied Research@Salesforce

    Previously: VP of AI Research & Applied AI @ Salesforce

    San Francisco, US

    58
    Evals & Reward Models85
    Synthetic Data & Self-Play85
    RLHF / RLVR75
    Personality & Non-Verifiable Rewards35
    LLM Creativity10
    Strengths
    Authored LLM post-training survey (2025) — field authority
    APIGen (91 citations) — verifiable function-calling dataset pipeline
    Gaps
    No work on personality/emotion/humor/sarcasm post-training
    …click to see all
    CJ

    Carlos E. Jimenez

    low hireability

    Researcher@Anthropic

    Previously: Teaching Assistant @ University of Utah

    San Francisco, US

    45
    Evals & Reward Models85
    Synthetic Data & Self-Play70
    Personality & Non-Verifiable Rewards35
    RLHF / RLVR30
    LLM Creativity5
    Strengths
    SWE-bench (540 cit) — gold-standard agentic code eval framework
    SWE-smith — synthetic data scaling for agent training (37 cit)
    Gaps
    No RLHF/PPO/DPO or reward modeling work published
    …click to see all
    CP

    Carlos Miguel Patiño

    low hireability

    Research Engineer@Hugging Face

    Previously: Staff Machine Learning Engineer @ Factored

    NL

    26
    RLHF / RLVR60
    Synthetic Data & Self-Play40
    Evals & Reward Models20
    LLM Creativity5
    Personality & Non-Verifiable Rewards5
    Strengths
    DistillationTrainer (TRL PR #5407) — on-policy KD with external teacher server
    GOLD trainer buffered rollouts — GRPO-style on-policy generation pipeline
    Gaps
    No personality, humor, or non-verifiable reward work
    …click to see all
    CA

    cat-state

    low hireability
    33
    RLHF / RLVR75
    Evals & Reward Models45
    Synthetic Data & Self-Play25
    Personality & Non-Verifiable Rewards15
    LLM Creativity5
    Strengths
    NeMo PPO trainer (trlx) — 1,500+ lines, scales to 20B NeMo Megatron
    PrimeIntellect/verifiers 2025 — merged PRs on RLVR rollout infra
    Gaps
    No work on personality, humor, or non-verifiable subjective rewards
    …click to see all
    CJ

    Chao Jia

    low hireability

    Senior Staff Researcher, GenAI Unit@DeepMind

    Previously: Principal Research Scientist @ AIML @ Apple

    San Francisco, US

    43
    RLHF / RLVR70
    Evals & Reward Models50
    Personality & Non-Verifiable Rewards40
    LLM Creativity35
    Synthetic Data & Self-Play20
    Strengths
    Role title 'Gemini Multimodal Post-Training' — exact query match
    Gemini 2.5 co-author (2025) — frontier post-training at scale
    Gaps
    Only 7 months at DeepMind — very low likelihood of near-term departure
    …click to see all
    CY

    Chenghao Yang

    low hireability

    Graduate Research Assistant@University of Chicago

    Previously: Student Researcher @ Google

    Chicago, US

    49
    RLHF / RLVR82
    Evals & Reward Models62
    Personality & Non-Verifiable Rewards55
    LLM Creativity30
    Synthetic Data & Self-Play15
    Strengths
    f-DPO (ICLR 2024 Spotlight, 126 cites): DPO generalization for diversity + alignment
    EAD-RLVR (2025): verifiable RL via exploratory annealed decoding
    Gaps
    Synthetic data / self-play pipelines: no clear evidence in publications
    …click to see all
    CZ

    Chenguang Zhu

    low hireability

    Senior Research Scientist@Meta

    Previously: Teaching Assistant @ The University of Texas at Austin

    San Francisco, US

    60
    RLHF / RLVR90
    Evals & Reward Models85
    Synthetic Data & Self-Play60
    Personality & Non-Verifiable Rewards45
    LLM Creativity20
    Strengths
    WPO (EMNLP 2024): direct weighted preference optimization for RLHF
    Self-Generated Critiques: reward model improvement via self-critique (NAACL 2025)
    Gaps
    No published work on personality, humor, or creativity alignment
    …click to see all
    CX

    Chen Xing

    low hireability

    Research Scientist@Meta

    Previously: Senior Research Scientist, Strategic Partnership Lead @ Scale AI

    San Francisco, US

    49
    Evals & Reward Models80
    Synthetic Data & Self-Play75
    RLHF / RLVR45
    Personality & Non-Verifiable Rewards25
    LLM Creativity20
    Strengths
    ReGenesis (ICLR 2025 Oral): LLM self-improvement, self-play for reasoning data
    MultiChallenge (ACL 2025): multi-turn conversation eval benchmark
    Gaps
    No explicit RLHF/DPO/PPO/GRPO papers — post-training via self-improvement, not RL alignment
    …click to see all
    CC

    Ching-An Cheng

    low hireability

    Senior Research Scientist@Google

    Previously: Principal Researcher @ Microsoft

    Redmond, US

    37
    RLHF / RLVR75
    Evals & Reward Models55
    Personality & Non-Verifiable Rewards25
    Synthetic Data & Self-Play20
    LLM Creativity10
    Strengths
    Direct Nash Optimization (2024) — preference optimization for LLMs, 162 citations
    LLF-Bench (ICLR 2025) — benchmark for interactive language feedback evaluation
    Gaps
    No work on personality, tone, humor, or non-verifiable reward modeling
    …click to see all
    CR

    Chong Ruan

    low hireability

    Researcher@DeepSeek

    Previously: MS student @ Peking University

    49
    RLHF / RLVR90
    Evals & Reward Models80
    Synthetic Data & Self-Play60
    Personality & Non-Verifiable Rewards10
    LLM Creativity5
    Strengths
    DeepSeek-R1 co-author — pioneered GRPO/RLVR for reasoning (5230+ citations)
    Inference-Time Scaling for Generalist Reward Modeling — direct reward model research
    Gaps
    No evidence of consumer/personality post-training or non-verifiable rewards
    …click to see all
    CZ

    Chunting Zhou

    low hireability

    Researcher@Stealth

    Previously: Research Scientist @ Meta

    San Francisco, US

    51
    Synthetic Data & Self-Play82
    Personality & Non-Verifiable Rewards60
    Evals & Reward Models55
    RLHF / RLVR40
    LLM Creativity20
    Strengths
    LIMA (NeurIPS 2023, 1674 citations) — seminal low-data alignment paper
    Self-Alignment via Instruction Backtranslation — synthetic data for SFT
    Gaps
    Alignment is SFT-first — limited PPO/DPO/GRPO RL post-training evidence
    …click to see all
    CH

    Conghui He

    low hireability

    Young Leading Scientist@Shanghai AI Lab

    Previously: Researcher @ Sensetime

    Shanghai, CN

    58
    Evals & Reward Models88
    Synthetic Data & Self-Play75
    Personality & Non-Verifiable Rewards60
    RLHF / RLVR50
    LLM Creativity15
    Strengths
    MMBench (1520 cites) — leading LLM eval benchmark; 41 eval papers
    MinerU: open-source PDF/doc extraction tool, directly relevant to docs wrangling
    Gaps
    No explicit PPO/GRPO/RL rollout work — more data-quality focused than RL-optimization focused
    …click to see all
    CT

    Corentin Tallec

    low hireability

    Research Scientist@DeepMind

    Previously: PhD Student @ Laboratoire de Recherche en Informatique

    Paris, FR

    15
    RLHF / RLVR35
    LLM Creativity20
    Evals & Reward Models8
    Personality & Non-Verifiable Rewards8
    Synthetic Data & Self-Play5
    Strengths
    Gemini 2.5 co-author (2025) — on frontier LLM team at Google DeepMind
    2025 code-gen patent — practical LLM knowledge post-training alignment
    Gaps
    No direct RLHF, DPO, or PPO for LLM post-training work documented
    …click to see all
    DJ

    Daxin Jiang

    low hireability

    Co-Founder & CEO@StepFun

    Previously: Vice President @ Microsoft

    Beijing, CN

    59
    RLHF / RLVR90
    Synthetic Data & Self-Play85
    Evals & Reward Models75
    Personality & Non-Verifiable Rewards25
    LLM Creativity20
    Strengths
    Open-Reasoner-Zero (2025): open-source base model RL — RLVR at scale
    WizardLM (1057 citations): Evol-Instruct synthetic instruction data pipeline
    Gaps
    No published work on personality, humor, or non-VR subjective reward modeling
    …click to see all
    DG

    Deep Ganguli

    low hireability

    Member Of Technical Staff@Anthropic

    Previously: Research Director @ Stanford University

    US

    71
    RLHF / RLVR95
    Personality & Non-Verifiable Rewards90
    Evals & Reward Models88
    Synthetic Data & Self-Play55
    LLM Creativity25
    Strengths
    RLHF paper co-author (Anthropic, 2022) — 3059 citations, defining work
    Constitutional AI (RLAIF) co-author — canonical non-verifiable reward modeling
    Gaps
    No evidence of synthetic self-play or persona-conditioned data generation
    …click to see all
    DY

    Dejian Yang

    low hireability

    Researcher@DeepSeek AI

    Previously: Researcher @ Microsoft

    40
    RLHF / RLVR95
    Evals & Reward Models60
    Synthetic Data & Self-Play35
    LLM Creativity5
    Personality & Non-Verifiable Rewards5
    Strengths
    DeepSeek-R1: pure RL on reasoning — foundational RLVR expertise
    DeepSeek-Prover-V2: RL for subgoal decomposition (formal math RLVR)
    Gaps
    No work on personality, tone, or non-verifiable reward modeling
    …click to see all
    DZ

    Dinghuai Zhang

    low hireability

    Senior Researcher@Microsoft

    Previously: Intern @ Apple

    Beijing, CN

    32
    RLHF / RLVR82
    Evals & Reward Models38
    Synthetic Data & Self-Play22
    Personality & Non-Verifiable Rewards12
    LLM Creativity8
    Strengths
    FlowRL (2025): reward distribution matching for LLM reasoning RL
    Rollout-Training Mismatch paper — RL stability/efficiency for LLMs
    Gaps
    No consumer post-training work: personality, creativity, humor, non-verifiable rewards
    …click to see all
    DS

    DJ Strouse

    low hireability

    Member of Technical Staff@ReflectionAI

    Previously: PhD Student @ University of Michigan

    New York, US

    50
    RLHF / RLVR88
    Evals & Reward Models68
    Synthetic Data & Self-Play45
    Personality & Non-Verifiable Rewards40
    LLM Creativity10
    Strengths
    Constrained RLHF (ICLR 2024 Spotlight) — reward overoptimization direct contribution
    Direct Nash Optimization (2024) — LLM self-improvement via general preferences
    Gaps
    No visible work on personality, tone, creativity, or non-verifiable rewards
    …click to see all
    EG

    Edward Grefenstette

    low hireability

    Director of Research, Frontier AI Board Member, and Assistants Program Area Lead@DeepMind

    Previously: Head of Machine Learning @ Cohere

    London, GB

    50
    RLHF / RLVR80
    Personality & Non-Verifiable Rewards65
    Evals & Reward Models50
    LLM Creativity30
    Synthetic Data & Self-Play25
    Strengths
    Assistants Program Area Lead at DeepMind — leads LLM post-training org
    'Understanding the Effects of RLHF on LLM Generalisation and Diversity' (2023)
    Gaps
    No explicit synthetic data or self-play pipeline work visible
    …click to see all
    EH

    Eric Hartford

    low hireability
    71
    Personality & Non-Verifiable Rewards85
    Synthetic Data & Self-Play80
    LLM Creativity75
    Evals & Reward Models60
    RLHF / RLVR55
    Strengths
    Dolphin series: production SFT post-training on Qwen2/Llama3/Gemma2, millions of downloads
    Samantha model — personality-conditioned AI companion for consumer LLM use cases
    Gaps
    Founder/CEO of Quixi AI — actively building own company, availability low
    …click to see all
    ES

    Eric Michael Smith

    low hireability

    Research Scientist, Generative AI@Meta

    Previously: Research Engineer @ Meta

    New York, US

    61
    Personality & Non-Verifiable Rewards82
    Evals & Reward Models65
    RLHF / RLVR62
    LLM Creativity58
    Synthetic Data & Self-Play38
    Strengths
    Llama 2 & 3 co-author — chat fine-tuning with RLHF at billion-parameter scale
    Empathetic conversation (1.4K cites) — emotional and implicit understanding in dialogue
    Gaps
    Specific RLHF sub-role in Llama 2/3 unclear — likely on responsible AI/safety side vs. RL training
    …click to see all
    EM

    Eric Mitchell

    low hireability

    Member of Technical Staff@OpenAI

    Previously: Machine Learning Research Engineer @ Samsung

    San Francisco, US

    69
    RLHF / RLVR97
    Personality & Non-Verifiable Rewards85
    Evals & Reward Models82
    Synthetic Data & Self-Play60
    LLM Creativity22
    Strengths
    Co-leads Post-training Frontiers at OpenAI (o1, o3, GPT-5-Thinking)
    DPO author — foundational preference optimization paper (2023)
    Gaps
    No published work specifically on creative writing or roleplay post-training
    …click to see all
    EW

    Eric Wallace

    low hireability

    Member of Technical Staff@OpenAI

    Previously: Doctoral Student @ University of California, Berkeley

    San Francisco, US

    36
    RLHF / RLVR72
    Evals & Reward Models60
    Personality & Non-Verifiable Rewards25
    Synthetic Data & Self-Play20
    LLM Creativity5
    Strengths
    Co-leads OpenAI Alignment Training — direct RLHF/post-training leadership
    The Instruction Hierarchy (2024): post-training for instruction following
    Gaps
    No published work on RLVR, tool use, or agentic tasks
    …click to see all
    EP

    Ethan Perez

    low hireability

    Research Scientist@Anthropic

    Previously: Research Advisor @ New York University

    San Francisco, US

    70
    RLHF / RLVR95
    Personality & Non-Verifiable Rewards85
    Evals & Reward Models80
    Synthetic Data & Self-Play65
    LLM Creativity25
    Strengths
    'Pretraining LMs with Human Preferences' — RLHF post-training research
    'Towards Understanding Sycophancy' (527 cit.) — non-VR reward design
    Gaps
    No direct work on persona-conditioned SFT data pipelines or self-play data gen
    …click to see all
    FM

    Fandong Meng

    low hireability

    Senior Researcher@Tencent

    Previously: PHD Student @ Institute of Computing Technology, Chinese Academy of Sciences

    Beijing, CN

    55
    Evals & Reward Models82
    RLHF / RLVR75
    Personality & Non-Verifiable Rewards55
    Synthetic Data & Self-Play45
    LLM Creativity20
    Strengths
    RewardAnything (2025): principle-following reward models — direct RM hit
    GRAM-R (2025): self-training foundation reward model for reasoning
    Gaps
    No RLVR/verifiable-reward RL (code, tool) work found — mostly preference/RM
    …click to see all
    FH

    Fei Huang

    low hireability

    Chief Scientist and Senior Director of Language Technologies Lab@DAMO Academy

    Previously: VP of Security Strategy @ SUSE

    San Francisco, US

    82
    Personality & Non-Verifiable Rewards90
    RLHF / RLVR88
    Evals & Reward Models85
    LLM Creativity75
    Synthetic Data & Self-Play72
    Strengths
    VP Alibaba Cloud — heads the Qwen Language Tech Lab (post-training team)
    "Editing Personality for LLMs" (2024) — direct personality post-training research
    Gaps
    Very senior VP/executive — unlikely to join as IC researcher
    …click to see all
    FC

    Fenia Christopoulou

    low hireability

    Member of Engineering (Applied Research)@poolside

    Previously: NLP Research Scientist @ Huawei

    Paris, FR

    32
    RLHF / RLVR72
    Evals & Reward Models35
    Personality & Non-Verifiable Rewards28
    Synthetic Data & Self-Play15
    LLM Creativity10
    Strengths
    SparsePO (EMNLP 2025): sparse token-level preference optimization for LLMs
    RL for reasoning at Poolside AI — active applied post-training work
    Gaps
    Only ~6 months at Poolside — recent hire, low hireability
    …click to see all
    FS

    Florian Strub

    low hireability

    Head of RLVR and Post-training engineering@Cohere

    Previously: Co-head of Command A and Command R7B Post-training @ Cohere

    Paris, FR

    51
    RLHF / RLVR96
    Evals & Reward Models78
    Synthetic Data & Self-Play38
    Personality & Non-Verifiable Rewards30
    LLM Creativity12
    Strengths
    Head of RLVR and Post-training engineering at Cohere — exact JD match
    Co-led Command A and R7B post-training — production-scale RLHF experience
    Gaps
    Limited work on personality adherence or non-verifiable reward modeling
    …click to see all
    FW

    Furu Wei

    low hireability

    Chief Scientist@Microsoft

    Previously: Partner Research Manager @ Microsoft

    Beijing, CN

    60
    RLHF / RLVR88
    Evals & Reward Models85
    Synthetic Data & Self-Play82
    Personality & Non-Verifiable Rewards30
    LLM Creativity15
    Strengths
    Reward Reasoning Model (NeurIPS 2025) — reward modeling for post-training
    Preference Optimization with Pseudo Feedback (2025) — RLVR/DPO at scale
    Gaps
    No direct work on personality design, humor, or non-verifiable consumer rewards
    …click to see all
    GS

    Gabriel Synnaeve

    low hireability

    Research Scientist@Meta

    Previously: Postdoctoral Fellow @ Meta

    Paris, FR

    50
    RLHF / RLVR88
    Synthetic Data & Self-Play85
    Evals & Reward Models68
    LLM Creativity5
    Personality & Non-Verifiable Rewards5
    Strengths
    SWE-RL (Feb 2025): verifiable-reward RL for SE; 41% SWE-bench Verified
    Self-play SWE-RL (Dec 2025): self-play synthetic data gen without human labels
    Gaps
    Zero work on personality, humor, tone, or non-verifiable reward modeling
    …click to see all
    GL

    Gerasimos Lampouras

    low hireability

    Principal Research Scientist@Huawei

    Previously: Research Associate @ University of Cambridge

    London, GB

    48
    RLHF / RLVR75
    Evals & Reward Models65
    Synthetic Data & Self-Play55
    Personality & Non-Verifiable Rewards35
    LLM Creativity10
    Strengths
    SparsePO (ICLR 2025) — token-level preference optimization, DPO variant
    Code-Optimise (2024) — self-generated preference data for code SFT
    Gaps
    No published work on personality design or non-verifiable reward modeling
    …click to see all
    HL

    Hang Li

    low hireability

    Head of Research@ByteDance

    Previously: Director of AI Lab @ ByteDance

    Beijing, CN

    39
    RLHF / RLVR72
    Evals & Reward Models50
    Synthetic Data & Self-Play35
    Personality & Non-Verifiable Rewards30
    LLM Creativity10
    Strengths
    ReFT (2024, 188 citations) — RL-based reasoning fine-tuning, core RLVR
    AGILE (2024) — RL framework for LLM tool-use agents
    Gaps
    No direct work on personality/humor/creativity non-verifiable rewards
    …click to see all
    HD

    Haodong Duan

    low hireability

    Postdoctoral Researcher, Young Scientist@Shanghai AI Laboratory

    Previously: Applied Scientist Intern @ Amazon

    Hong Kong, HK

    55
    Evals & Reward Models93
    RLHF / RLVR82
    Synthetic Data & Self-Play62
    Personality & Non-Verifiable Rewards28
    LLM Creativity8
    Strengths
    InternLM-XComposer2.5-Reward — multi-modal reward model (ACL Findings 2025)
    Visual-RFT (239 citations) — RLVR applied to vision-language tasks
    Gaps
    No evidence of personality, humor, or subjective-quality reward modeling
    …click to see all
    HW

    Haoxiang Wang

    low hireability

    Research Scientist@Luma AI

    Previously: Research Scientist @ NVIDIA

    San Francisco, US

    50
    RLHF / RLVR88
    Evals & Reward Models82
    Synthetic Data & Self-Play48
    Personality & Non-Verifiable Rewards28
    LLM Creativity5
    Strengths
    RLHF Workflow (TMLR 2024) — end-to-end online RLHF recipe paper
    ArmoRM: multi-objective MoE reward model integrated into RewardBench
    Gaps
    No personality, creativity, or non-verifiable reward modeling work found
    …click to see all
    HE

    Harrison Edwards

    low hireability

    Research Scientist@DeepMind

    Previously: Research Scientist @ OpenAI

    London, GB

    54
    Evals & Reward Models90
    RLHF / RLVR88
    Synthetic Data & Self-Play48
    Personality & Non-Verifiable Rewards35
    LLM Creativity10
    Strengths
    "Let's Verify Step by Step" (ICLR 2024) — authored PRM800K process reward dataset
    "Prover-Verifier Games" (2025) — adversarial RL training for verifiable LLM outputs
    Gaps
    Only ~7 months into DeepMind role — very low hireability
    …click to see all
    HS

    Harshit Sikchi

    low hireability

    Researcher@OpenAI

    Previously: Graduate Research Assistant @ The University of Texas at Austin

    San Francisco, US

    44
    RLHF / RLVR85
    Evals & Reward Models78
    Personality & Non-Verifiable Rewards30
    Synthetic Data & Self-Play20
    LLM Creativity5
    Strengths
    CPL: preference learning without RL (ICLR 2024, 119 citations)
    Scaling Laws for Reward Model Overoptimization (NeurIPS 2024, 85 citations)
    Gaps
    No personality, humor, or non-verifiable reward modeling work
    …click to see all
    HG

    Hongcheng Gao

    low hireability

    Incoming PhD student@College of AI at Tsinghua University

    Previously: Intern @ Tsinghua University

    Beijing, CN

    26
    RLHF / RLVR70
    Evals & Reward Models38
    Synthetic Data & Self-Play10
    LLM Creativity5
    Personality & Non-Verifiable Rewards5
    Strengths
    Kimi k1.5 (741 citations): RL scaling for LLMs, direct RLVR evidence
    Kimi k2/k2.5: agentic intelligence, tool use alignment at scale
    Gaps
    No consumer post-training work: personality, creativity, humor absent
    …click to see all
    HY

    Hongkun Yu

    low hireability

    Principal Engineer@DeepMind

    Previously: Senior Staff Software Engineer @ Google

    San Francisco, US

    55
    RLHF / RLVR72
    Personality & Non-Verifiable Rewards68
    Evals & Reward Models65
    Synthetic Data & Self-Play65
    LLM Creativity5
    Strengths
    Conditional Language Policy (2024): steerable multi-objective finetuning for LLMs
    TIR-Judge (ICLR 2026): RLVR + tool-integrated RL for LLM evaluation
    Gaps
    No evidence of creative writing, roleplay, or humor/personality-focused post-training
    …click to see all
    HF

    HUANG Fei

    low hireability
    31
    Evals & Reward Models70
    RLHF / RLVR40
    Personality & Non-Verifiable Rewards20
    Synthetic Data & Self-Play15
    LLM Creativity10
    Strengths
    Qwen3Guard (arXiv:2510.14276): safety reward model for Qwen3 LLMs
    Qwen3-4B-SafeRL: RL fine-tuning using guard model as reward signal
    Gaps
    Reward modeling is safety/content moderation, not personality or creativity
    …click to see all
    HC

    Hyung Won Chung

    low hireability

    AI Research Scientist@Meta

    Previously: Research Scientist @ OpenAI

    San Francisco, US

    67
    RLHF / RLVR92
    Evals & Reward Models82
    Synthetic Data & Self-Play72
    Personality & Non-Verifiable Rewards65
    LLM Creativity25
    Strengths
    o1 System Card co-author — frontier RLVR / reasoning RL work at OpenAI
    Deliberative Alignment — scalable reward modeling for non-verifiable policy adherence
    Gaps
    No direct work on personality adherence, humor, or creative roleplay data
    …click to see all
    JH

    Jack Hessel

    low hireability

    Member of Technical Staff@Anthropic

    Previously: Founding Researcher @ Samaya AI

    Seattle, US

    81
    LLM Creativity88
    Personality & Non-Verifiable Rewards82
    Synthetic Data & Self-Play80
    Evals & Reward Models78
    RLHF / RLVR75
    Strengths
    RL4LMs (2022): RLHF benchmarks + baselines for NLP post-training
    SODA: 1M-scale social dialogue distillation — production synthetic data
    Gaps
    ~8 months at Anthropic — low near-term hireability
    …click to see all
    JY

    Jane Yu

    low hireability

    Member of Technical Staff@OpenAI

    Previously: Research Scientist @ Meta

    San Francisco, US

    59
    RLHF / RLVR75
    Evals & Reward Models72
    Synthetic Data & Self-Play70
    Personality & Non-Verifiable Rewards55
    LLM Creativity25
    Strengths
    Toolformer (2550 cit): LLMs teaching themselves to use tools
    Teaching LLMs to Reason with RL — RLVR / PPO for reasoning (2024)
    Gaps
    Likely recent hire at Meta AI — low near-term hireability
    …click to see all
    JK

    Jan Hendrik Kirchner

    low hireability

    Researcher@Anthropic

    Previously: Researcher @ OpenAI

    31
    RLHF / RLVR68
    Evals & Reward Models52
    Synthetic Data & Self-Play20
    Personality & Non-Verifiable Rewards12
    LLM Creativity5
    Strengths
    Weak-to-Strong Generalization (404 citations) — RLHF scalable oversight
    Prover-Verifier Games — game-theoretic verifier/reward training
    Gaps
    No work on personality, humor, sarcasm, or non-verifiable creative rewards
    …click to see all
    JL

    Jan Leike

    low hireability

    Lead of the Alignment Science team@Anthropic

    Previously: Co-lead of the Superalignment Team @ OpenAI

    57
    RLHF / RLVR95
    Evals & Reward Models88
    Personality & Non-Verifiable Rewards55
    Synthetic Data & Self-Play30
    LLM Creativity15
    Strengths
    InstructGPT RLHF paper co-author — 20K+ citations
    Deep RL from Human Preferences (2017) — reward learning pioneer
    Gaps
    No published work on personality/humor/creativity reward modeling
    …click to see all
    JK

    Jared Kaplan

    low hireability

    Anthropic

    56
    RLHF / RLVR90
    Evals & Reward Models80
    Personality & Non-Verifiable Rewards80
    Synthetic Data & Self-Play20
    LLM Creativity10
    Strengths
    'Training helpful & harmless assistant with RLHF' (2022, 2,977 citations) — foundational RLHF
    Constitutional AI / RLAIF (2022, 2,230 citations) — non-verifiable reward modeling
    Gaps
    Co-founder of Anthropic — extremely unlikely to leave his own company
    …click to see all
    JW

    Jason E Weston

    low hireability

    Research Scientist@Meta

    Previously: Researcher @ Meta

    New York, US

    81
    RLHF / RLVR95
    Evals & Reward Models92
    Synthetic Data & Self-Play88
    Personality & Non-Verifiable Rewards85
    LLM Creativity45
    Strengths
    Self-Rewarding LMs (2024, 548 cit) — foundational RLHF reward modeling
    Meta-Rewarding LMs (2025) — LLM-as-meta-judge, self-improving alignment
    Gaps
    12 years at Meta FAIR — entrenched senior researcher, low mobility signals
    …click to see all
    JW

    Jason Wei

    low hireability

    Research Scientist@Meta

    Previously: Research Scientist @ OpenAI

    San Francisco, US

    49
    RLHF / RLVR85
    Evals & Reward Models75
    Synthetic Data & Self-Play50
    Personality & Non-Verifiable Rewards20
    LLM Creativity15
    Strengths
    FLAN (Finetuned LMs are Zero-Shot Learners) — instruction tuning at scale, 4.8K citations
    Chain-of-thought prompting — core post-training reasoning technique, 22K citations
    Gaps
    No direct published work on personality, humor, or non-verifiable reward modeling
    …click to see all
    JG

    Jianfeng Gao

    low hireability

    Distinguished Scientist & Vice President@Microsoft

    Previously: Partner Research Manager in Business AI @ Microsoft

    Woodinville, US

    58
    RLHF / RLVR80
    Synthetic Data & Self-Play75
    Evals & Reward Models70
    Personality & Non-Verifiable Rewards45
    LLM Creativity20
    Strengths
    'RL for Reasoning in LLMs' (2025, 80 citations) — RLVR core work
    'FlowRL' reward distribution matching for LLM reasoning (ICLR 2026)
    Gaps
    No specific work on personality, humor, or sarcasm detection
    …click to see all
    JS

    John Schulman

    low hireability

    cofounder and chief scientist@Thinking Machines

    Previously: researcher on the Alignment Science team @ Anthropic

    63
    RLHF / RLVR100
    Evals & Reward Models92
    Synthetic Data & Self-Play65
    Personality & Non-Verifiable Rewards45
    LLM Creativity15
    Strengths
    PPO inventor (31K citations) — RLHF backbone algorithm
    InstructGPT co-author — pioneered RLHF for LLMs
    Gaps
    Co-founder of Thinking Machines — very low recruiting likelihood
    …click to see all
    JS

    Joykirat Singh

    low hireability

    Research Assistant@University of North Carolina at Chapel Hill

    Previously: Research Fellow @ Microsoft

    Chapel Hill, US

    38
    RLHF / RLVR75
    Synthetic Data & Self-Play60
    Evals & Reward Models40
    Personality & Non-Verifiable Rewards10
    LLM Creativity5
    Strengths
    Agentic RL tool-use paper (2505.01441) — RL for LLM tool calls, 34 citations
    Self-evolved DPO — self-play preference optimization for small models
    Gaps
    No work on personality, tone, humor, or non-verifiable reward modeling
    …click to see all
    JL

    Junlong Li

    low hireability

    Ph.D. Student@HKUST

    Previously: Lecturer @ Shanghai Jiao Tong University

    Hong Kong, HK

    63
    Evals & Reward Models90
    RLHF / RLVR85
    Synthetic Data & Self-Play65
    Personality & Non-Verifiable Rewards60
    LLM Creativity15
    Strengths
    DeepSeek-R1 co-author — RLVR at massive scale (4530 citations)
    Generative Judge (ICLR 2024) — reward model for alignment evaluation
    Gaps
    Just started PhD at HKUST (Sep 2025) — low near-term hireability
    …click to see all
    JP

    Junting Pan

    low hireability

    Research Scientist@Apple

    Previously: Research Scientist Intern @ Meta

    San Francisco, US

    26
    RLHF / RLVR65
    Evals & Reward Models42
    Synthetic Data & Self-Play15
    LLM Creativity5
    Personality & Non-Verifiable Rewards5
    Strengths
    Step-Controlled DPO (ICLR 2025) — direct preference optimization for reasoning
    SpiritSight Agent — GUI/computer use; agentic Knowledge post-training
    Gaps
    No personality or consumer post-training work (humor, creativity, sarcasm)
    …click to see all
    KC

    Kai Chen

    low hireability

    Research Scientist & Head of Large Model Center@Shanghai AI Laboratory

    Previously: Director @ SenseTime

    Shanghai, CN

    60
    Evals & Reward Models92
    RLHF / RLVR80
    Synthetic Data & Self-Play55
    Personality & Non-Verifiable Rewards50
    LLM Creativity25
    Strengths
    Leads InternLM post-training team — direct match to Neuralace's use case
    InternLM2 tech report (497 cites) — RLHF + SFT pipeline at scale
    Gaps
    No work on consumer personality, humor, or creative writing post-training
    …click to see all
    KZ

    Kaipeng Zhang

    low hireability

    Principal Researcher@Shanda AI Research

    Previously: Researcher @ Shanghai AI Lab

    Shanghai, CN

    38
    Evals & Reward Models78
    RLHF / RLVR65
    Synthetic Data & Self-Play20
    Personality & Non-Verifiable Rewards20
    LLM Creativity5
    Strengths
    MM-Eureka (2025, 178 cit): rule-based RLVR — direct post-training RL work
    ProJudge: MLLM process judge dataset — reward model / verifiable eval signal
    Gaps
    No direct consumer post-training work (personality, humor, tone, roleplay, creativity)
    …click to see all
    KC

    Karl Cobbe

    low hireability

    Research Scientist@OpenAI

    San Francisco, US

    45
    RLHF / RLVR90
    Evals & Reward Models88
    Synthetic Data & Self-Play20
    Personality & Non-Verifiable Rewards20
    LLM Creativity5
    Strengths
    Let's Verify Step by Step -- seminal process reward model paper
    GSM8K benchmark (5786 citations) -- RLVR eval standard
    Gaps
    No published work on personality, humor, or non-verifiable creative rewards
    …click to see all
    LW

    Leandro Von Werra

    low hireability

    Head of Research@Hugging Face

    Previously: Machine Learning Engineer @ Hugging Face

    Bern, CH

    56
    RLHF / RLVR97
    Evals & Reward Models70
    Synthetic Data & Self-Play65
    Personality & Non-Verifiable Rewards40
    LLM Creativity10
    Strengths
    TRL library creator — industry-standard RLHF/DPO/GRPO training framework
    'N Implementation Details of RLHF with PPO' — seminal PPO post-training paper
    Gaps
    No evidence of personality/creativity/humor reward modeling work
    …click to see all
    LH

    Le Hou

    low hireability

    Senior Staff Software Engineer@DeepMind

    Previously: Staff Software Engineer @ DeepMind

    San Francisco, US

    46
    Synthetic Data & Self-Play68
    RLHF / RLVR65
    Evals & Reward Models42
    Personality & Non-Verifiable Rewards38
    LLM Creativity15
    Strengths
    FLAN Collection: landmark instruction-tuning data/methods work at Google Brain scale
    Conditional Language Policy: multi-objective steerable finetuning framework (2024)
    Gaps
    No explicit PPO/DPO/GRPO reward-modeling publications — primarily SFT/instruction-tuning focused
    …click to see all
    LW

    Lilian Weng

    low hireability

    Research Scientist@OpenAI

    48
    RLHF / RLVR85
    Evals & Reward Models80
    Personality & Non-Verifiable Rewards50
    Synthetic Data & Self-Play15
    LLM Creativity10
    Strengths
    Rule-Based Rewards for LM Safety — reward model design with verifiable rules
    Multi-step RL + auto-generated rewards, red teaming (NeurIPS 2024)
    Gaps
    No published work on personality, humor, or consumer NVR reward modeling
    …click to see all
    LC

    Louis Castricato

    low hireability
    84
    RLHF / RLVR95
    LLM Creativity85
    Evals & Reward Models85
    Personality & Non-Verifiable Rewards85
    Synthetic Data & Self-Play70
    Strengths
    trlX: built the canonical RLHF training framework (2023, 76 cites)
    Generative Reward Models (2025, 92 cites) — reward modeling for post-training
    Gaps
    CEO of Overworld AI — active startup (PRs March 2026), very low availability
    …click to see all
    LS

    Luca Soldaini

    low hireability

    Lead Research Scientist@Ai2

    Previously: Senior Research Scientist @ Ai2

    Seattle, US

    46
    RLHF / RLVR75
    Synthetic Data & Self-Play65
    Evals & Reward Models60
    Personality & Non-Verifiable Rewards20
    LLM Creativity10
    Strengths
    Tulu 3 co-author — RLVR+DPO post-training pipeline, 465 citations
    Led Ai2 OLMo post-training team (2022–early 2026)
    Gaps
    No personality/tone/humor reward modeling in published work
    …click to see all
    MV

    Michal Valko

    low hireability

    Chief Models Officer, Member of the Founding Team, Member of Technical Staff@Stealth AI Startup

    Previously: Principal Llama Engineer @ Meta

    Paris, FR

    47
    RLHF / RLVR95
    Evals & Reward Models72
    Personality & Non-Verifiable Rewards30
    Synthetic Data & Self-Play25
    LLM Creativity15
    Strengths
    Built online RL stack for Llama 3 — hands-on RLHF at scale
    Nash Learning from Human Feedback — ICML 2023 Best Paper
    Gaps
    Co-founded Isara Labs 2025 — very low hireability as active founder
    …click to see all
    MS

    Mrinank Sharma

    low hireability

    Intern@University of Oxford

    Previously: Research Internship @ Indian Institute of Technology, Delhi

    Oxford, GB

    51
    Personality & Non-Verifiable Rewards88
    RLHF / RLVR65
    Evals & Reward Models65
    Synthetic Data & Self-Play30
    LLM Creativity8
    Strengths
    Sycophancy paper (542 cit): key work on non-verifiable rewards shaping LLM personality
    Constitutional AI co-author — foundational RLHF alignment methodology
    Gaps
    No visible work on synthetic conversation data gen or self-play pipelines
    …click to see all
    ND

    Nan Du

    low hireability

    Member of Technical Staff@OpenAI

    Previously: Principal Researcher @ Apple

    San Francisco, US

    55
    RLHF / RLVR72
    Synthetic Data & Self-Play68
    Personality & Non-Verifiable Rewards65
    Evals & Reward Models50
    LLM Creativity20
    Strengths
    FLAN co-author — foundational instruction finetuning (4821 citations)
    ReAct co-author — seminal tool use / agentic reasoning (5045 citations)
    Gaps
    No explicit RLVR / verifiable-reward RL work (code evals, tool verification)
    …click to see all
    NL

    Nathan Lambert

    low hireability

    Senior Research Scientist@Allen Institute for AI

    Previously: Research Scientist & RLHF Team Lead @ Hugging Face

    Seattle, US

    71
    RLHF / RLVR97
    Evals & Reward Models95
    Synthetic Data & Self-Play80
    Personality & Non-Verifiable Rewards65
    LLM Creativity20
    Strengths
    Led Tülu 3 RLVR pipeline — open LLM post-training SOTA at AllenAI
    RewardBench author — standard benchmark for reward model evaluation
    Gaps
    Founding SAIL Media — likely not seeking employment
    …click to see all
    OB

    Olivier Bachem

    low hireability

    Senior Director, Research Scientist@DeepMind

    Previously: Director, Research Scientist @ DeepMind

    Zurich, CH

    73
    RLHF / RLVR93
    Evals & Reward Models88
    Personality & Non-Verifiable Rewards72
    Synthetic Data & Self-Play58
    LLM Creativity52
    Strengths
    BOND (2025): Best-of-N distillation — directly LLM alignment via reward
    WARM + WARP: reward model weight averaging; production-grade RLHF
    Gaps
    No visible work on tool-call or agentic post-training
    …click to see all
    OT

    Oyvind Tafjord

    low hireability

    Staff Research Scientist@DeepMind

    Previously: Principal Research Scientist @ Allen Institute for AI

    Seattle, US

    47
    RLHF / RLVR75
    Evals & Reward Models75
    Synthetic Data & Self-Play40
    Personality & Non-Verifiable Rewards30
    LLM Creativity15
    Strengths
    Tulu 3 co-author: introduces RLVR for verifiable-reward post-training
    OLMES (NAACL 2025): standardized eval framework for LMs
    Gaps
    No explicit work on personality reward modeling or subjective quality RLHF
    …click to see all
    PB

    Phil Blunsom

    low hireability

    Chief Technology Officer@Cohere

    Previously: Chief Scientist @ Cohere

    London, GB

    58
    RLHF / RLVR88
    Evals & Reward Models85
    Synthetic Data & Self-Play55
    Personality & Non-Verifiable Rewards42
    LLM Creativity22
    Strengths
    "Improving reward models with synthetic critiques" (2025) — RM + synthetic data
    "Uncertainty-Aware Step-wise Verification with Generative Reward Models" (2025)
    Gaps
    No direct evidence of personality-design or creative-writing RL work
    …click to see all
    PB

    Pramodith Ballapuram

    low hireability
    40
    RLHF / RLVR82
    Evals & Reward Models52
    Personality & Non-Verifiable Rewards30
    Synthetic Data & Self-Play20
    LLM Creativity15
    Strengths
    33 TRL commits — GRPOTrainer async tool calls, SAPO/CISPO losses (Nov–Jan 2026)
    Async reward functions in GRPO — unlocks non-verifiable LLM-judge rewards
    Gaps
    No evidence of personality / creative post-training or non-verifiable reward design
    …click to see all
    QL

    Qian Liu

    low hireability

    Member of Technical Staff@xAI

    Previously: Researcher @ TikTok

    Singapore, SG

    51
    RLHF / RLVR85
    Evals & Reward Models72
    Synthetic Data & Self-Play55
    Personality & Non-Verifiable Rewards30
    LLM Creativity15
    Strengths
    SimpleRL-Zoo (2025): RLVR/zero-RL — investigates RL for open base model reasoning
    SimpleTIR (2025): multi-turn RL for tool-integrated reasoning — aligns with tool-call direction
    Gaps
    Singapore-based — outside stated target locations (USA/Europe/China/India)
    …click to see all
    QL

    Quoc V Le

    low hireability

    Research Scientist@Google

    Previously: Research Visitor @ Max Planck Institute for Biological Cybernetics

    San Francisco, US

    59
    RLHF / RLVR82
    Synthetic Data & Self-Play75
    Evals & Reward Models62
    Personality & Non-Verifiable Rewards55
    LLM Creativity20
    Strengths
    FLAN series co-author — instruction fine-tuning at scale
    "SFT Memorizes, RL Generalizes" (2025) — direct RL post-training comparison
    Gaps
    No specific work on personality/humor/creative writing for LLMs
    …click to see all
    RT

    Rohan Taori

    low hireability

    Member of Technical Staff@Anthropic

    Previously: OSV Fellow @ O'Shaughnessy Ventures

    San Francisco, US

    65
    Synthetic Data & Self-Play90
    Evals & Reward Models85
    RLHF / RLVR82
    Personality & Non-Verifiable Rewards52
    LLM Creativity15
    Strengths
    AlpacaFarm: RLHF simulation framework, reward modeling without human data
    Stanford Alpaca: canonical self-instruct SFT data pipeline (52K examples)
    Gaps
    No published work on personality, tone, or non-verifiable reward modeling
    …click to see all
    SJ

    Scott R Johnston

    low hireability

    Anthropic

    Previously: Senior Displays Engineer @ Apple

    San Francisco, US

    53
    RLHF / RLVR92
    Personality & Non-Verifiable Rewards82
    Evals & Reward Models70
    Synthetic Data & Self-Play12
    LLM Creativity10
    Strengths
    Anthropic RLHF paper (2022) — 2,977 citations, foundational post-training work
    'Towards Understanding Sycophancy' — personality adherence and non-VR evals
    Gaps
    No synthetic data / self-play published work
    …click to see all
    SN

    Sharan Narang

    low hireability

    Director, AI Research@Meta

    Previously: Tech Lead @ Google

    San Francisco, US

    49
    Evals & Reward Models80
    RLHF / RLVR72
    Synthetic Data & Self-Play40
    Personality & Non-Verifiable Rewards35
    LLM Creativity20
    Strengths
    FLAN (arXiv:2210.11416) — instruction finetuning at scale, core post-training
    Llama 2 co-author — RLHF chat model post-training at Meta
    Gaps
    No standalone RLHF/PPO/DPO/GRPO paper — involvement implicit via Llama 2
    …click to see all
    SR

    Sharath Chandra Raparthy

    low hireability

    Research Engineer@DeepMind

    Previously: Member of Technical Staff @ Reka AI

    London, GB

    44
    RLHF / RLVR75
    Synthetic Data & Self-Play62
    Evals & Reward Models55
    Personality & Non-Verifiable Rewards20
    LLM Creativity10
    Strengths
    Llama 3 tool-use + math reasoning post-training — shipped at scale
    Rainbow Teaming: adversarial synthetic LLM data generation (NeurIPS 2024)
    Gaps
    No personality, creativity, or non-verifiable reward modeling work
    …click to see all
    SH

    Shengyi Huang

    low hireability

    Researcher@Allen Institute for Artificial Intelligence

    Previously: Researcher @ Hugging Face

    49
    RLHF / RLVR97
    Evals & Reward Models72
    Synthetic Data & Self-Play45
    Personality & Non-Verifiable Rewards20
    LLM Creativity10
    Strengths
    Tulu 3 co-author: AllenAI RLVR + DPO post-training pipeline
    CleanRL author — canonical PPO implementations, 10K+ GitHub stars
    Gaps
    No published work on personality, tone, humor, or non-verifiable reward shaping
    …click to see all
    SY

    Shunyu Yao

    low hireability

    Research Scientist@OpenAI

    Previously: Research Intern @ Sierra

    San Francisco, US

    38
    Evals & Reward Models82
    RLHF / RLVR50
    Synthetic Data & Self-Play40
    Personality & Non-Verifiable Rewards10
    LLM Creativity8
    Strengths
    tau-bench: SOTA Tool-Agent-User eval benchmark (2024, 204 cites)
    Reflexion: verbal RL / self-improvement for agents (3360 cites)
    Gaps
    No work on personality, humor, or non-verifiable reward modeling
    …click to see all
    SZ

    Songyang Zhang

    low hireability

    Young Scientist@Shanghai AI Laboratory

    Previously: postdoctoral researcher @ ShanghaiTech University

    Shanghai, CN

    62
    Evals & Reward Models95
    RLHF / RLVR78
    Synthetic Data & Self-Play58
    Personality & Non-Verifiable Rewards50
    LLM Creativity28
    Strengths
    OpenCompass creator — leading LLM eval platform (372 citations)
    CompassJudger-1/2 — generalist judge/reward models, verifiable + subjective
    Gaps
    Just started at Tencent Hunyuan (~2 months) — very low hireability window
    …click to see all
    TE

    Teknium

    low hireability
    66
    Synthetic Data & Self-Play90
    Personality & Non-Verifiable Rewards80
    LLM Creativity78
    RLHF / RLVR55
    Evals & Reward Models28
    Strengths
    OpenHermes-2.5: ~1M synthetic SFT samples, 1290+ models trained on it
    GPTeacher: roleplay + toolformer datasets, dual consumer + knowledge signal
    Gaps
    Co-founder at NousResearch — low hireability
    …click to see all
    TL

    Tianle Li

    low hireability

    Member of Technical Staff@xAI

    Previously: full-time team member @ Nexusflow

    60
    Evals & Reward Models90
    RLHF / RLVR88
    Synthetic Data & Self-Play62
    Personality & Non-Verifiable Rewards42
    LLM Creativity18
    Strengths
    Led RL post-training + recipe studies for Grok 4.1 and Grok 4.2 at xAI
    Grok 4: synthetic datasets, tool-use training, evals — directly query-aligned
    Gaps
    No personality adherence or creative writing work (Consumer Post-Training gap)
    …click to see all
    TL

    Tianyu Liu

    low hireability

    Researcher@Alibaba

    Previously: Senior Researcher @ Tencent

    Beijing, CN

    53
    RLHF / RLVR83
    Evals & Reward Models82
    Synthetic Data & Self-Play48
    Personality & Non-Verifiable Rewards42
    LLM Creativity10
    Strengths
    Alibaba Qwen team staff researcher — direct Qwen post-training experience
    ACL 2025: scalable long-CoT RL — RLVR post-training at production scale
    Gaps
    Consumer post-training gap — no personality/humor/roleplay/creativity work
    …click to see all
    TS

    Timo Schick

    low hireability

    Member of Technical Staff@Microsoft

    Previously: Member of Technical Staff @ Microsoft

    San Francisco, US

    46
    Synthetic Data & Self-Play80
    Evals & Reward Models45
    RLHF / RLVR35
    LLM Creativity35
    Personality & Non-Verifiable Rewards35
    Strengths
    Toolformer (2023, 2599 cit.) — LLMs teaching themselves tool/API use
    DINO: Generating Datasets with PLMs (264 cit.) — direct synthetic data pipeline work
    Gaps
    No direct RLHF/PPO/DPO/GRPO work — tool-calling trained via SFT not RL
    …click to see all
    WX

    Wei Xiong

    low hireability

    Senior Research Scientist@NVIDIA

    Previously: Research Scientist @ Adobe

    San Francisco, US

    60
    RLHF / RLVR95
    Evals & Reward Models85
    Synthetic Data & Self-Play65
    Personality & Non-Verifiable Rewards45
    LLM Creativity10
    Strengths
    RAFT paper (583 citations) — reward-ranked SFT data selection, core post-training
    RLHF Workflow paper (248 citations) — end-to-end online RLHF recipe
    Gaps
    No work on personality, humor, or subjective non-verifiable reward modeling
    …click to see all
    WL

    Wing Lian

    low hireability
    42
    RLHF / RLVR80
    Evals & Reward Models55
    Personality & Non-Verifiable Rewards35
    Synthetic Data & Self-Play25
    LLM Creativity15
    Strengths
    Axolotl: 1,418 commits — DPO, GRPO, KTO, ORPO, reward modeling all supported
    13 commits to huggingface/trl — core RLHF library
    Gaps
    No academic research output — engineer/builder, not researcher
    …click to see all
    XC

    Xinyun Chen

    low hireability

    AI research scientist@Meta

    Previously: staff research scientist @ DeepMind

    37
    Evals & Reward Models75
    RLHF / RLVR65
    Synthetic Data & Self-Play35
    LLM Creativity5
    Personality & Non-Verifiable Rewards5
    Strengths
    AlphaCode (1774 citations) — competition-level code gen with test-based RLVR filtering
    Teaching LLMs to Self-Debug — code verification and post-hoc correction
    Gaps
    No personality, tone, or non-verifiable reward work
    …click to see all
    XZ

    Xizhou Zhu

    low hireability

    Researcher@Shanghai AI Laboratory

    Previously: Researcher @ SenseTime

    35
    Evals & Reward Models65
    RLHF / RLVR55
    Synthetic Data & Self-Play40
    Personality & Non-Verifiable Rewards8
    LLM Creativity5
    Strengths
    ZeroGUI: online RL for GUI agents at zero human annotation cost
    VisualPRM: process reward model + VisualPRM400K eval dataset
    Gaps
    All post-training work is multimodal (vision-language), not text-only LLM
    …click to see all
    XR

    Xuancheng Ren

    low hireability

    Researcher@Alibaba

    Previously: PhD student @ Peking University

    CN

    55
    RLHF / RLVR78
    Evals & Reward Models68
    Synthetic Data & Self-Play62
    Personality & Non-Verifiable Rewards38
    LLM Creativity28
    Strengths
    #1 contributor to QwenLM/Qwen3 (108 commits) — core team
    Qwen2.5 post-training: multistage RL + 1M+ SFT samples
    Gaps
    No direct evidence of personality/creativity post-training work
    …click to see all
    XP

    Xuehai Pan

    low hireability

    Code Engineer of Agent/RL Infra@DeepSeek AI

    Previously: Technical Staff @ Moonshot AI

    Beijing, CN

    55
    RLHF / RLVR90
    Evals & Reward Models72
    Synthetic Data & Self-Play65
    Personality & Non-Verifiable Rewards45
    LLM Creativity5
    Strengths
    88 commits to safe-rlhf — primary RLHF framework implementer
    BeaverTails: human-preference dataset for reward model training
    Gaps
    No evidence of personality/humor/creativity reward modeling
    …click to see all
    YM

    Yunlin Mao

    low hireability
    31
    Evals & Reward Models82
    RLHF / RLVR45
    Synthetic Data & Self-Play18
    LLM Creativity5
    Personality & Non-Verifiable Rewards5
    Strengths
    400+ merged PRs to modelscope/evalscope — core maintainer
    TIR-Bench & SWE-Smith evals — tool-calling and code agent evaluation
    Gaps
    No evidence of personality, creativity, or non-verifiable reward work
    …click to see all
    YQ

    Yu Qiao

    low hireability

    Principal Researcher@Shanghai AI Laboratory

    Previously: Professor @ Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences

    Shenzhen, CN

    52
    Evals & Reward Models85
    RLHF / RLVR75
    Synthetic Data & Self-Play50
    Personality & Non-Verifiable Rewards30
    LLM Creativity20
    Strengths
    VisualPRM: process reward model for multimodal reasoning (73 cites, 2025)
    VideoChat-R1: RL fine-tuning / RLVR applied to video MLLMs (93 cites)
    Gaps
    No explicit work on personality, humor, or consumer-facing non-verifiable reward modeling
    …click to see all
    YW

    Yu Wu

    low hireability

    Head of LLM Alignment Team@DeepSeek AI

    Previously: Senior Researcher @ Microsoft

    49
    RLHF / RLVR95
    Evals & Reward Models88
    Synthetic Data & Self-Play40
    Personality & Non-Verifiable Rewards15
    LLM Creativity8
    Strengths
    DeepSeek-R1: pioneered GRPO algorithm for RLVR reasoning (5230 citations)
    Math-Shepherd: process reward model without human annotations (593 citations)
    Gaps
    No published work on personality, tone, or non-verifiable reward modeling
    …click to see all
    ZH

    Zac Hatfield-Dodds

    low hireability

    member of technical staff@Anthropic

    Previously: unknown @ Autonomy, Agency, and Assurance Institute

    San Francisco, US

    51
    RLHF / RLVR80
    Evals & Reward Models78
    Personality & Non-Verifiable Rewards72
    Synthetic Data & Self-Play18
    LLM Creativity8
    Strengths
    "Training HH Assistant with RLHF" — co-author, 3,057 citations, foundational post-training paper
    Sycophancy paper (ICLR 2024): analyzes how RLHF produces non-truthful preference alignment
    Gaps
    No evidence of RLVR, verifiable-reward RL, or tool-call training
    …click to see all
    ZD

    Zhengxiao Du

    low hireability

    Tech Lead@ZhipuAI

    Previously: Research Intern @ Beijing Academy of Artificial Intelligence

    Beijing, CN

    60
    RLHF / RLVR88
    Evals & Reward Models75
    Synthetic Data & Self-Play72
    Personality & Non-Verifiable Rewards42
    LLM Creativity22
    Strengths
    ChatGLM-RLHF (2024): production PPO alignment pipeline for 30B+ model
    Does RLHF Scale? (2025): empirical RLHF scaling across data/model/method
    Gaps
    Hireability low: ~9 months into new senior role at ZhipuAI
    …click to see all

    Runs

    #1completed0 qualified / 0 foundMay 7, 1:30 PM