People often ask me to recommend some classic papers related to AI Agents and large models. Here, I list some papers that have been quite enlightening for me, which can serve as a Reading List.

Most of these papers were just published this year, but there are also some classic papers on text large models and image/video generation models. Understanding these classic papers is key to comprehending large models.

If you finish reading all these papers, even if you only grasp the core ideas, I guarantee you will no longer be just a prompt engineer but will be able to engage in in-depth discussions with professional researchers in large models.

2024 Update

Here is a summary of some papers from this year (2024), currently being updated.

AI Infra

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model (Multi-head Latent Attention and Mixture-of-Experts) https://arxiv.org/pdf/2405.04434

Mooncake: Kimi’s KVCache-centric Architecture for LLM Serving (Prefix Cache and Prefill/Decode Split) https://arxiv.org/abs/2407.00079v1

DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving https://www.usenix.org/system/files/osdi24-zhong-yinmin.pdf

Optimizing AI Inference at Character.AI https://research.character.ai/optimizing-inference/

Original Text from December 2023

More Interesting AI Agents

Generative Agents: Interactive Simulacra of Human Behavior https://arxiv.org/abs/2304.03442

RoleLLM: Benchmarking, Eliciting, and Enhancing Role-Playing Abilities of Large Language Models https://arxiv.org/abs/2310.00746

Role play with large language models https://www.nature.com/articles/s41586-023-06647-8

Exploring Large Language Models for Communication Games: An Empirical Study on Werewolf https://arxiv.org/abs/2309.04658

MemGPT: Towards LLMs as Operating Systems https://arxiv.org/abs/2310.08560

Augmenting Language Models with Long-Term Memory https://arxiv.org/abs/2306.07174

Do LLMs Possess a Personality? Making the MBTI Test an Amazing Evaluation for Large Language Models https://arxiv.org/pdf/2307.16180.pdf

More Useful AI Agents

The Rise and Potential of Large Language Model Based Agents: A Survey https://arxiv.org/abs/2309.07864

MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework https://arxiv.org/abs/2308.00352

Communicative Agents for Software Development https://arxiv.org/pdf/2307.07924.pdf

Large Language Models Can Self-Improve https://arxiv.org/abs/2210.11610

Evaluating Human-Language Model Interaction https://arxiv.org/abs/2212.09746

Large Language Models can Learn Rules https://arxiv.org/abs/2310.07064

AgentBench: Evaluating LLMs as Agents https://arxiv.org/abs/2308.03688

WebArena: A Realistic Web Environment for Building Autonomous Agents https://arxiv.org/abs/2307.13854

TableGPT: Towards Unifying Tables, Nature Language and Commands into One GPT https://arxiv.org/abs/2307.08674

Task Planning and Decomposition

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models https://arxiv.org/abs/2201.11903

Tree of Thoughts: Deliberate Problem Solving with Large Language Models https://arxiv.org/abs/2305.10601

Implicit Chain of Thought Reasoning via Knowledge Distillation https://arxiv.org/abs/2311.01460

ReAct: Synergizing Reasoning and Acting in Language Models https://arxiv.org/abs/2210.03629

ART: Automatic multi-step reasoning and tool-use for large language models https://arxiv.org/abs/2303.09014

Branch-Solve-Merge Improves Large Language Model Evaluation and Generation https://arxiv.org/abs/2310.15123

WizardLM: Empowering Large Language Models to Follow Complex Instructionshttps://arxiv.org/pdf/2304.12244.pdf

Hallucination

Siren’s Song in the AI Ocean: A Survey on Hallucination in Large Language Modelshttps://arxiv.org/pdf/2309.01219.pdf

Check Your Facts and Try Again: Improving Large Language Models with External Knowledge and Automated Feedback https://arxiv.org/abs/2302.12813

SelfCheckGPT: Zero-Resource Black-Box Hallucination Detection for Generative Large Language Models https://arxiv.org/abs/2303.08896

WebBrain: Learning to Generate Factually Correct Articles for Queries by Grounding on Large Web Corpus https://arxiv.org/abs/2304.04358

Multimodal

Learning Transferable Visual Models From Natural Language Supervision (CLIP) https://arxiv.org/abs/2103.00020

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale (ViT): https://arxiv.org/abs/2010.11929

MiniGPT-v2: large language model as a unified interface for vision-language multi-task learninghttps://arxiv.org/abs/2310.09478

MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models https://arxiv.org/abs/2304.10592

NExT-GPT: Any-to-Any Multimodal LLM https://arxiv.org/pdf/2309.05519.pdf

Visual Instruction Tuning (LLaVA) https://arxiv.org/pdf/2304.08485.pdf

Improved Baselines with Visual Instruction Tuning (LLaVA-1.5) https://arxiv.org/abs/2310.03744

Sequential Modeling Enables Scalable Learning for Large Vision Models (LVM) https://arxiv.org/pdf/2312.00785.pdf

CoDi-2: In-Context, Interleaved, and Interactive Any-to-Any Generation https://arxiv.org/pdf/2311.18775.pdf

Neural Discrete Representation Learning (VQ-VAE) https://browse.arxiv.org/pdf/1711.00937.pdf

Taming Transformers for High-Resolution Image Synthesis (VQ-GAN) https://arxiv.org/abs/2012.09841

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows https://arxiv.org/abs/2103.14030

BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models https://browse.arxiv.org/pdf/2301.12597.pdf

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning https://browse.arxiv.org/pdf/2305.06500.pdf

ImageBind: One Embedding Space To Bind Them All https://arxiv.org/abs/2305.05665

Meta-Transformer: A Unified Framework for Multimodal Learning https://arxiv.org/abs/2307.10802

Image/Video Generation

High-Resolution Image Synthesis with Latent Diffusion Models https://arxiv.org/pdf/2112.10752.pdf

Structure and Content-Guided Video Synthesis with Diffusion Models (RunwayML Gen1) https://browse.arxiv.org/pdf/2302.03011.pdf

Hierarchical Text-Conditional Image Generation with CLIP Latents (DaLLE-2) https://arxiv.org/pdf/2204.06125.pdf

AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning https://arxiv.org/abs/2307.04725

Adding Conditional Control to Text-to-Image Diffusion Models (ControlNet) https://arxiv.org/abs/2302.05543

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesishttps://arxiv.org/abs/2307.01952

Speech Synthesis

Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech (VITS)https://browse.arxiv.org/pdf/2106.06103.pdf

Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers (VALL-E)https://arxiv.org/abs/2301.02111

Speak Foreign Languages with Your Own Voice: Cross-Lingual Neural Codec Language Modeling (VALL-E X) https://arxiv.org/pdf/2303.03926.pdf

MusicLM: Generating Music From Text https://arxiv.org/abs/2301.11325

Foundation of Large Models

Attention Is All You Need https://arxiv.org/abs/1706.03762

Sequence to Sequence Learning with Neural Networks https://arxiv.org/abs/1409.3215

Neural Machine Translation by Jointly Learning to Align and Translate https://arxiv.org/abs/1409.0473

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding https://arxiv.org/abs/1810.04805

Scaling Laws for Neural Language Models https://arxiv.org/pdf/2001.08361.pdf

Emergent Abilities of Large Language Models https://openreview.net/pdf?id=yzkSU5zdwD

Training Compute-Optimal Large Language Models (ChinChilla scaling law) https://arxiv.org/abs/2203.15556

Scaling Instruction-Finetuned Language Models https://arxiv.org/pdf/2210.11416.pdf

Direct Preference Optimization: Your Language Model is Secretly a Reward Model https://arxiv.org/pdf/2305.18290.pdf

Progress measures for grokking via mechanistic interpretability https://arxiv.org/abs/2301.05217

Language Models Represent Space and Time https://arxiv.org/abs/2310.02207

GLaM: Efficient Scaling of Language Models with Mixture-of-Experts https://arxiv.org/abs/2112.06905

Adam: A Method for Stochastic Optimization https://arxiv.org/abs/1412.6980

Efficient Estimation of Word Representations in Vector Space (Word2Vec) https://arxiv.org/abs/1301.3781

Distributed Representations of Words and Phrases and their Compositionality https://arxiv.org/abs/1310.4546

GPT

Language Models are Few-Shot Learners (GPT-3) https://arxiv.org/abs/2005.14165

Language Models are Unsupervised Multitask Learners (GPT-2) https://d4mucfpksywv.cloudfront.net/better-language-models/language_models_are_unsupervised_multitask_learners.pdf

Open Source Large Models

LLaMA: Open and Efficient Foundation Language Models https://arxiv.org/abs/2302.13971

Llama 2: Open Foundation and Fine-Tuned Chat Models https://arxiv.org/pdf/2307.09288.pdf

Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality https://lmsys.org/blog/2023-03-30-vicuna/

LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset https://arxiv.org/abs/2309.11998

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena https://arxiv.org/abs/2306.05685

How Long Can Open-Source LLMs Truly Promise on Context Length? https://lmsys.org/blog/2023-06-29-longchat/

Mixtral of experts https://mistral.ai/news/mixtral-of-experts/

OpenChat: Advancing Open-source Language Models with Mixed-Quality Data https://arxiv.org/abs/2309.11235

RWKV: Reinventing RNNs for the Transformer Era https://arxiv.org/abs/2305.13048

Mamba: Linear-Time Sequence Modeling with Selective State Spaces https://arxiv.org/ftp/arxiv/papers/2312/2312.00752.pdf

Retentive Network: A Successor to Transformer for Large Language Models https://arxiv.org/abs/2307.08621

Baichuan 2: Open Large-scale Language Models https://arxiv.org/abs/2309.10305

GLM-130B: An Open Bilingual Pre-trained Model https://arxiv.org/abs/2210.02414

Qwen Technical Report https://arxiv.org/abs/2309.16609

Skywork: A More Open Bilingual Foundation Model https://arxiv.org/abs/2310.19341

Fine-Tuning

Learning to summarize from human feedback https://arxiv.org/abs/2009.01325

Self-Instruct: Aligning Language Model with Self Generated Instruction https://arxiv.org/abs/2212.10560

Scaling Down to Scale Up: A Guide to Parameter-Efficient Fine-Tuning https://arxiv.org/abs/2303.15647

LoRA: Low-Rank Adaptation of Large Language Models https://arxiv.org/abs/2106.09685

Vera: Vector-Based Random Matrix Adapation https://arxiv.org/pdf/2310.11454.pdf

QLoRA: Efficient Finetuning of Quantized LLMs https://arxiv.org/abs/2305.14314

Chain of Hindsight Aligns Language Models with Feedback https://arxiv.org/abs/2302.02676

Beyond Human Data: Scaling Self-Training for Problem-Solving with Language Models https://arxiv.org/pdf/2312.06585.pdf

Performance Optimization

Efficient Memory Management for Large Language Model Serving with PagedAttention (vLLM) https://arxiv.org/abs/2309.06180

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness https://arxiv.org/abs/2205.14135

S-LoRA: Serving Thousands of Concurrent LoRA Adapters https://arxiv.org/abs/2311.03285

GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism https://proceedings.neurips.cc/paper/2019/file/093f65e080a295f8076b1c5722a46aa2-Paper.pdf

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism https://arxiv.org/pdf/1909.08053.pdf

ZeRO: Memory Optimizations Toward Training Trillion Parameter Models https://arxiv.org/pdf/1910.02054.pdf

Fast Transformer Decoding: One Write-Head is All You Need https://arxiv.org/abs/1911.02150

Comments