200 Interview Questions on Large Models

This article is a companion material for the book Hands-On Large Language Models.

When interviewing candidates and attending industry seminars, I often find that many people have extensive practical experience but know very little about the basic principles of models. To help everyone better understand this book, and to facilitate those who need to prepare for interviews to read this book more purposefully, I have systematically compiled common interview questions in the field of large models around the themes of each chapter of this book. Most of the answers to these questions can be found directly in the book, while some advanced questions can be answered from the references in this book or the latest papers on the internet. I hope all readers can read this book with these questions in mind.

Chapter 1: Introduction to Large Language Models

What is the difference between the encoder and decoder in Transformer, and are models with only an encoder or only a decoder useful?
What are the differences between GPT and the model architecture in the original Transformer paper?
What are the advantages and disadvantages of encoder-only (BERT-like), decoder-only (GPT-like), and full encoder-decoder architectures?
Why is the self-attention mechanism of Transformer considered a significant advancement over the attention mechanism in early RNNs?
Why do large language models have the concept of maximum context length? Why does it refer to the total length of input and output?
How are the first token latency, input throughput, and output throughput of large language models calculated? What are the requirements for first token latency, input, and output throughput in different application scenarios?
Why is the two-step paradigm of pre-training and fine-tuning so important? What core capabilities does the foundational model acquire through pre-training? What role does fine-tuning play in guiding the model to follow instructions, answer questions, and align with human values?
How does LLaMA-3 8B achieve comprehensive capabilities stronger than LLaMA-1 70B?

Chapter 2: Tokens and Embeddings

What is the difference between the tokenizer of large language models and traditional Chinese word segmentation? Is there only one unique way to tokenize a sentence for a given vocabulary?
Why is traditional BM25 retrieval sensitive to the quality of Chinese word segmentation, while large language models are not sensitive to the choice of tokenizer?
What are the advantages of byte-level BPE tokenizers used by modern large language models like GPT-4 and LLaMA compared to traditional BPE tokenizers?
How do domestically pre-trained large language models express Chinese corpora with relatively fewer tokens compared to overseas models?
How do large models distinguish between what the user says and what the AI says in chat history?
When large models make tool calls, how are the tool call parameters distinguished from text responses?
In the case of training song embeddings with playlist data in the reference section, design a system using embedding technology to solve e-commerce product recommendations. What data would you use as the equivalent of a “sentence”? How would you incorporate user behavior into the embedding model?
What is the role of negative samples in the training process of Word2vec?
What is the difference between traditional static word embeddings (like word2vec) and contextual embeddings generated by large language models? What value do static word embeddings still have with the advent of contextual embeddings?
How do contextual embeddings solve the problem of polysemy, such as the English token representing token, token, or token in a technical context, and the Chinese “推理” representing reasoning or inference?
In word2vec and other word embedding spaces, there is a phenomenon of king - man + woman ≈ queen. Why is this? Does the token embedding space of large language models have similar properties?

Chapter 3: Illustrated Large Language Models

How does a large model know when its output should end?
How to prevent the model from seeing future tokens during training?
How does the attention mechanism calculate the correlation between each token in the context? Does each attention head focus on only one token? Why divide by the square root of d_k before softmax?
Q and K in the attention expression seem symmetrical, but why is there only KV in the KV cache, not Q?
How much would inference performance decrease without the KV cache?
Why are residual connections needed in Transformer?
What is the difference between LayerNorm in Transformer and BatchNorm in ResNet, and why did LLaMA-3 switch to RMSNorm?
What is the role of the feedforward network in Transformer? Is the feedforward network necessary since there is already a softmax non-linear layer in the attention layer?
If you need to modify as few parameter values as possible to make the model forget a specific piece of knowledge, should you modify the parameters of the attention layer or the feedforward network layer?
Why are large language models often inaccurate in mathematical calculations?
How do model depth (number of layers), width (hidden dimension size), number of attention heads, context length, and other parameters affect each other? If you want to train a model ten times larger than the current model, how would you adjust these parameters?
Take an open-source model you are familiar with as an example and introduce the size and shape of each matrix in the model.
During the inference process of large language models, which is the bottleneck, memory bandwidth or computing power? Take an open-source model you are familiar with as an example and calculate the input batch size at which memory bandwidth and computing power are balanced.
From a statistical perspective, what distribution do Transformer output layers assume tokens follow?
Given an open-source model that supports 8K context, how can it be extended to support 32K context? What challenges will increasing context length bring to the KV cache?
Why does the attention mechanism need multiple heads? How do GQA and MQA optimizations differ from simply reducing the number of attention heads? Do GQA and MQA optimize the training phase or the inference phase?
FlashAttention does not reduce the amount of computation, so why can it achieve acceleration? How does FlashAttention achieve incremental computation of softmax?
What are the advantages of RoPE (rotary position embedding) compared to absolute position encoding in the Transformer paper? What challenges does RoPE face when extrapolating to long contexts?
Since the length of training samples is often less than the maximum context length, how can interference be avoided when multiple training samples are placed in the same context for training?
How can a small-scale large language model be used to improve the inference performance of a large-scale model while minimizing the impact on the inference results of the large model? Speculative decoding does not reduce the amount of computation, so why can it improve inference performance?

Chapter 4: Text Classification

How can text classification be achieved based on embedding vectors generated by representational models?
What are the advantages and disadvantages of using embedding vectors for classification compared to directly classifying with generative models?
If there is no labeled data, how can text classification be achieved based on embedding models? How can label descriptions be optimized to improve the accuracy of zero-shot classification?
The classification method of embedding model + logistic regression in the book achieved an F1 score of 0.85, while the zero-shot classification method achieved an F1 score of 0.78. Under what circumstances would you choose zero-shot classification if labeled data is available?
Why does Transformer perform much better than a naive Bayes classifier? What is the problem with the conditional independence assumption of the naive Bayes classifier?
How does masked language modeling differ from BERT’s masking strategy? How does this pre-training method help the model achieve better performance in downstream text classification tasks?
Suppose you have a dataset containing 1 million customer reviews, but only 1,000 labeled data. How would you build a classification system that combines the advantages of representational models and generative models using both labeled and unlabeled data?
When using generative models for text classification, which of the following three prompts would be more effective?
- “Is the following sentence positive or negative?”
- “Classify the sentiment of this movie review as positive or negative.”
- “You are a sentiment analysis expert. Given a movie review, determine if it expresses a positive or negative opinion. Return only the label ‘positive’ or ‘negative’.”

Chapter 5: Text Clustering and Topic Modeling

With powerful generative large language models, what is the use of embedding models? Please give an example suitable for embedding models but not for generative models. (Hint: recommendation systems)
Given a large number of documents, how can they be clustered into several groups, and how can the theme of each group be summarized?
What is the difference in principle between the bag-of-words method and document embedding? Is the bag-of-words method completely useless?
What is the difference between c-TF-IDF in BERTopic and traditional TF-IDF? How does this difference help improve the quality of topic representation?
What are the advantages and disadvantages of topic models like LDA, BTM, NMF, BERTopic, and Top2Vec? Which model should be used for long documents, short documents, and high-quality demand vertical domains?
What are the advantages and disadvantages of centroid-based and density-based text clustering algorithms?
Why is it beneficial to separate the clustering and topic representation steps in the topic modeling process?
In a topic modeling project, you find that there are many overlapping keywords in the generated topics. How can you use the techniques introduced in this chapter to improve the distinction between topics?
When using BERTopic, if a large proportion of documents are classified as outliers, what might be the cause? How can clustering parameters be adjusted?
In news or social media recommendation systems, topics often evolve rapidly over time. How can emerging topics be detected?
How can a recommendation system for a content platform be built, providing recommendations through text clustering and topic modeling during cold start, and using user interaction data to improve recommendation effectiveness once a certain amount of user interaction data is available?

Chapter 6: Prompt Engineering

For translation tasks, creative writing tasks, and brainstorming tasks, how should temperature and top_p be set? How can you verify that your chosen parameter settings are optimal?
Why do some models still have some uncertainty in their output even when the temperature is set to 0? (Hint: speculative decoding)
How can hallucinations be reduced in a specified large model through prompts?
What components should a professional prompt template consist of? Why is it necessary to describe role definitions in prompts?
For a complex prompt, how can you test which parts are useful and which parts are useless?
How can prompt templates be designed to prevent prompt injection as much as possible? How can prompt injection attacks be detected at the system level?
If user information is placed in system prompts, but the large model often forgets user information after many rounds of conversation, how can this be resolved?
How can ChatGPT be made to output its own system prompts?
Before inference models, how can models be made to think before answering? What are the advantages and disadvantages of techniques like chain of thought, self-consistency, and thought trees?
In creative writing tasks, how can models generate multiple possible outputs and then select the best one?
If the model needs to follow a specified format for output, how should the prompt be written?
How can it be ensured that the model’s output is always in a valid JSON format? (Hint: constrained sampling)
When using large models for classification tasks, how can it be ensured that the output is always one of several categories and does not output irrelevant content? (Hint: constrained sampling)
If creating an English learning application, how can it be ensured that what it says is always within a specified vocabulary and never includes out-of-scope new words? (Hint: constrained sampling)

Chapter 7: Advanced Text Generation Techniques and Tools

If we need to generate novel titles, character settings, and storylines, and a single model call does not yield good results, how can we generate them step by step?
If the user has too many conversation rounds with the model, exceeding the model’s context limit, but still wants to retain as much of the user’s dialogue information as possible, what should be done?
In role-playing scenarios, after the user has too many conversation rounds with the model (but not exceeding the context limit), the model often fails to notice key events that occurred in past dialogues. What should be done?
After many conversation rounds with the model, the pre-fill latency for processing input tokens increases. How should this be resolved? (Hint: Persist KV cache)
How to write an agent that can autonomously think about what keywords to search for next and which webpage to browse, like OpenAI Deep Research?
How to write an agent to help users plan a trip that includes flight bookings, hotel arrangements, and sightseeing? What tools need to be configured? How to ensure the system can still provide reasonable advice when faced with incomplete or contradictory information?
If the prompt for a single agent is too long, leading to performance degradation, how can it be split into multiple agents and call different agents at the right time? How can effective context transfer and result integration be achieved between different agents?
Different foundational models perform differently on different tasks. How can the most suitable model be automatically selected based on task characteristics?
If a tool takes a long time to call, how can the agent continue to interact with the user or call other tools while waiting for the tool call to return, and promptly take the next action when the tool call returns?
For continuous dialogue tasks in role-playing scenarios, how can role settings and historical dialogues be cached to reduce the cost and latency of input tokens?
How does an agent handle time information in memory, such as “the issue discussed yesterday”? How to proactively ask the user when the user does not reply for a long time?
When multiple agents are discussing in the same room, how to prevent multiple agents from talking over each other and avoid awkward silences?
How can an agent that supports real-time voice maintain low latency while avoiding talking over the user?
How can an agent that supports voice input use non-acoustic methods to semantically understand whether the user is speaking to someone nearby or to it?
What are the differences between PTQ and QAT quantization methods, and what are their advantages and disadvantages?

Chapter 8: Semantic Search and Retrieval-Augmented Generation (RAG)

In RAG, why should documents be divided into multiple chunks for indexing? How to solve the problem of missing content context after document chunking? How to handle dependencies across segments?
If the matching effect of vector similarity retrieval is not good, what other methods are there besides changing the embedding model?
Vector similarity retrieval cannot achieve precise keyword matching, and traditional keyword retrieval cannot match semantically similar words. How to solve this contradiction?
Vector similarity retrieval is already based on semantic similarity matching, so why is a re-ranking model still needed?
Why should user input be rewritten before vector similarity retrieval?
The documents retrieved by the RAG system may contain conflicting information or outdated data. How to prevent being misled by this information when generating answers?
How to enable the retrieval module to receive feedback from the generation module and dynamically adjust the retrieval strategy, such as annotating different documents with credibility?
How to enhance the interpretability of the RAG system, including clearly annotating the source of generated content and quantitatively displaying the system’s confidence in the answers?
How can an agent summarize the experience of handling enterprise tasks into a knowledge base and introduce the experience from the knowledge base in subsequent tasks? How to ensure that experience continues to accumulate rather than simply overwriting existing experience with new experience?
If you need to answer questions based on the content of a long novel, and the length of the novel far exceeds the context limit, how should you comprehensively use summary and RAG techniques to answer both the story outline and story details?
How to extend the RAG system from pure text to multimodal, supporting the retrieval of images, videos, and documents with both text and images, and presenting answers in a multimodal form, such as including charts and videos from the original documents?
If you need to design an AI smart companion that records everything the user says and does every day for several months, how can you quickly retrieve relevant memories when needed, allowing the AI to answer questions based on memory? Integrate dialogue history windowing, summary, RAG, and other technologies.

Chapter 9: Multimodal Large Language Models

Why can’t ViT simply assign a unique, discrete ID to each image block like processing text tokens, but must use linear projection to generate continuous embedding vectors?
In CLIP training, why is it necessary to simultaneously maximize the similarity of matching image-text pairs and minimize the similarity of non-matching pairs?
BLIP-2 adopts a strategy of freezing pre-trained ViT and LLM and only training the Q-Former. What is the core motivation and advantage of this design?
How does BLIP-2 connect the pre-trained image encoder and pre-trained LLM? Why not directly connect the output of the visual encoder to the language model, but introduce the Q-Former as an intermediate layer structure?
When mapping multimodal features to the text feature space, information loss is inevitable. What are the differences in information retention capabilities between cross-attention, Q-Former, and linear mapping methods?
What are the roles of image-text contrastive learning, image-text matching, and image-based text generation tasks in the BLIP-2 model? How do they differ from today’s multimodal models like Qwen-VL?
When building a multimodal model based on pre-trained modality encoders, modality decoders, and text large language models, what data is needed for the two stages of multimodal pre-training and multimodal fine-tuning, and which parameters of the model need to be frozen?
When processing images, both CLIP and BLIP-2 preprocess them into a fixed size. How to handle images with large differences in aspect ratio?
How does the model handle both input images and text questions when implementing visual question answering (VQA) in BLIP-2?
Taking an open-source multimodal model you are familiar with as an example, what is the initial delay when inputting a 512x512 image and a 100-token question, and how much does each part of the modality encoder, Q-Former, and LLM account for?
How long does each step of operation typically take for a multimodal large language model that can operate a computer graphical interface, and what constitutes the delay?
Humans operate unfamiliar interfaces slowly but quickly operate familiar interfaces. How can a multimodal model quickly operate familiar interfaces like humans?
If there is a weaker multimodal model and a stronger text model (such as DeepSeek R1), how can the capabilities of both be combined to answer multimodal questions?
If training data for image-text pairs in a vertical domain (such as medicine) is extremely limited, how can a multimodal large language model be built for that domain?
How to build an AI photo assistant that can index tens of thousands of user photos and efficiently retrieve relevant photos based on user queries?
In end-to-end speech models, how is speech converted into token representations?
How do end-to-end speech models continue real-time voice interaction with users during tool calls, and how are the results of tool calls and user voice inputs distinguished in the model’s context?
What are the similarities and differences in the technical routes between image generation models (such as Stable Diffusion) and image understanding models (such as CLIP, BLIP-2)? Why do diffusion models require noise during inference, while autoregressive models do not?

Chapter 10: Creating Text Embedding Models

Why does learning through contrast (similar/dissimilar samples) usually capture the semantics or specific task features of text more effectively than learning only similar samples?
How to generate negative samples to improve model performance? How to construct high-quality hard negative samples?
What is the difference between dual encoders and cross encoders? If you need to build a large-scale semantic search engine, which architecture would you prioritize for calculating the similarity between queries and documents, and why? If the task changes to precise re-ranking of a small number of candidate pairs, would your choice change?
What are the advantages and disadvantages of multi-negative ranking loss (MNR), cosine similarity loss, and softmax loss when training embedding models? In what scenarios might cosine similarity loss be more appropriate than MNR loss?
Why does TSDAE choose to use special tokens instead of average pooling as sentence representation?
Compared to supervised methods, what are the advantages and disadvantages of unsupervised pre-training methods like TSDAE when handling out-of-domain data or domain adaptation?
What improvements does MTEB have over basic semantic similarity testing (STSB)? What categories of embedding tasks are included?
How to continuously improve the performance of the re-ranking model in the RAG system based on user preference feedback data?
If a RAG system has no human users and is only used by AI Agents, how can AI Agent feedback be automatically collected to continuously improve the performance of the re-ranking model in the RAG system?
If you want to build a text embedding model similar to Google Image Search to find similar images based on input images, how should it be trained?
If you want to build a semantic search system for a non-natural language vertical domain (such as amino acid sequences, integrated circuit design) with very little labeled data, how should the embedding model be trained?
As new data and new concepts continue to emerge, how can you detect when to update the text embedding model to achieve incremental continuous learning?

Chapter 11: Fine-Tuning Representational Models for Classification Tasks

In fine-tuning tasks, which layers’ weights should be frozen? What is the difference between fine-tuning the first few layers of the encoder, the last few layers of the encoder, and the feedforward neural network layers?
If there is very little labeled training data, how can the amount of training data be augmented? (Hint: SetFit)
Before training the classification head, SetFit first uses contrastive learning to fine-tune the Sentence Transformer. Why is this fine-tuning step crucial for achieving high performance with very few labeled samples?
Compared to directly using a frozen general Sentence Transformer to extract embedding vectors and then training a classifier, what characteristics can SetFit’s contrastive learning fine-tuning method enable embedding vectors to learn that are more suitable for downstream classification tasks?
During continued pre-training, how can the model acquire specific domain knowledge while maximizing the retention of its general capabilities?
Please compare the advantages and disadvantages of the following three schemes for vertical domain text classification tasks: (a) Directly fine-tuning a general BERT model; (b) Continuing to pre-train BERT on medical text and then fine-tuning; (c) Pre-training a model from scratch with medical text and then fine-tuning.
In continued pre-training based on masked language modeling, how should the position and probability of mask occurrences be designed?
During fine-tuning, why is the model usually more sensitive to hyperparameters such as learning rate compared to the pre-training stage?
In named entity recognition tasks, when BERT splits a word into multiple tokens, how is the label alignment problem solved?
How to train a small model using domain data for use on embedded devices, while handling text classification, named entity recognition, and semantic search tasks?
Suppose the training corpus of an embedding model is mainly composed of English, and its performance in Chinese is poor. How can its Chinese capability be improved at a relatively low cost of continued pre-training?
For a critical scenario classification task, such as misclassifying “severe adverse reaction” as “mild adverse reaction” being more dangerous than the reverse error, how to choose evaluation metrics, solve the problem of class imbalance in the dataset, and modify the loss function?

Chapter 12: Fine-Tuning Generation Models

Based on the LLaMA-3 70B open-source model, how can the model be fine-tuned to make its output style more concise, more like WeChat chat, and ensure that the output content complies with China’s large model safety requirements? How much data do you think needs to be prepared, how many GPUs are needed, and how long should the training take?
Someone claims that an article was generated using DeepSeek R1 and provides you with the complete prompt used for generation. How should you verify or falsify this claim? How to quantify the probability of this prompt generating the article? (Hint: Use perplexity)
For a model with 96 Transformer blocks, each with a 12,288 × 12,288 weight matrix, how many parameters need to be fine-tuned after using a rank-8 LoRA? How much computation is required for each step of the fine-tuning process? How much does it reduce compared to full fine-tuning?
How does block quantization in QLoRA solve the information loss problem caused by ordinary quantization?
Given a corporate knowledge base composed of several articles, hoping to make the model remember it through SFT, how can it be converted into a dataset suitable for SFT? How to determine the size of the dataset required for SFT?
What impact will it have if the end marker </s> is missing in the fine-tuning data template?
When fine-tuning a model, how should hyperparameters such as learning rate, LoRA alpha, and LoRA rank be set? How to decide when to stop training the model, and is a lower validation set loss function always better?
During fine-tuning, should the loss function be calculated only for the output part, or for both the input and output parts? What are the advantages and disadvantages of each approach?
After the fine-tuned model goes live, if some use cases repeatedly fail, how should the SFT dataset be modified?
After many conversation rounds, the model exhibits “echo” problems such as repeating the user’s questions or previous answers. How should this be resolved through fine-tuning methods?
What are the most popular models currently performing well in different fields? Why do some models perform well on leaderboards but not in practical use?
What are the advantages and disadvantages of Chatbot Arena’s model evaluation method compared to a fixed test set?
What are the differences between PPO and DPO in terms of computational efficiency, implementation complexity, and training stability?
If there is a high-quality but limited human preference dataset, should PPO or DPO be used?
What does “Proximal” mean in PPO? How to prevent the model’s generalization ability from declining on issues outside the fine-tuning dataset? How to prevent the model from converging to a single type of high-reward answer?
What are the roles of the actor model, critic model, reward model, and reference model in PPO?
How does PPO solve the classic sparse reward and reward hacking problems in RL?
What are the roles of key techniques such as normalized advantage function, value function clipping, and entropy regularization in PPO?
What does the beta parameter mean in DPO, and what impact does increasing or decreasing it have?
Imagine a website where all content is AI-generated, and the average user dwell time for each piece of content is recorded. How can this be converted into preference data required by DPO? How does the processing differ for websites like Xiaohongshu and Zhihu?
For a ChatGPT-type website, how can user behavior be converted into DPO data? For example, likes and dislikes, regeneration, copying, sharing, follow-up questions, etc.
What is the alignment problem of large language models? How to prevent large language models from outputting personal privacy information in the training corpus?
How to solve the prompt injection problem as much as possible through model fine-tuning?
Given 100 rules for answering user questions, and placing them entirely in the prompt does not achieve good instruction-following results, how to construct a fine-tuning dataset and use RL training to enable the model to follow these 100 rules after fine-tuning?

Illustrated Reasoning Large Language Models

According to scaling laws, how can we estimate the size of the pre-training dataset and the computational power required to train a large language model of a specific scale?
From the perspective of the principles of large language models, why is it impossible for the LLaMA-3 70B model to reliably solve the 24-point problem without outputting a chain of thought? (That is, input the description of the 24-point problem and four integers under 100, and immediately output a single word Yes or No)
How does the chain of thought mode triggered by the “let’s think step by step” prompt differ from the principles of reasoning models? Why is the upper limit of reasoning models higher even though both involve computation during testing?
What is the difference between RL in reasoning models and RLHF in non-reasoning models?
Based on research on AlphaZero playing board games, what is the optimal ratio of computational power during training and testing?
If fine-tuning a reasoning model for a vertical domain is needed, in what scenarios are the process reward model (PRM) and outcome reward model (ORM) suitable?
In the MCTS method, how do you balance exploration and exploitation? What methods are used to evaluate exploration and exploitation respectively?
How does the STaR method allow a model to improve itself through self-generated reasoning data? What are its advantages and disadvantages?
During the post-training process of reasoning models, the chain of thought becomes longer, improving accuracy but increasing response delay. How can the trade-off between reasoning depth and response delay be managed?
How can reasoning models automatically adjust reasoning depth based on problem complexity, user needs, and system load?
Why is the cost per output token generally higher for reasoning models compared to non-reasoning models with the same architecture and parameter size?
In real-time voice dialogue applications, how can reasoning models be utilized without causing users to endure excessive response delays?
How can RL methods enhance a large language model’s tool-calling capabilities? How can a model be trained to intelligently decide when to rely on internal reasoning abilities and when to call external tools, such as writing a piece of code to solve complex reasoning problems, instead of exhaustively listing all possibilities during the output reasoning process?
In what scenarios should prompt engineering, RAG, SFT, and RLHF methods be applied? For example: rapid iteration of basic capabilities (prompt engineering), user personalized memory (prompt engineering), case library and factual knowledge (RAG), output format and language style (SFT), domain foundational capabilities (SFT), domain deep thinking capabilities (RL), domain tool-calling capabilities (RL), continuous optimization based on user feedback (RLHF).

(Note: Most of the content on illustrated reasoning large language models will be included in my next translation work “Illustrated DeepSeek”)

DeepSeek R1

What are the differences in the training process between DeepSeek R1 and R1-Zero, and what are their respective advantages and disadvantages? Given that the reasoning process generated by R1-Zero has poor readability and performs worse on non-reasoning tasks than R1, what is the value of R1-Zero’s existence? How does the R1 training process address the aforementioned issues of R1-Zero?
Why is it said that DeepSeek R1-Zero might open a path for models to surpass human intelligence levels?
Why can DeepSeek R1 produce much more interesting content than the DeepSeek V3 base model in creative writing tasks with only a short thought process?
Why didn’t DeepSeek R1 use methods like PRM, MCTS, or beam search?
What is the difference between GRPO used by DeepSeek R1 and PPO? How does advantage normalization solve the value function estimation problem in traditional PPO algorithms?
What is the role of the KL penalty term in GRPO? Why can a KL penalty term that is too large or too small affect training outcomes?
Why does DeepSeek R1 include 200,000 training samples unrelated to reasoning in the SFT stage?
How does DeepSeek distill the reasoning capabilities of R1 into smaller models? If we want to distill a smaller vertical domain model ourselves, how can we retain as much of R1’s capabilities in a specific domain as possible?
Compared to MQA, DeepSeek MLA actually occupies more KV cache, so why is MLA better than MQA? On which dimension does MLA perform low-rank compression?
How does DeepSeek MLA resolve the incompatibility between RoPE positional encoding and low-rank KV? What issues might arise if other attention bias-based positional encodings are used?
Why do the first three layers of the DeepSeek MoE model use dense connections while subsequent layers use MoE? What would be the impact if all layers used MoE?
What is the difference between DeepSeek MoE and Mixtral MoE? What are the advantages of fine-grained expert partitioning and shared expert isolation in DeepSeek MoE?
How does expert load balancing in DeepSeek MoE solve the routing collapse problem?
From the perspective of large language models modeling concepts in language, why does R1-Zero’s chain of thought exhibit multilingual mixing phenomena?
The method of R1-Zero is mainly suitable for tasks with clear verification mechanisms (such as mathematics, programming). How can this method be extended to more subjective fields (such as creative writing or strategic analysis)?
If we want to train a model with an error rate of less than 1% for four arithmetic operations within 1000 on the basis of a non-reasoning model through RL, what is the minimum expected size of the base model, and how long is the RL process expected to take with how many GPUs? (Hint: TinyZero)
On the basis of the QwQ-32B reasoning model, how can vertical domain capabilities be strengthened through RL in scenarios similar to OpenAI Deep Research? How should the training dataset be constructed, and how should the reward function be designed?
DeepSeek R1 does not support multimodal. If we want to support image reasoning on the basis of R1, such as learning to navigate mazes or inferring geographic locations from photos, how should the training dataset be constructed, and how should the reward function be designed?
What advantages does DeepSeek V3’s multi-token prediction method have in terms of sample utilization efficiency and reasoning efficiency compared to predicting one token at a time?
In which matrix computations does DeepSeek V3’s mixed-precision training use FP8 quantization? How does DeepSeek V3 perform grouped quantization of activation values and weights to minimize the impact on model precision?
What advantages does DeepSeek’s DualPipe parallel training algorithm have over traditional pipeline parallelism? How does it work in conjunction with expert parallelism to solve the load balancing problem of MoE models?

(Note: Most of the content on DeepSeek R1 will be included in my next translation work “Illustrated DeepSeek”)

About “Illustrated Large Models”

“Illustrated Large Models—Principles and Practice of Generative AI” (Hands-On Large Language Models) is my first translation work, which will be released in mid-May.

The development of large models is rapid, as the saying goes, “AI progresses in a day, while the world progresses in a year.” Many people are lost in the flourishing garden of models, unsure of which model to use for their application scenarios, and unable to predict the development direction of models in the coming year, often feeling anxious. In fact, almost all large models today are based on the Transformer architecture, with endless variations but the same core.

This book, “Illustrated Large Models,” is an excellent resource to help you systematically understand the basic principles and capability boundaries of Transformers and large models. When Turing Company approached me to translate this book, I immediately agreed upon seeing the author’s name, as it was Jay Alammar’s blog post “The Illustrated Transformer” that truly helped me understand Transformers (Chapter 3 of this book is an expansion of that blog post). Although there are countless books and articles explaining large models on the market, the exquisite illustrations and the depth and clarity of explanation in this book are rare. The book starts with tokens and embeddings, not limited to generative models, but also includes representation models that many people overlook. Additionally, the book covers practical content such as text classification, text clustering, prompt engineering, RAG, and model fine-tuning.

I am very honored to be the translator of this book, working with editor Liu Meiying to bring this book to Chinese readers.

Take some time to read this book and systematically understand the basic principles and capability boundaries of Transformers and large models, just like having a map and compass on an adventure journey in the world of large models. This way, we won’t worry about newly released models rendering long-term engineering accumulation useless overnight, and we can develop products for future models. Once the model capabilities are ready, the product can immediately scale up.

I hope this book can become a sightseeing bus in the garden of large models, allowing more people to see the full picture of large models. Thus, the ever-expanding capability boundaries of large models become a visual feast rather than a monster devouring everything; we have the opportunity to stand at the forefront of AI, achieve more dreams, and gain more freedom.

Praise for the Book (Chinese Edition)

Many thanks to Yuan Jinhui, founder of Silicon-Based Mobility, Zhou Lidong, director of Microsoft Research Asia, Lin Junyang, head of Alibaba Qwen Algorithm, Li Guohao, founder of CAMEL-AI.org community, and Zhong Tai, founder of AgentUniverse, for their strong recommendations!