Making Friends with Foundational Model Companies—Six Forks Podcast
Original podcast content: Six Forks Podcast “R&D Positions Must Embrace AI, Then Spend the Remaining Time Conducting Experiments—A Conversation with Huawei’s First Batch of Genius Youth Li Bojie”
The following content is about 30,000 words, organized by the author using AI based on the podcast content. Thanks to Hunter Leslie for the wonderful interview and post-production. The two-hour session was a blast, with no re-recordings. Also, thanks to AI for allowing me to organize 30,000 words of content in one afternoon and supplement it with other previously written materials.
Hunter Leslie: Welcome to Six Forks, I’m Hunter. Today we’re going to talk about R&D-related topics, and our guest is Li Bojie. Bojie was jointly trained by USTC and Microsoft, and was part of Huawei’s first batch of genius youth. In just three years at Huawei, he became a level 20 senior expert. In July 2023, driven by his belief in AI, he ventured into entrepreneurship in the fields of large models and Web3. Bojie, please say hello to everyone.
Li Bojie: Hello everyone, my name is Li Bojie. I was an undergraduate at USTC in 2010, then a Ph.D. student at USTC and MSRA (Microsoft Research Asia) in 2014, and part of Huawei’s first batch of genius youth in 2019. In 2023, I left Huawei to start a business with my classmates.
Hunter Leslie: Yes, exactly. So the first question I really want to ask is, you see, in 2019, the genius youth joined Huawei, and in just two or three years reached level 20. Having been at Huawei, I know how difficult it is to reach that level. Everything seemed to be going smoothly, so why suddenly start a business, right? Because at this stage, being able to advance so quickly within a platform is actually a very difficult thing.
Li Bojie: If I were to sum it up in one sentence, I wanted to experience a different life through entrepreneurship and enable AI to better benefit humanity.
If I elaborate a bit more, I can share my story about when I first connected with AI. Initially, I was doing systems research during my undergraduate years and didn’t understand AI. But when I joined MSRA, which is considered China’s best AI lab and often referred to as the “Huangpu Military Academy” of AI, I was exposed to a lot of AI-related things even though I was working on systems and networks. However, I didn’t initially learn AI algorithms because many of us in systems thought AI was as intelligent as the amount of human input it received. Why? Because AI was still quite primitive at the time, unable to truly understand natural language. It could only capture patterns and relationships between inputs, data, and outputs, but whether it truly understood was questionable.
I think it was in early 2017 when a lecture changed my entire perspective. I was at MSRA, and I can’t remember which professor it was, but they talked about two movies from 2013: one was “Her,” and the other was an episode from “Black Mirror” (Be Right Back).
In “Her,” as many of us now know, it depicts a general AI assistant that can listen, see, and speak, helping with daily tasks, making phone calls, solving social anxiety issues, and providing emotional value to the protagonist. The male protagonist, going through a divorce, finds solace in the AI and eventually falls in love with it.
The “Black Mirror” episode tells another story. The female protagonist’s husband dies, but she discovers she’s pregnant. Her friend recommends an AI digital clone, initially a text-based chat using online data. Later, it upgrades to voice, and she uploads videos to create a voice version. Eventually, she even orders a physical robot that looks just like her deceased husband. They continue living together, but the film also raises ethical questions about whether AI can truly replace a real person, which is still challenging.
These two movies were recommended by the professor, and after watching them, I was deeply moved. MSRA had many studies on text and voice processing, and it seemed technically feasible, especially the text chat in “Black Mirror.” Even in 2017, MSRA’s technology could achieve that. So, I wondered if I could train something using my chat logs because I had just broken up with my girlfriend, and we had 100,000 chat records. I tried using those records to see if I could train something.
But I didn’t know AI, and our group didn’t have GPUs since we were focused on systems and networks. Coincidentally, cryptocurrency mining was booming, and Bitcoin prices were soaring. I realized that mining GPUs were becoming expensive, and buying some might even be profitable. So, I spent tens of thousands on dozens of GPUs, like the old 980 and 1080Ti models, because it was 2017. I rented a basement in Beijing…
Hunter Leslie: Were you still in school at that time?
Li Bojie: Yes, I was still in school. I was part of the USTC joint program, but I spent most of my time in Beijing at MSRA. I found a cheap basement, ran an electrical line because the machines consumed too much power for regular wiring, and set up fans to prevent it from becoming a furnace. I assembled some cases and installed the GPUs. Most of the time, I wasn’t training models because I didn’t have much time; I was mainly mining to make money. When I sold the machines, I made more money than I spent on the GPUs because Bitcoin prices had risen so much that even second-hand mining cards were more expensive than their original prices, so I made a small profit.
Meanwhile, I occasionally trained models, but my skills were limited, and I didn’t understand AI well, so I could only use others’ models, which didn’t work well. We know that in 2017, even Transformers hadn’t emerged, so the AI models were quite outdated and ineffective.
Later, I learned about the Microsoft Xiaoice team, which was famous for its chatbot since 2013. In 2014, they hired a celebrity as a product manager, making it very popular. Xiaoice had many capabilities, including text conversation, voice, riddles, couplets, and poetry, so I learned a lot from that team and gained some understanding of AI.
Now, seven years later, in 2024, I find that the things we couldn’t achieve before are now possible because AI models have advanced rapidly. Whether it’s voice or text, these are no longer issues. As mentioned earlier, whether it’s the virtual assistant scenario in “Her” or the digital clone scenario in “Black Mirror,” they are now feasible, and our company has the technology.
Even for scenarios that are still challenging, like creating a robot identical to a boyfriend in “Black Mirror,” I thought it would take 20 years. But now, with embodied intelligence, or robotics, I think it might be achievable in five years because embodied intelligence is advancing rapidly.
So, I believe it’s an excellent era to turn many sci-fi movie scenarios into reality. We’ve seen many sci-fi movies, like Avatar and Marvel, involving physical laws or mechanical limitations that are hard to overcome in the short term. But with AI, these movie scenarios are either already a reality or could become one in the near future. Therefore, I think AI is incredibly exciting because sci-fi movies represent humanity’s aspirations for future technology. Turning sci-fi into reality will undoubtedly have immense commercial value.
Hunter Leslie: Yes, because you mentioned embodied intelligence, I think there’s still a lack of consensus among entrepreneurs and investors, including some scholars. People believe that besides the brain, you need a body, and while you have some control algorithms, it doesn’t seem easy. Even with powerful tools like Google’s RT2, it seems challenging. Some say it might take ten years or more, while Elon Musk predicts around 2030. You seem relatively optimistic, can I understand it that way?
Li Bojie: Five years aligns with Musk’s expectations. I’m not overly optimistic. I think there are two major events in five years: one is AGI, where AI reaches or surpasses human general intelligence; the second is embodied intelligence, where humanoid robots become commercially viable.
Hunter Leslie: But in the next year or two, it might be AGI.
Li Bojie: Yes, AGI might come sooner in the next couple of years. Our company doesn’t work on embodied intelligence, and I’m not very knowledgeable about it, so I’ll just share some bold opinions. I think the biggest challenge is the latency of foundational models because current embodied intelligence still relies on non-large model methods, using traditional reinforcement learning for control. Large models have high latency, making it difficult to achieve precise low-latency control at the millisecond level. However, model advancements are rapid, and we can discuss this further later, as model latency is one of our key focuses.
Additionally, I feel that the mechanical aspect of robotics is relatively ready because many robot manufacturers demonstrate with someone controlling the robot remotely, and it works well. So, I think the missing piece is AI; once AI is solved, embodied intelligence will naturally follow.
Although I don’t have the capability to work on embodied intelligence, nor do I plan to touch on this area, I believe its potential is truly immense. As we mentioned earlier, the part of AI depicted in sci-fi movies is already on its way to becoming a reality, while the remaining mechanical and space exploration parts are tasks for embodied intelligence, whose current bottleneck also seems to be AI.
A quote from Liu Cixin deeply moved me, “You promised me the stars and the sea, but all I got was Facebook… From a long-term perspective, among countless possible futures, no matter how prosperous Earth becomes, those futures without space travel are dim.” Why has the world turned out as Liu described? Traveling to other planets is almost everyone’s common dream, so why does capital flow into the internet and AI instead of manned spaceflight? Because in recent decades, there hasn’t been a major breakthrough in energy technology, and the vast distances and strong gravity of the universe have become nearly insurmountable obstacles for human physical exploration.
However, we know that the speed of information transmission is the speed of light, making vast distances seem not so unreachable. Even if information requires a physical carrier, embodied intelligence might be more suitable for space environments than the human body. I believe AI is currently the most feasible technological route to spread human civilization deep into the universe. If AI can carry human intelligence on chips and survive, reproduce, and evolve autonomously, then why can’t chips be another form of life? Life has evolved significantly in form to adapt from the ocean to land. Why can’t adapting to the cosmic environment be another evolution of life? I don’t wish for humans on Earth to be replaced by AI, but why must life in space and on other planets take the form of human bodies?
Therefore, although some say AI is a bubble and AI products are hard to monetize, I don’t care about these. As long as what I do can help humanity realize the scenes in sci-fi movies, I have immense passion.
Hunter Leslie: Is your company now called Logenic?
Li Bojie: Actually, Logenic was our earliest name, and we haven’t used that name for a long time.
Hunter Leslie: So I’m not sure if this can be discussed. For example, what is your company doing now, and what problem do you want to solve?
Li Bojie: Actually, the name Logenic was something my co-founder Siyuan and I came up with together, but we didn’t think too much about what to do at the time, so we just picked a name. After naming it, we kept changing directions, and after changing directions, we stopped using that name because Logenic felt too generic and lacked specificity. Later, we switched to a more focused name, but that new name hasn’t been promoted or made public.
I think Lei Jun said something quite insightful, which is that in the early stages of entrepreneurship, don’t make a big fuss about promoting the personal relationship between the entrepreneur and the company. Why? He said, “If a person is unknown, they can focus on honing their skills.” He mentioned that when he founded Xiaomi, he had already succeeded in entrepreneurship with Kingsoft, and many people had high expectations of him. If he started again with MIUI, starting small, two problems might arise.
First, the company’s team might leverage his reputation for promotion, leading to two reactions when people hear about Lei Jun’s work. The first reaction is, “How could someone as impressive as Lei Jun create something as simple as MIUI?” The second reaction is, “Lei Jun’s work must be amazing, so I’ll use it without thinking.” This way, people actually ignore whether the product itself is good or not and don’t care.
Additionally, he has many resources, so he might directly buy traffic, right? Many foundational model companies do this too, spending 15 or 20 yuan per user to buy traffic, and suddenly gaining 10 million users. But by the second month, 95% of those users are gone. I think this kind of thing is mostly a waste of money or just convenient for fundraising, without much other use. So we haven’t promoted this aspect ourselves. But I think this might just be my personal opinion, and it might not be good, because most people still go for this rapid growth wave, right?
Hunter Leslie: Yes, yes, Liu Xiaohu, right? Very aggressive. So if this project is currently in a relatively confidential stage, if you look at the medium to long term, maybe three to five years, what problem do you want to solve with AI in this wave of entrepreneurship?
Li Bojie: Before talking about the problem I want to solve, I think I should first share my thoughts on what small-scale startups like ours should do.
This year, I think the biggest realization is that small-scale startups must make friends with foundational model companies, not enemies.
Foundational model companies are those that pre-train foundational models, like OpenAI, Anthropic, or the domestic Liu Xiaohu you mentioned. These companies have enormous resources, like a single round of funding might be over a billion dollars, allowing them to explore AGI. We know there’s a scaling law for models: the larger the model, the better its performance ceiling. So these foundational model companies can explore AGI, while small companies like ours find it very difficult to do so. Nowadays, even a billion dollars might not be enough to pursue AGI; it might require 100 billion or a trillion dollars. Such resources are clearly beyond the reach of entrepreneurs at our level. Competing with them on model capabilities is very risky.
Another situation is, for example, if I make an application, but it’s just a layer on top of the original model. This phenomenon is also dangerous because you often see a foundational model company, like OpenAI, release a new model, and suddenly a bunch of startups are killed. The problem is, you’re actually competing in the same track as them. Many people say that every time OpenAI holds a conference, a wave of startups dies. This shows that these companies are actually making enemies with foundational model companies.
But I think AI is still in the rapid ascent phase of the S-curve, and the capabilities of foundational models are rapidly improving. At this time, if I only do a little packaging on top of it, make some small engineering optimizations…
Hunter Leslie: Shell?
Li Bojie: It’s hard to have a moat because it gets replaced quickly. I have a deep understanding of this because I’ve made this mistake myself.
Last year, our team initially did a lot of fine-tuning work, one for voice and one for text. The voice fine-tuning was about creating celebrity voices, like Musk, Trump…
Hunter Leslie: Guo Degang
Li Bojie: For example, if I want to make a Guo Degang voice pack, I download a bunch of Guo Degang’s voice clips and tune them for a long time until whatever it says sounds like Guo Degang. But this requires very high-quality voice downloads. If it’s Guo Degang, it’s fine because he’s a comedian. But if it’s Musk, who speaks hesitantly, and the YouTube video quality isn’t high, the downloaded voice quality isn’t clean, and it often crashes during training. After crashing, there are many corner cases that are hard to solve, so the final result isn’t good.
How was this problem finally solved? This year, new models came out with zero-shot learning, meaning I just need to upload a one-minute voice clip, and it doesn’t matter if there’s background noise or stuttering. No matter what kind of voice it is, it can mimic it, even stuttering.
Hunter Leslie: I saw ByteDance also released such a product, and the effect is quite good.
Li Bojie: Actually, some open-source ones are even better. For example, there’s a company I like called Fish Speech, which is from Fish Audio, and it’s open-source. You can directly upload a one-minute voice clip, and it handles everything for you. Of course, it’s not perfect yet, and there’s still a lot of room for improvement, but at least it’s commercially usable.
That’s the first fine-tuning thing, which is essentially an improvement on a foundational model, and many previous engineering optimizations are now useless. The second thing is text fine-tuning. At that time, we believed in one thing: I have a small model, and I do a little fine-tuning. Fine-tuning means, for example, Trump speaks interestingly, right? So I want to create a model that mimics Trump’s speech, and I find a lot of Trump’s speech material, and it really can mimic Trump’s style. But after achieving that, the problem is that the fine-tuned model often loses some original capabilities. For example, it can mimic Trump’s speech, but it might not even be able to solve the simplest elementary math problem. It has many such issues that are hard to solve.
Also, the model from that year, like when we first started, we used LLaMA 1, the earliest open-source model. My co-founder at Berkeley worked on Vicuna, which should be the first open-source dialogue model based on LLaMA. But at that time, due to the foundational model’s limited capabilities, the model often started talking nonsense after 20 rounds of dialogue, not knowing what to say. This wasn’t a problem that simple fine-tuning could solve.
But now, with the same cost, models at the 7B or 8B level don’t have this problem. Whether it’s domestic Qwen 2.5 or overseas LLaMA 3.1, or the newly released Yi Lightning, they don’t have this issue. So with the improvement of foundational model capabilities, fine-tuning might not even be necessary. I just need to set the character, like putting Trump’s speech style into a prompt, and using the best models now, like Yi Lightning, OpenAI’s latest GPT-4o, or Claude 3.5 Sonnet, they can all handle it without needing fine-tuning.
I might have rambled a bit earlier, but the point I wanted to make is that the fine-tuning and engineering changes we made can easily be swept away by the progress of foundational models. So it’s like we’re doing things that make us enemies with foundational model companies.
Then I started thinking about how to make friends with foundational model companies. I now have two angles of thought. The first is that we should do some system optimization. OpenAI or Anthropic, these big companies, are improving the algorithms inside the models, but on the system periphery, like if I’m making a complete application, there are many things outside the model that need improvement, which I’ll discuss in more detail later.
The second thing is, I think now these AIs, whether models or products, are still unknown to most people. Even many AI industry practitioners don’t have a good understanding of the latest developments in AI models, the boundaries of each model, when to use which model, and how to write prompts. So I think bridging the information gap is also crucial.
Hunter Leslie: So, in the future, you might mainly focus on these two directions?
Li Bojie: Yes, in the future, I might mainly want to focus on these two directions.
Specifically, the first thing is system optimization based on first principles.
From my undergraduate days tinkering with systems in the Linux Association to my Ph.D. research on high-performance data center systems at MSRA, where MSRA did a special report on me titled “Tinkering with Systems to Improve Performance by 10 Times,” to my work at Huawei, which also focused on system performance optimization. I have a habit of thinking based on first principles, considering what the performance of an application should be based on hardware capabilities, what it is now, and what causes the gap in between. First principles are not only used by Musk; Google’s Jeff Dean also advocates this way of thinking.
When many people talk about system optimization, they think of AI training and inference optimization, which is also very competitive. Many people might think that training and inference optimization mainly involves optimizing CUDA operators and designing new Attention algorithms or position encoding, but that’s not the case. There are far more areas worth optimizing in the system than just the model itself.
Recently, I had a very emotional experience. We previously calculated that using H100 machines to serve a 70B model and then selling APIs would definitely be a loss based on the current popular API pricing in the market. Are these companies doing inference and selling APIs really doing business at a loss? In September, Berkeley’s community version vLLM 0.6 suddenly improved performance by 2.7 times. Why? In the previous version, only 38% of the time was spent on GPU computation, while the rest was wasted on HTTP API servers and scheduling, including competing for Python’s GIL global lock. The 2.7 times performance improvement didn’t come from CUDA operators, Attention algorithms, or quantization, but rather from the seemingly insignificant HTTP server and scheduling. I talked to friends from several companies, and they said they had already done these optimizations internally, and they had some optimizations that the vLLM community version hadn’t done yet, so their actual inference performance was much higher than the community version.
If we don’t just focus on model inference but look at the entire AI application end-to-end, we’ll find that there are many critical points for AI applications that no one is addressing. For example, the latency issue of API calls is very important for real-time interactive applications, but many large models and speech synthesis, and speech recognition APIs have high first token latency (TTFT). For instance, a speech synthesis open-source project I like, Fish Speech, has a first token latency of only 200 milliseconds when called from the East Coast of the US, but it takes over 600 milliseconds when called from the West Coast. Pinging from the West Coast to the East Coast only takes 75 milliseconds, and the network speed of US data centers is fast, so theoretically, the 500KB synthesized speech should only take 5 milliseconds to transmit. Why is the theoretical latency from the West Coast not 300 milliseconds but 600 milliseconds? One reason is TCP’s slow start, and another is the multiple handshakes during connection establishment. These are very basic things in wide-area network optimization, and even optimizing these things might not result in great papers, but neither Google nor Cloudflare has done it, nor have I seen anyone else do it.
The OpenAI API is similar. It only takes 400 milliseconds for the first token latency when accessed from the West Coast, but it might take 1 second in Asia. We know that the network latency between the West Coast and Asia is only about 200 milliseconds, so where did the remaining 400 milliseconds go? It’s still the overhead of connection establishment. OpenAI actually uses CloudFlare’s service, and the access point in Asia is actually the IP of CloudFlare’s Asian access point, which means that even a large service like CloudFlare has a lot of optimization space. And is this API latency unique to AI? No, other APIs are like this too. This problem has existed since the day the internet appeared, for decades, and the academic community knows the solutions. Many big companies are quietly using them internally, but most people just don’t know.
I think many people, especially those working on algorithms, only care about model performance metrics and not system performance metrics. But in real business scenarios, performance metrics are often crucial. For example, many companies providing large model API services don’t pay attention to the TTFT (first token latency) metric. It’s not just the voice calls mentioned earlier; AI Agents are also very sensitive to TTFT. For instance, Claude 3.5’s Computer Use is still relatively slow, and other RPA Agents that fully use large models and can operate phones and computers take four to five seconds to respond, which is a latency issue. Embodied intelligence robots now find it difficult to directly use large models for control, also due to latency issues.
Another example I want to share is voice calls. At the end of last year, when we first made a demo of voice calls, the latency was as high as 5 seconds. ChatGPT’s earliest voice call feature also had this level of latency. Then we analyzed what was slow, such as AI algorithms, the theoretical computational power and memory access required for a single inference of the model, and compared it with the performance metrics of the GPU card. The theoretical fastest time is often more than 10 times the actual time, which means we have more than 10 times the optimization space. Additionally, how much time is wasted on network protocols, database access, and client-side, all need to be meticulously reduced. In this way, from 5 seconds to 2.5 seconds, 2 seconds, 1.5 seconds, 1 second, 750 milliseconds, until now, we can achieve end-to-end 500-600 milliseconds, faster than any other I’ve seen, and the entire system can run with just one 4090.
Moreover, because we used the latest open-source model in the field of voice cloning, AI can mimic anyone’s voice. I can use any person or game character I like as a voice pack, and I can even pull Trump and Musk into a group chat with me. This is something OpenAI absolutely cannot do. OpenAI is not incapable of voice cloning; it’s just that its scale is too large, and the copyright risk is too high. GPT-4o was sued just because its voice was similar to the female lead in “Her.” If the voices of all celebrities could be freely mimicked, OpenAI would definitely get into trouble.
Such an extremely optimized voice call that can mimic anyone’s voice costs no more than 3 cents per hour, while OpenAI’s latest realtime API costs 120 yuan per hour, and the end-to-end latency is still higher than ours. If you use OpenAI’s speech recognition, large model, and speech synthesis to put together a system, it would also cost 6 yuan per hour, and the end-to-end latency would definitely be over 2 seconds. After using a 500-millisecond voice call, going back to a 2-second one really feels like it’s broken. What does 3 cents per hour mean? Watching an hour of high-definition video on Bilibili might cost more than 3 cents in bandwidth. We can look at the pricing of Tencent Cloud and Agora’s RTC services; even the price of selling WeChat voice call services externally is close to 3 cents per hour. This means that the cost of large models is no longer an issue, and many applications no longer need to charge users for subscriptions; they can use the internet model.
The second thing is to bridge the information gap.
I found that many things I consider common knowledge within the circle are unknown to some people within the same circle. For example, some people think Anthropic was the first to make AI operate computers, but AI operating computers and phones has been around for a long time, and RPA has been around for many years. Anthropic just improved a 7.9% benchmark to 14.9%, and today this benchmark has been surpassed again, while humans are at 75%.
There’s also the issue of this year’s Nobel Prize in Physics. Many people ask, what does a Boltzmann machine have to do with large models? When GPT-4o’s realtime API came out, many people were trying it and saying AI could finally make calls. I reminded them to be careful not to blow up their bills. Some people don’t know that Claude 3.5 Sonnet is currently the model with the strongest general programming ability.
AI can help a lot in bridging the information gap. We know that previously, people searched for information through websites and search engines, then information found people through recommendation systems. Now AI can generate unique information for each person. After AI interacts with you more, it knows where your knowledge boundaries are, making recommendations much more efficient.
Speaking of bridging the information gap, I remember an interesting project I participated in at school, the USTC Course Evaluation Community, which is a website where students review courses. The reason for its creation was that my girlfriend at the time didn’t know which course to choose and had no idea where to find relevant information, so she wanted to create a website for students to share course experiences. She pulled me and my roommate to develop this website, and now it has hundreds of thousands of visitors every month. Some students who went to other schools for graduate studies said that without a course evaluation community like this, they didn’t know how to choose courses. This is an example of bridging the information gap.
One regret I have is that much of the work I did before was not published, or the papers were published but not open-sourced. For example, the core idea in FlashAttention, our AKG automatic operator generator published at PLDI 2021, actually automatically performs operator fusion and loop tiling, balancing between recomputation and storing intermediate results. I remember when I first joined Huawei in 2019, I was responsible for writing an operator, softmax. To fuse the softmax operator, an online algorithm was needed. I searched and finally found a paper published by NVIDIA in 2018, which proposed an online algorithm that could calculate softmax by scanning the data once. With this algorithm, combined with the AKG framework, the previous matrix multiplication and subsequent softmax could be fused. But at that time, I didn’t even know what Attention was, so I couldn’t have proposed FlashAttention. However, if AKG had been open-sourced back then and the community used it, FlashAttention might have been invented earlier.
Similarly, with RPC, I actually developed a very high-performance RPC framework at Huawei. We mentioned earlier the issue of high API call latency between the West Coast and East Coast. The framework I developed could solve this problem. But this work wasn’t worth publishing a paper on because the techniques used were already proposed by the academic community. It’s just that there isn’t a good engineering implementation now. Maybe each big company has a lot of black technology for optimization internally, which means these black technologies are locked in safes, actually forming an information gap.
There are also some works I did during my Ph.D., like ClickNP, a framework for developing network functions on FPGA using high-level languages. At that time, it was just a paper and not open-sourced. If it had been open-sourced, programming network functions on FPGA in academia would probably be much simpler. ClickNP was mainly used for research purposes at Microsoft, and after I stopped doing FPGA research, it was probably locked in a safe. Therefore, I think this is a waste of valuable intellectual resources for humanity. To this day, there isn’t an open-source framework like ClickNP that supports developing network functions on FPGA using high-level languages. Students in academia writing network functions on FPGA either use difficult-to-write Verilog or general HLS tools, without a framework specifically optimized for network programming. If it were open-sourced, many people would continue to use and improve it, even if one day I no longer contribute to the project, as long as the project hasn’t been eliminated, others would continue to maintain it.
I think there is a lot of information asymmetry because experts assume things are too obvious and don’t explain them in detail, but most people don’t understand. For example, in the Transformer paper, there is actually a small footnote that hints at the KV Cache concept. The authors might have thought this was obvious, but for most readers, it wasn’t, so KV Cache was reinvented.
Some information asymmetry is because most people’s attention is limited. If an article covers too much, things in the corners go unnoticed. Leslie Lamport, a Turing Award winner from Microsoft Research and a pioneer in distributed systems, told us that his most famous paper introduced the concept of relative space-time from relativity into computer distributed systems, proposing the concept of logical clocks. He mentioned in the paper that using such relative clocks can determine the order of all input messages, thus implementing any state machine and any distributed system. However, many people told him they never saw the state machine in the paper, making him doubt his memory. I think it’s because the concept of logical clocks is already mind-bending, and most readers feel they’ve learned a lot just by understanding it, so they might not have paid much attention to the state machine part.
Since most people’s attention is limited, it reminds us that whether writing papers or making products, the focus must be sharp enough to be explained in one sentence. Otherwise, good things hidden in the corners are hard to discover. A product manager told me that several popular apps recently have features that ByteDance’s buttons also have, but there are too many agents in the buttons, and users don’t know what to do at a glance, so the traffic is not as good as a single hit.
I have another thought: bridging the information gap actually goes against the essence of business, which relies on information asymmetry to have a moat. Moreover, most people are lazy to think and unwilling to learn new knowledge, so bridging the information gap is painful for most people. Therefore, many companies initially dedicated to bridging the information gap start with very high-quality content, but after reaching a certain scale, they become vulgar and turn into time-wasting things. If I had to choose between founding a Wikipedia and a TikTok, I would definitely choose Wikipedia.
Hunter Leslie: I have a question because I talked to some entrepreneurial classmates before, and I find it difficult because large models iterate quickly to cover many capabilities. Today, you have to find this direction, which I think is inherently difficult. How do I find the first thing to do in such a large demand space? Do you have any good methods, for example, if I’m starting a business now, I definitely have to think about this problem, right? Are there any good methods to locate a possible thing to do?
Li Bojie: Are you saying that I now set a big framework saying I want to start a business, and now, okay, start looking for where I want to do AI, right?
Hunter Leslie: But I don’t want to be wiped out by OpenAI, so what should I do? Is there a methodology for this demand thing? What do you think?
Li Bojie: At the beginning, I didn’t know what to do either, and I explored slowly, adjusting directions back and forth, and stepped on many pitfalls. But later, I talked to some experienced people in entrepreneurial circles and watched interviews with outstanding people like Zuckerberg and Lei Jun. I found that they might prefer the so-called “20% time entrepreneurship” concept. What does it mean? It means they might first use 20% of their time part-time to do something, and then find that this part-time project is very popular with users, and then they expand it into a commercial project. This has a higher chance of success. Recently, for example, Google’s hottest product, NotebookLM, which is a podcast product, is a Google 20% project. But Google spent so much money and so many people on so many products, but none of them…
Hunter Leslie: Made a comeback, right?
Li Bojie: Yes, made a comeback, this is a 20% project.
This fundamentally raises a question, why is it difficult for big companies to innovate? We all know the difference between top-down innovation and bottom-up innovation.
Once I decide to start a business, okay, now there’s money, people, everything is set up for you, and you have to find a nail to hit quickly, and it must be scalable, right? At this time, sometimes actions become distorted, meaning I don’t know what to do, or I think this thing is too small to be worth doing because many things potentially have a lot of market potential or many users, but at first, you don’t know it has so much demand. So, maybe at the beginning of the discussion, you think this thing probably won’t be used by many people, so you dismiss it.
At this time, it’s easy to converge on some common needs, which are those common needs that everyone can see. In the end, it’s very likely to compete head-on with big companies, for example, I want to make a better ChatGPT, or I want to make a better Siri, or I want to make a better foundational model, right? Basically, it’s about tossing around a few things, which is definitely the track of head-on competition with big companies.
Hunter Leslie: Hmm, for example, today I’m a researcher at DeepMind, and I say I want to start a business. I understand that the initial direction might be my own interest, or maybe I have a pain point myself, and then I go and do this thing. It’s a bit like elite entrepreneurship, where I spend some time making an MVP to run, and then see how the results are, and if it works, I fine-tune it, and in the future, I might gradually get closer to saying I want to do that thing, but not necessarily fully committed from the start, right? Or how I want to do it. So there’s actually a process, it’s not like someone stands up and says I want to make Apple, right? And then just get it done.
Li Bojie: Hmm, this logic is correct because you see Facebook was also made when he was in school, right? And at first, it was just about which student photo looked better, right? And Google wasn’t specifically started as a business either, it was also first made as a search engine algorithm in school, right? Then Larry Page and others took it out to start a business. Many other companies are actually similar.
Unless there’s a model like Copy To China, which can be done. That is, it’s already there, right? Then I take it over and copy it, I have money to spend, and buy users and traffic, I think this can be done.
Hunter Leslie: Do you think this is feasible now? Because you see Meituan, and including Alibaba’s e-commerce, which originally copied eBay, it seemed possible during the mobile internet era. Do you think today, for example, in the recruitment industry, which is quite popular recently, like Mercor with a valuation of 250 million dollars, and Final Run, if I directly copy it and do it in China, what do you think?
Li Bojie: Today, this logic, I think, is also valid. You say all our other foundational model companies are not all copying OpenAI? It’s not just China, you say Anthropic is also made by people from OpenAI, right? So OpenAI is the pioneer, and everyone else is chasing after it. But now, Anthropic seems to be running possibly faster than OpenAI, right? This is also hard to say, and you can’t say that those running behind are necessarily followers, right?
Hunter Leslie: If copying to China, for example, the AI thing, do you think, for example, if I want to do some adaptation? Because China and the US are different, user habits are also different. For example, 2C because 2C and 2B are different. For example, if I decide to do this 2C thing, I might need to do some micro-innovations, and the domestic models are much worse than overseas. If I want to do this, it seems like it’s just like that. The environment abroad and domestically is different. One is payment, and the other is that their models are inherently much stronger than ours because, essentially, AI products, I think their underlying models are almost your lower limit. So it feels like it’s hard to make something that surpasses a similar application in Silicon Valley, I feel like that.
Li Bojie: You see, Baidu’s search effect hasn’t surpassed Google, right? But Baidu is still doing well in China, right?
So, I think first, in many scenarios, the model is enough, which also involves my judgment of the future. I think a GPT-4 level model, as long as its cost drops a bit, which is already showing a rapid downward trend, is basically enough for most application scenarios. Because now, for example, GPT-4, it’s basically like a liberal arts student, right? That liberal arts student’s abilities are basically very useful for normal writing and other daily tasks.
But many people haven’t used it, I think mainly for two reasons: The first is that its cost is still relatively high, so in many places, it can only use a paywall to block users, for example, you can only use the best model if you pay, and you can’t afford it if you don’t pay. The second is that most users haven’t developed the habit, just like when the iPhone 1 first came out, maybe those who liked electronics thought this thing was really cool, but most people still said my Nokia is better, right?
Hunter Leslie: There’s a first adopter, user habit development issue.
Li Bojie: If its cost really drops to a very low price, and users can use it freely, then I think a GPT-4 level model is actually enough for most daily scenarios. In this way, most domestic foundational model companies have actually reached this level. The next step is to figure out how to reduce the cost to a very low price.
Another path mentioned is moving towards AGI, and I personally agree more with the view of Anthropic’s CEO. He recently published a long article, and in this article, he said that future AGI models will definitely be very large and possibly very expensive. But such AGI may not be for ordinary people to use, but there will be millions of such super intelligences, which are smarter and more talented than any real person, forming a genius country.
This so-called genius country is used to solve the most important problems in human science, such as medicine, social science, natural science, biology, which require a lot of experiments, but human efficiency in doing these experiments is very low. So, like medicine and natural science, including some biology things, progress is very slow. But if there is AI, it can replicate many copies, equivalent to having millions of top scientists doing research for you every day. So, the scientific progress of the next 50 to 100 years can be shortened to 5 to 10 years. Once AGI is achieved, it is expected to reach AGI within 5 years, and then give another 5 to 10 years, okay, then give me 50 to 100 years of technological progress.
Hunter Leslie: So, he expects that in 10 or 20 years, the average human lifespan could be anticipated to reach 150 years.
Li Bojie: You probably saw that too, right? Of course, that might be a bit optimistic, but I personally agree with this direction. I think those models, the really impressive ones, will be very expensive. So, they are meant for high-end scientific research scenarios.
Hunter Leslie: The next question is because you just mentioned that some new models are constantly iterating. And the application of these things still relies heavily on underlying models. Recently, we’ve seen that whether it’s OpenAI due to shareholder pressure or for fundraising, they released o1, and recently there’s been talk about Orion possibly coming out in December, with 100 times the performance of GPT-4. Including what you just mentioned about the Anthropic CEO, this might be because it’s very complex, requires funding, and needs to stabilize the team, so the Anthropic CEO’s statements might also be for similar reasons, needing to promote externally to gain more attention and resources. Do you think this is a normal thing, or is there some hype involved?
Li Bojie: I think you have a point; there are definitely some purposes for fundraising. Just like OpenAI’s usual approach is to hold back big announcements until everything is ready, but you can see that the recently released real-time voice API and o1 both have an unfinished research feel. Actually, these are because it might need to raise a large amount of money, so it has to do this. Including the Anthropic CEO, who previously had a relatively pessimistic attitude, calling for AI to be safe. His original intention of leaving OpenAI to start Anthropic was to take things slowly. But why has he suddenly changed to a more optimistic tone? It’s definitely to raise money.
I think there is indeed a fundamental problem, which is that creating AGI requires a lot of funding. That’s also why I don’t want to go head-to-head with OpenAI or Anthropic, because now OpenAI has used tens of billions of dollars, which is far from enough for AGI. Many think tank analysis reports show that from GPT-2 to GPT-4, computing power may have increased by 1,000 to 10,000 times. If it comes to AGI, a similar level of increase might still be needed.
GPT-4 has already used 100,000 chips, and increasing by 1,000 times would mean 100 million chips or 1 billion chips. Now, the global chip manufacturing capacity and energy capacity, if 100 million chips are needed, the energy consumed would already exceed the total of all data centers worldwide. But we know that human energy is basically still growing linearly, and there hasn’t been much progress in controlled nuclear fusion for decades, so it’s hard to expect energy to suddenly increase tenfold in five years. So, limited energy and chip manufacturing capacity are major challenges AGI faces.
With just this much energy, including chip manufacturing capacity, chip factories also need to gradually improve, which is also difficult. So, I think this capability might reach its limit at around a 1,000-fold increase. At that 1,000-fold point, can I still train AGI? And this means needing 1,000 times the money, from the original 1 billion dollars to 1 trillion dollars. That 1 trillion dollars, including energy and chip expenses, requires raising a lot of money. So, what you mentioned about O1 definitely has this purpose, including needing to make the whole society realize that this matter is very important and has the potential to become a very important thing.
Talking about O1, I’ll say a few more words. My personal view on O1 is that I actually think O1 is a very big breakthrough, and many in the industry also say it has opened a new paradigm.
The first is reinforcement learning, which means using reinforcement learning methods to greatly supplement the lack of training data, because we previously said that models at the GPT-4 level have basically used up all the high-quality text data in human society. So what about new data now? If you let it generate freely, it can only be “garbage in, garbage out,” right? That’s very difficult to handle. So what to do? O1’s method is reinforcement learning, using the self-play method from the AlphaGo training method. It’s like using mathematics and programming, which have clear right and wrong answers, because you need to know the reward function, whether it’s right or wrong. Mathematics and programming are easy to judge for correctness, so I can let it generate endless training data, called post-training, but post-training might have more training data than pre-training. This way, its data can be infinitely expanded.
The second thing is its test time scaling, which means it can use more slow-thinking time during inference to do this, which is also very crucial. For example, if you give me a math problem and ask me to answer it within a second, it’s very difficult for me too, right? But because a token actually carries computational power, its thinking time is limited. If I give it more thinking time, allowing it to write out the intermediate thinking process step by step, then first, its thinking accuracy might improve a lot.
For example, previously, the model 3.8 and 3.11 often couldn’t calculate which was larger, which was mainly because if you give me two very large numbers and ask me to compare them within a second, I might easily make mistakes because that’s intuition. But if you give me more time, I actually have a methodology to compare digit by digit, right? Now, OpenAI’s O1 actually does this digit-by-digit comparison, explicitly writing this logic into the RL process and the test time process. So, by following this methodology, it won’t make mistakes.
I think this is very, very crucial because the issue of AI making mistakes was actually a key factor preventing it from being used normally in large-scale commercial applications. For example, in many commercial scenarios, like some 2B cases, some banks come to us and ask if we can use large models to do accounting. I say this can’t be done now because even a single-digit error in accounting is a big problem, and the current large model accuracy is at most 90%, which is far from what is desired, much lower than human accuracy.
Secondly, some are more complex. I’ve been working on agents, letting them perform slightly more complex actions, like doing things step by step, where each step has only a 90% success rate, and after 10 steps, it’s only 10%. But if each step has a 99.9% success rate, after 10 steps, the success rate might be 99%. It’s an exponential accumulation process. So, the single-step accuracy must be high enough, at least higher than humans, for it to be useful. So, I think if O1 continues in this direction, it solves a very critical point of whether AI can be used in serious commercial scenarios, these high-value, high-added-value commercial scenarios.
Hunter Leslie: So, this COT thing, do you think it might be a completely new paradigm different from the previous Next Token Prediction, right?
Li Bojie: It’s still Next Token Prediction.
Hunter Leslie: You think it’s still Next Token Prediction?
Li Bojie: Yes, it definitely is, because it’s still one token at a time, just that it writes out the thinking process. I remember a sentence from “Sapiens” that says human thinking is conducted through language. Actually, this COT is just that, COT is writing out the human thinking process in a language form. Of course, its language might not be English, might not be Chinese, it might be its own intermediate language, but everyone uses language to write, so the thinking process is essentially data.
Hunter Leslie: Yes, I suddenly thought of a question because now it seems like there’s a consensus, right, that by 2029 or 2030, AGI might be 10,000 times stronger than human intelligence. If there are some black swan events in this, what we’re talking about today might not necessarily be correct, or it might not necessarily be correct. Maybe in 2029, we can talk again about whether what we predicted was right or wrong. Do you think there might be some things that would prevent this from happening, like in Huawei’s red and blue teams doing confrontations, imagining this thing can’t happen? What do you think might cause AGI not to be realized?
Li Bojie: I think many reasons could lead to that. For example, the first reason might be that the so-called scaling law hits a bottleneck and can’t go further, which is possible, right? Everyone knows GPT2 to GPT3 to 4 can scale, but can it scale to 5? No one knows, including OpenAI internally, they might encounter some difficulties. If they had GPT5 ready, they wouldn’t need to use O1 to fill the gap, right? So, 5 hasn’t been trained to the level they want. Although they have many interesting internal developments, they feel it’s not convenient to disclose them to everyone yet. So, this indicates there are indeed difficulties.
The second thing is that even if the scaling law holds and continues to grow, before reaching the AGI level, human electricity energy or chip production capacity might be exhausted, meaning that even if all of humanity’s production capacity is concentrated, AGI might not be achievable. You can’t cover the Earth’s surface with solar panels to make it happen, right?
The third possibility is that investors have lost confidence, because after all, this isn’t a matter of life and death for humanity. People like me, who are very radical, the so-called e/acc faction, are still relatively few. Most people are pragmatic. If investors see no profit after five years, they might stop investing, right? Because a large part of humanity is pragmatic, if they find they can’t get returns in the short term, they might stop further investment, which is the third possibility.
The fourth possibility is geopolitical factors, because AI has a very high potential to pose a significant threat to humanity, which is why people like Ilya often bring this up. Could it be that one day it’s considered similar to nuclear weapons, meaning it truly has the capability to threaten humanity because it reaches the same level of intelligence as humans, allowing it to autonomously control many things, and if not managed well, it could directly wipe out humanity, right? Could it be that governments or other organizations might restrict its development?
I think these four points could potentially prevent AI from reaching AGI. However, I hope these four points do not occur, which is also the consensus of the entire industry.
Hunter Leslie: Since you are involved in R&D, I understand that you might focus more on engineering than on algorithms. You might also be hiring for your own company and building your team. I’m curious, since you’ve worked at Huawei, how do you define an R&D engineer or an R&D position at this point in time? What makes someone a good R&D engineer, or what skills are necessary? Has your definition of competency changed from your time at Huawei to now as an entrepreneur?
Li Bojie: I think a fundamental skill that is crucial in both large companies and startups is having a solid foundation in computer science. One must have a thorough understanding of basic computer system concepts and the capabilities and limitations of each model or system component, such as operating systems and databases. This is important everywhere. It’s not necessary to have published many papers; having sufficient project experience and engineering skills is enough.
Secondly, there is a difference between large companies and startups. In large companies, one can focus on a specific task without needing strong learning abilities and still remain there. However, limited learning ability might restrict growth.
In startups, strong learning ability is crucial due to the rapid changes. Startups often pivot and change products frequently. For example, if I hire someone for NLP and then switch to CV, the engineer must adapt quickly.
Hunter Leslie: So, having a strong ability to learn and adapt is essential.
Li Bojie: Additionally, startups require strong self-motivation. Employees should know what to do without much management or external pressure and complete tasks with high quality.
I’ve learned this the hard way. Huawei’s management system is comprehensive, but in a startup, everything must be self-managed. Initially, I thought hiring a programmer who could work was enough, but management issues arose because Huawei had a complete system, including performance evaluations, attendance, HR policies, and company culture. In a startup, building company culture from scratch is challenging, and everyone must be committed and self-motivated. If someone is only there for a paycheck, even if they’re skilled, they might not be suitable for a startup due to high management costs.
Hunter Leslie: So, in startups, selecting the right people becomes crucial.
Li Bojie: Yes, many large companies also believe that people can’t be trained, only selected. ByteDance, for example, has a similar philosophy.
Hunter Leslie: Yes, ByteDance said that.
Li Bojie: Many large companies are similar. Campus recruitment might focus more on training since graduates lack engineering experience. For example, Huawei’s “Genius Youth” program starts with individual contributors to familiarize them with company processes and culture before leading small teams. I spent about six months before leading a small team, gradually increasing responsibilities.
It’s a gradual process, allowing for personal growth. Starting as an individual contributor, then a project leader, and eventually leading larger teams presents new challenges, such as indirect management. Although I haven’t reached higher levels, it’s a challenging process that requires training. In growing companies or startups, everyone might experience this process.
Hunter Leslie: Yes, you mentioned three skills: foundational professional skills, learning ability, and self-motivation. These seem simple but are high standards, right? Very high, and few people meet these criteria. It’s a challenge for entrepreneurs to find such people. Where can they be found, especially outside the Bay Area, where everyone wants to change the world? Are there reliable methods or channels in China?
Li Bojie: I can’t say I’ve found them because I haven’t found many who meet all three criteria. However, from listening to many industry leaders, it’s clear that such people are rare, which is why the failure rate of startups is 99%. Successful startup teams must have all the right conditions, including resources, direction, and timing.
To increase success rates, one should have a well-formed idea with real user demand and growth potential. At that point, it’s easier to attract like-minded people. If you only have a couple of slides, it’s hard to convince others that it can become a billion-dollar company.
Hunter Leslie: So, it’s important to show potential, and early partners should be trusted and familiar with your capabilities.
Hunter Leslie: With tools like Cursor and OpenAI, there are millions of engineers globally. With AI, many tasks are automated. What changes and remains the same for engineers? People are concerned about job security, especially with layoffs in large companies.
Li Bojie: The changes and constants in this context are interesting. With AI, should we worry about unemployment?
I believe there’s no need to worry about unemployment. AI increases efficiency, leading to more demand. Over the decades, from assembly language to C, C++, and now Python and Java, each technological advancement has created new demands. The IT industry has grown, and programmers are more in demand. Initially, only the military could afford programming, but now small businesses can hire programmers to realize their ideas.
However, not all ideas can be realized. There’s a joke about being just one programmer short. Programmers are scarce because ideas are plentiful, but hiring programmers is costly. Developing an app used to cost a million dollars, but now, with AI-assisted tools like Cursor, it might only cost $100,000.
Some independent developers with high skills might not need additional programmers. This could happen within two years, where strong product managers can articulate their needs to AI, eliminating the need for programmers. People will focus on conceptualizing rather than implementation details.
Sam Altman predicts billion-dollar companies with just one person, which is possible. A person with strong business and technical skills, leveraging AI, can achieve what used to require a large team. The hottest AI companies in Silicon Valley reached billion-dollar valuations with small teams, using AI extensively. For example, in meetings, they use AI notetakers for automatic minutes. AI meeting notes aren’t cutting-edge, as even Tencent Meetings offers them, but most companies haven’t adopted them.
Therefore, as long as programmers are good at using AI and learning new technologies, they will never be unemployed. AI is definitely expanding the opportunities.
Another point is, for those who used to work on foundational tasks, like myself who is good at infra system optimization, am I going to be unemployed? I don’t think so. No matter when, the optimization of these underlying systems remains a highly specialized field that AI finds very difficult to replace. Even now, assembly language programmers haven’t been replaced because every compiler or operating system has some core high-performance code that interacts with hardware, which must be written in assembly. It can’t be replaced. So, this always has its application value.
Hunter Leslie: Since I’m an outsider, assembly language isn’t the same system as Python or Java, right?
Li Bojie: It’s a very low-level language. It’s like you have to tell it to move from register A to register B, then add the values of register A and B together and put them in register C. Imagine in a computer, there are just those eight registers, and you need to access memory addresses, loading four bytes from one memory address to another. It’s very low-level stuff. So, if you want to use this to develop an Android app, how much work do you think it would take? Right, how many times do you need to manipulate memory addresses to draw an interface? It’s very difficult to code.
Hunter Leslie: So, current AI programming tools like Copilot can’t handle assembly?
Li Bojie: It can write it, but it might not optimize as well as professional optimizers. It can write some basic assembly. For example, if I want to write an operating system, I can use AI to help, but it definitely can’t optimize as well as Linux. AI can also help you develop a website, using some design templates, but it certainly can’t develop something as smooth as TikTok like professional designers can.
Hunter Leslie: So essentially, as an engineer or architect, the unchanged part is still thinking about the product and the user. But the changing part might be that what used to take many person-days to accomplish can now be done much faster. Moreover, you must use this tool because if you don’t, it might be difficult, and you might be eliminated.
Li Bojie: Yes, I think so. Many daily tasks are actually spent on these details. For example, daily tasks like filling out forms, claiming expenses, collecting invoices one by one, right? Now AI can basically do this. For programmers, it might mean writing so-called glue code. The front end provides an interface document, and the back end has to implement each interface in the document, which is nothing more than CRUD operations, right? User CRUD, content CRUD, right? These things consume a lot of daily development energy. But actually, these things can all be replaced by AI.
If someone doesn’t use AI at all, their development efficiency might be at least half as slow as others. At least for myself, using AI has definitely doubled my development capability.
Hunter Leslie: So, is there a saying that if you truly embrace AI, as a developer, how should you allocate your energy? For example, how much time should be spent coding, and how much time on researching products and users? What’s your view on this? What might the ratio look like?
Li Bojie: I think for a programmer, if they don’t want to transition to a product manager, they might not need to focus too much on products and users too early. They just need to focus on how to use AI to quickly complete the tasks assigned by their boss, which is very important. For example, if the boss assigns a task to implement a new page, it might have taken a week before, but now with AI, it might be done in a day. This greatly improves efficiency. After improving efficiency, the remaining time can be used to rest or, as you said, think about product and user-related issues to enhance understanding in other areas.
Hunter Leslie: So, the main focus is still on completing engineering tasks, right?
Li Bojie: My feeling is that with the current capabilities of AI, it essentially frees up your time. Once time is freed up, I don’t need to do those “manual labor” tasks. I can think and do more valuable and meaningful things.
For programmers, what are more valuable and meaningful things? This is actually a matter of perspective. I prefer the concept of Google’s 20% project. I think, perhaps in China, because of the busy 996 work schedule, most programmers don’t have time for such 20% projects, so the overall innovation capability is relatively poor. But in Silicon Valley, one good thing is that many programmers have enough time to work on a part-time project, and many companies there are relatively tolerant of these things. Like Google’s 20% project, it has become deeply embedded in the company culture.
I believe a lot of innovation actually comes from the bottom up, arising from difficult problems encountered in daily life and work, and then creating a project to solve them. If the solution is clever, you like using it, and the problem itself has enough promotional value, meaning many people have the same need, then it’s a good product with PMF that can be launched. But I think it’s hard to completely plan this from the top down. So, that’s why the 20% project is meaningful. I think if programmers use AI in the future, reducing work time and shortening the time to complete existing engineering tasks, they can spend more time on projects they are interested in. And now with AI assistance, one person might be able to use AI to create an MVP without necessarily hiring a front-end, back-end, and designer.
But can we change the 20% time to 20% of the people, letting 20% of employees focus on innovation instead of product development? Many big companies’ AI Labs do this, but few succeed. Why? I think a fundamental point is that 20% project innovation is about solving problems encountered in daily life and work, not just coming up with ideas out of thin air. If 20% of people sit in an office thinking about innovation, the ideas might only be good for publishing papers, with no real demand.
Hunter Leslie: So, spending time on experiments is meaningful. So, if I’m now doing R&D in a big company and AI comes along, I need to use it, which might involve a learning, adaptation, and transition process. Do you have any suggestions? Because my feeling is, for example, I’m already 40 this year, and I might not have the motivation or desire to learn new things. But some colleagues might want to transition but don’t have strong motivation to learn new things, and some might want to transition but don’t have a good approach. Since you’re also starting a business, do you have any suggestions for everyone? How to embrace AI faster and let AI empower oneself? What methods or ideas can you share?
Li Bojie: My suggestion is to first look at how others are using AI effectively. For example, online, like your Podcast, there are probably many, right, teaching people how to use AI, seeing how others complete a decent-looking game in half an hour, right? Then you’ll know how AI should be used.
Including myself, I used to like using ChatGPT until last year. I would encounter a problem, like wanting it to write a piece of code, then type it into ChatGPT, copy the code out, and paste it into PyCharm or other IDEs, but this was inefficient.
When did this change happen? It was when Cursor became popular, around April or May this year, and I started using Cursor intensively because a strong model, Claude 3.5 Sonnet, came out with excellent coding capabilities. What’s the difference between using it in an IDE and outside in ChatGPT? ChatGPT doesn’t know the environment of your surrounding code, so when you ask questions, they’re always limited, and it can’t directly help you modify existing long code. But in IDEs like Cursor or GitHub Copilot, it’s completely different because they have the context of the entire project. They can read the code and know where to change it, sometimes even without needing to locate the specific line. There’s a powerful feature where you just fill in a dialog box, input what you want to do, and it handles everything.
Of course, it can’t handle complex requirements now, as the model’s capabilities are still limited, and you might need to make further changes. But anyway, it often finds what needs to be changed without much effort, so it’s about how machines and humans collaborate, exploring the boundaries of the model, which are constantly changing. Previously, with weaker models, I needed to do more, like finding which line of code to change, marking it, and then changing it. Now, it might understand the code itself, so you don’t need to mark the line, just tell it the requirement, which is a new advancement.
If it’s a project with hundreds of thousands of lines, it still can’t cover everything at once. So, I still need to tell it which module to change because I truly understand the project and know which module needs to be changed for my requirement. I need to provide the relevant files for that module, and then it can make the changes. Maybe one day, when the model’s capabilities are stronger, I won’t even need to tell it which module to change.
Another issue is debugging. The model often writes code with bugs, and I still need to debug and fix them. Maybe in the future, it can debug itself, which would be another step forward.
So, each of us needs to learn and adapt to the development trend of model capabilities. But I think there are two points here:
First, if you’re using it as a productivity tool, make sure to use the best model, not a poor one. Sometimes, using a poor model is like buying a very bad phone as your first phone, like a knockoff, and then you might have a bad impression of phones, thinking they’re hard to use, right? But if your first phone is an Apple, you might think phones are great, right? So, if you start with a poor model, you might have a bad impression of AI overall and lose motivation later.
Then the second thing is to observe how others are using it. So, I think what you’re doing with this podcast, or the work many are doing to bridge the information gap, is very meaningful.
Hunter Leslie: So, first, you might need some channels to access this information gap, whether it’s listening to podcasts or watching videos on YouTube. Another thing is to try it yourself. You mentioned using the best models earlier. Is the best model determined by evaluation scores? Where can I find them? For 2C products or some products, there are rankings on Product Hunt. Are there other places? If I haven’t used them, where can I find the best models?
Li Bojie: Chatbot Arena is an academic project by Berkeley. It’s essentially a blind testing platform where users randomly select two models to test without knowing which is which, and then do an AB comparison to vote on which is better. All these blind test votes are organized into a ranking list. Currently, OpenAI’s models are ranked quite high, as well as Yi Lightning in China, which is both cheap and good. There’s also Google’s and Anthropic Claude.
There are also category rankings. For example, the overall ranking I mentioned earlier, but if it’s a programming category ranking, Anthropic’s Claude is currently number one. So, if I’m doing programming, I would definitely look at the programming category rankings. These authoritative international rankings are quite reliable. From this perspective, you can see the models. As for products, I think, as you mentioned, Product Hunt is quite good. When you look at the programming category, it’s likely that Cursor, Github Copilot, AI Devin, and similar programming tools are at the top.
Hunter Leslie: Another question is, maybe you don’t face this issue because you’ve had a relatively smooth career in big companies and then started your own business doing what you love. But in fact, many people in the industry in China might worry about being laid off one day. There are two scenarios: one where you’re still working, and another where you’re not doing it anymore. So, I wonder if you’ve thought about it, or if you know someone who has successfully transitioned after ten years in R&D and doesn’t want to continue. If I want to transition to something else, do you have any good ideas? For example, what might be more successful or valuable?
Li Bojie: I know quite a few people who have transitioned, especially those older than me. They might feel less interested in technology after a few years in R&D or find it too exhausting with the 996 work culture. They want a better work-life balance. In such cases, I think there are a few directions.
First, for smart people, quantitative trading might be a good idea. It’s about following the market, and if your methods are good enough, you might not need a large team. You could work alone or join a small elite team, invest some money, and as long as you make money, that’s enough. The only goal is to make money in the securities market. It’s a relatively closed field, so you don’t need to deal with operations or managing large teams. Many smart people have made a lot of money in this area.
However, it’s a winner-takes-all situation because you’re competing with the world’s top minds. If you feel your intelligence isn’t enough to compete with these top people, you’ll likely end up as a loser.
Hunter Leslie: So, if you’re not that outstanding, it’s better to choose another direction.
Li Bojie: I feel that product and technical planning are two good directions. Transitioning from technology to product and technical planning is advantageous. For example, if a product manager has no technical background, they might propose unrealistic requirements because they don’t understand the technical boundaries and capabilities. Many product managers encounter this. Sometimes, they think a requirement is simple, but it might take a year to complete. Conversely, some things they think are difficult might be easy for technical people to handle in a day. So, if someone with a technical background becomes a product manager, they have a better grasp of technical difficulty and complexity, which is an advantage.
The second direction is technical planning. Generally, large companies have planners or think tank analysts responsible for technical planning. If someone with a technical background does technical planning, they can have more insights because planning requires predicting the future. For example, if you ask me to predict what AI will be like in five years, whether AGI will appear, I can provide a lot of analysis. But if someone doesn’t even understand the current AI models, asking them to plan for the future would be problematic.
Besides these two directions, there are many other options. For example, opening a guesthouse or working in education are good choices. Especially in education, many people want to share their knowledge without living a 996 lifestyle, so they choose this field. Whether it’s basic education, university education, or online podcasts or courses, these are good choices. I’ve seen many classmates or friends succeed in this area.
Hunter Leslie: You’ve probably experienced a lot from last year to this year, gaining new insights into direction and team. You seem to be someone who likes to think, so I want to ask what you’ve been pondering recently. You might not have an answer yet, but maybe you’re looking for someone wise to discuss it with. What have you been thinking about lately?
Li Bojie: I definitely have many questions I’d like to ask. If I could only ask one, it would be: Do you think AGI can be achieved, and when? This question is crucial as it determines the upper limit of AI.
This wave of AI is advancing rapidly, heading straight for AGI, or will there be twists and turns like the previous wave of AI? For example, in 2016, CV models were getting stronger with larger models, but they couldn’t handle anything other than CV, not even NLP. It was only with Transformers that CV and NLP were unified. Will Transformers also have limitations and be unable to handle certain things?
But I think this wave has an advantage because it at least has multimodal capabilities now, and OpenAI’s GPT-4o, Claude 3.5 Sonnet have good coding abilities, and reasoning ability o1 is also showing promise. These things give the impression that they might not be difficult, and with enough computing power and insight, they can be solved.
Some even suggest that not much computing power is needed. For example, recently, Kai-Fu Lee mentioned that he trained a high-ranking Yi Lightning model with just $3 million, significantly reducing costs. And with o1, OpenAI, as a pioneer, invested a lot of computing power to train this reinforcement learning thing, but my gut feeling is that if the method is right, not much computing power is needed. Look at how quickly AlphaZero evolved, reaching top human levels by noon and surpassing humans by evening. If the feedback mechanism is right, a medium-sized company or even a school might be able to develop reasoning abilities comparable to o1 mini. So, this is very exciting.
Hunter Leslie: (Whether AGI can be achieved) is indeed a difficult question to answer, even if you discuss it with Elon Musk and Sam Altman, everyone has their own views. But it’s a very important question, especially if you’re starting a business. The capability boundaries of models might affect your judgment and understanding when planning products, so this is crucial.
Li Bojie: Therefore, I believe it’s essential to be friends with foundational model companies, as I mentioned earlier. If foundational model companies see you as an enemy, eventually, no foundational model company will be willing to share information with you, or even let you use their API, because they fear your company might replace them.
But if you become friends, many companies might let you use some internal, unreleased things first or share their insights and ongoing projects. For example, AI programming’s Devin got the beta version before o1 was released and made it into o1’s showcase. Chatbot Arena got the anonymous version of GPT-4o before its release and let users test it in the arena. Agora and LiveKit had already adapted to the real-time API before its release for real-time voice calls.
Once you become friends with foundational model companies, you might have an edge over others in understanding what the future will be like. As we discussed earlier, foundational models are rapidly developing, and their capabilities determine what applications can and cannot do. So, it’s best to be friends with top foundational model companies like OpenAI, Anthropic, or Google.
Hunter Leslie: What’s been causing you the most anxiety recently?
Li Bojie: Recently, my biggest anxiety is about how to approach this matter. I remember when I interned at MSRA, former dean Harry Shum said he didn’t care whether he was in academia or industry; he often switched between the two. He only cared if each project was impactful because a person’s career is made up of a series of projects, and as long as each project is impactful, that’s enough. This had a significant impact on me, and he also said that not everything is suitable for the industry, nor is everything suitable for academia; different things require different approaches.
I’ve talked to many people and have a superficial understanding that there are many ways to approach something.
One way is the typical startup company, raising a lot of money from investors, initially gathering a lot of funds, aiming to grow and go public. Such companies are often seen as the most successful examples, but they may not be suitable for every company, especially in certain fields. Because as a company grows, it inevitably involves conflicts between commercial realities and technical ideals, as seen with OpenAI over the past year, where many people left, involving compromises between technical ideals and commercial realities. Once a company grows, it may no longer be as cool or technology-driven.
Another possible way is the small and beautiful startup company. As a small and beautiful company, they start as a small team, perhaps without VC investment, but they have a very clear PMF, solving a real problem, so they can sustain themselves financially. If one day the market is large enough, they might scale up, but if not, they maintain their status.
For example, many people like so-called “9-6-5” (no overtime) companies. In fact, there are basically only two types in the country: one is a mature foreign company, and the other is a small and beautiful startup. They may be able to maintain a technology-driven and cool vibe for a longer time, with a good internal technical atmosphere, because they don’t have much competitive pressure and can sustain themselves, so they don’t need to move too fast. This is the second type of approach that I think is also quite good.
The third type is the community approach. For example, open source communities. For instance, in the early days of Linux, if Linus had sought an angel investor saying he was creating an operating system similar to UNIX and asked for investment, it would have been fortunate if he wasn’t shown the door. However, Linux expanded gradually through the community method, and it truly is an open-source operating system with intrinsic value because all others are commercial.
Open source has its value, but there is a problem with open-source projects: once they grow large, or when individuals or teams face financial pressure, it involves a commercial issue: how to commercialize open-source projects? This is another challenge because the interests of open-source communities and companies are hard to balance. When I create something new, should it be turned into a closed commercial version, or should it be contributed to the community? This is quite troublesome.
Actually, I’ve been observing this recently, and vLLM is a very obvious example. vLLM hasn’t been commercialized yet, but many companies have forked vLLM and made many optimizations themselves. One thing that struck me was the release of vLLM version 0.6 in September, which improved performance by 2.7 times. The 2.7 times performance boost wasn’t due to a bunch of fancy optimizations, like optimizing some operators. It was actually an HTTP server wasting a lot of performance. Also, Python has a GIL global lock, and a lot of performance was wasted in scheduling. So, they increased GPU utilization from 38% to a much higher level, and performance nearly tripled. When I talked to people from several big companies, both domestic and international, they all said they had done such optimizations internally long ago, but they hadn’t contributed them to the community. They also said their internal versions had much better inference performance than the current open-source version of vLLM. Every big company has a lot of hidden things, so there’s always this balance issue between commercial and open source.
As mentioned earlier, besides open-source communities, there are also non-profit projects. They may not be open source but are similar community projects. Wikipedia is a good example. If Wikipedia had initially said it wanted to create an encyclopedia and sought funding, it might have been difficult to secure investment. But it has its intrinsic value.
The third thing is Web3, which is also a good example of community projects. For instance, Bitcoin is the progenitor of Web3. If someone wanted to create a decentralized anonymous currency and sought funding, it would be difficult, right? But it has its intrinsic value. Many projects start as community-driven and then move to Web3 for funding, which can also lead to success. However, Web3 currently has a problem: there are many financial speculation projects in this field. In such a mixed environment, if your project is truly technology-driven, can it stand out and let people know its long-term value, rather than being wiped out by a Bitcoin cycle? This is also a challenging issue.
For community projects, regardless of which of the three routes they take, it’s still quite difficult. But I think this is something that technical idealists might enjoy doing.
Community projects may not have clear commercial value at the beginning, but at least they have community value, a public good, solving a public interest issue for humanity, right? If it’s something I just want to play with myself and don’t know if it’s useful, it might be an academic project. Many big names are doing academic projects and are quite successful.
So, I think these are four different types: typical startups, small and beautiful companies, community projects, or academic projects, all suitable for different stages and types of projects. So, I’m actually thinking about which of the two approaches mentioned earlier I should take. This is also part of the thought process.
Hunter Leslie: Alright, thank you for listening to this episode of the podcast. You’re welcome to follow, like, and share. If you have any thoughts, feel free to interact and leave comments. The next episode will be even more exciting!