How to Evaluate UC Berkeley's Proposed LVM?

(This article was first published on Zhihu)

It’s still early to say about the GPT era, but LVM is indeed a very interesting work. The reason why this work has attracted so much attention even before the source code was released is that many people I talked to these days mentioned this work. The fundamental reason is that LVM is very similar to the end-to-end visual large model architecture that everyone imagines. I guess GPT-4V might also have a similar architecture.

The principle of current multimodal large models is basically a fixed text large model (such as llama) connected to a fixed encoder, a fixed decoder, and a thin projection layer (glue layer) trained in between to stick the encoder/decoder and the middle transformer together. MiniGPT-4, LLaVA, and the recent MiniGPT-v2 (which also added Meta’s author, worth a look) all follow this idea.

These existing multimodal large model demos perform well, but there are some fundamental problems. For example, the accuracy of speech recognition is not high, and the clarity of speech synthesis is also not high, not as good as Whisper and vits that specialize in this. The fineness of image generation is also not as good as stable diffusion. Not to mention tasks that require precise correspondence between input and output images or speech, such as placing the logo from the input image onto the output image generated according to the prompt, or doing voice style transfer like xtts-v2. This is an interesting phenomenon, although theoretically, this projection layer can model more complex information, but the actual effect is not as accurate as using text as the intermediate representation.

The fundamental reason is the lack of image information during the training process of the text large model, leading to a mismatch in the encoding space. It’s like a congenitally blind person, even if they read a lot of text, some information about color is still missing.

So I have always believed that multimodal large models should introduce text, image, and speech data at the pre-training stage, rather than pre-training various modal models separately and then stitching the different modal models together.

The interesting part of the LVM work is that it proves that even just using image modality data, without any text annotations, can achieve good results in image completion tasks, and even have some understanding of the logic in images (such as some IQ test questions).

This is similar to animals, most of which do not have language, but have developed visual and auditory systems. They do not need to convert images into language to achieve complex image understanding.

Of course, this is not to say that future large models will only use image modality data. A visual large model that can truly be called the GPT era must use text, image, and speech multimodal data for pre-training. Its architecture is likely to be similar to the LVM paper, using VQ-VAE or VQ-GAN as encoder/decoder, and using Transformer as the main part of the autoregressive model.

Some people may ask where to find multimodal data.

In fact, all websites, apps, books, newspapers, and magazines in the world are dual-modal with images and text. Is it possible to consider using screenshots of websites and apps and photocopies of books directly as multimodal data to feed to the large model? Considering that a medium-resolution image after passing through the encoder is just a few hundred tokens, which is about the same as the number of words that can be contained in the picture, isn’t OCR even saved? In fact, a lot of information is lost after OCR of books, such as many illustrations and formulas in textbooks and professional books of science and engineering, which are basically lost after OCR. Moreover, information about the structure in Chinese characters is basically missing in current large models, so it is difficult to understand Martian language and ASCII Art.

I guess if the LVM architecture is used and website screenshots and book photocopies are used as the dataset without any annotations, it will definitely learn the multimodal ability that combines text and vision, which might make big news. Currently, the pictures in the LVM dataset definitely include some website screenshots and book photos, but the proportion is probably not large, so it is not enough to learn the ability to recognize words by looking at pictures.

In addition, there are a lot of videos on YouTube and shows on Netflix, all of which are multimodal data. Whisper uses a large amount of subtitled film and television data for training, and Whisper often recognizes blank speech as “XXX subtitle group”.

Some people are criticizing the term GPT era, but the paper only imitates the paper by Microsoft that mentioned Sparks of AGI, without clinging to GPT, which is probably added by self-media editors. I still believe that a visual model that can be called the GPT era must be a multimodal model pre-trained with text, image, and speech multimodal data.

I have always believed that data is very important. LVM shows that feeding image data into Transformer can learn vision, Whisper shows that feeding speech data into Transformer can learn speech, and Transformer can even do sequence analysis in AI for science. Recently, some Transformer-only multimodal papers have been published. The Transformer-only architecture may be a general solution for multimodal problems.

Recently, Berkeley’s Starling 7B model has approached the general capabilities of 70B models in roleplay, writing, and reasoning, but has not made much progress in factual answers like MMLU (which mainly depends on model capacity), because it used a new fine-tuning method, feeding in a large amount of high-quality data generated by GPT-4.

Now many domestic companies working on foundational large models are using similar methods to distill data from GPT-4, with some companies even spending tens of millions on GPT API costs in a month. A significant portion of OpenAI’s revenue turns out to be helping other companies organize training data.

Twitter discovered this way of making money by selling data early on, restricting crawlers from July 2023. A developer account that can access all historical Twitter data costs 5000 USD per month and can only access 1 million tweets, which is equivalent to charging 0.005 USD per tweet, costing 150 USD to read Elon Musk’s 30,000 tweets.

The reason Character AI can know a lot about celebrities and anime characters, imitate their tone, and sound more like human communication, not as lengthy as ChatGPT, is because its training data contains a large amount of conversational data. Such a 3B model, after proper optimization, can even be put into a phone, opening up a whole new world.

Most companies that are just starting to work on foundational large models are thinking about making the model bigger, but those ahead, like OpenAI and Character, are already thinking about making the model smaller. GPT-3.5-Turbo, through the MoE architecture, only needs to activate about 20B parameters for each inference. GPT-4-Turbo is likely also to reduce the parameter size of GPT-4 through model distillation, thereby reducing inference costs. It is said that GPT-5 will include thousands of experts. Although some domestic companies have reached a parameter size close to GPT-4, they are completely dense, resulting in unbearably high inference costs. Inference cost is very important, otherwise, the more you sell, the more you lose.

When we were using vits fine-tuning to generate celebrity voices, we found Elon Musk difficult to deal with because he often stutters when speaking, leading to inaccurate correspondence between speech and captions, and low-quality training data. Using vits few shot or xtts v2 instead yields better results. However, voices like Donald Trump and Paimon from Genshin Impact, due to the high quality of the data itself, produce very good results with vits fine-tune.

How to Evaluate UC Berkeley's Proposed LVM?

Comments