Recently, if you pay a little attention to technology news, you will find that the field of humanoid robots is incredibly “hot“! Major tech giants from both domestic and international markets are entering the fray, from domestic companies like Tencent, Meituan, Alibaba, Xiaomi, and Huawei, to foreign giants like Tesla, OpenAI, Microsoft, and NVIDIA, without exception. For instance, Tencent has invested in UBTECH, Meituan is betting on Yushun Technology, OpenAI and Microsoft have invested in Favor AI, Google holds Boston Dynamics, Tesla has launched humanoid robots, and NVIDIA is also involved; these are just a few examples. Humanoid robots are experiencing explosive growth.
At this point, some may ask: Do robots require maintenance? From a product perspective, the answer is definitely yes. Just like our cars need regular maintenance, robots also need to be occasionally “cared for“, needing oil when they run low, and charging when the battery is depleted, as most humanoid robots are primarily powered by electricity.
The actions of AI giants entering this field, especially those of OpenAI, are particularly noteworthy. As we all know, OpenAI is a software company focused on developing large AI models, relying on writing AI programs for its livelihood. In the past, software companies and hardware production seemed to “stay in their own lanes“, as the software industry has high profit margins. For example, Microsoft’s Windows operating system and Office software can be infinitely replicated and sold after development, yielding substantial profits. In contrast, the hardware industry operates with transparent profit margins and fierce competition; apart from companies like Apple with strong brand premiums, most hardware manufacturers operate on thin margins. So why is OpenAI venturing into the hardware domain to create humanoid robots? This actually conceals a significant secret about the development of artificial intelligence.
In our last live stream, we discussed the future direction of artificial intelligence—AGI (Artificial General Intelligence), which refers to entities that may possess intelligence comparable to or even surpassing that of humans, as depicted in science fiction films. However, the path to achieving AGI is not smooth, and we are currently facing bottlenecks. Previously, I shared a paper from the journal Nature discussing the limitations in the development of large models. Today, I would like to introduce another paper from Nature published about half a year ago, which presents an interesting viewpoint: language is primarily a tool for communication, not for thinking.
We all know that today’s large language models are trained on vast amounts of text data and can fluently answer various questions. However, this paper, through research on deaf children, found that even some children who have never been exposed to language or have lost the ability to access language can still learn mathematics, perform relational reasoning, and establish causal chains. This proves that language is not a necessary condition for complex thinking; not being able to speak does not hinder thinking and reasoning. Turing Award winner Yang Likun (phonetic) holds a similar view, asserting that thinking is not equivalent to language, and training large models solely on language data to achieve human-like thinking and intelligence is fundamentally unrealistic.
Yang Likun also pointed out some issues with current large models. Firstly, large models lack persistent memory. Taking ChatGPT as an example, it has a token limit; ChatGPT has a limit of 4000 tokens, while GPT-4 has a limit of 32000 tokens. This means that when you converse with it, the amount of input text is limited, and once it exceeds the limit, it can no longer “remember“ previous content. In contrast, humans do not have such limitations; we can remember important things even after a long time.
Secondly, large models cannot plan complex sequences of actions. Imagine planning a trip during the Spring Festival; a multitude of thoughts can instantly arise in your mind: how many days off do you have, can you take leave, what’s the weather like, where do you want to go, etc. However, a large model cannot even match a cat; when a cat sees a mouse, it evaluates the distance, judges the environment, and then controls its muscles to catch it, which involves a series of complex planning that large models simply cannot achieve.
Therefore, training large models solely based on language symbols cannot form true reasoning abilities, nor can it reach human intelligence levels. Current large models are like children raised in a greenhouse, trained only through text data, lacking interaction with the real world, making it difficult to form common sense. For instance, there was a classic example where many large models believed that 9.11 is greater than 9.9, which is a manifestation of a lack of common sense.
Human common sense arises from interactions with nature through our senses. What we see, hear, smell, touch, and taste are all sources of information. However, once this information is converted into text, 99% is discarded, and it is precisely this discarded “redundant information“ that forms the basis of human common sense. For example, a nine-month-old baby cannot speak but already has a grasp of physical common sense such as gravity and inertia; when they see a cup suspended in the air, they will show surprise. Yet, large models developed through extensive human and material resources lack even this basic common sense.
Currently, some models trained on video, while increasing the information dimension, still lack auditory, tactile, and other sensory information, which remains problematic. Therefore, for large models to possess common sense, they must interact extensively with the physical world to acquire the information that has been overlooked.
To address this issue, many are brainstorming solutions. The Beijing General Artificial Intelligence Research Institute has made an interesting attempt. They have created a virtual physical world, akin to a super-large 3D game, containing virtual parks, buildings, characters, vehicles, etc. In this virtual world, there is an AI model named “Tongtong“, akin to a child. Researchers have participants wear VR glasses and hold controllers to interact with “Tongtong“ in the virtual scene. For instance, if someone spills milk in the scene, “Tongtong“ will actively look for a cloth to clean it up, organize the room, turn on the lights, etc. Through this method, they do not rely on extensive text training but instead allow “Tongtong“ to grow gradually in the virtual environment, cultivating intelligence. According to their tests, by early 2024, “Tongtong“ had reached the intelligence level of a 3-4 year old child. This is a meaningful exploration; although there is still a gap between the virtual world and reality, it provides new ideas for the development of artificial intelligence.
The entry of AI giants into the robotics industry also stems from recognizing this issue. They realize that relying solely on language training cannot achieve AGI; it is essential for artificial intelligence to interact with the physical world, and humanoid robots are key to achieving this goal. Today’s humanoid robots are equipped with numerous sensors; cameras can see, microphones can hear, fingers made of new materials can sense pressure and roughness, and some even have olfactory sensors.
Once these humanoid robots enter households, they can collect vast amounts of data, which is the “common sense material“ that large models lack. This data will be uploaded to the cloud, where the AI’s “brain“ will perform computational analysis. If any company makes breakthroughs in this area, allowing artificial intelligence to form common sense, it will have a significant advantage, delivering a dimensionality reduction strike to other companies. This is why so many AI giants are rushing into the robotics industry; their goals are not merely to sell robots for profit.
Moreover, the amount of high-quality data available on the internet for training large models is actually quite limited. Previous statistics indicated that there are only 18T of data, equivalent to the capacity of a few portable hard drives, and this data is nearly exhausted. For instance, the development of ChatGPT-5 faces this issue; continuing to rely on existing data for training makes it difficult to improve results, so new avenues must be sought.
Of course, the motivations of manufacturers producing humanoid robots vary. Some manufacturers are purely optimistic about the market and believe there are profits to be made, thus following the trend without necessarily having the grand goal of achieving AGI. However, if their products gain a high market share, they can collect more data, which can be sold to companies specializing in artificial intelligence, representing a significant income. This is similar to the previously discussed issue of big data pricing, where e-commerce platforms utilize vast amounts of data to set different prices for different users, highlighting the value of data.
It is foreseeable that in the coming years, consumer-grade robots will experience rapid development. Currently, some domestic robots are priced at less than 100,000 yuan, while Tesla’s robot is said to be priced at 20,000 USD (equivalent to over 140,000 yuan). Although the price is not low, it may become more affordable as technology advances. However, the current capabilities of robots are still limited; they can handle simple repetitive tasks like cleaning, but complex tasks such as cooking, laundry, and caring for the elderly are still challenging in the short term. Currently, service robots in some restaurants, hotels, and shopping malls can only perform relatively simple tasks, and their practicality needs improvement.
Do you remember the movie I, Robot? The scenes in it might represent a possible future for AGI. Initially, household service robots enter thousands of homes, but they lack human intelligence and only perform simple tasks. Until one day, they suddenly gain intelligence and even begin to control humans. Although the film contains artistic exaggeration, with a large number of humanoid robots entering households and continuously collecting data, no one can predict whether similar situations will arise in the future.
Technological development is filled with unknowns and challenges. Just like cloning technology and atomic bombs, people are both worried about risks and unable to stop the pace of exploration. Because if you don’t research, others will. Regarding the development of AI and robotics, we should have expectations while remaining vigilant, hoping they develop in a direction beneficial to humanity. That concludes today’s sharing; what are your thoughts on this topic? Feel free to leave comments for discussion.