AI Agent: From Tool to User of Tools

The AI chat application Kimi has quickly gained popularity, introducing a reward-based charging model. Compared to membership fees, this approach of humanizing AI provides a brand new experience and prompts us to reflect on the changing relationship between AI and humans.
Pre-trained Large Models: From “Specialized” to “General”.Over the past 30 years, AI researchers have continuously improved model performance on specific tasks they focused on. Starting with Deep Blue defeating world chess champion Garry Kasparov in 1997, Watson winning against human champions in the quiz show “Jeopardy!”, ResNet surpassing human accuracy in the ImageNet image recognition competition, AlphaGo defeating Lee Sedol in Go, OpenAI Five beating professional human teams in the multiplayer strategy game Dota 2, and AlphaFold achieving higher accuracy than humans in protein structure prediction. The Turing Test measures the ability of AI to mimic human intelligence, and it can be said that AI has passed the Turing Test on these tasks. The capabilities behind these tasks encompass various aspects of human intelligence, including perception, cognition, and decision-making.
However, until OpenAI released the pre-trained large model GPT-3 in 2020, all tasks were completed by “specialized” models. Taking the Chinese-to-English translation task as an example, expert system approaches required linguists and translation experts to write specific rule libraries, traditional machine learning methods needed to learn the conditional probabilities from Chinese to English based on bilingual corpora, and traditional deep learning methods required fine-tuning on labeled data based on pre-trained representations to obtain the Chinese-to-English model. Today, however, ChatGPT-like chat applications based on pre-trained large models solve “general” tasks through a conversational interface: they can not only achieve mutual translation between dozens of languages but also handle question-answering, summarization, writing, and other natural language understanding and generation tasks. Furthermore, the recently released GPT-4o model unifies multimodal understanding and generation tasks such as speech recognition, dialogue, and speech synthesis.
AI Agent: The User of Tools.When AI can only solve specific tasks, we see it as a tool being used. When AI can solve general tasks and has a naturally interactive interface, we see it as an assistant, or a Co-Pilot, where we tell it various specific tasks, and it will do as we say.
The “general” capability of pre-trained large models is not only reflected in content understanding and generation but also extends to planning and using tools, which are types of thinking and decision-making tasks. For example, if we ask, “How many small breads can each child get on average” while looking at a picture with several children and some small breads, the large model can plan the task into three steps: detecting small breads, detecting children, and performing division, calling the corresponding object detection model or division tool at each step to complete it. When AI transitions from being a tool being used to becoming the subject that can use tools, such an AI system with task planning and tool usage capabilities can be called Auto-Pilot, or AI Agent. In Co-Pilot mode, AI is a human assistant, collaborating with humans in the workflow; in Auto-Pilot mode, AI acts as a human agent, independently taking on most of the work, with humans only responsible for setting task goals and evaluating results.
In fact, since the inception of artificial intelligence, solutions for AI Agents have been explored. Recently, there have been three paradigms: rule-based design, reinforcement learning, and currently, pre-trained large models. The representatives of reinforcement learning AI Agents are AlphaGo and OpenAI Five, which still interact in specific environments aimed at specialized tasks. Pre-trained large models have learned general world knowledge and can input and output through language, thus generalizing to different tasks and environments. Specifically, AI Agents based on pre-trained large models include two categories: intelligent agents and intelligent entities.
Intelligent Agents: Designed and Applied Like Humans.Currently, the main way to convey demands and instructions to AI Agents is through prompts. By designing prompts with specific structures, we find that AI Agents exhibit human-like thinking, reasoning, and correction abilities.
The standard way to use ChatGPT-like applications is a question-and-answer format, leveraging the model’s human-like system capabilities—direct and quick responses. For tasks requiring deep logic, such as arithmetic problems, step-by-step example prompts need to be designed to guide the large model in systematic second-order thinking chain reasoning. For more complex tasks, such as the 24-point game, linear reasoning of the thinking chain is insufficient; it requires constructing a retraceable tree structure of prompts to guide the large model in tree reasoning, exploring the possibilities of task solutions from different angles.
Another approach is the ability to reason combined with action feedback. “Learning without thinking leads to confusion; thinking without learning leads to danger,” applies to AI Agents as well. “Thinking” represents reasoning, while “learning” represents action. A technique called ReAct combines Reasoning and Action, dynamically adjusting the reasoning process based on observed action feedback through a Thought-Act-Observe cycle, enhancing adaptability and flexibility in complex environments.
In addition to the capabilities already acquired from training, we hope AI Agents can also learn during task execution: drawing lessons from failures and summarizing experiences from successes. A technique called Reflexion mimics human self-reflection abilities, building on the aforementioned CoT and ReAct, by consolidating action trajectories and rewards/punishments of task success/failure into experiential memory to guide future task execution.
Equipped with these human-like abilities, AI Agents can be designed to replace humans in completing some complex tasks. For individuals, a typical application is personal assistants.
The release of models like GPT-4o and Astra further enhances the availability of natural language-based personal assistants, fundamentally changing how we acquire and produce information. In the past, different information needs required resolving on different websites and apps; in the future, all needs can be collected through personal assistants, which will interact with different websites and apps. Over time, AI Agents can learn users’ preferences and habits, providing more personalized services. AI Agents will become the new entry point for information in the intelligent era.
For enterprises, the form of enterprise software will also change. From a usage perspective, traditional software is process-oriented, often containing hundreds of operational processes; AI Agent-based software will be goal-oriented, hiding most human-oriented operations, allowing large models to replace humans in calling tools, planning, and executing task chains. From the types of needs met, traditional software standardizes high-frequency needs; while AI Agent-based software can be “produced on demand,” meeting long-tail, dynamic needs by building AI Agents that fit enterprise workflows. AI Agents will become customized digital employees, capable of understanding business logic and automatically executing highly repetitive tasks. Enterprise software is evolving from SaaS to MaaS, and after the intelligent upgrade of cloud services, it will again evolve into Agent as a Service, achieving automated upgrades.
In summary, as intelligent agents, AI Agents are positioned as personal assistants on the consumer side, with the basic requirement of natural interaction and the goal of personalization; on the business side, they are positioned as digital employees, with the basic requirement of automation and the goal of specialization. As AI Agents transition from being tools to users of tools, the relationships between people and people, and between people and content on the internet will also undergo significant changes.
Intelligent Entities: Analyzing and Interacting Like Humans.In the five levels of AGI designed by Turing Award winner Yoshua Bengio, we are currently roughly at the third level—multimodal perception and the fourth level—embodiment and action; the ultimate fifth level is social interaction. “Stanford Town” is an interesting attempt to explore the social interaction capabilities of AI Agents by creating AI Agents that simulate human social behaviors, analyzing large models’ capabilities in complex social environments.
An important scenario for AI Agents in social interaction is virtual companionship, which is relatively easy to achieve in the short term in terms of task complexity and fault tolerance. Applications targeting adults, such as Character.ai, Talkie, and Linky, have already emerged, and in the future, AI Agents providing virtual companionship for the elderly and children may hold greater social value. From a technical perspective, enhancing large models’ role-playing abilities and analyzing and adjusting large models’ personality traits are directions with both research significance and application value.
Analyzing the cultural values of large models is crucial for the large-scale deployment of AI Agents in social interactions. According to Hofstede’s cultural dimensions theory, we find that mainstream large models exhibit different cultural value tendencies, and this tendency can be further deepened by the distribution of fine-tuning data and user populations. Models with stronger comprehensive capabilities have relatively better adaptability, able to adjust responses based on system messages or prompt language to align with specific cultural value tendencies.
Another issue worth noting is the echo chamber problem of AI Agents. In the intelligent era, this issue may evolve into a model echo chamber: since AI Agents are often based on the same large model or derived from similar model bases, their inherent values, preferences, and decision-making logic may be amplified during social interactions, leading to an inherent bias in the entire information ecosystem and resulting in cognitive fixation. Classic sociological issues need to be reanalyzed and understood in the context of AI Agents to promote the diversified development of intelligent entities and realize a safe and sustainable intelligent society.
Science fiction writer Arthur C. Clarke proposed in 1964 that “humans are the stepping stones for advanced life in the future.” Corresponding to the development of autonomous driving, current AI is between L3 Co-Pilot and L4 Agent, while L5 is Species silicon-based life, which many fear to be the “singularity.” Ensuring the safety of large models currently relies on two stages: supervised fine-tuning (SFT) based on instruction data after pre-training and reinforcement learning (RLHF) based on human feedback preference data. RLHF is very effective when human evaluators can provide high-quality feedback signals. However, on the time scale of AI capability evolution, human evaluation abilities are relatively fixed. Beyond a certain critical point, humans will no longer be able to provide effective feedback signals for aligning AI systems. OpenAI’s proposal for super alignment discusses how to control and supervise superhuman-level AI. Superintelligence and super alignment are a major thread in the future development of artificial intelligence: one explores the limits of capability, while the other ensures the safety baseline, crafting the sharpest spear and constructing the strongest shield.
In the past few technological revolutions, the popularization of electricity, computing power, and information technology relied on the improvement of infrastructures such as power grids, computers, and the internet, serving users through electrical appliances, personal PCs, and web/apps. The infrastructure of the ongoing intelligent technology revolution is large models, while the application carrier is AI Agents. As the marginal cost of intelligence approaches zero, we can foresee a massive explosion in AI Agent applications.
Everything is computable. We can’t help but wonder: where will the boundary of the capabilities of large models supported by computational theory lie? To better design, apply, analyze, and coexist with them, we first need to truly understand them. Everyone may need to learn some knowledge of computational theory, which is not only the “language” for communicating with emerging social entities like AI Agents but also the “physics” supporting the underlying logic of the future digital world.
(The author is a professor at the School of Computer Science and Technology, Beijing Jiaotong University)

Leave a Comment