Comprehensive Breakdown: What Is an AI Agent?

👉 This might be useful for you community

🐱 One-on-one communication/interview booklet/resume optimization/job search guidance, welcome to join「Yudao Rapid Development Platform」knowledge planet. Below are some materials provided by the planet:

“Project Practice (Video)”: Learn from the book, practice on the project “Practice”

“High-Frequency Interview Questions in the Internet”: Facing the resume, learning, and blooming in spring

“Architecture x System Design”: Master the high-frequency interview scenarios

“Java Learning Guide”: Systematic learning of mainstream technology stacks in the internet

“Must-Read Java Source Code Column”: Knowing the reason behind it

👉This might be a useful open-source project for you

Domestic Star project with over 100,000 stars, front end includes management backend + WeChat mini program, backend supports monolithic and microservices architecture.

Features include RBAC permissions, SaaS multi-tenancy, data permissions, shopping mall, payments, workflows, large screen reports, WeChat official account, CRM, etc.:

Boot Repository: https://gitee.com/zhijiantianya/ruoyi-vue-pro

Cloud Repository: https://gitee.com/zhijiantianya/yudao-cloud

Video Tutorials: https://doc.iocoder.cn

[First batch in China] Supports JDK 21 + Spring Boot 3.2.2, JDK 8 + Spring Boot 2.7.18 dual versions

Source: zhuanlan.zhihu.com/p/657737603

Introduction
1. Research Background
2. What is an AI Agent
3. Development Path from NLP to AGI
4. Why We Need AI Agents
5. Framework of Intelligent Agents
6. Application Scenarios of Intelligent Agents
Open Questions Discussion

Comprehensive Breakdown: What Is an AI Agent?

Author Introduction

Lulu who loves eating avocados, Master’s in Electronics and Communication Engineering from Peking University.

An Agent refers to an intelligent entity.

Comprehensive Breakdown: What Is an AI Agent?

Introduction

Recently, a hot topic in the AI circle is AI Agents! The AI venture capital circle is also closely watching the progress of related startups. Many people say that we haven’t fully understood large models, and now there is AI Agent… But don’t worry, Agents are still in the early stages.

The capabilities of AI Agents are actually intertwined with those of large models, as the boundaries of large model capabilities ultimately determine the boundaries of AI Agent capabilities.

Recently, Fudan University and Stanford University have both published their views and understanding of AI AGENTS.

A backend management system + user mini program implemented based on Spring Boot + MyBatis Plus + Vue & Element, supporting RBAC dynamic permissions, multi-tenancy, data permissions, workflows, third-party logins, payments, SMS, shopping mall, etc.

Project Address: https://github.com/YunaiV/ruoyi-vue-pro

Video Tutorials: https://doc.iocoder.cn/video/

1. Research Background

As early as the 1950s, Alan Turing expanded the concept of “intelligence” to artificial entities and proposed the famous Turing Test. These artificial intelligence entities are usually referred to as agents. The concept of “agent” originates from philosophy, describing an entity that possesses desires, beliefs, intentions, and the ability to take action. A Stanford University paper titled “Generative Agents: Interactive Simulacra of Human Behavior” delves into the AI Agent’s memory, reactions, and planning.

A backend management system + user mini program implemented based on Spring Cloud Alibaba + Gateway + Nacos + RocketMQ + Vue & Element, supporting RBAC dynamic permissions, multi-tenancy, data permissions, workflows, third-party logins, payments, SMS, shopping mall, etc.

Project Address: https://github.com/YunaiV/yudao-cloud

Video Tutorials: https://doc.iocoder.cn/video/

2. What is an AI Agent

AI Agents are considered the next direction for OpenAI. OpenAI co-founder Andrej Karpathy mentioned at a recent public event that “Compared to model training methods, OpenAI is currently more focused on changes in the Agent field, and whenever a new AI Agents paper comes out, there is excitement and serious discussion internally.”

In the field of artificial intelligence, this term has been given a new meaning: intelligent entities characterized by autonomy, reactivity, proactivity, and social capabilities.

An AI Agent is designed to have the ability to think and act independently. You only need to provide a goal, such as writing a game or developing a webpage, and it will generate a sequence of tasks to start working based on environmental feedback and monologue. It’s as if artificial intelligence can self-prompt and feedback, continuously evolving and adapting to achieve the goal you set in the best possible way.

3. Development Path from NLP to AGI

The development path from NLP to AGI is divided into five levels: corpus, internet, perception, embodiment, and social attributes. Currently, large language models have reached the second level, with text input and output at the scale of the internet. On this basis, if LLM-based Agents are endowed with perceptual and action spaces, they will reach the third and fourth levels. Furthermore, multiple agents can solve more complex tasks through interaction and cooperation, or reflect social behaviors in the real world, potentially reaching the fifth level — agent society.

4. Why We Need AI Agents

Why do we need AI Agents shortly after large language models (LLMs) became popular? The combination of LLMs and tools like LangChain has unleashed various possibilities in content generation, coding, and analysis. Currently, a representative plugin in ChatGPT is the code interpreter. In this application, the concept of Agents plays a crucial role.

Here, the Agent can be viewed as the brain of artificial intelligence, using LLMs for reasoning, planning, and taking action.

Language models (LLMs) are limited to the knowledge they were trained on, and this knowledge quickly becomes outdated. (Re-training such a large model daily with the latest information is impractical.)

Some drawbacks of LLMs

Can produce hallucinations.
Results are not always factual.
Limited or no understanding of current events.
Struggle with complex calculations.

This is where AI Agents come into play, as they can leverage external tools to overcome these limitations.

What are these tools? Tools are plugins, integrated APIs, code libraries, etc., that the agent uses to complete specific tasks, such as:

Google Search: For obtaining the latest information
Python REPL: For executing code
Wolfram: For performing complex calculations
External APIs: For retrieving specific information

LangChain provides a generic framework to easily implement these tool calls through instructions from large language models. The birth of AI Agents is to handle various complex tasks, which can be divided into two main categories: action-oriented and planning-execution types.

5. Framework of Intelligent Agents

The conceptual framework of LLM-based Agents consists of three components: control end (Brain), perception end (Perception), and action end (Action). Below, we will introduce each:

Control End: Brain

This is the core of the intelligent agent. It not only stores memory and knowledge but also performs essential functions like information processing and decision-making. It can present the reasoning and planning processes and effectively respond to unknown tasks, reflecting the generalization and transferability of intelligent agents. As the most critical component of intelligent agents, the authors elaborate on its capabilities from five aspects:

Natural Language Interaction: Language is the medium of communication, containing rich information. Thanks to the powerful natural language generation and understanding capabilities of LLMs, intelligent agents can interact with the outside world through multi-turn dialogues to achieve goals. Specifically, it can be divided into two aspects:

High-Quality Text Generation: Numerous evaluation experiments show that LLMs can generate fluent, diverse, novel, and controllable text. Although performance may be lacking in some languages, they generally possess good multilingual capabilities.
Understanding Implications: In addition to the content presented, language may also convey the speaker’s intentions, preferences, and other information. Understanding implications helps agents communicate and cooperate more efficiently, and large models have already shown potential in this regard.

Knowledge: Based on large-scale corpus training, LLMs have the ability to store vast amounts of knowledge. In addition to language knowledge, common sense knowledge and specialized skill knowledge are important components of LLM-based Agents.

Although LLMs still face issues of knowledge obsolescence and hallucinations, some existing research can alleviate these problems to some extent through knowledge editing or external knowledge base calls.

Memory: In this framework, the memory module stores the agent’s past observations, thoughts, and action sequences. Through specific memory mechanisms, the agent can effectively reflect on and apply previous strategies, leveraging past experiences to adapt to unfamiliar environments.

Short-Term Memory: I believe that all contextual learning (see prompt engineering) utilizes the model’s short-term memory for learning.
Long-Term Memory: This provides the agent with the capability to retain and recall information over long periods (infinitely), usually by leveraging external vector storage and rapid retrieval.

Common methods for enhancing memory capabilities include three approaches:

Extending Backbone Architecture Length Limitations: Improving the inherent sequence length limitation of Transformers.
Summarizing Memory: Summarizing memories to enhance the agent’s ability to extract key details from memory.
Compressing Memory: Using vectors or appropriate data structures to compress memory can improve retrieval efficiency.

Additionally, the retrieval methods for memory are also crucial; only by retrieving appropriate content can the agent access the most relevant and accurate information.

Reasoning & Planning: Reasoning ability is crucial for intelligent agents to make decisions and analyze complex tasks. Specifically for LLMs, this involves a series of prompting methods represented by Chain-of-Thought (CoT). Planning is a commonly used strategy when facing large challenges. It helps agents organize thoughts, set goals, and determine the steps to achieve these goals. In practical implementation, planning can include two steps:

Plan Formulation: The agent decomposes complex tasks into more manageable subtasks. For example: decomposing once and executing in sequence, step-by-step planning and execution, multi-path planning and selecting the optimal path, etc. In scenarios requiring specialized knowledge, agents can integrate with specific domain Planner modules to enhance capabilities.
Plan Reflection: After formulating a plan, reflection and evaluation of its pros and cons can occur. This reflection generally comes from three aspects: leveraging internal feedback mechanisms; interacting with humans for feedback; obtaining feedback from the environment.

Transferability & Generalization: LLMs with world knowledge endow intelligent agents with strong transfer and generalization capabilities. A good agent is not just a static knowledge base but should also possess dynamic learning abilities:

Generalization to Unknown Tasks: As model scale and training data increase, LLMs have demonstrated remarkable abilities in solving unknown tasks. Large models fine-tuned with instructions perform well in zero-shot tests, achieving results comparable to expert models across many tasks.
In-context Learning: Large models can not only learn by analogy from a few examples in context, but this capability can also extend to multimodal scenarios beyond text, providing more possibilities for agents in real-world applications.
Continual Learning: The main challenge of continual learning is catastrophic forgetting, where models tend to lose knowledge from previous tasks when learning new ones. Specialized intelligent agents should strive to avoid losing knowledge from general domains.

Perception End: Perception

Multimodal perception deepens the agent’s understanding of the working environment and significantly enhances its generality.

Text Input: As the most fundamental capability of LLMs, this will not be elaborated further.

Visual Input: LLMs themselves do not possess visual perception capabilities and can only understand discrete text content. Visual input typically contains a wealth of information about the world, including object properties, spatial relationships, scene layouts, etc. Common methods include:

Transforming visual input into corresponding text descriptions (Image Captioning): This can be directly understood by LLMs and has high interpretability.
Encoding visual information: Forming a perception module with visual foundation models + LLMs, allowing the model to understand content from different modalities through alignment operations, which can be trained end-to-end.

Auditory Input: Auditory perception is also an important component of human perception. Due to LLMs’ excellent tool-calling capabilities, a straightforward idea is that agents can use LLMs as a control hub, cascading existing toolsets or expert models to perceive audio information. Additionally, audio can also be visually represented through spectrograms. Spectrograms can display 2D information as planar images, allowing some visual processing methods to be transferred to the audio domain.

Other Inputs: Information in the real world extends beyond text, visuals, and audio. The authors hope that in the future, intelligent agents will be equipped with richer perception modules, such as touch, smell, and other organs, to obtain richer attributes of target objects. At the same time, agents should also have a clear perception of the surrounding environment’s temperature, humidity, and brightness, taking more Environment-aware actions.

Furthermore, agents can also be introduced to perceive the broader overall environment: utilizing mature perception modules like LiDAR, GPS, and inertial measurement units.

Action End: Action

After the brain analyzes and makes decisions, the agent must also take action to adapt to or change the environment:

Text Output: As the most basic capability of LLMs, this will not be elaborated further.

Tool Usage: Despite LLMs possessing excellent knowledge reserves and professional capabilities, they may still encounter robustness issues, hallucinations, and a series of challenges when facing specific problems. Meanwhile, tools, as an extension of the user’s capabilities, can assist in specialization, factualness, and interpretability. For example, a calculator can be used to solve mathematical problems, and a search engine can be used to find real-time information.

Moreover, tools can also expand the action space of intelligent agents. For instance, by calling voice generation, image generation, and other expert models, multimodal action methods can be obtained. Therefore, how to make agents proficient tool users, i.e., learning how to effectively utilize tools, is a very important and promising direction.

Currently, the main tool learning methods include learning from demonstrations and learning from feedback. Additionally, meta-learning and curriculum learning can also enable agent programs to generalize their capabilities in using various tools. Furthermore, intelligent agents can further learn how to “be self-sufficient” in creating tools, thereby enhancing their autonomy and independence.

Embodied Action: Embodiment refers to the agent’s ability to understand, modify the environment, and update its state during interactions with the environment. Embodied action is seen as a bridge between virtual intelligence and physical reality.

Traditional reinforcement learning-based agents face limitations in sample efficiency, generalization, and complex problem reasoning. In contrast, LLM-based agents, by introducing the rich internal knowledge of large models, enable embodied agents to actively perceive and influence the physical environment like humans. Depending on the agent’s level of autonomy in the task or the complexity of the action, the following atomic actions can be identified:

Observation can help intelligent agents locate themselves in the environment, perceive objects, and gather other environmental information;
Manipulation involves completing specific tasks like grasping and pushing;
Navigation requires intelligent agents to change their location based on task objectives and update their state according to environmental information.

By combining these atomic actions, agents can accomplish more complex tasks. For example, answering a question like “Is the watermelon in the kitchen larger than the bowl?” requires the agent to navigate to the kitchen and observe the sizes of both to arrive at an answer.

6. Application Scenarios of Intelligent Agents

Three application paradigms of LLM-based Agents: single-agent, multi-agent, and human-machine interaction.

Single-Agent Scenarios

Intelligent agents that can accept natural language commands from humans and perform daily tasks are currently favored by users and hold significant real-world use value. The authors first elaborate on the diverse application scenarios and corresponding capabilities of single intelligent agents.

In the paper, the applications of single intelligent agents are categorized into three levels:

Three levels of single-agent application scenarios: task-oriented, innovation-oriented, and lifecycle-oriented.

In task-oriented deployments, agents assist human users in handling basic daily tasks. They need to possess basic capabilities for understanding instructions, task decomposition, and interacting with the environment. Specifically, based on existing task types, the actual applications of agents can be divided into simulating network environments and simulating real-life scenarios.
In innovation-oriented deployments, agents can demonstrate the potential for autonomous exploration in cutting-edge scientific fields. Although inherent complexities from specialized domains and a lack of training data pose challenges to building intelligent agents, significant progress has been made in fields like chemistry, materials, and computer science.
In lifecycle-oriented deployments, agents possess the ability to continuously explore, learn, and acquire new skills in an open world, ensuring long-term survival. In this section, the authors use the game “Minecraft” as an example. The survival challenges in the game can be seen as a microcosm of the real world, and many researchers have used it as a unique platform to develop and test agents’ comprehensive abilities.

Multi-Agent Scenarios

Two interaction forms in multi-agent application scenarios: cooperative interaction and adversarial interaction.

Cooperative Interaction: As the most widely deployed type in practical applications, cooperative agent systems can effectively enhance task efficiency and improve decision-making collaboratively. Specifically, based on different forms of cooperation, the authors further categorize cooperative interactions into unordered and ordered cooperation.

When all agents express their opinions freely and collaborate in an unordered manner, it is called unordered cooperation.
When all agents follow certain rules, such as expressing their opinions one by one in a pipeline format, the entire cooperation process is orderly, referred to as ordered cooperation.

Adversarial Interaction: Intelligent agents interact in a tit-for-tat manner. Through competition, negotiation, and debate, agents discard potentially erroneous beliefs and meaningfully reflect on their actions or reasoning processes, ultimately enhancing the overall quality of system responses.

Human-Machine Interaction Scenarios

Two modes of human-machine interaction scenarios: Instructor-Executor mode vs. Equal Partnership mode.

Instructor-Executor Mode: Humans act as instructors, providing instructions and feedback; while agents act as executors, adjusting and optimizing step by step according to the directives. This mode has been widely applied in education, healthcare, and business.
Equal Partnership Mode: Some studies have observed that agents can exhibit empathy in interactions with humans or participate in task execution as equal partners. Intelligent agents demonstrate potential applications in daily life and are expected to integrate into human society in the future.

Overview of AI Agents

Open Questions Discussion

1. How should research on intelligent agents and large language models mutually promote and develop together?

Large models demonstrate strong potential in language understanding, decision-making, and generalization capabilities, becoming key roles in the agent construction process. Conversely, advancements in agents also place higher demands on large models.

2. What challenges and concerns will LLM-based Agents bring?

Whether intelligent agents can truly be deployed requires rigorous safety assessments to avoid harm to the real world. The authors summarize more potential threats, such as illegal misuse, unemployment risks, and impacts on human well-being, etc.

3. What opportunities and challenges will arise from scaling up the number of agents?

In simulated societies, increasing the number of individuals can significantly enhance the credibility and realism of simulations. However, as the number of agents increases, communication and message propagation issues can become quite complex, leading to distortions, misunderstandings, or hallucination phenomena that can significantly reduce the overall efficiency of the simulation system.

4. The ongoing debate online about whether LLM-based Agents are the right path to AGI.

Some researchers believe that large models represented by GPT-4 have been trained on sufficient corpora, and agents built on this foundation have the potential to unlock the door to AGI. However, other researchers argue that autoregressive language modeling does not exhibit true intelligence, as they merely respond. A more comprehensive modeling approach, such as world models, is needed to reach AGI.

5. The evolutionary process of collective intelligence. Collective intelligence is a process of aggregating opinions from many to convert them into decisions.

However, will merely increasing the number of agents produce genuine “intelligence”? Furthermore, how to coordinate individual agents to help the intelligent agent society overcome “groupthink” and individual cognitive biases?

6. Agent as a Service (AaaS).

Since LLM-based Agents are more complex than the large models themselves, small and medium-sized enterprises or individuals find it more challenging to build them locally. Therefore, cloud vendors could consider deploying intelligent agents as a service, similar to other cloud services, AaaS has the potential to provide users with high flexibility and on-demand self-service.

Welcome to join my knowledge planet and comprehensively enhance your technical skills.

👉 Join by, “Long press” or “Scan” the QR code below:

The content of the planet includes: project practice, interview recruitment, source code analysis, learning paths.

If this article is helpful, please like and share it.<br/>Thank you for your support! (*^__^*)

Introduction

1. Research Background

2. What is an AI Agent

3. Development Path from NLP to AGI

4. Why We Need AI Agents

5. Framework of Intelligent Agents

6. Application Scenarios of Intelligent Agents

Open Questions Discussion

Related posts

Leave a Comment Cancel reply