What Is an AI Agent?

Author | Yang Tao

With the launch of plugins and function calling capabilities by ChatGPT, the concept of building an AI Agent based on LLM (Large Language Model) as a core controller has increasingly become a concept with infinite possibilities. The application potential of LLM is not limited to generating high-quality articles, stories, essays, and programs; it can also be built into a powerful general problem solver. AI Agents are undoubtedly the most exciting development line of current large models.

1

Concept

In the field of computer science and artificial intelligence, the term ‘Agent’ is generally translated as ‘intelligent entity,’ which is defined as a software or hardware entity that exhibits one or more intelligent characteristics such as autonomy, reactivity, sociality, proactivity, reflectiveness (thoughtfulness), and cognition in a certain environment. An AI Agent refers to an intelligent entity driven by LLM. Currently, there is no widely accepted definition for it, but we can describe it as a system that uses LLM to reason about problems, can autonomously create plans to solve problems, and will utilize a series of tools to execute those plans.

In short, an AI Agent is a system with complex reasoning capabilities, memory, and task execution abilities, as shown in the figure below:

An AI Agent consists of the following core components:

Core of the Agent

The central coordination module manages core logic and characteristics of agent behavior and can make critical decisions. Here we need to define the following:

Overall Goals of the Agent

The overall goals and objectives that the agent must achieve.
Execution Tools

A brief list of all tools that the agent can use (or a ‘user manual’).
Explanation of How to Use the Planning Module

A detailed explanation of the functions of different planning modules and when to use them.
Relevant Memory

This is a dynamic part that fills in the most relevant content from the user’s past conversations. ‘Relevance’ is determined based on the questions posed by the user.
Agent Personality (Optional)

If the LLM is required to prefer certain types of tools or exhibit certain characteristics in the final response, the desired personality can be clearly described.

The following figure is a prompt template for solving the user’s question: ‘How much did profits grow between Q1 and Q2 of FY2024?’

When the prompt is input to the LLM, the decision made by the LLM is that it needs to use a search tool:

Memory Module

The memory module plays a very important role, recording the internal logs of the agent and the history of interactions with users. There are two types of memory modules:

Short-term Memory

The thoughts and actions that the agent experiences while trying to answer a single question posed by the user. This is usually the context in prompt engineering; once the context limit is exceeded, the LLM will forget the previously input information.
Long-term Memory

Behaviors and thoughts related to interactions between the user and the agent, containing conversation records spanning weeks or months. This is usually an external vector database that can retain and quickly retrieve historical information almost indefinitely.

The memory module needs to retrieve information not only based on semantic similarity. Typically, a comprehensive score is composed of semantic similarity, importance, recency, and other specific metrics. The memory module is used to retrieve specific information.

Toolset

The toolset is a clearly defined executable workflow that the agent uses to perform tasks. Typically, the toolset is specialized third-party APIs.

For example, the agent can use RAG (Retrieval-Augmented Generation) to generate answers based on context; use a code interpreter (such as a Python script interpreter) to programmatically solve complex tasks; use APIs to search for information on the internet; or use any simple API service, such as using a weather API to get weather forecasts or using an instant messaging API to send and receive messages.

Planning Module

Complex problems, such as analyzing a set of financial statements to answer higher-level business questions, often require a step-by-step approach. For LLM-driven agents, planning capability is essentially the advanced practice of many advanced techniques in prompt engineering. Complex problems can be addressed by using a combination of two techniques:

Task and Problem Decomposition

Composite problems or inferred information require some form of decomposition. For example, the question ‘How much did profits grow between Q1 and Q2 of FY2024?’ can be decomposed into multiple sub-questions:

What is the profit for Q1?
What is the profit for Q2?
What is the difference between the two results above?

A dedicated AI Agent must be able to lead this decomposition, for example by using the following prompt template:

When the specific problem is input to the LLM, the LLM will respond with the results of the problem decomposition:

Reflection or Critique

Techniques such as ReAct (Reason and Act), Reflexion, Chain of Thought, and Graph of Thought have become frameworks based on critique or evidence-based prompting. They have been widely used to enhance the reasoning and responsiveness of LLMs. These techniques can also be used to optimize the execution plans generated by the agent.

For example, the ReAct model supported by the LangChain agent framework, which is the Reason+Act model, breaks down complex problems, retrieves missing content from external sources through tools, and then supplements it into the problem response.

The ReAct prompt template generally includes repeated iterations of these contents:

For the example question ‘What other devices can control the program that Apple Remote was originally designed to interact with?’

ReAct will think and act as follows:

First Thought: I need to find the program that Apple Remote was originally designed to interact with. First Action: Search for ‘Apple Remote’ on the search engine. Thus obtaining an introduction about Apple Remote.
Second Thought: Find in the results that Apple Remote was originally designed to control the Front Row program; I need to look for information about Front Row. Second Action: Search for ‘Front Row’. The search engine returns no relevant information.
Third Thought: I should add ‘software’ and search again. Third Action: Search for ‘Front Row (software)’. Obtaining an introduction to Front Row software.
Fourth Thought: In the results, key information is found: Front Row software can be controlled by Apple Remote or keyboard multifunction keys. Fourth Action: Complete, return the user’s answer: ‘keyboard multifunction keys’.

This process is illustrated in the following figure:

Challenges

Building an agent based on LLM currently faces a series of challenges:

Limited Context Length

The limited context length restricts the amount of historical information, detailed instructions, API call context, and API responses that can be accommodated. The design of the agent system must work within the constraints of limited communication bandwidth. In contrast, mechanisms like ‘self-reflection’ require longer or even infinite context windows to achieve better results. Although vector access can provide access to a larger knowledge base, its expressiveness is not as strong as the attention mechanism within the context.

Insufficient Long-term Planning and Task Decomposition Capability

When planning for long-term history and effectively exploring solutions, agents face challenges. LLMs find it very difficult to adjust plans when facing unknown errors. Compared to humans learning from failure, LLMs are still not powerful enough.

Unreliable Natural Language Interface

The current agent systems heavily rely on using natural language as the interface between LLMs and external components (such as memory and tools). However, the outputs of LLMs are often questionable; they frequently make grammatical errors and sometimes exhibit non-compliance (e.g., not following user instructions). As a result, many agents spend a lot of effort parsing model outputs.

Outlook

The intelligent revolution of generative AI has evolved to the point where three collaboration models between humans and AI have emerged:

What Is an AI Agent?

In the Agents mode, humans set goals and provide necessary resources, and then AI independently takes on most of the work, with humans supervising the process and evaluating the final results. In this mode, AI fully embodies the interactivity, autonomy, and adaptability characteristics of agents, approaching the role of an independent actor, while humans play more of a supervisory and evaluative role. The Agents mode is undoubtedly more efficient compared to the Embedding mode and Copilot mode and may become the main mode of human-machine collaboration in the future.

AI Agents are a significant driving force for artificial intelligence to become infrastructure. Looking back at the history of technological development, the end of technology is to become infrastructure; for instance, electricity has become an infrastructure that is as unnoticed as air yet indispensable, as has cloud computing. Almost everyone agrees that artificial intelligence will become the infrastructure of future society. And agents are promoting the infrastructure of artificial intelligence. AI Agents can adapt to different tasks and environments and can learn and optimize their performance, making them applicable in a wide range of fields, thus becoming the foundational support for various industries and social activities.

References

[1]https://arxiv.org/abs/2304.03442

[2] https://arxiv.org/abs/2210.03629

[3] https://lilianweng.github.io/posts/2023-06-23-agent/

[4] https://zhuanlan.zhihu.com/p/641322714

[5] https://developer.nvidia.com/blog/introduction-to-llm-agents/

[6] https://developer.nvidia.com/blog/building-your-first-llm-agent-application/

[7] https://zhuanlan.zhihu.com/p/676828569

[8] https://zhuanlan.zhihu.com/p/643799381

[9] https://zhuanlan.zhihu.com/p/676544930

[10] https://zhuanlan.zhihu.com/p/664281311

[11] https://python.langchain.com/docs/modules/agents/

AI Meets Cloud

Long press to scan the QR code to follow us for more information

1 Concept

Related posts

Leave a Comment Cancel reply

1

Concept