AI-Agent Series: Intelligent Agents Centered on Large Models

AI-Agent is like breaking down large models into smaller details, decomposing them into independent intelligent agents, similar to how biological organisms are composed of individual cells, where each cell contains genes. To be more precise: each cell carries the genes of the entire organism, all embodying a holographic image within the agent.

Each AI-Agent intelligent agent based on large models carries the “genes” of the adopted LLM large model, just like LEGO blocks, where each agent is endowed with a structure that is simple, clearly defined, and has a specific functional range of cells or genes. The “large” model exerts its power in the “small” agents, and the “large” must realize its true value within the “small,” reflecting a philosophical flavor of “the reverse is the way of movement.” –EBATOM

—

Basic Composition and Technical Principles

The basic composition of AI Agents

The NLP team at Fudan University summarizes in their paper “A Survey on Large Language Model based Autonomous Agents” that if AI Agents are built based on large language models, their overall framework consists of three key parts: brain, perception, and action:

· Brain: Mainly composed of a large language model, which not only stores knowledge and memory but also undertakes information processing and decision-making functions, capable of presenting reasoning and planning processes to effectively handle unknown tasks.

· Perception: The core purpose of the perception module is to extend the agent’s perceptual space from the pure text domain to text, auditory, and visual modalities.

· Action: In the construction of the agent, the action module receives the action sequences sent by the brain module and executes actions that interact with the environment.

After perceiving the environment, humans integrate, analyze, and reason about the information perceived in the brain and make decisions. Subsequently, they use their nervous system to control their bodies and perform adaptive or creative actions, such as conversing, avoiding obstacles, or starting a fire. When an agent possesses a brain-like structure, along with knowledge, memory, reasoning, planning, generalization capabilities, and multimodal perception abilities, it can also respond to the surrounding environment in various human-like ways. In the construction of the agent, the action module receives the action sequences sent by the brain module and executes actions that interact with the environment.

For those interested in details, further reading can be found in the original paper: https:///pdf/2309.07864.pdf

From the above architecture, we can see that the components that AI Agents may involve are numerous. We cannot list them all, but we can discuss a portion of them: Perception: Perception is the sensory input received by the AI agent from its environment. These provide information about the current state of the observable environment in which the agent operates. For example, if the AI agent is a customer service chatbot, the perception may include:

· User messages

· User profile information

· User location

· Chat history

· Language preferences (e.g., Simplified Chinese or English)

· Time and date

· User preferences

· User emotion recognition

Agent Function: The core of the agent architecture is the agent’s function. It maps the agent’s perception of the environment to the actions it should take. In other words, the agent function allows the AI to determine what actions to take based on the information collected. This is where the “intelligence” of the agent lies, as it involves reasoning and selecting actions to achieve its goals. Software agents and AI tools have learning and performance elements, meaning that as the agent performs tasks, the agent function improves based on the agent’s history and training data. Execution: The executor is essentially the “muscle” of the agent, executing the decisions made by the agent function. These actions can range widely, from driving autonomous vehicles to typing text on a chatbot’s screen. Some common executors include:

· Text response generator: This executor is responsible for generating text-based responses and sending them to users. It receives text replies from the chatbot and sends them to users through the chat interface.

· Service integration API: The chatbot may need to integrate with a system, such as the company’s CRM system, to access customer data, create support tickets, or check order status. These integrations involve API calls as executors, allowing the chatbot to interact with external systems and retrieve or update information as needed.

· Notifications and reminders: Notification executors can send email notifications, text messages, or push notifications to users’ devices, reminding them of upcoming appointments, order status changes, promotions, or other relevant updates. These executors help keep users informed and engaged.

Knowledge Base: The knowledge base is where the AI agent stores its initial knowledge about the environment. This knowledge is typically predefined or learned during training. It forms the basis of the agent’s decision-making process. For example, an autonomous vehicle may have a knowledge base containing information about road rules, while an automated customer service agent can access detailed information about the company’s products. Feedback: Over time, feedback is crucial for the improvement of AI agents. This feedback can come from two sources: evaluators or the environment itself. Evaluators can be humans or another AI system used to assess agent performance. Alternatively, the environment can provide feedback in the form of results generated by the agent’s actions. This feedback loop allows the agent to adapt, learn from experience, and make better decisions in the future.

It is important to emphasize that depending on the chosen embodiment approach, agents can manifest in various forms such as software operations, robots, autonomous vehicles, etc. It is not limited to just software program-level actions (e.g., deciding to call a specific plugin/API as needed).

For example, the embodied intelligent framework VoxPoser, launched by the renowned AI expert Fei-Fei Li’s team, connects large models to robots, transforming complex instructions into specific action plans (without additional data and training), achieving impressive performance in various robotic operation tasks in both simulated and real-world environments:

Meanwhile, autonomous driving is, in my opinion, the most advanced form of embodied intelligence to date:

How AI Agents Work

A typical AI Agent designed to achieve specified goals generally follows these steps (though the order of steps may vary depending on the agent’s design configuration or objectives): The operation of AI Agents is similar to other popular AI solutions, which require user input for goals, and then the agent initiates its journey towards the goal through the core language learning model operating in the background to return its first output and demonstrate its understanding of the task at hand. Next comes the carefully crafted task list. Driven by the established goals, the agent formulates a series of tasks, prioritizing them in the order of completion. Once satisfied with its plan, it delves into information retrieval. The agent’s function acts like an experimental computer user, navigating the vast expanse of the internet to gather relevant information. Some advanced agents collaborate with other AI models to access specialized tasks such as image generation and computer vision functions (i.e., function calls and tool usage). All collected data is meticulously managed by the agent to relay information back to the user and refine its strategy for more optimized progress. As each task is completed, the agent actively seeks feedback from external sources and internal thought processes to estimate its distance from the final goal. Before achieving the goal, the agent continuously iterates, formulates new tasks, and seeks more data and feedback to move towards the goal. For example, AutoGPT >>>

AutoGPT is an AI Agent framework based on GPT-4 for automated content generation, notable for its ability to operate almost entirely independently (text-based, such as gathering and organizing industry information, writing market research reports, generating code, etc.), requiring minimal human intervention. Below, we will illustrate how AutoGPT receives tasks, processes information, and provides solutions through a simple process:

· Initialization and goal setting: When starting with AutoGPT, the first step is to set an identifier (like a name) and clarify the task it needs to accomplish. This step helps AutoGPT define its goal direction, laying the foundation for subsequent decision-making and task execution.

· Data analysis: AutoGPT begins working with the information you provide, deeply analyzing this data to identify patterns and key details. This process deepens its understanding of the task, laying the groundwork for generating solution prompts.

· Generating prompts: Based on the data analysis, AutoGPT can generate self-generated prompts for solving the task. These prompts guide AutoGPT on how to effectively achieve the goal.

· Autonomous information gathering: AutoGPT is not limited to the data provided at the start; it actively collects more information from the internet to enrich its knowledge base, thereby enhancing the depth and accuracy of task processing.

· Data review and optimization: The newly collected information is carefully reviewed and evaluated by the system to ensure the authenticity and validity of all information. Any misleading or inaccurate content is excluded, ensuring the reliability of the decision-making basis.

· Continuous learning and improvement: AutoGPT emphasizes learning and self-improvement from each task. By analyzing execution results and feedback, the system continuously adjusts and optimizes, making it more efficient and precise in handling subsequent tasks.

· Output results: After a series of analyses, learning, and optimizations, AutoGPT provides a solution that integrates all available information and analysis. This output reflects a deep understanding and comprehensive response to the task.

Content sourced from the internet; please contact for removal if there are any copyright infringements.

Related posts

Leave a Comment Cancel reply