Robots Can Understand Human Intentions with an Accuracy of Up to 90%

(Source: MIT Technology Review)

In the “eyes” of robots, the real world is filled with vast amounts of information that need to be processed. Understanding every data point in a scene often requires significant computational resources and time. Moreover, using this information to determine how to better assist humans is a particularly challenging task.

Now, robotic experts at the Massachusetts Institute of Technology have found a method that helps robots filter out data noise and focus on the features in a scene that are most relevant to assisting humans.

They named this method “Relevance.” With this method, robots can utilize various cues in the scene, such as audio and video information, to determine human goals and quickly identify the objects most likely to help achieve those goals. Subsequently, the robots will perform a series of actions to safely deliver the relevant objects to humans or complete the corresponding tasks.

The researchers demonstrated the practical effects of this method through experiments simulating a conference breakfast scenario. They set up a table filled with various fruits, beverages, snacks, and utensils, equipped with a robotic arm containing a microphone and camera. After applying the new “Relevance” method, experiments showed that robots could accurately identify human goals in different scenarios and provide appropriate assistance.

In one scenario, the robot captured the visual cue of a human reaching for a cup of prepared coffee and quickly delivered milk and a stirrer to that person; in another scenario, after hearing two people discussing coffee, the robot brought them a can of coffee and cream.

Overall, the robot’s accuracy in predicting human goals reached 90%, and the accuracy in identifying relevant objects was as high as 96%. Furthermore, this method significantly improved the robot’s safety, reducing collision occurrences by over 60% compared to performing the same tasks without this method.

Kamal Youcef-Toumi, a professor of mechanical engineering at MIT, stated, “This method of achieving relevance makes interactions between robots and humans much easier. Robots do not need to frequently ask humans for their needs but can actively gather information from the scene and autonomously determine how to provide assistance.”

Kamal’s team is exploring how robots programmed with “Relevance” can play roles in smart manufacturing and warehouse scenarios. They envision a future where robots can work alongside humans, seamlessly providing assistance.

Kamal, along with graduate students Xiaotong Zhang and Dingcheng Huang, will present this new method at the IEEE International Conference on Robotics and Automation in May this year.

Finding Focus

The team’s method is inspired by humans’ ability to filter relevant information in daily life. Humans can filter out distracting information and focus on important content thanks to the Reticular Activating System (RAS) in the brain. The RAS is a bundle of neurons in the brainstem that operates at a subconscious level, filtering out unnecessary stimuli and allowing individuals to consciously perceive key information. For example, the RAS prevents our senses from becoming overloaded with too much information, allowing us to focus on the task at hand, such as pouring coffee, without being distracted by every item on the kitchen counter.

Kamal explained, “These neurons filter out all unimportant information, allowing the brain to focus on currently relevant content, which is truly amazing. This is essentially the idea behind our research.”

He and his team developed a robotic system that largely mimics the RAS’s ability to filter and process information. This method consists of four main stages. The first stage is the “Perception” stage, during which the robot observes and learns, acquiring audio and video cues, such as information from microphones and cameras, which continuously feed into an AI “toolkit.” This toolkit may include a large language model (LLM) for processing audio conversations, identifying keywords and phrases, and various algorithms for detecting and classifying objects, humans, body movements, and task goals. The AI toolkit operates in the background like the brain’s RAS, continuously filtering information at a subconscious level.

The second stage is the “Trigger Check” stage, where the system periodically checks for significant events, such as whether someone has entered the environment. Once a person is detected, the system enters the third stage. This stage is the core of the entire system, determining which features in the environment are most likely related to assisting humans.

To establish this relevance, the researchers developed an algorithm that can receive real-time predictions made by the AI toolkit. For example, the LLM in the toolkit might identify the keyword “coffee,” while the action classification algorithm might label a person reaching for a cup as having the goal of “making coffee.” The team’s “Relevance” method synthesizes this information to first determine the “categories” of objects most relevant to the goal of “making coffee.” Thus, categories like “fruits” and “snacks” may be automatically excluded, while categories like “cups” and “creamers” are retained. Next, the algorithm further filters within the relevant categories to identify the most relevant “elements.” For instance, based on visual cues in the environment, the system might mark the cup closest to the person as more relevant and helpful than a cup further away.

In the fourth and final stage, the robot retrieves the identified relevant objects and plans a path to deliver these objects to the human.

Assistant Mode

The researchers tested the new system in the simulated conference breakfast experiment. They chose this scenario based on a publicly available breakfast action dataset, which contains videos and images of various typical activities people perform during breakfast, such as making coffee, cooking pancakes, pouring cereal, and frying eggs, with each video and image labeled with actions and overall goals (e.g., frying eggs or making coffee).

Using this dataset, the team tested various algorithms in the AI toolkit, allowing the algorithms to accurately label and classify human tasks, goals, and relevant objects when receiving human actions in new scenarios.

In the experiment, they set up a robotic arm and gripper, configuring the system to provide assistance when humans approached a table filled with beverages, snacks, and utensils. The results showed that when no one was present, the robot’s AI toolkit continuously operated in the background, marking and classifying the objects on the table.

During the trigger check process, once the robot detected a human, it would immediately respond, initiating the relevance analysis phase and quickly identifying the most likely relevant objects in the scene based on the human goals determined by the AI toolkit.

Co-author Zhang stated, “The relevance method can guide robots to provide seamless, intelligent, safe, and efficient assistance in highly dynamic environments.”

Looking ahead, the team hopes to apply this system to similar workplace and warehouse environments, as well as in everyday household tasks.

Zhang said, “I want to test this system at home, for example, can it bring me a cup of coffee while I read the newspaper; can it help me grab the laundry bag while I do laundry; can it hand me a screwdriver while I do repairs. Our vision is to achieve more natural and fluid human-robot interaction.”

Original link:

https://news.mit.edu/2025/robotic-system-zeroes-objects-most-relevant-helping-humans-0424

Related posts

Leave a Comment Cancel reply