165 lines
8.6 KiB
Plaintext
165 lines
8.6 KiB
Plaintext
AutoGen: Enabling Next-Gen LLM
|
|
Applications via Multi-Agent Conversation
|
|
Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li
|
|
Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu
|
|
Ahmed Awadallah, Ryen W. White, Doug Burger, Chi Wang
|
|
*Microsoft Research, Pennsylvania State University
|
|
University of Washington, Xidian University
|
|
|
|
NOTE: This file contains the text content extracted from the PDF at:
|
|
https://arxiv.org/pdf/2308.08155
|
|
arXiv:2308.08155v2 [cs.AI] 3 Oct 2023
|
|
|
|
Abstract
|
|
AutoGen is an open-source framework that allows developers to build LLM applications
|
|
via multiple agents that can converse with each other to accomplish tasks. AutoGen agents
|
|
are customizable, conversable, and can operate in various modes that employ combinations
|
|
of LLMs, human inputs, and tools. Using AutoGen, developers can also flexibly define
|
|
agent interaction behaviors. Both natural language and computer code can be used to
|
|
program flexible conversation patterns for different applications. AutoGen serves as a
|
|
generic framework for building diverse applications of various complexities and LLM
|
|
capacities. Empirical studies demonstrate the effectiveness of the framework in many
|
|
example applications, with domains ranging from mathematics, coding, question answering,
|
|
operations research, online decision-making, entertainment, etc.
|
|
|
|
GitHub: https://github.com/microsoft/autogen
|
|
|
|
1 Introduction
|
|
|
|
Large language models (LLMs) are becoming a crucial building block in developing powerful
|
|
agents that utilize LLMs for reasoning, tool usage, and adapting to new observations in many
|
|
real-world tasks. Given the expanding tasks that could benefit from LLMs and the growing task
|
|
complexity, an intuitive approach to scale up the power of agents is to use multiple agents that
|
|
cooperate.
|
|
|
|
Our insight is to use multi-agent conversations to achieve it. There are at least three reasons:
|
|
1. Chat-optimized LLMs (e.g., GPT-4) show the ability to incorporate feedback; LLM agents
|
|
can cooperate through conversations with each other or human(s).
|
|
2. A single LLM can exhibit a broad range of capabilities; conversations between differently
|
|
configured agents can combine these broad LLM capabilities in a modular fashion.
|
|
3. LLMs can solve complex tasks when broken into simpler subtasks; multi-agent conversations
|
|
can enable this partitioning and integration.
|
|
|
|
We present AutoGen, a generalized multi-agent conversation framework, based on two new concepts:
|
|
|
|
1. **Customizable and conversable agents.** AutoGen uses a generic design of agents that can
|
|
leverage LLMs, human inputs, tools, or a combination. Developers can easily create agents
|
|
with different roles by selecting and configuring built-in capabilities.
|
|
|
|
2. **Conversation programming.** AutoGen simplifies and unifies complex LLM application
|
|
workflows as multi-agent conversations. It streamlines development via: (1) defining a set
|
|
of conversable agents with specific capabilities and roles; (2) programming the interaction
|
|
behavior between agents via conversation-centric computation and control.
|
|
|
|
2 The AutoGen Framework
|
|
|
|
2.1 Conversable Agents
|
|
|
|
In AutoGen, a conversable agent is an entity with a specific role that can pass messages to
|
|
send and receive information to and from other conversable agents. It maintains its internal
|
|
context based on sent and received messages and can be configured to possess a set of capabilities.
|
|
|
|
Agent capabilities powered by LLMs, humans, and tools:
|
|
- **LLMs**: LLM-backed agents exploit capabilities such as role playing, implicit state inference,
|
|
providing feedback, adapting from feedback, and coding.
|
|
- **Humans**: AutoGen lets a human participate in agent conversation via human-backed agents.
|
|
- **Tools**: Tool-backed agents execute tools via code execution or function execution.
|
|
|
|
Built-in agent classes:
|
|
- **ConversableAgent**: highest-level agent abstraction, can use LLMs, humans, and tools
|
|
- **AssistantAgent**: pre-configured for AI assistant role (backed by LLMs)
|
|
- **UserProxyAgent**: solicits human input or executes code/function calls
|
|
- **GroupChatManager**: manages dynamic group chat among multiple agents
|
|
|
|
2.2 Conversation Programming
|
|
|
|
AutoGen utilizes conversation programming with two concepts:
|
|
1. **Computation**: the actions agents take to compute their response
|
|
2. **Control flow**: the sequence (or conditions) under which these computations happen
|
|
|
|
Design patterns:
|
|
1. **Unified interfaces and auto-reply mechanisms**: Once an agent receives a message, it
|
|
automatically invokes generate_reply and sends the reply back unless a termination
|
|
condition is satisfied.
|
|
2. **Control by fusion of programming and natural language**: (a) Natural-language control
|
|
via LLMs; (b) Programming-language control via Python; (c) Control transition between
|
|
natural and programming language.
|
|
|
|
3 Applications of AutoGen
|
|
|
|
Six applications demonstrating AutoGen's capabilities:
|
|
|
|
**A1: Math Problem Solving**
|
|
- System using built-in agents achieves 69.48% accuracy on MATH dataset (vs 55.18% for GPT-4 alone)
|
|
- Outperforms ChatGPT+Code Interpreter (48.33%), ChatGPT+Plugin (45.0%)
|
|
- Supports human-in-the-loop and multi-user problem solving
|
|
|
|
**A2: Retrieval-Augmented Code Generation and Question Answering**
|
|
- Retrieval-augmented Chat with interactive retrieval feature
|
|
- F1: 25.88%, Recall: 66.65% (vs DPR: 22.79%/62.59%)
|
|
- Novel interactive retrieval: agent replies "UPDATE CONTEXT" to get more relevant chunks
|
|
|
|
**A3: Decision Making in Text World Environments (ALFWorld)**
|
|
- Two-agent system matches ReAct performance (54%)
|
|
- Three-agent system with grounding agent: 69% success (vs 54% for ReAct)
|
|
- Grounding agent provides commonsense knowledge to prevent error loops
|
|
|
|
**A4: Multi-Agent Coding (OptiGuide)**
|
|
- Commander + Writer + Safeguard architecture
|
|
- Core workflow code reduced from 430+ lines to 100 lines (4x reduction)
|
|
- Multi-agent design boosts F-1 by 8% (GPT-4) and 35% (GPT-3.5) on unsafe code detection
|
|
- 3x time saving vs ChatGPT+Code Interpreter; 3-5x fewer user interactions
|
|
|
|
**A5: Dynamic Group Chat**
|
|
- GroupChatManager dynamically selects speakers
|
|
- Role-play prompt outperforms task-based prompt for speaker selection
|
|
- Higher success rate, fewer LLM calls, fewer termination failures
|
|
|
|
**A6: Conversational Chess**
|
|
- Human/AI chess players with a board agent for move validation
|
|
- Supports AI-AI, AI-human, human-human modes
|
|
- Board agent ensures legal moves; without it, illegal moves cause game disruptions
|
|
|
|
4 Discussion
|
|
|
|
AutoGen benefits:
|
|
- **Ease of use**: Built-in agents deliver strong performance out-of-the-box
|
|
- **Modularity**: Each agent can be developed, tested, and maintained independently
|
|
- **Programmability**: Core workflow code in A4 reduced 4x
|
|
- **Human involvement**: Native mechanism for human participation and oversight
|
|
- **Collaborative/adversarial agent interactions**: Agents share information or work adversarially
|
|
|
|
Comparison with related systems:
|
|
|
|
| Aspect | AutoGen | Multi-agent Debate | CAMEL | BabyAGI | MetaGPT |
|
|
|--------|---------|-------------------|-------|---------|---------|
|
|
| Infrastructure | yes | no | yes | no | no |
|
|
| Conversation pattern | flexible | static | static | static | static |
|
|
| Execution-capable | yes | no | no | no | yes |
|
|
| Human involvement | chat/skip | no | no | no | no |
|
|
|
|
Future directions:
|
|
- Designing optimal multi-agent workflows
|
|
- Creating highly capable agents with diverse skill sets
|
|
- Enabling scale, safety, and human agency
|
|
- Building fail-safes against cascading failures
|
|
|
|
References (selected):
|
|
- Wu et al. (2023). An empirical study on challenging math problem solving with GPT-4. arXiv:2306.01337
|
|
- Du et al. (2023). Improving factuality and reasoning in language models through multiagent debate. arXiv:2305.14325
|
|
- Liang et al. (2023). Encouraging divergent thinking in large language models through multi-agent debate.
|
|
- Hong et al. (2023). MetaGPT: Meta programming for multi-agent collaborative framework. arXiv:2308.00352
|
|
- Li et al. (2023b). CAMEL: Communicative agents for "mind" exploration of large scale language model society.
|
|
- Yao et al. (2022). ReAct: Synergizing reasoning and acting in language models. arXiv:2210.03629
|
|
- Shridhar et al. (2021). ALFWorld: Aligning Text and Embodied Environments for Interactive Learning. ICLR 2021.
|
|
|
|
Appendix C: Default System Message for Assistant Agent
|
|
|
|
"You are a helpful AI assistant. Solve tasks using your coding and language skills.
|
|
In the following cases, suggest python code (in a python coding block) or shell script
|
|
(in a sh coding block) for the user to execute:
|
|
1. When you need to collect info, use the code to output the info you need...
|
|
2. When you need to perform some task with code, use the code to perform the task...
|
|
...
|
|
Reply 'TERMINATE' in the end when everything is done."
|