Designing a generalist scientific agent capable of performing tasks in laboratory settings to assist researchers has become a key goal in recent Artificial Intelligence (AI) research. Unlike everyday tasks, scientific tasks are inherently more delicate and complex, requiring agents to possess a higher level of reasoning ability, structured and temporal understanding of their environment, and a strong emphasis on safety. Existing approaches often fail to address these multifaceted requirements. To tackle these challenges, we present DAVIS. Unlike traditional retrieval-augmented generation (RAG) approaches, DAVIS incorporates structured and temporal memory, which enables model-based planning. Additionally, DAVIS implements an agentic, multi-turn retrieval system, similar to a human’s inner monologue, allowing for a greater degree of reasoning over past experiences. DAVIS demonstrates substantially improved performance on the ScienceWorld benchmark comparing to previous approaches on 8 out of 9 elementary science subjects. In addition, DAVIS’s World Model demonstrates competitive performance on the famous HotpotQA and MusiqueQA dataset for multi-hop question answering. To the best of our knowledge, DAVIS is the first RAG agent to employ an interactive retrieval method in a RAG pipeline.
DAVIS improves performance across 8/9 ScienceWorld subjects compared to prior work.
| Method | HotpotQA | MuSiQueQA | ||
|---|---|---|---|---|
| EM | F1 | EM | F1 | |
| GPT-4o | 46.3 | 64.1 | 19.0 | 34.4 |
| GPT-4-turbo | 44.3 | 60.4 | 20.5 | 34.7 |
| GraphReader (GPT-4) | 55.0 | 70.0 | 38.0 | 47.4 |
| HOLMES (GPT-4) | 66.0 | 78.0 | 48.0 | 58.0 |
| GraphRAG (GPT-4o-mini) | 58.7 | 63.3 | 40.0 | 53.5 |
| DAVIS (GPT-4o) | 56.25 | 73.8 | 33.8 | 48.5 |
| DAVIS (GPT-4-turbo) | 55.25 | 71.0 | 34.0 | 47.1 |
We study the contribution of the World Model and the Actor–Critic components to DAVIS’s performance on ScienceWorld tasks of varying difficulty.
| Task | D | D + W |
|---|---|---|
| Long Tasks | ||
| Melt (1-2) | 3.00 | 70.00 |
| Determine Melting Point Unk. (2-3) | 5.00 | 92.33 |
| Medium Tasks | ||
| Mix Paint Secondary (6-1) | 40.00 | 36.37 |
| Test Conductivity (3-3) | 55.00 | 58.33 |
| Short Tasks | ||
| Lifespan Longest-Lived (7-1) | 66.67 | 100.00 |
| Find Living Thing (4-1) | 25.00 | 100.00 |
| Type | Task | D + W | D - A | D - C |
|---|---|---|---|---|
| Long | 1-2 | 70.0 (4.38) | 25.0 (1.12) | 23.3 (3.20) |
| 2-3 | 92.3 (1.51) | 79.7 (1.19) | 33.3 (1.29) | |
| Medium | 6-1 | 36.4 (3.22) | 100.0 (1.49) | 86.0 (2.49) |
| 3-3 | 58.3 (2.71) | 28.0 (1.32) | 49.3 (1.72) | |
| Short | 7-1 | 83.3 (2.00) | 66.7 (2.00) | 83.3 (2.00) |
| 4-1 | 100.0 (2.50) | 44.7 (2.94) | 25.0 (2.50) |
@inproceedings{dinh-etal-2025-davis,
title = "{DAVIS}: Planning Agent with Knowledge Graph-Powered Inner Monologue",
author = "Dinh, Minh Pham and
Yankoski, Michael G and
Syed, Munira and
Ford, Trenton W.",
editor = "Christodoulopoulos, Christos and
Chakraborty, Tanmoy and
Rose, Carolyn and
Peng, Violet",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2025",
month = nov,
year = "2025",
address = "Suzhou, China",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.findings-emnlp.895/",
doi = "10.18653/v1/2025.findings-emnlp.895",
pages = "16490--16505",
ISBN = "979-8-89176-335-7",
abstract = "Designing a generalist scientific agent capable of performing tasks in laboratory settings to assist researchers has become a key goal in recent Artificial Intelligence (AI) research. Unlike everyday tasks, scientific tasks are inherently more delicate and complex, requiring agents to possess a higher level of reasoning ability, structured and temporal understanding of their environment, and a strong emphasis on safety. Existing approaches often fail to address these multifaceted requirements. To tackle these challenges, we present DAVIS. Unlike traditional retrieval-augmented generation (RAG) approaches, DAVIS incorporates structured and temporal memory, which enables model-based planning. Additionally, DAVIS implements an agentic, multi-turn retrieval system, similar to a human{'}s inner monologue, allowing for a greater degree of reasoning over past experiences. DAVIS demonstrates substantially improved performance on the ScienceWorld benchmark comparing to previous approaches on 8 out of 9 elementary science subjects. In addition, DAVIS{'}s World Model demonstrates competitive performance on the famous HotpotQA and MusiqueQA dataset for multi-hop question answering. To the best of our knowledge, DAVIS is the first RAG agent to employ an interactive retrieval method in a RAG pipeline."
}