DAVIS: Planning Agent with Knowledge Graph-Powered Inner Monologue

Minh Pham-Dinh; Michael G. Yankoski; Munira Syed; Trenton W. Ford

doi:10.18653/v1/2025.findings-emnlp.895

Abstract

Designing a generalist scientific agent capable of performing tasks in laboratory settings to assist researchers has become a key goal in recent Artificial Intelligence (AI) research. Unlike everyday tasks, scientific tasks are inherently more delicate and complex, requiring agents to possess a higher level of reasoning ability, structured and temporal understanding of their environment, and a strong emphasis on safety. Existing approaches often fail to address these multifaceted requirements. To tackle these challenges, we present DAVIS. Unlike traditional retrieval-augmented generation (RAG) approaches, DAVIS incorporates structured and temporal memory, which enables model-based planning. Additionally, DAVIS implements an agentic, multi-turn retrieval system, similar to a human’s inner monologue, allowing for a greater degree of reasoning over past experiences. DAVIS demonstrates substantially improved performance on the ScienceWorld benchmark comparing to previous approaches on 8 out of 9 elementary science subjects. In addition, DAVIS’s World Model demonstrates competitive performance on the famous HotpotQA and MusiqueQA dataset for multi-hop question answering. To the best of our knowledge, DAVIS is the first RAG agent to employ an interactive retrieval method in a RAG pipeline.

Diagram showing DAVIS world model, inner monologue retrieval, and actor–critic modules

DAVIS enables planning for interactive scientific reasoning tasks via knowledge-graph memory and an inner-monologue retrieval loop.

Inner Monologue.

Results

ScienceWorld: Success Rate by Subject

DAVIS improves performance across 8/9 ScienceWorld subjects compared to prior work.

Bar chart comparing ScienceWorld subjects for DAVIS and baseline models — Per-subject breakdown.

World Model QA: HotpotQA / MuSiQue

Method	HotpotQA		MuSiQueQA
Method	EM	F1	EM	F1
GPT-4o	46.3	64.1	19.0	34.4
GPT-4-turbo	44.3	60.4	20.5	34.7
GraphReader (GPT-4)	55.0	70.0	38.0	47.4
HOLMES (GPT-4)	66.0	78.0	48.0	58.0
GraphRAG (GPT-4o-mini)	58.7	63.3	40.0	53.5
DAVIS (GPT-4o)	56.25	73.8	33.8	48.5
DAVIS (GPT-4-turbo)	55.25	71.0	34.0	47.1

Ablation Studies

We study the contribution of the World Model and the Actor–Critic components to DAVIS’s performance on ScienceWorld tasks of varying difficulty.

Effect of World Model (WM)

**Table 1:** DAVIS performance with World Model (D+W) and without World Model (D).
Task	D	D + W
Long Tasks
Melt (1-2)	3.00	70.00
Determine Melting Point Unk. (2-3)	5.00	92.33
Medium Tasks
Mix Paint Secondary (6-1)	40.00	36.37
Test Conductivity (3-3)	55.00	58.33
Short Tasks
Lifespan Longest-Lived (7-1)	66.67	100.00
Find Living Thing (4-1)	25.00	100.00

Actor–Critic Ablation

**Table 2:** Ablation study: Full model (D+W), without Actor (D-A), and without Critic (D-C). Values are success rate (%), with average steps in parentheses.
Type	Task	D + W	D - A	D - C
Long	1-2	70.0 (4.38)	25.0 (1.12)	23.3 (3.20)
Long	2-3	92.3 (1.51)	79.7 (1.19)	33.3 (1.29)
Medium	6-1	36.4 (3.22)	100.0 (1.49)	86.0 (2.49)
Medium	3-3	58.3 (2.71)	28.0 (1.32)	49.3 (1.72)
Short	7-1	83.3 (2.00)	66.7 (2.00)	83.3 (2.00)
Short	4-1	100.0 (2.50)	44.7 (2.94)	25.0 (2.50)

BibTeX

@inproceedings{dinh-etal-2025-davis,
    title = "{DAVIS}: Planning Agent with Knowledge Graph-Powered Inner Monologue",
    author = "Dinh, Minh Pham  and
      Yankoski, Michael G  and
      Syed, Munira  and
      Ford, Trenton W.",
    editor = "Christodoulopoulos, Christos  and
      Chakraborty, Tanmoy  and
      Rose, Carolyn  and
      Peng, Violet",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2025",
    month = nov,
    year = "2025",
    address = "Suzhou, China",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.findings-emnlp.895/",
    doi = "10.18653/v1/2025.findings-emnlp.895",
    pages = "16490--16505",
    ISBN = "979-8-89176-335-7",
    abstract = "Designing a generalist scientific agent capable of performing tasks in laboratory settings to assist researchers has become a key goal in recent Artificial Intelligence (AI) research. Unlike everyday tasks, scientific tasks are inherently more delicate and complex, requiring agents to possess a higher level of reasoning ability, structured and temporal understanding of their environment, and a strong emphasis on safety. Existing approaches often fail to address these multifaceted requirements. To tackle these challenges, we present DAVIS. Unlike traditional retrieval-augmented generation (RAG) approaches, DAVIS incorporates structured and temporal memory, which enables model-based planning. Additionally, DAVIS implements an agentic, multi-turn retrieval system, similar to a human{'}s inner monologue, allowing for a greater degree of reasoning over past experiences. DAVIS demonstrates substantially improved performance on the ScienceWorld benchmark comparing to previous approaches on 8 out of 9 elementary science subjects. In addition, DAVIS{'}s World Model demonstrates competitive performance on the famous HotpotQA and MusiqueQA dataset for multi-hop question answering. To the best of our knowledge, DAVIS is the first RAG agent to employ an interactive retrieval method in a RAG pipeline."
}