DAVIS: Planning Agent with Knowledge Graph-Powered Inner Monologue

Findings of EMNLP 2025 · Suzhou, China · Nov 2025

1Davis Institute for Artificial Intelligence, 2University of Notre Dame, 3William & Mary

Presentation

Abstract

Designing a generalist scientific agent capable of performing tasks in laboratory settings to assist researchers has become a key goal in recent Artificial Intelligence (AI) research. Unlike everyday tasks, scientific tasks are inherently more delicate and complex, requiring agents to possess a higher level of reasoning ability, structured and temporal understanding of their environment, and a strong emphasis on safety. Existing approaches often fail to address these multifaceted requirements. To tackle these challenges, we present DAVIS. Unlike traditional retrieval-augmented generation (RAG) approaches, DAVIS incorporates structured and temporal memory, which enables model-based planning. Additionally, DAVIS implements an agentic, multi-turn retrieval system, similar to a human’s inner monologue, allowing for a greater degree of reasoning over past experiences. DAVIS demonstrates substantially improved performance on the ScienceWorld benchmark comparing to previous approaches on 8 out of 9 elementary science subjects. In addition, DAVIS’s World Model demonstrates competitive performance on the famous HotpotQA and MusiqueQA dataset for multi-hop question answering. To the best of our knowledge, DAVIS is the first RAG agent to employ an interactive retrieval method in a RAG pipeline.

Teaser Image

DAVIS enables planning for interactive scientific reasoning tasks via knowledge-graph memory and an inner-monologue retrieval loop.

Inner Monologue Image

Inner Monologue.

Results

ScienceWorld: Success Rate by Subject

DAVIS improves performance across 8/9 ScienceWorld subjects compared to prior work.

ScienceWorld results by subject
Per-subject breakdown.

World Model QA: HotpotQA / MuSiQue

Method HotpotQA MuSiQueQA
EM F1 EM F1
GPT-4o 46.3 64.1 19.0 34.4
GPT-4-turbo 44.3 60.4 20.5 34.7
GraphReader (GPT-4) 55.0 70.0 38.0 47.4
HOLMES (GPT-4) 66.0 78.0 48.0 58.0
GraphRAG (GPT-4o-mini) 58.7 63.3 40.0 53.5
DAVIS (GPT-4o) 56.25 73.8 33.8 48.5
DAVIS (GPT-4-turbo) 55.25 71.0 34.0 47.1

Ablation Studies

We study the contribution of the World Model and the Actor–Critic components to DAVIS’s performance on ScienceWorld tasks of varying difficulty.

Effect of World Model (WM)

Table 1: DAVIS performance with World Model (D+W) and without World Model (D).
Task D D + W
Long Tasks
Melt (1-2) 3.00 70.00
Determine Melting Point Unk. (2-3) 5.00 92.33
Medium Tasks
Mix Paint Secondary (6-1) 40.00 36.37
Test Conductivity (3-3) 55.00 58.33
Short Tasks
Lifespan Longest-Lived (7-1) 66.67 100.00
Find Living Thing (4-1) 25.00 100.00

Actor–Critic Ablation

Table 2: Ablation study: Full model (D+W), without Actor (D-A), and without Critic (D-C). Values are success rate (%), with average steps in parentheses.
Type Task D + W D - A D - C
Long 1-2 70.0 (4.38) 25.0 (1.12) 23.3 (3.20)
2-3 92.3 (1.51) 79.7 (1.19) 33.3 (1.29)
Medium 6-1 36.4 (3.22) 100.0 (1.49) 86.0 (2.49)
3-3 58.3 (2.71) 28.0 (1.32) 49.3 (1.72)
Short 7-1 83.3 (2.00) 66.7 (2.00) 83.3 (2.00)
4-1 100.0 (2.50) 44.7 (2.94) 25.0 (2.50)

BibTeX

@inproceedings{dinh-etal-2025-davis,
    title = "{DAVIS}: Planning Agent with Knowledge Graph-Powered Inner Monologue",
    author = "Dinh, Minh Pham  and
      Yankoski, Michael G  and
      Syed, Munira  and
      Ford, Trenton W.",
    editor = "Christodoulopoulos, Christos  and
      Chakraborty, Tanmoy  and
      Rose, Carolyn  and
      Peng, Violet",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2025",
    month = nov,
    year = "2025",
    address = "Suzhou, China",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.findings-emnlp.895/",
    doi = "10.18653/v1/2025.findings-emnlp.895",
    pages = "16490--16505",
    ISBN = "979-8-89176-335-7",
    abstract = "Designing a generalist scientific agent capable of performing tasks in laboratory settings to assist researchers has become a key goal in recent Artificial Intelligence (AI) research. Unlike everyday tasks, scientific tasks are inherently more delicate and complex, requiring agents to possess a higher level of reasoning ability, structured and temporal understanding of their environment, and a strong emphasis on safety. Existing approaches often fail to address these multifaceted requirements. To tackle these challenges, we present DAVIS. Unlike traditional retrieval-augmented generation (RAG) approaches, DAVIS incorporates structured and temporal memory, which enables model-based planning. Additionally, DAVIS implements an agentic, multi-turn retrieval system, similar to a human{'}s inner monologue, allowing for a greater degree of reasoning over past experiences. DAVIS demonstrates substantially improved performance on the ScienceWorld benchmark comparing to previous approaches on 8 out of 9 elementary science subjects. In addition, DAVIS{'}s World Model demonstrates competitive performance on the famous HotpotQA and MusiqueQA dataset for multi-hop question answering. To the best of our knowledge, DAVIS is the first RAG agent to employ an interactive retrieval method in a RAG pipeline."
}