H2HMem: A Multimodal Memory Benchmark for Agents in Human-Human Interactions
Large language model agents are increasingly deployed in human-human interaction settings, such as meeting assistants and clinical documentation systems. H2HMem evaluates memory capabilities across dyadic and multi-party conversations with multimodal information streams, testing memory recall, reasoning, and application.
Leaderboard
Evaluating memory agents on H2HMem benchmark. Results are based on retrieval-augmented methods with default top-k=5 retrieval setting.
A-Mem
Text-based | GPT-4.1-Nano
MuRAG
Multimodal | GPT-4.1-Nano
NGM
Multimodal | GPT-4.1-Nano
| Method | Backbone | ||
|---|---|---|---|
A-Mem Text-based | GPT-4.1-Nano | 57.57 | |
MuRAG Multimodal | GPT-4.1-Nano | 55.27 | |
NGM Multimodal | GPT-4.1-Nano | 50.49 | |
4 | NaiveRAG Text-based | GPT-4.1-Nano | 45.69 |
5 | Full (MM) Multimodal | GPT-4.1-Nano | 39.88 |
6 | Full (Text) Text-based | GPT-4.1-Nano | 34.64 |
Evaluation Tasks
H2HMem evaluates memory capabilities across three functional dimensions with nine distinct task types.
Memory Recall
Evaluates whether models can retrieve explicitly presented multimodal information from conversations.
Task Types
- Unimodal Precise RecallUPR
Retrieve information from a single modality (text or image)
- Cross-modal Related RetrievalCRR
Retrieve aligned content across modalities, mapping text to image or vice versa
- Knowledge ResolutionKR
Retrieve currently correct information from multi-session dialogues with updated information