H2HMem: A Multimodal Memory Benchmark for Agents in Human-Human Interactions

Large language model agents are increasingly deployed in human-human interaction settings, such as meeting assistants and clinical documentation systems. H2HMem evaluates memory capabilities across dyadic and multi-party conversations with multimodal information streams, testing memory recall, reasoning, and application.

25DialoguesDyadic & multi-party
309SessionsMulti-session conversations
2,236QA PairsAcross 9 task types
1,300ImagesMultimodal content

Leaderboard

Evaluating memory agents on H2HMem benchmark. Results are based on retrieval-augmented methods with default top-k=5 retrieval setting.

Dataset:
Metric:
#1
57.57

A-Mem

Text-based | GPT-4.1-Nano

59.68
Recall
43.45
Reasoning
64.37
Application
#2
55.27

MuRAG

Multimodal | GPT-4.1-Nano

56.42
Recall
41.36
Reasoning
63.46
Application
#3
50.49

NGM

Multimodal | GPT-4.1-Nano

48.00
Recall
40.48
Reasoning
64.15
Application
MethodBackbone
A-Mem

Text-based

GPT-4.1-Nano57.57
MuRAG

Multimodal

GPT-4.1-Nano55.27
NGM

Multimodal

GPT-4.1-Nano50.49
4
NaiveRAG

Text-based

GPT-4.1-Nano45.69
5
Full (MM)

Multimodal

GPT-4.1-Nano39.88
6
Full (Text)

Text-based

GPT-4.1-Nano34.64

Evaluation Tasks

H2HMem evaluates memory capabilities across three functional dimensions with nine distinct task types.

Memory Recall

Evaluates whether models can retrieve explicitly presented multimodal information from conversations.

QA Pairs:772

Task Types

  • Unimodal Precise RecallUPR

    Retrieve information from a single modality (text or image)

  • Cross-modal Related RetrievalCRR

    Retrieve aligned content across modalities, mapping text to image or vice versa

  • Knowledge ResolutionKR

    Retrieve currently correct information from multi-session dialogues with updated information