LightMem: Lightweight and efficient memory-augmented generation
LightMem proposes a new, human-inspired memory system for large language model (LLM) agents that keeps long-term context while drastically cutting token usage, API calls, and latency compared with existing memory frameworks.
Research topic and objective
The paper studies how to design an external memory system for LLM agents that is both accurate and computationally efficient during long, multi-turn interactions.
Its main objective is to introduce LightMem, a three-stage memory architecture inspired by human sensory, short-term, and long-term memory, and to show that it improves question‑answering accuracy while sharply reducing cost on long‑context benchmarks.
Key findings and conclusions
- LightMem consistently achieves higher accuracy than strong memory baselines (like A‑Mem, MemoryOS, Mem0, LangMem) on the LongMemEval and LoCoMo dialogue benchmarks, using both GPT‑4o‑mini and Qwen3‑30B backbones.
- At the same time, it reduces total token usage by up to 38× (GPT) and 21.8× (Qwen), and cuts API calls by up to 30× and 17.1× respectively on LongMemEval; online test‑time savings are even larger, reaching over 100× fewer tokens and hundreds of times fewer calls.
- On LoCoMo, LightMem improves accuracy by up to about 18 percentage points for GPT and around 29 percentage points for Qwen compared to memory baselines, while reducing total tokens by up to about 20.9× and API calls by up to about 55.5×.
- The authors conclude that a human‑memory‑inspired pipeline—early compression, topic‑aware grouping, and offline “sleep‑time” consolidation—can simultaneously enhance long‑horizon reasoning and efficiency for LLM agents.
Critical data and facts
Architecture and mechanism
- LightMem has three main modules that mimic human memory stages:
- Light1 (Sensory memory): A pre‑compression module uses LLMLingua‑2 (or similar) to filter out low‑value tokens from each turn, keeping only the most informative ones based on token‑level retention probabilities and entropy measures.
- Light2 (Short‑term memory, STM): Compressed content is buffered and segmented into topic‑coherent groups using a hybrid method that combines attention patterns and semantic similarity between dialogue turns to find topic boundaries, then summarized by an LLM once a token threshold is reached.
- Light3 (Long‑term memory, LTM): Summaries, embeddings, and raw turns are stored as memory entries; new entries are inserted via “soft” updates at test time, and heavier re‑organization, de‑duplication, and abstraction are deferred to an offline “sleep‑time” phase that runs parallel, batched update operations.
- Memory entries store a topic label, an embedding of the summary, and the user/model turns, enabling semantic retrieval and later consolidation with time‑aware update queues that only allow newer entries to update older ones.
Complexity and efficiency gains
- In a dialogue with (N) turns and average (T) tokens per turn, conventional systems typically require (O(N)) summarization calls and updates, whereas LightMem reduces the number of calls to roughly (\frac{Nr^{x}T}{th}), where (r) is the compression ratio, (x) the number of compression iterations, and (th) the STM buffer threshold.
- This design changes runtime complexity for memory construction from (O(N)) to (O(\frac{Nr^{x}T}{th})), explaining the observed large reductions in API calls and tokens.
Benchmark results (selected)
LongMemEval (GPT‑4o‑mini backbone):
- Strong baseline A‑Mem reaches accuracy around 62.6%, while LightMem configurations reach up to about 68.6% accuracy, a gain of roughly 2.1–6.4 percentage points over the best baseline.
- Compared with baselines, LightMem reduces total token usage by about 10×–38×, reduces API calls by about 3.6×–30×, and speeds runtime by roughly 2.9×–12.4× when counting both online and offline phases.
- When counting only online test‑time cost, LightMem cuts tokens by up to about 105.9× and API calls by up to about 159.4× relative to other memory systems.
LongMemEval (Qwen3‑30B backbone):
- LightMem improves accuracy by up to about 7.67 percentage points over A‑Mem, with configurations reaching about 70.2% accuracy.
- It reduces total tokens by around 6.9×–21.8× and API calls by roughly 3.3×–17.1×, with runtime speedups of about 1.6×–6.3×.
LoCoMo (GPT‑4o‑mini backbone):
- Memory baselines such as A‑Mem, MemoryOS, Mem0 generally reach accuracies in the mid‑50s to mid‑60s, while LightMem variants reach around 70–73%, for gains of roughly 6.1–18.1 percentage points.
- LightMem reduces total tokens by about 2.87×–20.92×, reduces API calls by about 13.29×–39.78×, and speeds runtime by around 2.63×–8.21×.
LoCoMo (Qwen3‑30B backbone):
- LightMem configurations achieve around 71–73% accuracy, exceeding baselines by roughly 4.4–29.3 percentage points.
- It cuts total tokens by about 3.33×–18.02×, API calls by about 12.96×–55.48×, and runtime by around 1.18×–5.57×.
Analyses of submodules
- Pre‑compression: Compressing to 50–80% of original tokens on LongMemEval preserves QA performance close to uncompressed input while drastically reducing tokens; the compression model runs with under 2 GB GPU memory and adds negligible overhead.
- Topic segmentation: The hybrid attention‑plus‑similarity method achieves over 80% segmentation accuracy against ground‑truth session boundaries and outperforms attention‑only or similarity‑only variants.
- Ablation: Removing topic segmentation slightly improves efficiency but drops QA accuracy by about 6.3 percentage points (GPT) and 5.4 percentage points (Qwen), showing its importance for preserving semantic units.
- STM threshold: Larger STM thresholds consistently improve efficiency (fewer calls, less token usage) but affect accuracy in a non‑monotonic way; optimal thresholds depend on the compression ratio and model, reflecting a trade‑off between cost and performance.
- Sleep‑time update: Soft, append‑only updates at test time avoid irreversible information loss from mis‑handled real‑time edits, while offline parallel updates use similarity queues and timestamps to reconcile and consolidate memories with low latency.