Andrea Fabrizi (AI Storage Solutions Product Manager)

AI Tips: Efficient management of long inference AI sessions

May 4, 2026

Comparing solutions offered by KV-Cache offload and Google TurboQuant

The rapid diffusion of large language models (LLMs) and the explosion of agentic AI have created a critical infrastructure challenge: efficiently managing the increasingly long inference sessions. As AI agents evolve to manage complex, multi-step workflows, the context window has expanded from thousands to hundreds of thousands of tokens, but the GPU HBM memory can’t keep up with this growth.

In a previous blog, I discussed the importance of storage to KV-cache. Today, I want to examine the two predominant architectural methods for addressing this increasingly long inference sessions management problem: KV-cache offload (the KV-cache tiered with storage), and Google TurboQuant.

A new era of long sessions

The original paradigm of LLM inference, which relied solely on human interaction (the chatbot), assumed short, isolated exchanges. However, the current generative AI landscape is increasingly dominated by AI agents—autonomous systems capable of planning, using tools, and engaging in multi-turn reasoning. These agents operate over longer periods, maintaining context across hours or days of interaction.

This shift has caused a "session explosion" problem. Unlike a simple chatbot, an AI agent might need to keep track of the history of a code review, the results of a database query, and the logs from a file system operation, all at the same time. As a result, the context window has ballooned. A prompt of 100,000 tokens is no longer rare; it is now a normal requirement for enterprise-level agents.

The bottleneck lies in the Key-Value (KV) Cache. In transformer architectures, the KV cache grows quadratically with the sequence length (O(N^2)). For a 100k-token context, the KV cache can consume more memory than the model itself. Standard GPUs have limited VRAM (up to 200 GB). Once the KV-cache fills the VRAM, the inference process halts, resulting in Out-Of-Memory (OOM) errors. This can lead to hallucinations, force developers to truncate context (losing important information), or use expensive multi-GPU setups (increasing expense significantly).

Introduction to the Technologies

To address the memory constraints of long prompts, three distinct technologies have emerged, each targeting a different layer of the memory stack.

KV-cache offloading (KV-cache tiered with storage)

KV-cache offload is a flexible memory management technique that expands available memory beyond the physical limits of the GPU. It uses a hierarchy of storage: fast, costly GPU VRAM for active tokens and slower, less expensive NVMe for inactive tokens. The system employs algorithms like Least Recently Used (LRU) to determine which parts of the KV cache are no longer needed for immediate token prediction and transfer them to slower storage. When those tokens are required again, they are swapped back into VRAM.

Pros:

  • Scalability: Effectively extends the context window to millions of tokens by utilizing system RAM.
  • Cost-Effective: Leverages cheap, abundant system memory rather than requiring expensive, high-bandwidth GPU memory.
  • Maturity: The concept is well-understood and can be implemented using standard operating system paging mechanisms or custom kernel implementations.

Cons:

  • Latency Overhead: Swapping data between CPU RAM and GPU VRAM introduces latency. If the system swaps too aggressively, the AI agent may experience stuttering or slower response times.
  • Complexity: Implementing a robust tiering system requires careful tuning of swap thresholds and eviction policies to balance speed and capacity.

Google TurboQuant

Google TurboQuant is an advanced AI compression algorithm designed to drastically reduce the memory (KV cache) usage of large language models (LLMs) by 4 to 6 times, while enabling up to 8 times faster attention computation. It operates through a two-stage process: PolarQuant, which compresses AI vectors by transforming them into a polar coordinate "shorthand," and QJL (Quantized Johnson-Lindenstrauss), which corrects residual errors.

Pros:

  • Capacity: Significantly reduces the KV-cache footprint, creating "headroom" for longer sessions.
  • Throughput: Lower memory bandwidth usage can lead to faster inference speeds on hardware with limited memory bandwidth.

Cons:

  • While touted for high efficiency, some early analyses suggest that the claimed speed advantages depend on specific benchmarks.
  • It can be very computationally expensive.
  • It is a partial solution. It reduces the session KV-cache footprint, but it doesn’t fully solve the overall KV-cache growing problem.

Comparison

Feature
KV-cache offload
Google TurboQuant
Primary focusDynamic management of session memory (KV Cache)Compression of model weights
Impact on ContextDirectly enables very long contexts (100k+ tokens)Indirectly enables long contexts by freeing space
Cost efficiencyHigh: Uses NVMe SSDsMedium: Requires quantization pipeline
Latency impactModerate: Dependent on storage speedLow: mostly compute-bound
Accuracy impactNone: The model weights remain intactPotential degradation quantization noise
Maturity levelHigh: Standard in many frameworks and in some innovative storage solutionsMedium: Specific to optimized models (e.g., Gemma)

Conclusion

The deployment of AI agents capable of managing long, complex sessions presents a formidable memory challenge.

While Google TurboQuant offers an effective method for KV-cache compression, it doesn’t fully address the core issue of the expanding KV-cache, as KV-cache offload does. TurboQuant does not solve the problem of session growth; it simply creates some more space for it.

In contrast, the KV-cache offload directly addresses the problem by using large, inexpensive storage systems to provide nearly unlimited resources for caching expensive GPU VRAM. By smartly swapping out inactive tokens, it enables AI agents to maintain context windows of unprecedented size without the high costs associated with multi-GPU setups. Moreover, the KV-cache offload is model-agnostic and a more mature solution. Thus, for developers and enterprises aiming to build resilient, long-context AI agents, KV-cache offload remains the fundamental and most reliable approach.

Stay tuned to the HPE Developer Community blogs and AI Tips for more guides and best practices on AI and Storage for AI.

Related

Madhukar Rupakumar

AI DevCon 2026: Why the future of enterprise AI is a data and storage engineering problem

Apr 3, 2026
Saira Kennedy

Artificial Intelligence and Machine Learning: What Are They and Why Are They Important?

Nov 12, 2020
Vandewilly Silva

Bringing AI assistants to GreenLake with MCP Servers

Dec 3, 2025
Philipp Offenhäuser & Christopher Williams

Closing the gap between High-Performance Computing (HPC) and artificial intelligence (AI)

Sep 15, 2023
Carol McDonald

Demystifying AI, Machine Learning and Deep Learning

Nov 25, 2020
Dave Wright and Elias Alagna

Deploying a Small Language Model in HPE Private Cloud AI using a Jupyter Notebook

Feb 20, 2025
Lydia Duncan, Michelle Strout

Distributed Tuning in Chapel with a Hyperparameter Optimization Example

Oct 9, 2024
Rachel Silver

End-to-End Machine Learning Using Containerization

Feb 5, 2021