logo
Главная страница Новости

новости компании о Lightbits and ScaleFlux demo 100x to 280x KV Cache acceleration

Сертификация
Китай Beijing Qianxing Jietong Technology Co., Ltd. Сертификаты
Китай Beijing Qianxing Jietong Technology Co., Ltd. Сертификаты
Просмотрения клиента
Торговый персонал CO. технологии Пекин Qianxing Jietong, Ltd очень профессионален и терпелив. Они могут обеспечить цитаты быстро. Качество и упаковка продуктов также очень хороши. Наше сотрудничество очень ровно.

—— LLC》 Festfing DV 《

Когда я искал C.P.U. intel и SSD Тошиба срочно, Sandy от CO. технологии Пекин Qianxing Jietong, Ltd дала мне много помощь и получила мне продукты мне быстро. Я действительно оцениваю ее.

—— Иены киски

Sandy CO. технологии Пекин Qianxing Jietong, Ltd очень осторожный продавец, который может напомнить меня об ошибок конфигурации во времени когда я покупаю сервер. Инженеры также очень профессиональны и могут быстро выполнить испытывая процесс.

—— Strelkin Mikhail Vladimirovich

Мы очень довольны нашим опытом работы с Beijing Qianxing Jietong. Качество продукции отличное, и доставка всегда вовремя. Их отдел продаж профессионален, терпелив и очень полезен во всех наших вопросах. Мы искренне ценим их поддержку и надеемся на долгосрочное партнерство. Настоятельно рекомендуется!

—— Ахмад Навид

Качество: Очень хороший опыт работы с моим поставщиком. МикроТик RB3011 уже использовался, но он был в очень хорошем состоянии и все работало идеально.и все мои проблемы были решены быстро- Очень надежный поставщик. - Очень рекомендую.

—— Джеран Колесио

Оставьте нам сообщение
компания Новости
Lightbits and ScaleFlux demo 100x to 280x KV Cache acceleration
Lightbits Labs and ScaleFlux have achieved a 100x to 280x performance boost for KV cache workloads by leveraging LightInferra cache software to read data from ScaleFlux computational storage SSDs.

The two companies supplied KV cache data to GPUs deployed within a FarmGPU data center environment, and will showcase this breakthrough at Nvidia’s upcoming GTC conference. A KV cache stores token vectors in a GPU’s high-bandwidth memory (HBM). Once HBM capacity is exhausted, KV cache data blocks must be recalculated — a process that consumes time and degrades AI training and inference speeds. This slowdown becomes especially pronounced as AI workloads scale up, leading to a sharp rise in the number of tokens used to generate vectors.

KV cache software logically expands the cache layer outward: first to the x86 CPU and its DRAM on the GPU server, then to local NVMe drives in the same x86 system, and further to external NVMe SSDs. This tiered expansion eliminates the need to recompute token vectors. While NVMe SSDs naturally have higher access latency than HBM or DRAM, retrieving precomputed token vectors is far faster than recalculating tens of thousands of them from scratch. Lightbits and ScaleFlux claim their solution drastically accelerates KV cache data retrieval from SSDs.

Arthur Rasmusson, Director of AI Architecture at Lightbits Labs, stated: “We’re transforming inference memory from a reactive cache into an intelligent, streamed data layer.”

How?


“By prefetching only the data that matters and delivering it to GPUs over high-speed RDMA before it is needed, we eliminate the stalls that traditionally limit long-context performance. The result is lower Time-to-First-Token (TTFT), more stable throughput under real-world load, and significantly higher effective GPU utilization.”

Keith McKay, Senior Director of Solutions Architecture and Technical Partnerships at ScaleFlux, commented: “What we’re showing at GTC is an early look at how smarter data placement and persistent attention state management could help inference systems stay responsive as context windows grow. This is very much a collaboration we want to shape alongside real operators.”

Both Lightbits and ScaleFlux aim to encourage cloud and infrastructure operators to adopt their software and SSDs, eliminating costly GPU idle time.

Let’s first examine ScaleFlux’s contribution, then move to the more sophisticated Lightbits software layer.

ScaleFlux provides NVMe SSDs and Computational Storage Drives (CSDs) equipped with hardware-based Write Reduction Technology (WRT). Powered by hardware-accelerated compression and SoC-driven metadata management, these drives deliver up to four times more logical capacity than physical storage, while remaining fully transparent to host systems. The company is a member of the Open Flash Platform (OFP) consortium, which is working to redefine AI data infrastructure with dense, low-latency, power-efficient systems — offering 10x the density of conventional file-based AI storage and just one-tenth the power consumption.

Building on these storage drives, Lightbits adds intelligent prefetching of KV Cache data before GPUs require it, preventing stalls caused by insufficient KV capacity or costly token vector recomputation. Its LightInferra software uses KV Cache-optimized caching algorithms to pull required data into GPU memory at RDMA speeds ahead of actual demand.

Again, how?


The software runs on the x86 host embedded within GPU servers and tracks access patterns of KV Cache data blocks. Using this telemetry, it operates a Sub-Linear Sparse Attention Prefetch (SLSAP) engine to identify the KV blocks most likely to be needed next.

This engine combines locality-sensitive hashing (LSH) with statistical reuse modeling — analyzing historical access locality in attention computations — to score and prioritize KV blocks, then selects those with the highest probability of being requested by GPUs.

This selection process leverages the inherent sparsity in GPU data access: most tokens only meaningfully relate to a small subset of previous tokens. By isolating these high-probability blocks, the solution drastically reduces the volume of token vectors that must be streamed back to GPUs.

A second algorithm focuses on reuse patterns: recent tokens, semantically similar tokens, and structural patterns common in RAG or multi-turn chat scenarios are frequently reused and prioritized accordingly.

LightInferra retrieves these token blocks first from the x86 server’s DRAM, or from external ScaleFlux SSDs if necessary, then preloads them into the GPU’s HBM via RDMA links.

Lightbits has benchmarked this approach against re-computing cached content from scratch using large language model workloads, measuring improvements in Time-to-First-Token (TTFT). The reported 100x to 280x acceleration figures are derived directly from these test results.

последние новости компании о Lightbits and ScaleFlux demo 100x to 280x KV Cache acceleration  0

Of course, we’d love to see benchmark results comparing the Lightbits-ScaleFlux KV Cache acceleration

 scheme with KV Cache accelerators from DDN, Hammerspace, VAST Data, WEKA and others, but they

are not available.


There are charts showing how LightInferra-ScaleFlux progressively improved on cache regeneration TTFT

 as the model size increases. Eg.;


последние новости компании о Lightbits and ScaleFlux demo 100x to 280x KV Cache acceleration  1


All related benchmark data is presented in log-scale charts, tailored primarily for computer science professionals, but plain language makes the real-world impact far easier to grasp: “The outcome is sustained Time-to-First-Token (TTFT) performance as context scales from 100k tokens toward 1 million and beyond.”
As Jonmichael Hands of FarmGPU puts it, when a 400k-token conversation resumes and the system has to regenerate the entire KV cache from scratch, that means two full minutes of GPU runtime with zero tokens produced. LightInferra changes the economic model entirely— the same workload generates its first token in under half a second, turning a non-viable product tier into a profitable one.

Lightbits and ScaleFlux have designed this joint solution specifically for next-gen neocloud GPU farms, where large GPU pods run hundreds or even thousands of concurrent AI model workloads. Nearly every one of these workloads will hit the limit of KV cache capacity in the GPU’s high-bandwidth memory (HBM).

 Under traditional setups, teams face two costly options: slowly fetching token vectors from generic external storage, or the far more time-consuming process of recomputing those vectors from scratch—both of which leave GPUs sitting idle for hours on end. The LightInferra and ScaleFlux combination eliminates this crippling industry pain point entirely.

FarmGPU CEO Jonmichael Hands added: “Fast networked storage from Lightbits unlocks a wealth of new use cases for long-context inference. By pairing our managed service with Lightbits’ high-performance storage running on ScaleFlux NVMe drives, we can cut time to first token and boost GPU utilization, drastically lowering the total cost of ownership (TCO) for inference workloads.”

Beijing Qianxing Jietong Technology Co., Ltd.
Sandy Yang/Global Strategy Director
WhatsApp / WeChat: +86 13426366826
Email: yangyd@qianxingdata.com
Website: www.qianxingdata.com/www.storagesserver.com

Business Focus:
ICT Product Distribution/System Integration & Services/Infrastructure Solutions
With 20+ years of IT distribution experience, we partner with leading global brands to deliver reliable products and professional services.
“Using Technology to Build an Intelligent World”Your Trusted ICT Product Service Provider!
Время Pub : 2026-03-18 11:34:46 >> список новостей
Контактная информация
Beijing Qianxing Jietong Technology Co., Ltd.

Контактное лицо: Ms. Sandy Yang

Телефон: 13426366826

Оставьте вашу заявку (0 / 3000)