SPEED-Bench: A Unified and Diverse Benchmark for Accelerated Inference

Introduction

Recent research has focused on various methods to improve the inference speed of large language models (LLMs). Among these, SPEED-Bench plays a crucial role in evaluating ‘Speculative Decoding’ (SD) technology. SD significantly improves throughput by using a lightweight draft model to speculate multiple future tokens, which are then verified in parallel by the target model. However, existing SD evaluation methods have limitations, including a lack of data diversity, short input sequence lengths, and high levels of inference stack.

To address these issues, this article introduces SPEED-Bench, released by Hugging Face, and explores its components, analysis results, and the importance of SD evaluation. SPEED-Bench is designed to accurately evaluate SD performance by reflecting diverse semantic domains and realistic service environments.

What is SPEED-Bench?

SPEED-Bench considers two perspectives when evaluating SD: First, draft quality depends on the semantic domain and the entropy of the input text. Second, actual service environment speedups depend on batch size, input sequence length (ISL), and system constraints. SPEED-Bench has built a benchmarking ecosystem that takes both of these aspects into account.

SPEED-Bench combines two custom data splits and a unified measurement framework to identify various aspects of SD.

“Qualitative” data split: Designed to maximize semantic diversity and measure draft accuracy.
“Throughput” data split: Designed to evaluate system-level speedups across various ISLs and difficulty levels.
Unified measurement framework: Standardizes evaluation across systems.

Qualitative Split: Semantic Coverage and Draft Accuracy

The goal of the Qualitative split is to measure the quality of SD, specifically conditional acceptance rate (AR) and acceptance length, across various semantic domains. Previous benchmarks have failed to properly evaluate SD performance due to limited scale and diversity. Therefore, SPEED-Bench collected data from 18 public data sources and organized it into 11 categories: coding, mathematics, humanities, STEM, writing, summarization, role-playing, RAG, multilingual, reasoning, and QA. Each category contains 80 samples, providing significantly greater diversity than existing benchmarks.

Furthermore, SPEED-Bench uses a text embedder to embed candidate prompts into a vector space and applies a selection algorithm to minimize average pairwise cosine similarity, securing high semantic diversity. This allows for clearer differentiation of SD performance across domains. SPEED-Bench plays a crucial role in SD performance evaluation.

Throughput Split: Realistic Service Workloads

While the Qualitative split is useful for measuring draft accuracy, it is not sufficient for evaluating system-level speedups. The Throughput split of SPEED-Bench is designed to address this. To this end, various ISL buckets from 1k to 32k were created, and for each bucket, data of three categories: easy, medium, and high difficulty, were collected. Also, SPEED-Bench does not use random token inputs, preventing the overestimation of speedups. SPEED-Bench is essential for accurately evaluating the performance of SD in realistic service environments.

Unified Measurement Framework

Benchmarking SD across different inference engines is a challenging task. Each engine may use different templates, handle BOS tokens differently, or tokenize inputs inconsistently. These differences can interfere with the effectiveness of the SD algorithm. Therefore, SPEED-Bench has introduced a lightweight measurement framework that handles tokenization and prompt formatting externally to address these issues. This framework is integrated with engines used in production environments such as TensorRT-LLM, vLLM, and SGLang, enabling standardized evaluation. SPEED-Bench is an essential tool for accurate SD evaluation.

Analysis Results of SPEED-Bench

The analysis results obtained through SPEED-Bench provide valuable insights into SD performance. For example, it is known that the acceptance length of SD varies greatly depending on the domain, and higher acceptance lengths can be obtained in some low-entropy domains. Also, SPEED-Bench can reveal the side effects of specific system optimizations. For example, vocabulary cutoff can lead to performance degradation in some domains. These results help improve SD performance and achieve better performance in real-world service environments. SPEED-Bench is a valuable resource for ongoing research and development.

Conclusion

SPEED-Bench is an important tool for evaluating SD and can be used in both research and production environments. SPEED-Bench supports the analysis of SD performance across diverse semantic domains, measurement of speedups in realistic service environments, and comparison of various inference engines. We look forward to more accurate and realistic SD evaluations through SPEED-Bench in the future.

In-depth Analysis and Implications

Array

Original source: **Introducing SPEED-Bench: A Unified and Diverse Benchmark for Speculative Decoding**

SPEED-Bench: A Unified and Diverse Benchmark for Accelerated Inference

SPEED-Bench: A Unified and Diverse Benchmark for Accelerated Inference

Introduction

What is SPEED-Bench?

Qualitative Split: Semantic Coverage and Draft Accuracy

Throughput Split: Realistic Service Workloads

Unified Measurement Framework

Analysis Results of SPEED-Bench

Conclusion

In-depth Analysis and Implications

Mamba-3: A New State Space Model Frontier with 2x Smaller States and Enhanced MIMO Decoding Hardware Efficiency

Transformer’s New Innovation: Attention Residuals!

ByteDance DeerFlow 2.0: Open-Source SuperAgent Framework for Automated Task Execution

Anthropic Announces Study: AI Doesn’t ‘Destroy’ Jobs, Proposes New Measurement Method

PENTACROSS

SPEED-Bench: A Unified and Diverse Benchmark for Accelerated Inference

SPEED-Bench: A Unified and Diverse Benchmark for Accelerated Inference

Introduction

What is SPEED-Bench?

Qualitative Split: Semantic Coverage and Draft Accuracy

Throughput Split: Realistic Service Workloads

Unified Measurement Framework

Analysis Results of SPEED-Bench

Conclusion

In-depth Analysis and Implications

Mamba-3: A New State Space Model Frontier with 2x Smaller States and Enhanced MIMO Decoding Hardware Efficiency

You May Also Like

Transformer’s New Innovation: Attention Residuals!

ByteDance DeerFlow 2.0: Open-Source SuperAgent Framework for Automated Task Execution

Anthropic Announces Study: AI Doesn’t ‘Destroy’ Jobs, Proposes New Measurement Method

PENTACROSS