Introducing Storage Buckets on the Hugging Face Hub

In-Depth Analysis of Hugging Face Hub Storage Buckets

Introducing Storage Buckets on the Hugging Face Hub

Recently, Hugging Face introduced storage buckets. Hugging Face Hub has become an essential platform for sharing models and datasets and collaborating. However, training deep learning models often generates a large number of intermediate files that require frequent changes and version management, such as checkpoints, optimizer states, and processed data fragments. These files are not well-suited for management using the existing Git-based storage methods. Recognizing this issue, Hugging Face has developed storage buckets to enhance productivity and support efficient data management.

Storage buckets provide modifiable object storage similar to S3 and can be browsed on the hub and utilized through Python scripts. They leverage Xet technology, allowing for efficient management of ML artifacts with significant overlap between files, leading to reduced bandwidth, faster transfer speeds, and more efficient storage space utilization. For enterprise customers, this can reduce billing scales through deduplication.

Why Storage Buckets?

When training clusters persistently log checkpoints and optimizer states, data pipelines repeatedly process raw data, and agents store trace information, memories, and shared knowledge graphs, Git may not be the optimal abstraction. Storage buckets are designed to address these requirements. Buckets are versionless storage containers that exist within a user or organization namespace, follow standard Hugging Face permissions, can be configured as public or private, possess pages that can be viewed in a browser, and can be programmatically addressed using handles such as hf://buckets/username/my-training-bucket.

The Importance of Xet Technology

Storage buckets are built on Xet, Hugging Face’s unit-of-size storage backend, which is an important component. Xet does not treat files as single blobs but divides them into chunks and performs deduplication between them. When uploading a processed dataset that is nearly identical to the raw dataset, many chunks already exist. The same situation occurs when saving subsequent checkpoints with most of the model’s parameters frozen. This results in reduced bandwidth, faster transfer speeds, and more efficient storage space utilization.

These characteristics naturally suit ML workloads. Training pipelines consistently generate families of related artifacts: raw and processed data, subsequent checkpoints, agent traces, and derived summaries. Xet is designed to leverage this redundancy. For enterprise customers, billing is based on deduplicated storage space, which reduces the scale billed for shared chunks. Deduplication benefits both speed and cost. Storage buckets offer these technical advantages.

Moving Data Closer to Computation: Prewarming

Storage buckets reside on the hub and primarily utilize global storage. However, not all workloads are always available where and when needed, so for distributed training and large-scale pipelines, the location of storage directly impacts throughput. Prewarming allows you to move hot data closer to the compute that runs it, to the cloud provider and region. Instead of moving data across regions every time data is read, storage buckets prepare the data when the job starts. This is particularly useful for large datasets or training clusters with different parts of the pipeline running in different clouds, where fast access to checkpoints is required. We currently have partnerships with AWS and GCP, and more cloud providers will be added in the future.

Getting Started

You can set up and run a bucket in under 2 minutes using the hf CLI. First, install and log in using the following command:

curl -LsSf https://hf.co/cli/install.sh | bash
hf auth login

Create a bucket for your project:

hf buckets create my-training-bucket --private

If your training job writes checkpoints to a local directory, such as ./checkpoints, synchronize that directory with the bucket using the following command:

hf buckets sync ./checkpoints hf://buckets/username/my-training-bucket/checkpoints

For large transfers, it’s a good idea to preview what will happen first. The --dry-run option outputs the plan without performing the actual actions:

hf buckets sync ./checkpoints hf://buckets/username/my-training-bucket/checkpoints --dry-run

You can save the plan to a file for review and apply it later:

hf buckets sync ./checkpoints hf://buckets/username/my-training-bucket/checkpoints --plan sync-plan.jsonl
hf buckets sync --apply sync-plan.jsonl

Once the job is complete, list the contents of the bucket using the following command:

hf buckets list username/my-training-bucket -h

Or access it directly in your browser at: https://huggingface.co/buckets/username/my-training-bucket. Experience efficient data management and collaboration by leveraging storage buckets.

Using Buckets with Python

All of the above operations can be performed through the Python API in huggingface_hub, available starting with v1.5.0. The API follows the same pattern: create, sync, and inspect. Here is an example code:

from huggingface_hub import create_bucket, list_bucket_tree, sync_bucket

create_bucket("my-training-bucket", private=True, exist_ok=True)

sync_bucket(
    "./checkpoints",
    "hf://buckets/username/my-training-bucket/checkpoints",
)

for item in list_bucket_tree(
    "username/my-training-bucket",
    prefix="checkpoints",
    recursive=True,
):
    print(item.path, item.size)

File System Integration

Storage buckets also operate through HfFileSystem, a fsspec-compatible file system in huggingface_hub. This means you can list, read, write, and glob bucket contents using standard file system operations. Any library that supports fsspec can access storage buckets directly without additional configuration. For example, libraries such as pandas, Polars, and Dask can read and write from storage buckets using hf:// paths.

Buckets to Versioned Repositories

Storage buckets are a quick, modifiable space for files you’re working on. When something becomes a stable, publishable item, it typically belongs to a versioned model or dataset repository. Future plans will support direct transfers between storage buckets and repositories, allowing you to promote final checkpoints to model repositories or commit processed shards to dataset repositories once the pipeline is complete. The working layer and the publishing layer are separated but fit into a single Hub-native workflow.

Thanks to Our Launch Partners

Before making buckets available to everyone, we ran a private beta with a small group of launch partners. Thank you to Jasper, Arcee, IBM, and PixAI for testing the early version, identifying bugs, and providing feedback to directly shape the look of this feature.

Conclusion and Resources

Storage buckets provide a missing storage layer to the hub. They provide a Hub-native place for everything useful before the final result: checkpoints, processed data, agent traces, logs, etc. Built on Xet, they are more efficient than forcing Git for the related artifact types that AI systems constantly generate, leading to reduced bandwidth, faster transfer speeds, and enterprise billing in terms of deduplicated scales. Existing hub users can keep more workflows in one place. By using a familiar model like S3-style storage, it provides a clear path for final publication to the hub.

Storage buckets are included in existing Hub storage plans. Free accounts have storage space to get started, and PRO and Enterprise plans provide higher limits. See the storage pricing page for details.

In-Depth Analysis and Implications

Array

Original source: Introducing Storage Buckets on the Hugging Face Hub

PENTACROSS

Recent Posts

How to Use ChatGPT Like a Pro: 10 Workflows That Save You Hours Every Week

Introduction: Is ChatGPT Really a Useless Tool? Since the emergence of ChatGPT, it has garnered…

2시간 ago

Code Concepts: A Large-Scale Synthetic Dataset Based on Programming Concepts

Code Concepts: A Large-Scale Synthetic Dataset Based on Programming Concepts Code Concepts: A Large-Scale Synthetic…

2시간 ago

NVIDIA Nemotron 3 Super: Open-Source Hybrid Mamba-Attention MoE Model Released, 5x Higher Throughput for Agentic AI

The gap between closed (proprietary) large language models and transparent open-source models is rapidly shrinking.…

2시간 ago

Gemini Embedding 2: A New Vector Model for Multimodal Data

Gemini Embedding 2: A New Vector Model for Multimodal Data Gemini Embedding 2: A New…

17시간 ago

Building a Self-Designing Meta-Agent: Automated Configuration, Instantiation, and Refinement

Building a Self-Designing Meta-Agent: Automated Configuration, Instantiation, and Refinement There is increasing interest in meta-agents…

19시간 ago

Fish Audio S2: A New Era of Expressive Text-to-Speech (TTS)

Fish Audio S2: A New Era of Expressive Text-to-Speech (TTS) Fish Audio S2: A New…

19시간 ago