Getting Corpora

Roadmap Feature

Hugging Face distribution is planned but not yet implemented. Currently, corpora should be downloaded directly from their GitHub repositories. This page describes the upcoming distribution system.

Current Approach: GitHub

Text-Fabric corpora are currently distributed via GitHub repositories. Text-Fabric includes a built-in downloader that fetches zipped corpus data attached to GitHub releases, but this approach has limitations.

GitHub Rate Limiting

GitHub's API enforces rate limits that affect corpus downloads:

Anonymous users: 60 requests per hour
Authenticated users: 5,000 requests per hour (requires personal access token)

Users who download multiple corpora or check for updates frequently can exhaust these limits. When rate-limited, users must wait an hour, configure authentication tokens, or fall back to manual cloning.

Manual Download

For now, the most reliable approach is to clone repositories directly:

bash

git clone --depth 1 https://github.com/ETCBC/bhsa.git

python

from cfabric import Fabric

CF = Fabric('/path/to/bhsa/tf/c')
api = CF.loadAll()

See the Corpus Index for links to available repositories.

Planned: Hugging Face Hub

Future versions of Context-Fabric will include a built-in downloader using Hugging Face Hub, eliminating the rate limiting issues with GitHub.

Why Hugging Face

Hugging Face Hub is already the standard platform for sharing large assets in the machine learning community. It's community-driven—anyone can create an account and upload datasets—and provides a standardized API for downloading. This is exactly what corpus distribution needs: a reliable, rate-limit-free platform where corpus maintainers can publish independently without centralized gatekeeping.

Benefit	Description
No rate limiting	Unlimited downloads without authentication
Community-driven	Anyone can publish corpora under their namespace
Proven at scale	Already hosts thousands of large datasets
Standardized API	Consistent download experience across all corpora

Planned API

python

import cfabric

# List available corpora
cfabric.list_corpora()

# Download by short name (from registry)
path = cfabric.download('bhsa')

# Or by full Hugging Face repo ID
path = cfabric.download('etcbc/cfabric-bhsa')

# Pin to a specific version
path = cfabric.download('bhsa', revision='v2023.1')

# Load the downloaded corpus
CF = cfabric.Fabric(locations=path)
api = CF.loadAll()

Namespace Convention

Corpora on Hugging Face will follow this naming pattern:

text

huggingface.co/datasets/{username}/cfabric-{corpus-name}

Examples:

etcbc/cfabric-bhsa — Official BHSA Hebrew Bible
context-fabric/cfabric-demo — Demo corpus
researcher/cfabric-my-corpus — Community contribution

For Corpus Contributors

Once implemented, anyone will be able to publish corpora:

Create a Hugging Face account
Create a dataset repo named cfabric-{your-corpus}
Upload your .tf files
Add a README with the context-fabric tag
Share your repo ID — users download with cfabric.download('username/cfabric-corpus')

Repository Structure

text

cfabric-{corpus-name}/
├── README.md              # Dataset card with metadata
└── tf/                    # Text-Fabric source files
    ├── otype.tf
    ├── oslots.tf
    ├── otext.tf
    └── {features}.tf

Timeline

This distribution system is on the roadmap. Follow the GitHub repository for updates.

Getting Started

Corpora

Concepts

File Formats

Core Library

MCP Server

Resources

API Reference