Creating Corpora

Context-Fabric works with any corpus in Text-Fabric format. This page covers how to create your own corpus from scratch or convert existing data.

Future releases of Context-Fabric may include AI-assisted corpus creation tools that can automatically generate TF files from common formats like plain text, XML, or spreadsheets. For now, the approaches below require some Python scripting.

What You Need

Every corpus requires three special files (the "WARP" features):

File	Purpose
`otype.tf`	Defines node types (word, sentence, paragraph, etc.)
`oslots.tf`	Maps non-slot nodes to their constituent slots
`otext.tf`	Configures text rendering and section navigation

Plus any number of feature files for your annotations (part of speech, lemma, syntactic function, etc.).

See TF Format for complete syntax documentation.

Conversion Approaches

From TEI XML

The tfbuilder tool converts TEI-encoded texts to Text-Fabric format. It handles common TEI structures like divisions, paragraphs, and inline annotations.

bash

pip install tfbuilder

From MQL (Emdros)

If you have data in MQL format (from the Emdros database system), Text-Fabric provides MQL conversion tools.

From Custom Formats

For other formats (CSV, JSON, plain text, etc.), use the Walker API from Text-Fabric to programmatically build a corpus. The walker lets you iterate through your source data and emit TF nodes and features.

The Walker API (Text-Fabric)

The Walker API is part of Text-Fabric, not Context-Fabric. Use Text-Fabric to create corpora, then load them with Context-Fabric.

The walker pattern has three components:

Configuration — Define your node types, slot type, and text formats
Director function — Walk through your source data, creating nodes and assigning features
Output — Walker validates the graph and writes .tf files

Basic Structure

python

from tf.fabric import Fabric
from tf.convert.walker import CV

# Configuration
cv = CV(Fabric(locations='output_dir'))

cv.configure(
    slotType='word',
    otext={'fmt:text-orig-full': '{word} '},
    generic={'author': 'Your Name'},
)

def director(cv):
    # Walk through your source data
    for paragraph in source_data:
        # Create a paragraph node
        para = cv.node('paragraph')

        for word_text in paragraph.words:
            # Create slot nodes (words)
            slot = cv.slot()
            cv.feature(slot, word=word_text)

        # End the paragraph
        cv.terminate(para)

# Run the conversion
cv.walk(director, 'corpus_name')

Key Concepts

Slots are the atomic text units (usually words or characters). Every other node type "contains" slots.

Embedders are nodes that automatically contain subsequently created slots until explicitly terminated. When you call cv.node('sentence'), that sentence will contain all slots created until you call cv.terminate(sentence).

Features are assigned with cv.feature(node, name=value) for node features and cv.edge(from_node, to_node, name=value) for edge features.

Planning Your Corpus

Before writing code, decide on:

Slot type — What's your atomic unit? Words? Characters? Morphemes?
Node hierarchy — What structural levels do you need? (document → chapter → paragraph → sentence → word)
Features — What annotations will you store? Consider:
- Linguistic: lemma, part of speech, morphology
- Structural: chapter number, verse number
- Metadata: speaker, date, source
Section structure — How will users navigate? (book/chapter/verse, document/paragraph, etc.)

Refer to the Graph Data Model for how these concepts map to the underlying structure.

Resources

TF Format Reference — Complete file format documentation
Text-Fabric Walker API — Full API reference
Text-Fabric Corpus Guide — Additional examples
tfbuilder — TEI XML converter

Getting Started

Corpora

Concepts

File Formats

Core Library

MCP Server

Resources

API Reference

Creating Corpora

What You Need

Conversion Approaches

From TEI XML

From MQL (Emdros)

From Custom Formats

The Walker API (Text-Fabric)

Basic Structure

Key Concepts

Planning Your Corpus

Resources