Creating Corpora
Context-Fabric works with any corpus in Text-Fabric format. This page covers how to create your own corpus from scratch or convert existing data.
Future releases of Context-Fabric may include AI-assisted corpus creation tools that can automatically generate TF files from common formats like plain text, XML, or spreadsheets. For now, the approaches below require some Python scripting.
What You Need
Every corpus requires three special files (the "WARP" features):
| File | Purpose |
|---|---|
otype.tf | Defines node types (word, sentence, paragraph, etc.) |
oslots.tf | Maps non-slot nodes to their constituent slots |
otext.tf | Configures text rendering and section navigation |
Plus any number of feature files for your annotations (part of speech, lemma, syntactic function, etc.).
See TF Format for complete syntax documentation.
Conversion Approaches
From TEI XML
The tfbuilder tool converts TEI-encoded texts to Text-Fabric format. It handles common TEI structures like divisions, paragraphs, and inline annotations.
pip install tfbuilder
From MQL (Emdros)
If you have data in MQL format (from the Emdros database system), Text-Fabric provides MQL conversion tools.
From Custom Formats
For other formats (CSV, JSON, plain text, etc.), use the Walker API from Text-Fabric to programmatically build a corpus. The walker lets you iterate through your source data and emit TF nodes and features.
The Walker API (Text-Fabric)
The Walker API is part of Text-Fabric, not Context-Fabric. Use Text-Fabric to create corpora, then load them with Context-Fabric.
The walker pattern has three components:
- Configuration — Define your node types, slot type, and text formats
- Director function — Walk through your source data, creating nodes and assigning features
- Output — Walker validates the graph and writes
.tffiles
Basic Structure
from tf.fabric import Fabric
from tf.convert.walker import CV
# Configuration
cv = CV(Fabric(locations='output_dir'))
cv.configure(
slotType='word',
otext={'fmt:text-orig-full': '{word} '},
generic={'author': 'Your Name'},
)
def director(cv):
# Walk through your source data
for paragraph in source_data:
# Create a paragraph node
para = cv.node('paragraph')
for word_text in paragraph.words:
# Create slot nodes (words)
slot = cv.slot()
cv.feature(slot, word=word_text)
# End the paragraph
cv.terminate(para)
# Run the conversion
cv.walk(director, 'corpus_name')
Key Concepts
Slots are the atomic text units (usually words or characters). Every other node type "contains" slots.
Embedders are nodes that automatically contain subsequently created slots until explicitly terminated. When you call cv.node('sentence'), that sentence will contain all slots created until you call cv.terminate(sentence).
Features are assigned with cv.feature(node, name=value) for node features and cv.edge(from_node, to_node, name=value) for edge features.
Planning Your Corpus
Before writing code, decide on:
-
Slot type — What's your atomic unit? Words? Characters? Morphemes?
-
Node hierarchy — What structural levels do you need? (document → chapter → paragraph → sentence → word)
-
Features — What annotations will you store? Consider:
- Linguistic: lemma, part of speech, morphology
- Structural: chapter number, verse number
- Metadata: speaker, date, source
-
Section structure — How will users navigate? (book/chapter/verse, document/paragraph, etc.)
Refer to the Graph Data Model for how these concepts map to the underlying structure.
Resources
- TF Format Reference — Complete file format documentation
- Text-Fabric Walker API — Full API reference
- Text-Fabric Corpus Guide — Additional examples
- tfbuilder — TEI XML converter