Documentation

validate

Validate Text-Fabric corpora loading in both Text-Fabric and Context-Fabric.

Tests each corpus with: 1. Text-Fabric loading from .tf files 2. Context-Fabric loading from .tf files (which auto-compiles to .cfm) 3. Context-Fabric loading from .cfm cache

Also samples feature values from both .tf and .cfm loading paths to verify data integrity through the compile/load cycle.

Tests each corpus one at a time to ensure clean memory state and accurate error attribution.

Usage:

python benchmarks/validate_corpora.py
python benchmarks/validate_corpora.py --corpus bhsa  # Test single corpus

Classes

class

CorpusStats

Statistics from loading a corpus.

Attributes

NameTypeDescription
edge_featuresint
errorstr | None
max_nodeint
max_slotint
node_featuresint
node_typesint
samplesFeatureSamples | None

Methods

__init__(self, max_slot: int = 0, max_node: int = 0, node_types: int = 0, node_features: int = 0, edge_features: int = 0, samples: FeatureSamples | None = None, error: str | None = None) None
Parameters
  • max_slot: int= 0
  • max_node: int= 0
  • node_types: int= 0
  • node_features: int= 0
  • edge_features: int= 0
  • samples: FeatureSamples | None= None
  • error: str | None= None
class

FeatureSamples

Sampled feature values for validation.

Attributes

NameTypeDescription
edge_samplesdict[(str, list[tuple[(int, int, Any)]])]
node_samplesdict[(str, list[tuple[(int, Any)]])]
text_sampleslist[tuple[(int, str)]]

Methods

__init__(self, node_samples: dict[(str, list[tuple[(int, Any)]])], edge_samples: dict[(str, list[tuple[(int, int, Any)]])], text_samples: list[tuple[(int, str)]]) None
Parameters
  • node_samples: dict[(str, list[tuple[(int, Any)]])]
  • edge_samples: dict[(str, list[tuple[(int, int, Any)]])]
  • text_samples: list[tuple[(int, str)]]
class

ValidationResult

Result of validating a single corpus.

Attributes

NameTypeDescription
cf_mmap_okbool
cf_mmap_statsCorpusStats
cf_okbool
cf_statsCorpusStats
corpusstr
mmap_stats_matchboolCheck that .cfm loading produces same stats as .tf loading.
samples_matchboolCheck that feature value samples match between .tf and .cfm loading.
stats_matchbool
tf_okbool
tf_statsCorpusStats

Methods

__init__(self, corpus: str, tf_stats: CorpusStats, cf_stats: CorpusStats, cf_mmap_stats: CorpusStats) None
Parameters
  • corpus: str
  • tf_stats: CorpusStats
  • cf_stats: CorpusStats
  • cf_mmap_stats: CorpusStats
get_sample_mismatches(self) list[str]

Get list of features with mismatched samples.

Functions

function
clear_caches(tf_path: Path) None

Clear Text-Fabric and Context-Fabric cache directories.

Parameters
  • tf_path: Path
function
load_with_context_fabric(tf_path: Path, collect_samples: bool = False) CorpusStats

Load corpus with Context-Fabric and return stats.

Parameters
  • tf_path: Path
  • collect_samples: bool= False
function
load_with_text_fabric(tf_path: Path) CorpusStats

Load corpus with Text-Fabric and return stats.

Parameters
  • tf_path: Path
function
main()
function
sample_feature_values(api, sample_size: int = 100) FeatureSamples

Sample feature values from loaded API for validation.

Sample feature values from loaded API for validation. Samples nodes at regular intervals across the corpus to get representative coverage.
Parameters
  • api
  • sample_size: int= 100
function
validate_corpus(corpus_name: str, corpus_dir: Path) ValidationResult

Validate a single corpus with both TF and CF.

Parameters
  • corpus_name: str
  • corpus_dir: Path