validate
Validate Text-Fabric corpora loading in both Text-Fabric and Context-Fabric.
Tests each corpus with: 1. Text-Fabric loading from .tf files 2. Context-Fabric loading from .tf files (which auto-compiles to .cfm) 3. Context-Fabric loading from .cfm cache
Also samples feature values from both .tf and .cfm loading paths to verify data integrity through the compile/load cycle.
Tests each corpus one at a time to ensure clean memory state and accurate error attribution.
Usage:
python benchmarks/validate_corpora.py
python benchmarks/validate_corpora.py --corpus bhsa # Test single corpusClasses
CorpusStats
Statistics from loading a corpus.
Attributes
| Name | Type | Description |
|---|---|---|
| edge_features | int | — |
| error | str | None | — |
| max_node | int | — |
| max_slot | int | — |
| node_features | int | — |
| node_types | int | — |
| samples | FeatureSamples | None | — |
Methods
__init__(self, max_slot: int = 0, max_node: int = 0, node_types: int = 0, node_features: int = 0, edge_features: int = 0, samples: FeatureSamples | None = None, error: str | None = None) → NoneParameters
max_slot: int= 0max_node: int= 0node_types: int= 0node_features: int= 0edge_features: int= 0samples: FeatureSamples | None= Noneerror: str | None= None
FeatureSamples
Sampled feature values for validation.
Attributes
| Name | Type | Description |
|---|---|---|
| edge_samples | dict[(str, list[tuple[(int, int, Any)]])] | — |
| node_samples | dict[(str, list[tuple[(int, Any)]])] | — |
| text_samples | list[tuple[(int, str)]] | — |
Methods
__init__(self, node_samples: dict[(str, list[tuple[(int, Any)]])], edge_samples: dict[(str, list[tuple[(int, int, Any)]])], text_samples: list[tuple[(int, str)]]) → NoneParameters
node_samples: dict[(str, list[tuple[(int, Any)]])]edge_samples: dict[(str, list[tuple[(int, int, Any)]])]text_samples: list[tuple[(int, str)]]
ValidationResult
Result of validating a single corpus.
Attributes
| Name | Type | Description |
|---|---|---|
| cf_mmap_ok | bool | — |
| cf_mmap_stats | CorpusStats | — |
| cf_ok | bool | — |
| cf_stats | CorpusStats | — |
| corpus | str | — |
| mmap_stats_match | bool | Check that .cfm loading produces same stats as .tf loading. |
| samples_match | bool | Check that feature value samples match between .tf and .cfm loading. |
| stats_match | bool | — |
| tf_ok | bool | — |
| tf_stats | CorpusStats | — |
Methods
__init__(self, corpus: str, tf_stats: CorpusStats, cf_stats: CorpusStats, cf_mmap_stats: CorpusStats) → NoneParameters
corpus: strtf_stats: CorpusStatscf_stats: CorpusStatscf_mmap_stats: CorpusStats
get_sample_mismatches(self) → list[str]Get list of features with mismatched samples.
Functions
clear_caches(tf_path: Path) → NoneClear Text-Fabric and Context-Fabric cache directories.
Parameters
tf_path: Path
load_with_context_fabric(tf_path: Path, collect_samples: bool = False) → CorpusStatsLoad corpus with Context-Fabric and return stats.
Parameters
tf_path: Pathcollect_samples: bool= False
load_with_text_fabric(tf_path: Path) → CorpusStatsLoad corpus with Text-Fabric and return stats.
Parameters
tf_path: Path
main()print_summary(results: list[ValidationResult]) → NonePrint summary table of all results.
Parameters
results: list[ValidationResult]
sample_feature_values(api, sample_size: int = 100) → FeatureSamplesSample feature values from loaded API for validation.
Parameters
apisample_size: int= 100
validate_corpus(corpus_name: str, corpus_dir: Path) → ValidationResultValidate a single corpus with both TF and CF.
Parameters
corpus_name: strcorpus_dir: Path