prepare

# Pre-compute data.

For CF to work efficiently, some derived data needs to be pre-computed. The pre-computed data has a similar function as indexes in a database.

Pre-computation is triggered when `cfabric.fabric.Fabric` loads features, and the order and nature of the steps is configured in `cfabric.fabric.PRECOMPUTE`.

The functions in this module implement those tasks.

Functions

function

boundary(info: InfoFunc, error: ErrorFunc, otype: OtypeData, oslots: OslotsData, rank: RankData) → BoundaryData

Computes boundary data.

Computes boundary data. For each slot, the nodes that start at that slot and the nodes that end at that slot are collected. Boundary data is used by the API functions `cfabric.locality.Locality.p`. and `cfabric.locality.Locality.n`. Parameters ---------- info: function Method to write informational messages to the console. error: function Method to write error messages to the console. otype: iterable The data of the `otype` feature. oslots: iterable The data of the `oslots` feature. rank: tuple The data of the `rank` pre-computation step. Returns ------- tuple * first: tuple of tuple The `n`-th member is the tuple of nodes that start at slot `n`, ordered in *reversed* canonical order (`cfabric.nodes`); * last: tuple of tuple The `n`-th member is the tuple of nodes that end at slot `n`, ordered in canonical order; Notes ----- !!! hint "why reversed canonical order?" Just for symmetry.

Parameters

info: InfoFunc
error: ErrorFunc
otype: OtypeData
oslots: OslotsData
rank: RankData

function

characters(info: InfoFunc, error: ErrorFunc, otext: OtextData, tFormats: dict[(str, tuple[(str, ...)])], tFeats: tuple[(str, dict[(int, Any)] | None)] = ()) → CharactersResult

Computes character data.

Computes character data. For each text format, a frequency list of the characters in that format is made. Parameters ---------- info: function Method to write informational messages to the console. error: function Method to write error messages to the console. otext: iterable The data of the `otext` feature. tFormats: dict Dictionary keyed by text format and valued by the tuple of features used in that format. tFeats: iterable Each `tFeat` is the name and the data of a text feature. i.e. a feature used in text formats. Returns ------- dict Keyed by format valued by a frequency dict, which is itself keyed by single characters and valued by the frequency of that character in the whole corpus when rendered with that format.

Parameters

info: InfoFunc
error: ErrorFunc
otext: OtextData
tFormats: dict[(str, tuple[(str, ...)])]
tFeats: tuple[(str, dict[(int, Any)] | None)]= ()

function

levDown(info: InfoFunc, error: ErrorFunc, otype: OtypeData, levUp: LevUpData, rank: RankData) → LevDownData

Computes level-down data.

Computes level-down data. Level-down data is used by the API function `cfabric.locality.Locality.d`. This function computes the embedded nodes of a node by looking them up from the level-down data. Parameters ---------- info: function Method to write informational messages to the console. error: function Method to write error messages to the console. otype: iterable The data of the `otype` feature. levUp: iterable The data of the `levUp` pre-computation step. rank: tuple The data of the `rank` pre-computation step. Returns ------- tuple The `n`-th member is an tuple of the embedded nodes of `n + maxSlot`. Those tuples are sorted in canonical order (`cfabric.nodes`). !!! hint "Memory efficiency" Slot nodes do not have embedded nodes, so they do not have to occupy space in this tuple. Hence the first member are the embedded nodes of node `maxSlot + 1`. !!! caution "Use with care" It is not advisable to use this data directly by `C.levDown.data`, it is far better to use the `cfabric.locality.Locality.d` function. Only when every bit of performance waste has to be squeezed out, this raw data might be a deal.

Parameters

info: InfoFunc
error: ErrorFunc
otype: OtypeData
levUp: LevUpData
rank: RankData

function

levUp(info: InfoFunc, error: ErrorFunc, otype: OtypeData, oslots: OslotsData, rank: RankData) → LevUpData

Computes level-up data.

Computes level-up data. Level-up data is used by the API function `cfabric.locality.Locality.u`. This function computes the embedders of a node by looking them up from the level-up data. Parameters ---------- info: function Method to write informational messages to the console. error: function Method to write error messages to the console. otype: iterable The data of the `otype` feature. oslots: iterable The data of the `oslots` feature. rank: tuple The data of the `rank` pre-computation step. Returns ------- tuple The `n`-th member is a tuple of the embedder nodes of `n`. Those tuples are sorted in canonical order (`cfabric.nodes`). Notes ----- !!! hint "Memory efficiency" Many nodes have the same tuple of embedders. Those embedder tuples will be reused for those nodes. Warnings -------- It is not advisable to use this data directly by `C.levUp.data`, it is far better to use the `cfabric.locality.Locality.u` function. Only when every bit of performance waste has to be squeezed out, this raw data might be a deal.

Parameters

info: InfoFunc
error: ErrorFunc
otype: OtypeData
oslots: OslotsData
rank: RankData

function

levels(info: InfoFunc, error: ErrorFunc, otype: OtypeData, oslots: OslotsData, otext: OtextData) → LevelsData

Computes level data.

Computes level data. For each node type, compute the average number of slots occupied by its nodes, and order the node types on that. Parameters ---------- info: function Method to write informational messages to the console. error: function Method to write error messages to the console. otype: iterable The data of the `otype` feature. oslots: iterable The data of the `oslots` feature. otext: iterable The data of the `otext` feature. Returns ------- tuple An ordered tuple, each member with the information of a node type: * node type name * average number of slots contained in the nodes of this type * first node of this type * last node of this type The order of the tuple is descending by average number of slots per node of that type. Notes ----- !!! explanation "Level computation and customization" All node types have a level, defined by the average amount of slots object of that type usually occupy. The bigger the average object, the lower the levels. Books have the lowest level, words the highest level. However, this can be overruled. Suppose you have a node type `phrase` and above it a node type `cluster`, i.e. phrases are contained in clusters, but not vice versa. If all phrases are contained in clusters, and some clusters have more than one phrase, the automatic level ranking of node types works out well in this case. But if clusters only have very small phrases, and the big phrases do not occur in clusters, then the algorithm may assign a lower rank to clusters than to phrases. In general, it is too expensive to try to compute the levels in a sophisticated way. In order to remedy cases where the algorithm assigns wrong levels, you can add a `@levels` and / or `@levelsConstraint` key to the `otext` configuration feature. See `cfabric.text`.

Parameters

info: InfoFunc
error: ErrorFunc
otype: OtypeData
oslots: OslotsData
otext: OtextData

function

order(info: InfoFunc, error: ErrorFunc, otype: OtypeData, oslots: OslotsData, levels: LevelsData) → OrderData

Computes order data for the canonical ordering.

Computes order data for the canonical ordering. The canonical ordering between nodes is defined in terms of the slots that nodes contain, and if that is not decisive, the rank of the node type is taken into account, and if that is still not decisive, the node itself is taken into account. Parameters ---------- info: function Method to write informational messages to the console. error: function Method to write error messages to the console. otype: iterable The data of the `otype` feature. oslots: iterable The data of the `oslots` feature. levels: tuple The data of the `levels` pre-computation step. Returns ------- tuple All nodes, slot and nonslot, in canonical order. See Also -------- cfabric.nodes: canonical ordering

Parameters

info: InfoFunc
error: ErrorFunc
otype: OtypeData
oslots: OslotsData
levels: LevelsData

function

rank(info: InfoFunc, error: ErrorFunc, otype: OtypeData, order: OrderData) → RankData

Computes rank data.

Computes rank data. The rank of a node is its place in among the other nodes in the canonical order (see `cfabric.nodes`). Parameters ---------- info: function Method to write informational messages to the console. error: function Method to write error messages to the console. otype: iterable The data of the `otype` feature. order: tuple The data of the `order` feature. Returns ------- tuple The ranks of all nodes, slot and nonslot, with respect to the canonical order.

Parameters

info: InfoFunc
error: ErrorFunc
otype: OtypeData
order: OrderData

function

sections(info: InfoFunc, error: ErrorFunc, otype: OtypeData, oslots: OslotsData, otext: OtextData, levUp: LevUpData, levDown: LevDownData, levels: LevelsData, sFeats: dict[(int, Any)] = ()) → SectionsResult

Computes section data.

Computes section data. CF datasets may define up to three section levels, roughly corresponding with a volume, a chapter, a paragraph. If the corpus has a richer section structure, it is also possible a different, more flexible and more extensive nest of structural sections. See `structure`. CF must be able to go from sections at one level to the sections at one level lower. It must also be able to map section headings to nodes. For this, the section features are needed, since they contain the section headings. We also map the sections to sequence numbers and back, at each level, e.g. in the Hebrew Bible `Genesis` is mapped to 1, `Exodus` to 2, etc. We also do it for integer values components, and we make sure that the first section at each level gets sequence number `1`. Parameters ---------- info: function Method to write informational messages to the console. error: function Method to write error messages to the console. otype: iterable The data of the `otype` feature. oslots: iterable The data of the `oslots` feature. otext: iterable The data of the `otext` feature. levUp: tuple The data of the `levUp` pre-computation step. levDown: tuple The data of the `levDown` pre-computation step. levels: tuple The data of the `levels` pre-computation step. sFeats: iterable Each `sFeat` is the data of a section feature. Returns ------- dict We have the following items: * `sec1`: Mapping from section-level-1 nodes to mappings from section-level-2 headings to section-level-2 nodes. * `sec2`: Mapping from section-level-1 nodes to mappings from section-level-2 headings to mappings from section-level-3 headings to section-level-3 nodes. * `seqFromNode`: Mapping from tuples of section nodes to tuples of sequence numbers. Only if there are precisely 3 section levels, otherwise this is an empty dictionary. * `nodeFromSeq`: Mapping from tuples of section sequence numbers to tuples of nodes. Only if there are precisely 3 section levels, otherwise this is an empty dictionary. Warnings -------- Note that the terms `book`, `chapter`, `verse` are not baked into CF. It is the corpus data, especially the `otext` configuration feature that spells out the names of the sections.

Parameters

info: InfoFunc
error: ErrorFunc
otype: OtypeData
oslots: OslotsData
otext: OtextData
levUp: LevUpData
levDown: LevDownData
levels: LevelsData
sFeats: dict[(int, Any)]= ()

function

sectionsFromApi(api: Api, sectionTypes: list[str], sectionFeats: list[str]) → SectionsResult | None

Compute sections data using API methods.

Compute sections data using API methods. This is an alternative to `sections()` that works with the high-level API rather than raw data structures. Used when loading from .cfm format. Parameters ---------- api : Api The CF API object with F, L, Fs attributes sectionTypes : list Section type names, e.g. ['book', 'chapter', 'verse'] sectionFeats : list Section feature names, e.g. ['book', 'chapter', 'verse'] Returns ------- dict Same structure as sections(): {sec1, sec2, seqFromNode, nodeFromSeq}

Parameters

api: Api
sectionTypes: list[str]
sectionFeats: list[str]

function

structure(info: InfoFunc, error: ErrorFunc, otype: OtypeData, oslots: OslotsData, otext: OtextData, rank: RankData, levUp: LevUpData, sFeats: dict[(int, Any)] = ()) → StructureResult | tuple[(dict[(Any, Any)], dict[(Any, Any)])]

Computes structure data.

Computes structure data. If the corpus has a rich section structure, it is possible to define a flexible and extensive nest of structural sections. Independent of this, CF datasets may also define up to three section levels, roughly corresponding with a volume, a chapter, a paragraph. See `sections`. CF must be able to go from sections at one level to the sections at one level lower. It must also be able to map section headings to nodes. For this, the section features are needed, since they contain the section headings. Parameters ---------- info: function Method to write informational messages to the console. error: function Method to write error messages to the console. otype: iterable The data of the `otype` feature. oslots: iterable The data of the `oslots` feature. otext: iterable The data of the `otext` feature. rank: tuple The data of the `rank` pre-computation step. levUp: tuple The data of the `levUp` pre-computation step. sFeats: iterable Each `sFeat` the data of a section feature. Returns ------- tuple * `headingFromNode` (Mapping from nodes to section keys) * `nodeFromHeading` (Mapping from section keys to nodes) * `multiple` * `top` * `up` * `down` Notes ----- A section key of a structural node is obtained by going a level up from that node, retrieving the heading of that structural node, then going up again, and so on till a top node is reached. The tuple of headings obtained in this way is the section key.

Parameters

info: InfoFunc
error: ErrorFunc
otype: OtypeData
oslots: OslotsData
otext: OtextData
rank: RankData
levUp: LevUpData
sFeats: dict[(int, Any)]= ()

Getting Started

Corpora

Concepts

File Formats

Core Library

MCP Server

Resources

API Reference

prepare

Functions

Parameters

Parameters

Parameters

Parameters

Parameters

Parameters

Parameters

Parameters

Parameters

Parameters