string_pool

String pool management for string-valued features.

This module provides efficient storage for string-valued features using integer indices into a shared string pool. This approach minimizes memory usage when many nodes share the same string values.

Classes

class

IntFeatureArray

Integer feature storage.

Integer feature storage. Dense array with sentinel for missing values. Attributes ---------- values : np.ndarray Array of integer values (dtype=int32) MISSING (-1) indicates no value

Attributes

Name	Type	Description
MISSING	—	—
values	—	—

Methods

__init__(self, values: NDArray[]) → None

Initialize an IntFeatureArray.

Parameters

values: NDArray[]

filter_by_value(self, nodes: list[int] | range, value: int) → NDArray[]

Vectorized filter: return nodes where feature equals value.

Parameters

nodes: list[int] | range
value: int

filter_by_values(self, nodes: list[int] | range, values: set[int]) → NDArray[]

Vectorized filter: return nodes where feature is in values set.

Parameters

nodes: list[int] | range
values: set[int]

filter_greater_than(self, nodes: list[int] | range, threshold: int) → NDArray[]

Vectorized filter: return nodes where value > threshold.

Parameters

nodes: list[int] | range
threshold: int

filter_has_value(self, nodes: list[int] | range) → NDArray[]

Vectorized filter: return nodes that have any value.

Parameters

nodes: list[int] | range

filter_less_than(self, nodes: list[int] | range, threshold: int) → NDArray[]

Vectorized filter: return nodes where value < threshold.

Parameters

nodes: list[int] | range
threshold: int

filter_missing_value(self, nodes: list[int] | range) → NDArray[]

Vectorized filter: return nodes that have no value.

Parameters

nodes: list[int] | range

from_dict(cls, data: dict[(int, int | None)], max_node: int) → IntFeatureArray

Build from node->int dict.

Parameters

cls
data: dict[(int, int | None)]
max_node: int

get(self, node: int) → int | None

Get int value for node (1-indexed).

Parameters

node: int

get_frequency_counts(self) → dict[(int, int)]

Get frequency counts of all values using vectorized numpy operations.

items(self) → Iterator[tuple[(int, int)]]

Iterate over (node, value) pairs efficiently using numpy.

load(cls, path: str, mmap_mode: str = 'r') → IntFeatureArray

Load from .npy file.

Parameters

cls
path: str
mmap_mode: str= 'r'

save(self, path: str) → None

Save to .npy file.

Parameters

path: str

to_dict(self) → dict[(int, int)]

Convert to dict efficiently.

class

StringPool

Efficient string storage with integer indices.

Efficient string storage with integer indices. Uses numpy object arrays which support copy-on-write sharing. Attributes ---------- strings : np.ndarray Array of unique strings (dtype=object) indices : np.ndarray Per-node index into strings array (dtype=uint32) MISSING_STR_INDEX indicates no value

Attributes

Name	Type	Description
indices	—	—
strings	—	—

Methods

__init__(self, strings: NDArray[], indices: NDArray[]) → None

Initialize a StringPool.

Parameters

strings: NDArray[]
indices: NDArray[]

filter_by_value(self, nodes: list[int] | range, value: str) → NDArray[]

Vectorized filter: return nodes where feature equals value.

Parameters

nodes: list[int] | range
value: str

filter_by_values(self, nodes: list[int] | range, values: set[str]) → NDArray[]

Vectorized filter: return nodes where feature is in values set.

Parameters

nodes: list[int] | range
values: set[str]

filter_has_value(self, nodes: list[int] | range) → NDArray[]

Vectorized filter: return nodes that have any value.

Parameters

nodes: list[int] | range

filter_missing_value(self, nodes: list[int] | range) → NDArray[]

Vectorized filter: return nodes that have no value.

Parameters

nodes: list[int] | range

from_dict(cls, data: dict[(int, str)], max_node: int) → StringPool

Build string pool from node->string dict.

Parameters

cls
data: dict[(int, str)]
max_node: int

get(self, node: int) → str | None

Get string value for node (1-indexed).

Parameters

node: int

get_frequency_counts(self) → dict[(str, int)]

Get frequency counts of all values using vectorized numpy operations.

get_value_index(self, value: str) → int | None

Get the internal index for a string value.

Parameters

value: str

items(self) → Iterator[tuple[(int, str)]]

Iterate over (node, value) pairs efficiently using numpy.

load(cls, path_prefix: str, mmap_mode: str = 'r') → StringPool

Load from files.

Parameters

cls
path_prefix: str
mmap_mode: str= 'r'

save(self, path_prefix: str) → None

Save to {path_prefix}_strings.npy and {path_prefix}_idx.npy.

Parameters

path_prefix: str

to_dict(self) → dict[(int, str)]

Convert to dict efficiently.

Getting Started

Corpora

Concepts

File Formats

Core Library

MCP Server

Resources

API Reference

string_pool

Classes

IntFeatureArray

Attributes

Methods

Parameters

Parameters

Parameters

Parameters

Parameters

Parameters

Parameters

Parameters

Parameters

Parameters

Parameters

StringPool

Attributes

Methods

Parameters

Parameters

Parameters

Parameters

Parameters

Parameters

Parameters

Parameters

Parameters

Parameters