string_pool
String pool management for string-valued features.
This module provides efficient storage for string-valued features using integer indices into a shared string pool. This approach minimizes memory usage when many nodes share the same string values.
Classes
IntFeatureArray
Integer feature storage.
Attributes
| Name | Type | Description |
|---|---|---|
| MISSING | — | — |
| values | — | — |
Methods
__init__(self, values: NDArray[]) → NoneInitialize an IntFeatureArray.
Parameters
values: NDArray[]
filter_by_value(self, nodes: list[int] | range, value: int) → NDArray[]Vectorized filter: return nodes where feature equals value.
Parameters
nodes: list[int] | rangevalue: int
filter_by_values(self, nodes: list[int] | range, values: set[int]) → NDArray[]Vectorized filter: return nodes where feature is in values set.
Parameters
nodes: list[int] | rangevalues: set[int]
filter_greater_than(self, nodes: list[int] | range, threshold: int) → NDArray[]Vectorized filter: return nodes where value > threshold.
Parameters
nodes: list[int] | rangethreshold: int
filter_has_value(self, nodes: list[int] | range) → NDArray[]Vectorized filter: return nodes that have any value.
Parameters
nodes: list[int] | range
filter_less_than(self, nodes: list[int] | range, threshold: int) → NDArray[]Vectorized filter: return nodes where value < threshold.
Parameters
nodes: list[int] | rangethreshold: int
filter_missing_value(self, nodes: list[int] | range) → NDArray[]Vectorized filter: return nodes that have no value.
Parameters
nodes: list[int] | range
from_dict(cls, data: dict[(int, int | None)], max_node: int) → IntFeatureArrayBuild from node->int dict.
Parameters
clsdata: dict[(int, int | None)]max_node: int
get(self, node: int) → int | NoneGet int value for node (1-indexed).
Parameters
node: int
get_frequency_counts(self) → dict[(int, int)]Get frequency counts of all values using vectorized numpy operations.
items(self) → Iterator[tuple[(int, int)]]Iterate over (node, value) pairs efficiently using numpy.
load(cls, path: str, mmap_mode: str = 'r') → IntFeatureArrayLoad from .npy file.
Parameters
clspath: strmmap_mode: str= 'r'
save(self, path: str) → NoneSave to .npy file.
Parameters
path: str
to_dict(self) → dict[(int, int)]Convert to dict efficiently.
StringPool
Efficient string storage with integer indices.
Attributes
| Name | Type | Description |
|---|---|---|
| indices | — | — |
| strings | — | — |
Methods
__init__(self, strings: NDArray[], indices: NDArray[]) → NoneInitialize a StringPool.
Parameters
strings: NDArray[]indices: NDArray[]
filter_by_value(self, nodes: list[int] | range, value: str) → NDArray[]Vectorized filter: return nodes where feature equals value.
Parameters
nodes: list[int] | rangevalue: str
filter_by_values(self, nodes: list[int] | range, values: set[str]) → NDArray[]Vectorized filter: return nodes where feature is in values set.
Parameters
nodes: list[int] | rangevalues: set[str]
filter_has_value(self, nodes: list[int] | range) → NDArray[]Vectorized filter: return nodes that have any value.
Parameters
nodes: list[int] | range
filter_missing_value(self, nodes: list[int] | range) → NDArray[]Vectorized filter: return nodes that have no value.
Parameters
nodes: list[int] | range
from_dict(cls, data: dict[(int, str)], max_node: int) → StringPoolBuild string pool from node->string dict.
Parameters
clsdata: dict[(int, str)]max_node: int
get(self, node: int) → str | NoneGet string value for node (1-indexed).
Parameters
node: int
get_frequency_counts(self) → dict[(str, int)]Get frequency counts of all values using vectorized numpy operations.
get_value_index(self, value: str) → int | NoneGet the internal index for a string value.
Parameters
value: str
items(self) → Iterator[tuple[(int, str)]]Iterate over (node, value) pairs efficiently using numpy.
load(cls, path_prefix: str, mmap_mode: str = 'r') → StringPoolLoad from files.
Parameters
clspath_prefix: strmmap_mode: str= 'r'
save(self, path_prefix: str) → NoneSave to {path_prefix}_strings.npy and {path_prefix}_idx.npy.
Parameters
path_prefix: str
to_dict(self) → dict[(int, str)]Convert to dict efficiently.