Documentation

string_pool

String pool management for string-valued features.

This module provides efficient storage for string-valued features using integer indices into a shared string pool. This approach minimizes memory usage when many nodes share the same string values.

Classes

class

IntFeatureArray

Integer feature storage.

Integer feature storage. Dense array with sentinel for missing values. Attributes ---------- values : np.ndarray Array of integer values (dtype=int32) MISSING (-1) indicates no value

Attributes

NameTypeDescription
MISSING
values

Methods

__init__(self, values: NDArray[]) None

Initialize an IntFeatureArray.

Parameters
  • values: NDArray[]
filter_by_value(self, nodes: list[int] | range, value: int) NDArray[]

Vectorized filter: return nodes where feature equals value.

Parameters
  • nodes: list[int] | range
  • value: int
filter_by_values(self, nodes: list[int] | range, values: set[int]) NDArray[]

Vectorized filter: return nodes where feature is in values set.

Parameters
  • nodes: list[int] | range
  • values: set[int]
filter_greater_than(self, nodes: list[int] | range, threshold: int) NDArray[]

Vectorized filter: return nodes where value > threshold.

Parameters
  • nodes: list[int] | range
  • threshold: int
filter_has_value(self, nodes: list[int] | range) NDArray[]

Vectorized filter: return nodes that have any value.

Parameters
  • nodes: list[int] | range
filter_less_than(self, nodes: list[int] | range, threshold: int) NDArray[]

Vectorized filter: return nodes where value < threshold.

Parameters
  • nodes: list[int] | range
  • threshold: int
filter_missing_value(self, nodes: list[int] | range) NDArray[]

Vectorized filter: return nodes that have no value.

Parameters
  • nodes: list[int] | range
from_dict(cls, data: dict[(int, int | None)], max_node: int) IntFeatureArray

Build from node->int dict.

Parameters
  • cls
  • data: dict[(int, int | None)]
  • max_node: int
get(self, node: int) int | None

Get int value for node (1-indexed).

Parameters
  • node: int
get_frequency_counts(self) dict[(int, int)]

Get frequency counts of all values using vectorized numpy operations.

items(self) Iterator[tuple[(int, int)]]

Iterate over (node, value) pairs efficiently using numpy.

load(cls, path: str, mmap_mode: str = 'r') IntFeatureArray

Load from .npy file.

Parameters
  • cls
  • path: str
  • mmap_mode: str= 'r'
save(self, path: str) None

Save to .npy file.

Parameters
  • path: str
to_dict(self) dict[(int, int)]

Convert to dict efficiently.

class

StringPool

Efficient string storage with integer indices.

Efficient string storage with integer indices. Uses numpy object arrays which support copy-on-write sharing. Attributes ---------- strings : np.ndarray Array of unique strings (dtype=object) indices : np.ndarray Per-node index into strings array (dtype=uint32) MISSING_STR_INDEX indicates no value

Attributes

NameTypeDescription
indices
strings

Methods

__init__(self, strings: NDArray[], indices: NDArray[]) None

Initialize a StringPool.

Parameters
  • strings: NDArray[]
  • indices: NDArray[]
filter_by_value(self, nodes: list[int] | range, value: str) NDArray[]

Vectorized filter: return nodes where feature equals value.

Parameters
  • nodes: list[int] | range
  • value: str
filter_by_values(self, nodes: list[int] | range, values: set[str]) NDArray[]

Vectorized filter: return nodes where feature is in values set.

Parameters
  • nodes: list[int] | range
  • values: set[str]
filter_has_value(self, nodes: list[int] | range) NDArray[]

Vectorized filter: return nodes that have any value.

Parameters
  • nodes: list[int] | range
filter_missing_value(self, nodes: list[int] | range) NDArray[]

Vectorized filter: return nodes that have no value.

Parameters
  • nodes: list[int] | range
from_dict(cls, data: dict[(int, str)], max_node: int) StringPool

Build string pool from node->string dict.

Parameters
  • cls
  • data: dict[(int, str)]
  • max_node: int
get(self, node: int) str | None

Get string value for node (1-indexed).

Parameters
  • node: int
get_frequency_counts(self) dict[(str, int)]

Get frequency counts of all values using vectorized numpy operations.

get_value_index(self, value: str) int | None

Get the internal index for a string value.

Parameters
  • value: str
items(self) Iterator[tuple[(int, str)]]

Iterate over (node, value) pairs efficiently using numpy.

load(cls, path_prefix: str, mmap_mode: str = 'r') StringPool

Load from files.

Parameters
  • cls
  • path_prefix: str
  • mmap_mode: str= 'r'
save(self, path_prefix: str) None

Save to {path_prefix}_strings.npy and {path_prefix}_idx.npy.

Parameters
  • path_prefix: str
to_dict(self) dict[(int, str)]

Convert to dict efficiently.