API Reference

This page contains the complete API documentation for LotusFilter.

class lotf.CutoffTable(X, index, epsilon, batch_size=1000, verbose=True)[source]

Bases: object

Cutoff table for diversity-aware filtering of search results. The wrapper class for \(\{\mathcal{L}_n\}_{n=1}^N\) in the paper.

Precomputes neighbor lists within epsilon distance for efficient filtering. Used to remove similar items from search results to increase diversity.

Note

This class takes X and a faiss index as arguments, but they are only used for constructing the table and are not stored. Therefore, this class consumes only the memory required for the table after construction.

__init__(X, index, epsilon, batch_size=1000, verbose=True)[source]

Initialize CutoffTable.

Parameters:
  • X (ndarray) – 2D numpy array of dataset vectors

  • index – Faiss index for range search constructed with X. E.g., faiss.IndexFlatL2 or faiss.IndexHNSWFlat

  • epsilon (float) – distance threshold

  • batch_size (int) – batch size for large datasets

  • verbose (bool) – whether to print progress information

filter(dists, ids, final_k)[source]

Filter search results to increase diversity (L3-L7 in Algorithm 2 in the paper).

Takes candidate search results and removes similar items based on precomputed neighbor lists, selecting the most diverse final_k results.

Parameters:
  • dists (ndarray) – 2D array of candidate distances, shape (Nq, candidate_k). Typically, this is the output of a search by faiss.

  • ids (ndarray) – 2D array of candidate IDs, shape (Nq, candidate_k). Typically, this is the output of a search by faiss.

  • final_k (int) – Number of diverse results to return

Returns:

  • diverse_dists: 2D array of diverse distances, shape (Nq, final_k)

  • diverse_ids: 2D array of diverse IDs, shape (Nq, final_k)

Return type:

tuple

classmethod from_neighbor_lists(neighbor_lists, epsilon=None, N=None, D=None)[source]

Create CutoffTable directly from precomputed neighbor lists.

Useful for parameter optimization where neighbor lists are reused. Since this function is used inside optimize_epsilon, end users generally do not need to use this.

Parameters:
  • neighbor_lists (list) – list of numpy arrays, precomputed neighbor lists for each data point

  • epsilon (float | None) – distance threshold used to create the neighbor lists

  • N (int | None) – number of data points

  • D (int | None) – dimensionality of data points

Returns:

New instance created from the neighbor lists

Return type:

CutoffTable

lotf.div_score(dists, ids, X, lam)[source]

Compute diversity score for a search result (function \(f\) from Eq 2 in the paper).

Parameters:
  • dists (ndarray) – 1D array of distances from query to its top-k neighbors (squared L2 distances from faiss search). Shape: (topk,)

  • ids (ndarray) – 1D array of neighbor IDs (corresponding to the distances from faiss search). Shape: (topk,)

  • X (ndarray) – 2D array containing the original dataset vectors. Shape: (N, D)

  • lam (float) – weight parameter for the diversity term (0 = pure NN, 1 = pure diversity)

Returns:

  • total_score: Combined diversity-aware score = (1-lam) * similarity_term + lam * diversity_term

  • similarity_term: Average similarity score

  • diversity_term: Average diversity score (negative value)

Return type:

tuple

Example:

import numpy as np
X = np.random.rand(10, 3).astype(np.float32)
q = np.random.rand(1, 3).astype(np.float32)
topk = 3
dists = np.sum((q - X)**2, axis=1)
ids = np.argsort(dists)[:topk]
dists = dists[ids]

import lotf
score, sim_term, div_term = lotf.div_score(dists, ids, X, lam=0.5)
lotf.optimize_epsilon(X, Xq, index, candidate_k, final_k, lam, epsilon_max=None, verbose=True, num_iter=5, batch_size=1000, coarse_divisions=10, fine_divisions=100)[source]

Optimize the epsilon parameter for Cutoff Table.

This function uses iterative grid search (bracketing) to find the optimal epsilon value that minimizes the combined diversity score (Sec. 6 and Sec. B). It performs a coarse-to-fine search, progressively narrowing the search range around the best found parameters.

Parameters:
  • X (ndarray) – Database vectors of shape (N, D). This X can slightly differ from the X that the search actually runs on

  • Xq (ndarray) – Query vectors of shape (Nq, D). Since queries are usually unavailable during this hyperparameter search, one can use the training data or a subset of X.

  • index – Faiss index for similarity search

  • candidate_k (int) – Number of candidates to retrieve from Faiss

  • final_k (int) – Final number of diverse results to return

  • lam (float) – Balance parameter between similarity (1-lam) and diversity (lam)

  • epsilon_max (float | None) – Maximum epsilon to search. If None, computed automatically

  • verbose (bool) – Whether to print progress information

  • num_iter (int) – Number of optimization iterations (coarse-to-fine refinement)

  • batch_size (int) – Batch size for neighbor list construction

  • coarse_divisions (int) – Number of grid points in non-final iterations

  • fine_divisions (int) – Number of grid points in the final iteration

Returns:

Best parameters found with keys:
  • epsilon: Optimal epsilon value

  • div_score: Best diversity score achieved

  • L: Number of cutoff table entries for optimal epsilon

Return type:

dict

lotf.backend()[source]

Get information about available backends.

Shows which backend implementations are being used:

  • div_score: Uses Faiss for pairwise distance computation if available, otherwise falls back to NumPy

  • filter: Uses boost::unordered_flat_map/set if available for faster filtering, otherwise uses std::unordered_map/set

Returns:

Backend information formatted as multi-line string

Return type:

str

Example:

print(lotf.backend())
> div_score: Faiss
> filter: boost::unordered_flat_map/set
lotf.clean_invalid_values(vec)[source]

Remove invalid and extreme values from a numeric array.

Faiss search results sometimes contain problematic values that can cause numerical issues in downstream computations. This function filters out:

  • NaN (Not a Number) values

  • Infinite values (+inf, -inf)

  • Extremely large outliers (absolute value > 1e10)

Parameters:

vec (ndarray) – 1D numpy array of numeric values to clean

Return type:

ndarray

Returns:

Cleaned array containing only valid, finite values within reasonable range

Perform range search in batches to handle large datasets efficiently.

This function processes large datasets by splitting them into smaller batches, performing range search on each batch, and then properly combining the results. This approach helps manage memory usage while maintaining identical results to a single large range search operation.

Parameters:
  • index – Faiss index object with range_search capability

  • X (ndarray) – 2D numpy array of query vectors, shape (N, D)

  • thresh (float) – maximum squared L2 distance threshold for neighbors

  • bs (int) – batch size (number of queries to process per batch)

Returns:

(lims, dists, ids) matching Faiss range_search format
  • lims: 1D array of cumulative neighbor counts, length N+1, starts with 0

  • dists: 1D array of all distances (flattened across all queries)

  • ids: 1D array of all neighbor IDs (flattened across all queries)

Return type:

tuple

Example:

# These produce identical results:
lims1, dists1, ids1 = index.range_search(x=X, thresh=0.5)
lims2, dists2, ids2 = batch_range_search(index, X, thresh=0.5, bs=100)
assert np.array_equal(lims1, lims2)
assert np.array_equal(dists1, dists2)
assert np.array_equal(ids1, ids2)

Note

For very large datasets, this function prevents memory issues that could occur with direct range_search calls on the full dataset.

lotf.flatten_with_offsets(list_of_list)[source]

Flatten a list of lists (or np arrays) into a single vector with offset information.

Parameters:

list_of_list (list) – A list of lists (or np arrays), e.g., [[1, 2, 3], [4, 5, 6, 7], [8, 9]]

Returns:

  • flattened: A single vector, e.g., np.array([1, 2, 3, 4, 5, 6, 7, 8, 9])

  • offsets: Cumulative sum of the lengths of the lists/arrays, e.g., np.array([0, 3, 7, 9])

Return type:

tuple

Example:

flattened, offsets = flatten_with_offsets([[1, 2, 3], [4, 5, 6, 7], [8, 9]])
print(flattened)  # Output: [1 2 3 4 5 6 7 8 9]
print(offsets)    # Output: [0 3 7 9]
lotf.build_neighbor_lists(X, index, epsilon, with_dist, batch_size, verbose)[source]

Build raw neighbor lists (\(\{\mathcal{L}_n\}_{n=1}^N\) in the paper). This function is used inside CutoffTable. End users don’t need to call this usually.

Parameters:
  • X (ndarray) – 2D numpy array of dataset vectors

  • index – Faiss index for range search. Should be created with X.

  • epsilon (float) – distance threshold

  • with_dist (bool) – whether to return distances

  • batch_size (int) – batch size for large datasets

  • verbose (bool) – whether to print progress information

Returns:

  • neighbor_lists: list of np.array, \(\{\mathcal{L}_n\}_{n=1}^N\) in the paper

  • dist_lists: list of np.array or None, distances corresponding to neighbor_lists if with_dist=True

Return type:

tuple