API Reference¶
This page contains the complete API documentation for LotusFilter.
- class lotf.CutoffTable(X, index, epsilon, batch_size=1000, verbose=True)[source]¶
Bases:
object
Cutoff table for diversity-aware filtering of search results. The wrapper class for \(\{\mathcal{L}_n\}_{n=1}^N\) in the paper.
Precomputes neighbor lists within epsilon distance for efficient filtering. Used to remove similar items from search results to increase diversity.
Note
This class takes
X
and a faiss index as arguments, but they are only used for constructing the table and are not stored. Therefore, this class consumes only the memory required for the table after construction.- __init__(X, index, epsilon, batch_size=1000, verbose=True)[source]¶
Initialize CutoffTable.
- Parameters:
- filter(dists, ids, final_k)[source]¶
Filter search results to increase diversity (L3-L7 in Algorithm 2 in the paper).
Takes candidate search results and removes similar items based on precomputed neighbor lists, selecting the most diverse
final_k
results.- Parameters:
- Returns:
diverse_dists
: 2D array of diverse distances, shape (Nq
,final_k
)diverse_ids
: 2D array of diverse IDs, shape (Nq
,final_k
)
- Return type:
- classmethod from_neighbor_lists(neighbor_lists, epsilon=None, N=None, D=None)[source]¶
Create CutoffTable directly from precomputed neighbor lists.
Useful for parameter optimization where neighbor lists are reused. Since this function is used inside optimize_epsilon, end users generally do not need to use this.
- Parameters:
- Returns:
New instance created from the neighbor lists
- Return type:
- lotf.div_score(dists, ids, X, lam)[source]¶
Compute diversity score for a search result (function \(f\) from Eq 2 in the paper).
- Parameters:
dists (
ndarray
) – 1D array of distances from query to its top-k neighbors (squared L2 distances from faiss search). Shape: (topk
,)ids (
ndarray
) – 1D array of neighbor IDs (corresponding to the distances from faiss search). Shape: (topk
,)X (
ndarray
) – 2D array containing the original dataset vectors. Shape: (N
,D
)lam (
float
) – weight parameter for the diversity term (0 = pure NN, 1 = pure diversity)
- Returns:
total_score: Combined diversity-aware score
= (1-lam) * similarity_term + lam * diversity_term
similarity_term: Average similarity score
diversity_term: Average diversity score (negative value)
- Return type:
Example:
import numpy as np X = np.random.rand(10, 3).astype(np.float32) q = np.random.rand(1, 3).astype(np.float32) topk = 3 dists = np.sum((q - X)**2, axis=1) ids = np.argsort(dists)[:topk] dists = dists[ids] import lotf score, sim_term, div_term = lotf.div_score(dists, ids, X, lam=0.5)
- lotf.optimize_epsilon(X, Xq, index, candidate_k, final_k, lam, epsilon_max=None, verbose=True, num_iter=5, batch_size=1000, coarse_divisions=10, fine_divisions=100)[source]¶
Optimize the epsilon parameter for Cutoff Table.
This function uses iterative grid search (bracketing) to find the optimal epsilon value that minimizes the combined diversity score (Sec. 6 and Sec. B). It performs a coarse-to-fine search, progressively narrowing the search range around the best found parameters.
- Parameters:
X (
ndarray
) – Database vectors of shape (N
,D
). ThisX
can slightly differ from theX
that the search actually runs onXq (
ndarray
) – Query vectors of shape (Nq
,D
). Since queries are usually unavailable during this hyperparameter search, one can use the training data or a subset ofX
.index – Faiss index for similarity search
candidate_k (
int
) – Number of candidates to retrieve from Faissfinal_k (
int
) – Final number of diverse results to returnlam (
float
) – Balance parameter between similarity (1-lam
) and diversity (lam
)epsilon_max (
float
|None
) – Maximum epsilon to search. If None, computed automaticallyverbose (
bool
) – Whether to print progress informationnum_iter (
int
) – Number of optimization iterations (coarse-to-fine refinement)batch_size (
int
) – Batch size for neighbor list constructioncoarse_divisions (
int
) – Number of grid points in non-final iterationsfine_divisions (
int
) – Number of grid points in the final iteration
- Returns:
- Best parameters found with keys:
epsilon
: Optimal epsilon valuediv_score
: Best diversity score achievedL
: Number of cutoff table entries for optimal epsilon
- Return type:
- lotf.backend()[source]¶
Get information about available backends.
Shows which backend implementations are being used:
div_score: Uses Faiss for pairwise distance computation if available, otherwise falls back to NumPy
filter: Uses boost::unordered_flat_map/set if available for faster filtering, otherwise uses std::unordered_map/set
- Returns:
Backend information formatted as multi-line string
- Return type:
Example:
print(lotf.backend()) > div_score: Faiss > filter: boost::unordered_flat_map/set
- lotf.clean_invalid_values(vec)[source]¶
Remove invalid and extreme values from a numeric array.
Faiss search results sometimes contain problematic values that can cause numerical issues in downstream computations. This function filters out:
NaN (Not a Number) values
Infinite values (+inf, -inf)
Extremely large outliers (absolute value > 1e10)
- lotf.batch_range_search(index, X, thresh, bs)[source]¶
Perform range search in batches to handle large datasets efficiently.
This function processes large datasets by splitting them into smaller batches, performing range search on each batch, and then properly combining the results. This approach helps manage memory usage while maintaining identical results to a single large range search operation.
- Parameters:
- Returns:
- (lims, dists, ids) matching Faiss range_search format
lims: 1D array of cumulative neighbor counts, length N+1, starts with 0
dists: 1D array of all distances (flattened across all queries)
ids: 1D array of all neighbor IDs (flattened across all queries)
- Return type:
Example:
# These produce identical results: lims1, dists1, ids1 = index.range_search(x=X, thresh=0.5) lims2, dists2, ids2 = batch_range_search(index, X, thresh=0.5, bs=100) assert np.array_equal(lims1, lims2) assert np.array_equal(dists1, dists2) assert np.array_equal(ids1, ids2)
Note
For very large datasets, this function prevents memory issues that could occur with direct range_search calls on the full dataset.
- lotf.flatten_with_offsets(list_of_list)[source]¶
Flatten a list of lists (or np arrays) into a single vector with offset information.
- Parameters:
list_of_list (
list
) – A list of lists (or np arrays), e.g.,[[1, 2, 3], [4, 5, 6, 7], [8, 9]]
- Returns:
flattened: A single vector, e.g.,
np.array([1, 2, 3, 4, 5, 6, 7, 8, 9])
offsets: Cumulative sum of the lengths of the lists/arrays, e.g.,
np.array([0, 3, 7, 9])
- Return type:
Example:
flattened, offsets = flatten_with_offsets([[1, 2, 3], [4, 5, 6, 7], [8, 9]]) print(flattened) # Output: [1 2 3 4 5 6 7 8 9] print(offsets) # Output: [0 3 7 9]
- lotf.build_neighbor_lists(X, index, epsilon, with_dist, batch_size, verbose)[source]¶
Build raw neighbor lists (\(\{\mathcal{L}_n\}_{n=1}^N\) in the paper). This function is used inside
CutoffTable
. End users don’t need to call this usually.- Parameters:
- Returns:
neighbor_lists: list of np.array, \(\{\mathcal{L}_n\}_{n=1}^N\) in the paper
dist_lists: list of np.array or None, distances corresponding to
neighbor_lists
ifwith_dist=True
- Return type: