API Reference¶
This page contains the complete API documentation for LotusFilter.
- class lotf.CutoffTable(X, index, epsilon, batch_size=1000, verbose=True)[source]¶
- Bases: - object- Cutoff table for diversity-aware filtering of search results. The wrapper class for \(\{\mathcal{L}_n\}_{n=1}^N\) in the paper. - Precomputes neighbor lists within epsilon distance for efficient filtering. Used to remove similar items from search results to increase diversity. - Note - This class takes - Xand a faiss index as arguments, but they are only used for constructing the table and are not stored. Therefore, this class consumes only the memory required for the table after construction.- __init__(X, index, epsilon, batch_size=1000, verbose=True)[source]¶
- Initialize CutoffTable. - Parameters:
 
 - filter(dists, ids, final_k)[source]¶
- Filter search results to increase diversity (L3-L7 in Algorithm 2 in the paper). - Takes candidate search results and removes similar items based on precomputed neighbor lists, selecting the most diverse - final_kresults.- Parameters:
- Returns:
- diverse_dists: 2D array of diverse distances, shape (- Nq,- final_k)
- diverse_ids: 2D array of diverse IDs, shape (- Nq,- final_k)
 
- Return type:
 
 - classmethod from_neighbor_lists(neighbor_lists, epsilon=None, N=None, D=None)[source]¶
- Create CutoffTable directly from precomputed neighbor lists. - Useful for parameter optimization where neighbor lists are reused. Since this function is used inside optimize_epsilon, end users generally do not need to use this. - Parameters:
- Returns:
- New instance created from the neighbor lists 
- Return type:
 
 
- lotf.div_score(dists, ids, X, lam)[source]¶
- Compute diversity score for a search result (function \(f\) from Eq 2 in the paper). - Parameters:
- dists ( - ndarray) – 1D array of distances from query to its top-k neighbors (squared L2 distances from faiss search). Shape: (- topk,)
- ids ( - ndarray) – 1D array of neighbor IDs (corresponding to the distances from faiss search). Shape: (- topk,)
- X ( - ndarray) – 2D array containing the original dataset vectors. Shape: (- N,- D)
- lam ( - float) – weight parameter for the diversity term (0 = pure NN, 1 = pure diversity)
 
- Returns:
- total_score: Combined diversity-aware score - = (1-lam) * similarity_term + lam * diversity_term
- similarity_term: Average similarity score 
- diversity_term: Average diversity score (negative value) 
 
- Return type:
 - Example: - import numpy as np X = np.random.rand(10, 3).astype(np.float32) q = np.random.rand(1, 3).astype(np.float32) topk = 3 dists = np.sum((q - X)**2, axis=1) ids = np.argsort(dists)[:topk] dists = dists[ids] import lotf score, sim_term, div_term = lotf.div_score(dists, ids, X, lam=0.5) 
- lotf.optimize_epsilon(X, Xq, index, candidate_k, final_k, lam, epsilon_max=None, verbose=True, num_iter=5, batch_size=1000, coarse_divisions=10, fine_divisions=100)[source]¶
- Optimize the epsilon parameter for Cutoff Table. - This function uses iterative grid search (bracketing) to find the optimal epsilon value that minimizes the combined diversity score (Sec. 6 and Sec. B). It performs a coarse-to-fine search, progressively narrowing the search range around the best found parameters. - Parameters:
- X ( - ndarray) – Database vectors of shape (- N,- D). This- Xcan slightly differ from the- Xthat the search actually runs on
- Xq ( - ndarray) – Query vectors of shape (- Nq,- D). Since queries are usually unavailable during this hyperparameter search, one can use the training data or a subset of- X.
- index – Faiss index for similarity search 
- candidate_k ( - int) – Number of candidates to retrieve from Faiss
- final_k ( - int) – Final number of diverse results to return
- lam ( - float) – Balance parameter between similarity (- 1-lam) and diversity (- lam)
- epsilon_max ( - float|- None) – Maximum epsilon to search. If None, computed automatically
- verbose ( - bool) – Whether to print progress information
- num_iter ( - int) – Number of optimization iterations (coarse-to-fine refinement)
- batch_size ( - int) – Batch size for neighbor list construction
- coarse_divisions ( - int) – Number of grid points in non-final iterations
- fine_divisions ( - int) – Number of grid points in the final iteration
 
- Returns:
- Best parameters found with keys:
- epsilon: Optimal epsilon value
- div_score: Best diversity score achieved
- L: Number of cutoff table entries for optimal epsilon
 
 
- Return type:
 
- lotf.backend()[source]¶
- Get information about available backends. - Shows which backend implementations are being used: - div_score: Uses Faiss for pairwise distance computation if available, otherwise falls back to NumPy 
- filter: Uses boost::unordered_flat_map/set if available for faster filtering, otherwise uses std::unordered_map/set 
 - Returns:
- Backend information formatted as multi-line string 
- Return type:
 - Example: - print(lotf.backend()) > div_score: Faiss > filter: boost::unordered_flat_map/set 
- lotf.clean_invalid_values(vec)[source]¶
- Remove invalid and extreme values from a numeric array. - Faiss search results sometimes contain problematic values that can cause numerical issues in downstream computations. This function filters out: - NaN (Not a Number) values 
- Infinite values (+inf, -inf) 
- Extremely large outliers (absolute value > 1e10) 
 
- lotf.batch_range_search(index, X, thresh, bs)[source]¶
- Perform range search in batches to handle large datasets efficiently. - This function processes large datasets by splitting them into smaller batches, performing range search on each batch, and then properly combining the results. This approach helps manage memory usage while maintaining identical results to a single large range search operation. - Parameters:
- Returns:
- (lims, dists, ids) matching Faiss range_search format
- lims: 1D array of cumulative neighbor counts, length N+1, starts with 0 
- dists: 1D array of all distances (flattened across all queries) 
- ids: 1D array of all neighbor IDs (flattened across all queries) 
 
 
- Return type:
 - Example: - # These produce identical results: lims1, dists1, ids1 = index.range_search(x=X, thresh=0.5) lims2, dists2, ids2 = batch_range_search(index, X, thresh=0.5, bs=100) assert np.array_equal(lims1, lims2) assert np.array_equal(dists1, dists2) assert np.array_equal(ids1, ids2) - Note - For very large datasets, this function prevents memory issues that could occur with direct range_search calls on the full dataset. 
- lotf.flatten_with_offsets(list_of_list)[source]¶
- Flatten a list of lists (or np arrays) into a single vector with offset information. - Parameters:
- list_of_list ( - list) – A list of lists (or np arrays), e.g.,- [[1, 2, 3], [4, 5, 6, 7], [8, 9]]
- Returns:
- flattened: A single vector, e.g., - np.array([1, 2, 3, 4, 5, 6, 7, 8, 9])
- offsets: Cumulative sum of the lengths of the lists/arrays, e.g., - np.array([0, 3, 7, 9])
 
- Return type:
 - Example: - flattened, offsets = flatten_with_offsets([[1, 2, 3], [4, 5, 6, 7], [8, 9]]) print(flattened) # Output: [1 2 3 4 5 6 7 8 9] print(offsets) # Output: [0 3 7 9] 
- lotf.build_neighbor_lists(X, index, epsilon, with_dist, batch_size, verbose)[source]¶
- Build raw neighbor lists (\(\{\mathcal{L}_n\}_{n=1}^N\) in the paper). This function is used inside - CutoffTable. End users don’t need to call this usually.- Parameters:
- Returns:
- neighbor_lists: list of np.array, \(\{\mathcal{L}_n\}_{n=1}^N\) in the paper 
- dist_lists: list of np.array or None, distances corresponding to - neighbor_listsif- with_dist=True
 
- Return type: