agatha.construct.knn_util module

agatha.construct.knn_util.add_points_to_index(records, init_index_path, batch_size, output_path)

Loads an initial index, adds the partition to the index, and writes result

Return type

Path

agatha.construct.knn_util.get_faiss_index_initializer(faiss_index_path, index_name='final')
Return type

Tuple[str, Callable[[], Any]]

agatha.construct.knn_util.merge_index(init_index_path, partial_idx_paths, final_index_path)
Return type

Path

agatha.construct.knn_util.nearest_neighbors_network_from_index(hash_and_embedding, hash2name_db, batch_size, num_neighbors, faiss_index_name='final', weight=1.0)

Applies faiss and runs results through inverted index.

Return type

Iterable[str]

agatha.construct.knn_util.to_hash_and_embedding(records, id_field='id', embedding_field='embedding')
Return type

Tuple[ndarray, ndarray]

agatha.construct.knn_util.train_distributed_knn(hash_and_embedding, batch_size, num_centroids, num_probes, num_quantizers, bits_per_quantizer, training_sample_prob, shared_scratch_dir, final_index_path, id_field='id', embedding_field='embedding')

Computing all of the embeddings and then performing a KNN is a problem for memory. So, what we need to do instead is compute batches of embeddings, and use them in Faiss to reduce their dimensionality and process the appropriatly.

I’m so sorry this one function has to do so much…

@param hash_and_embedding: bag of hash value and embedding values @param text_field: input text field that we embed. @param id_field: output id field we use to store number hashes @param batch_size: number of sentences per batch @param num_centroids: number of voronoi cells in approx nn @param num_probes: number of cells to consider when querying @param num_quantizers: number of sub-vectors to discritize @param bits_per_quantizer: bits per sub-vector @param shared_scratch_dir: location to store intermediate results. @param training_sample_prob: chance a point is trained on @return The path you can load the resulting FAISS index

Return type

Path

agatha.construct.knn_util.train_initial_index(training_data, num_centroids, num_probes, num_quantizers, bits_per_quantizer, output_path)

Computes index using method from: https://hal.inria.fr/inria-00514462v2/document

Vector dimensionality must be a multiple of num_quantizers. Input vectors are “chunked” into num_quantizers sub-components. Each chunk is reduced to a bits_per_quantizer value. Then, the L2 distances between these quantized bits are compared.

For instance, a scibert embedding is 768-dimensional. If num_quantizers=32 and bits_per_quantizer=8, then each vector is split into subcomponents of only 24 values, and these are further reduced to an 8-bit value. The result is that we’re only using 1/3 of a bit per value in the input.

When constructing the index, we use quantization along with the L2 metric to perform K-Means, constructing a voronoi diagram over our training data. This allows us to partition the search space in order to make inference faster. num_centroids determines the number of voronoi cells a point could be in, while num_probes determines the number of nearby cells we consider at query time. So higher centroids means faster and less accurate inference. Higher probes means the opposite, longer and more accurate queries.

According to: https://github.com/facebookresearch/faiss/wiki/Faiss-indexes, we should select #centroids on the order of sqrt(n).

Choosing an index is hard: https://github.com/facebookresearch/faiss/wiki/Index-IO,-index-factory,-cloning-and-hyper-parameter-tuning

Return type

None