The reader of this article should be familiar with the basics of Python programming, machine learning, and building knowledge-based applications.
The following architectures experimented with a 100 KB data set on a low-end laptop with an i5 processor and 8 GB of RAM
- Data point clustering using K-means, glove vectors and fast text embedding
- The approximate nearest neighbor of the disc
a. Clustering of data points by K means by embedding a glove vector
To use the K-Means clustering algorithm, we used an elbow to determine the appropriate value for the number of clusters. The value of the number of clusters determined by the elbow was 200 in a reduced 30 k data set.
A simple keyword search in this case returned search results in 0.07 seconds.
But when semantic search was used for words that are not in glove_vectors, zeros returned and the cosine similarity point failed.
Quick text was used to get embeddings for words not found in glove vectors. With this approach, the silhouette score of 0.05 in the 30 k data set was measured.
Because the silhouette points are close to zero, the samples are at very close decision limits with neighboring groups. Therefore, searching within a cluster may not work well at these data points and moving to the next FAISS architecture.
This is the closest neighbor search that Facebook research has built into the billion-scale data and has been reported to be 8.5 times faster than previous SOTA methods for the nearest neighbor search.
For similarity search, the following functions with billions of vectors are required
1) Given a query vector, we need to find a list of vectors that are closest neighbors to the vectors using the Euclidean distance
2) Given a query vector, find a list of vectors that return the highest score.
The search for semantic similarity basically involves two operations.
a) Indexing: The FAISS library is built around an Index object. The list of vectors is first preprocessed to search faster. The dimension of the vectors is selected by the sentence vector transformer model we choose to encode the vectors. In our case, we have chosen a modification of the pre-trained BERT network of the sentence transformer model that uses Siamese and triplet network structures to derive semantically relevant sentence embeddings that can be compared with cosine similarity. Detailed code for creating a directory is available on GitHub.
We can organize and deserialize the directory for online transfer.
Before reading the serial number directory from the pickle file, disable it and then read the bytes from the file.
There are two ways to create an index. CPU index and GPU index. The difference is in the resources used to execute the code to create the index. The former uses CPU resources, while the latter uses GPU resources.
b) Search: The k-nearest neighbor search can be performed on the constructed index. For the query vector, find the k-Nearest neighbors in the corpus.
1. Normal raw power search with GPU
Calling the Knn function to the query_vector and the encoded_document vector returns the result in 0.1 seconds on the GPU. In this case, the index is built using IndexFlatL2, which is a simple calculation of the L2 distance, it does not work well in this case.
2. building an index with a pure internal product
The index was constructed using the IndexFlatIP protocol, which is a cosine similarity calculation and was found to give good results for this data.
Both processor resources and GPU resources were tested (through collaboration). The search portion of the code remains the same for all resources.
3.Location sensitive sensing
Constructing the index by passing dimensions and number of bits as arguments, in this case d = 768 and nbit = 1024 were used in the study. Selecting nbit <1024 affected the returned results. Therefore, at least 128 bytes were required to represent document vectors. Instead of calculating projections here, the calculations are optimized using random rotations of the vectors. This is also an exhaustive search and returns knn in 0.01 seconds.
If we want to reduce the memory footprint of the index and storage, we need to perform a lossy compression called product quantization.
Changing the memory would be accurate.
After trying ProductQuantizer using vector inverse list (IVF), the index vector built for a large set of 10 million datasets was close to 3GB and could not load data into RAM, and the collaboration session crashed. In addition to the IVF indexing format, we can only get approximate distances.
The index for which the product quantizer is used has a very low accuracy, so reclassification with accurate distance calculations is required, which is an expensive operation with limited RAM.
Comparison of latency across different architectures
Comparison of different search results
For the query string: “jQuery extensions for the event”. If model1 is built using a cluster, it is initialized with k-mean ++, the query_vector was predicted to belong to the cluster. Finding the five closest neighbors in that cluster yielded the following results.
Our cluster model clearly failed in this case and was reflected in the previously calculated silhouette point by overlapping point decision boundaries. To ensure that the clustering model failed, we plotted the scatterplot of two clusters, i.e., 52 and 54. The scores are very overlapping in nature.
Model3: Plate Nearest Nearest Neighbors, the sample node was fed, and the trimmed node output was obtained with a limited list of nodes during the code unit test. See the example output at the end of the next section.
FFrom the survey results above, we found that IndexFlatIP performed well in our case study compared to coarse force or LSH. Although LSH returned results in 0.01 second, the results for IndexFlatIP are more similar to the survey. Therefore, we deployed the CPU version of FAISS using IndexFlatIP in Streamlit, and a demo of the application is available in the Deployments section.
c. DISKANN: Fast, accurate billionth point Nearest neighbor search from one node
This is the approximate nearest neighbor search. The main advantage of this approach is that it can crawl, store and retrieve a billion sets of point data with 64GB of RAM and an SSD.
We have changed the material used for previous approaches. Instead of using one question for one accepted response mapping, we have used one question for many answers to construct a Directional Acyclic Chart as described in the research paper.
This study was presented by the Microsoft Research team at the NeurIPS conference. This algorithm is faster memory retrieval, faster indexing, and more scalable.
The Vamana algorithm includes a greedy search followed by a strong pruning of the nodes.
The code for this algorithm is implemented in Python, and the complete code is available from the GitHub link.
It has been tested with sample nodes. A simple unit test was performed for the greedy_search and robust qualifying function. Parallel tools were used to take advantage of multiprocessing in Python.
Streamlit is a lightweight framework used to create Data applications. The Streamlit library is entirely in Python. It is based on 3 simple principles. 1.Python scripts 2.Widget / component interaction 3.Deploy instantly.
The semantic search model of the FAISS processor was used to build a data-driven application in streamlit. Below is the set of data used to build EDA and Word Cloud.
Exploratory data analysis of the data
Find a link to the webcast of a simple application in streamlit
11.Conclusions and future work
It can be concluded that for a large data set ≤ 100k FAISS works well, but again RAM becomes a constraint. Undoubtedly, Disk ANN transcends these architectures for very large data sets in directory structure, SSD storage, and retrieval.
I will try the following thoughts on the subject
- Tuning of hyperparameters to DISKANN algorithm code.
- Use other fields in the dataset to search and calculate data for comparison
Full source code is available GitHub link
If you have any questions or suggestions about any of the above, please contact me LinkedIn