Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Use IndexInput to load the graph files for Native Index #1951

Closed
navneet1v opened this issue Aug 12, 2024 · 5 comments
Closed

[FEATURE] Use IndexInput to load the graph files for Native Index #1951

navneet1v opened this issue Aug 12, 2024 · 5 comments
Assignees
Labels
enhancement Features Introduces a new unit of functionality that satisfies a requirement indexing-improvements This label should be attached to all the github issues which will help improving the indexing time.

Comments

@navneet1v
Copy link
Collaborator

navneet1v commented Aug 12, 2024

Description

Currently K-NN plugin for native engines(Faiss and Nmslib) creates a separate graph file(in codec) to build and store the k-NN index at segment level. This file is tracked by Lucene for a segment but while reading the file k-NN plugin relies on FSDirectory to get the full path of the k-NN index at segment level and then use Native libs api to load the index in memory.

The above behavior causes few problems:

  1. If the Directory implementation changes from Opensearch side from FSDirectory to any other directory we need changes in k-NN plugin to ensure k-NN search is working correctly. With writable warm tier and Searchable snapshot there are new directory implementation present or will come.
  2. Even if the LeafReaderContext concrete implementation changes then also k-NN specific changes needs to be done. ref GH issue: [BUG] KNN Queries Fails when segment replication is enabled and replicas has deleted docs #1807, Fix: Fixed LeafReaders casting errors to SegmentReaders when segment repli… #1808
  3. Another issue which can be solved: [BUG] Knn Doc ingestion throws ClassCastException when used with OS plugin implementing IndexStorePlugin #1228 use of IndexStorePlugin

Solution

The solution I am proposing here is rather than relying on path of the file, k-NN plugin should use IndexInput to read the file. This new reading behavior also needs to be integrated with Faiss/Nmslib lib. In Faiss, I see they provide an interface IOReader which can be used to load the contents of the file. If k-NN plugin implements the interface and then underneath if it uses IndexInput to read the file this will avoid the problems mentioned above.

Some deep-dive I did suggest that IndexInput provides a way to read byte and Faiss just asks for n bytes anytime it wants to read anything.
Ref: https://github.com/facebookresearch/faiss/blob/df0dea6c6d8951056763dc03528b3973c6ba26e2/faiss/impl/index_read.cpp#L531
Ref: https://github.com/facebookresearch/faiss/blob/c0052c15336a57f7068a7d098d5ce5b6234a2d70/faiss/impl/io_macros.h#L17-L28
Ref Lucene: https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/store/DataInput.java#L58

I have not done any deep-dive on Nmslib.

Indexing

I also see that on indexing while writing the native index file we use the FSDirectory, if we do similar changes for writing the native index file, we can also remove the dependency of FSDirectory from write path too. Ref:

indexPath = Paths.get(((FSDirectory) (FilterDirectory.unwrap(state.directory))).getDirectory().toString(), engineFileName)

@navneet1v navneet1v moved this from Backlog to Backlog (Hot) in Vector Search RoadMap Aug 12, 2024
@navneet1v navneet1v added Features Introduces a new unit of functionality that satisfies a requirement and removed untriaged labels Aug 12, 2024
@jmazanec15
Copy link
Member

I really like this idea. Im wondering how we can minimize overhead of copying between native and java layers though.

@navneet1v
Copy link
Collaborator Author

I really like this idea. Im wondering how we can minimize overhead of copying between native and java layers though.

Yes that is an interesting question. I think its one time operation, during the graph loads. But yes we can think more once we have a working solution.

@0ctopus13prime
Copy link
Collaborator

I will work on few plausible solutions regarding this with problem definition where we can start discussions real quick.
Thank you.

@navneet1v
Copy link
Collaborator Author

Initial RFC for this feature has been added here: #2033

@navneet1v navneet1v added the indexing-improvements This label should be attached to all the github issues which will help improving the indexing time. label Sep 14, 2024
@navneet1v
Copy link
Collaborator Author

Closing out this GH issue as the feature is now merged in 2.18 and is getting released with 2.18.

@github-project-automation github-project-automation bot moved this from Backlog (Hot) to ✅ Done in Vector Search RoadMap Oct 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Features Introduces a new unit of functionality that satisfies a requirement indexing-improvements This label should be attached to all the github issues which will help improving the indexing time.
Projects
Status: Done
Development

No branches or pull requests

3 participants