Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Insufficient number of hits for nested knn queries with efficient filter #2347

Open
CorentinLimier opened this issue Dec 20, 2024 · 20 comments
Assignees
Labels
bug Something isn't working v2.19.0

Comments

@CorentinLimier
Copy link

CorentinLimier commented Dec 20, 2024

Hello 👋

What is the bug?

We use an index to store text documents for semantic search purpose. The text being long, we chunk it in paragraph to embed it using all-MiniLM-L6-v2 model. Each chunk being stored in that nested field of the document.
Each document has also an account_id attribute that we use when querying (efficient filtering).

Then we do approximative knn queries with lucene hnsw.

From these documentations :

I expect when executing a knn query on this nested field with efficient filter to get at least n hits, n being the minimum between k and the number of documents that match the efficient filter.

But for some specific input vector or query_text, we get less than n hits, and sometimes even 0. For the same filter with a different query, we get the correct n hits.

We have two other indices without nested field (only one vector per document) with the same efficient filter and it works as expected.

Seems similar to this #2222 or #2339 except the efficient filtering is as simple as a term filter.

How can one reproduce the bug?

Error happens on specific queries so it's hard to reproduce.

Here is the mapping of the index :

{
  "knowledge-index": {
    "mappings": {
      "properties": {
        "accountId": {
          "type": "keyword"
        },
        "id": {
          "type": "keyword"
        },
        "metadata": {
          "type": "text"
        },
        "metadataEmbedding": {
          "type": "nested",
          "properties": {
            "knn": {
              "type": "knn_vector",
              "dimension": 384,
              "method": {
                "engine": "lucene",
                "space_type": "l2",
                "name": "hnsw",
                "parameters": {}
              }
            }
          }
        },
        "timestamp": {
          "type": "date"
        }
      }
    }
  }
}

Here is the query :

GET /knowledge-index/_search?preference=_primary&explain=true&request_cache=false
{
  "from": 0,
  "_source": {
    "excludes": [
      "metadataEmbedding"
    ]
  },
  "query": {
    "nested": {
      "score_mode": "max",
      "path": "metadataEmbedding",
      "query": {
        "neural": {
          "metadataEmbedding.knn": {
            "query_text": "<query_text>",
            "model_id": "9QxR8YsBSCN1wquQEH2b",
            "k": <k>,
            "filter": {
                "term": {
                  "accountId":  "<account_id>"
              }
            }
          }
        }
      }
    }
  }
}

For k = 38, I get 6 hits

  "hits": {
    "total": {
      "value": 6,
      "relation": "eq"
    },
    "max_score": 0.50342417,

But for k = 1000 I get 32 hits, and k = 10000 (max value) 232 hits.

For another query_text value, I have different results where hits is always = k (or the max of documents that match filter which is 232)

I have the same results when converting first the text in vector and use directly the vector without the neural instruction :

POST /_plugins/_ml/_predict/text_embedding/9QxR8YsBSCN1wquQEH2b
{
  "text_docs":[ "<query_text>"],
  "return_number": true,
  "target_response": ["sentence_embedding"]
}

GET /knowledge-index/_search?preference=_primary&explain=true&request_cache=false
{
  "size": 5,
  "_source": {
    "excludes": [
      "metadataEmbedding"
    ]
  },
  "query": {
    "nested": {
      "path": "metadataEmbedding",
      "query": {
        "knn": {
          "metadataEmbedding.knn": {
            "vector": [
                 ....
             ],
            "k": 38,
            "filter": {
              "term": {
                "accountId": "<account_id>"
              }
            }
          }
        }
      }
    }
  }
}

What is the expected behavior?

Getting n hits, n being the minimum between k and the number of documents that match the efficient filter.

What is your host/environment?

  • opensearch version : 2.17.1

Do you have any additional context?

Here is the result of GET /_plugins/_knn/stats?pretty on the node :

{
      "max_distance_query_with_filter_requests": 0,
      "graph_memory_usage_percentage": 0,
      "graph_query_requests": 0,
      "graph_memory_usage": 0,
      "cache_capacity_reached": false,
      "load_success_count": 0,
      "training_memory_usage": 0,
      "indices_in_cache": {},
      "script_query_errors": 0,
      "hit_count": 0,
      "knn_query_requests": 2215,
      "total_load_time": 0,
      "miss_count": 0,
      "min_score_query_requests": 0,
      "knn_query_with_filter_requests": 2215,
      "training_memory_usage_percentage": 0,
      "max_distance_query_requests": 0,
      "lucene_initialized": true,
      "graph_index_requests": 0,
      "faiss_initialized": false,
      "load_exception_count": 0,
      "training_errors": 0,
      "min_score_query_with_filter_requests": 0,
      "eviction_count": 0,
      "nmslib_initialized": false,
      "script_compilations": 1,
      "script_query_requests": 2,
      "graph_stats": {
        "refresh": {
          "total_time_in_millis": 0,
          "total": 0
        },
        "merge": {
          "current": 0,
          "total": 0,
          "total_time_in_millis": 0,
          "current_docs": 0,
          "total_docs": 0,
          "total_size_in_bytes": 0,
          "current_size_in_bytes": 0
        }
      },
      "graph_query_errors": 0,
      "indexing_from_model_degraded": false,
      "graph_index_errors": 0,
      "training_requests": 0,
      "script_compilation_errors": 0
    },

Any idea on what could be the issue here ? Am I right to expect k hits for nested fields with efficient filter ?

Thanks for your help.

@CorentinLimier CorentinLimier added bug Something isn't working untriaged labels Dec 20, 2024
@heemin32
Copy link
Collaborator

heemin32 commented Dec 20, 2024

Further investigation into the code may be required, but based on the reported issue, it appears that post-filtering is being utilized instead of efficient filtering internally.

@heemin32
Copy link
Collaborator

heemin32 commented Dec 20, 2024

@CorentinLimier Could you try the same query without using neural to see if this is knn issue or neural issue?

@CorentinLimier
Copy link
Author

CorentinLimier commented Dec 20, 2024

@heemin32 Thanks for your help !

About this :

Further investigation into the code may be required, but based on the reported issue, it appears that post-filtering is being utilized instead of efficient filtering internally.

Indeed that could be it but what's weird is that for some other vectors (with exact same query/index & filter), we do have the correct value of hits at the end which would be highly unrealistic with post-filtering. So for some vectors it would use post-filtering, and for some efficient ?

Could you try the same query without using neural to see if this is knn issue or neural issue?

I did use a query without neural with same result (posted in my initial message). Or did you mean something else than the two queries posted ?

Here the query without neural

POST /_plugins/_ml/_predict/text_embedding/9QxR8YsBSCN1wquQEH2b
{
  "text_docs":[ "<query_text>"],
  "return_number": true,
  "target_response": ["sentence_embedding"]
}

GET /knowledge-index/_search?preference=_primary&explain=true&request_cache=false
{
  "size": 5,
  "_source": {
    "excludes": [
      "metadataEmbedding"
    ]
  },
  "query": {
    "nested": {
      "path": "metadataEmbedding",
      "query": {
        "knn": {
          "metadataEmbedding.knn": {
            "vector": [
                 ....
             ],
            "k": 38,
            "filter": {
              "term": {
                "accountId": "<account_id>"
              }
            }
          }
        }
      }
    }
  }
}

Results are exactly the same for each value of k : 6 hits for k=38, 32 hits for k = 1000, all hits for k = 10000

Thanks a lot 🙏

@heemin32
Copy link
Collaborator

heemin32 commented Dec 20, 2024

Oh. You were talking about value in hits. Could you also check if it also only return that hits number of documents by setting the size as same value as k to see if it is hits issue or search result issue as well.

Also, it would be nice if you could provide a reproducible steps with smaller data set.

@CorentinLimier
Copy link
Author

@heemin32

I indeed have the same number of value returned than that hit number if size = k.

Ex :

GET /knowledge-index/_search?preference=_primary&explain=true&request_cache=false
{
  "size": 38,
  "_source": {
    "excludes": [
      "metadataEmbedding"
    ]
  },
  "query": {
    "nested": {
      "path": "metadataEmbedding",
      "query": {
        "knn": {
          "metadataEmbedding.knn": {
            "vector": [
               ...
            ],
            "k": 38,
            "filter": {
              "term": {
                "accountId": "..."
              }
            }
          }
        }
      }
    }
  }
}

Returns 6 hits (instead of k=38) and 6 documents (instead of size=38).

Also, it would be nice if you could provide a reproducible steps with smaller data set.

I will try to create a reproducible example, I can understand that it will help. I wanted to make sure first that my issue was not a mistake from me or a misunderstanding of the doc. From your understanding of the mapping and queries, are we aligned on the fact that I should expect k documents even with nested fields and efficient filtering ?

Thanks a lot

@heemin32
Copy link
Collaborator

The exact hit number is a little more complex than min of k and max doc. It is sum of (min of k and max doc in a segment) for all segments. Still, the 6 hits for k = 38 and 30 hits for k = 1000 is not an expected behavior.

It could be an issue with lucene engine. Could you test it with faiss engine if possible?

@CorentinLimier
Copy link
Author

CorentinLimier commented Dec 20, 2024

@heemin32 Ok, I will try with faiss (by changing the engine attribute in mapping right ?). Will probably be able to do so starting from beginning of January. I will try to create a reproducible example as well.

Does it make sense to have this lucene issue only for nested structures ?

Thank you very much, will give you more details once I'll be able to work on it 🙏

@heemin32
Copy link
Collaborator

Does it make sense to have this lucene issue only for nested structures ?

Yes

@navneet1v
Copy link
Collaborator

@buddharajusahil can you please take a look into this issue.

@buddharajusahil
Copy link
Contributor

Sure @navneet1v , please assign me this task.

@CorentinLimier
Copy link
Author

CorentinLimier commented Dec 30, 2024

Hello,

I didn't manage to create a reproducible example with a smaller dataset.
You can find here my tentative, but everything is working as expected in this example :
tentative.txt

Not sure if it's a volume issue or not, and it's quite difficult to reproduce since in production, for some vectors it works as expected and for others not.

My next tentative will be to use faiss engine as suggested by @heemin32 and see if I see differences in production. We monitor the number of results for each request so I will be able to monitor if the number of results is closer to what I expect.

If you have any idea on what could be the root cause of the bug and help me reproducing the issue with a smaller dataset, I would be happy to help. Same if I can provide more details on the current situation in production.

Thanks

@buddharajusahil
Copy link
Contributor

Hi @CorentinLimier I think this is related to an overall problem with efficient filtering. Can you try one thing, retry this tentative, but instead with 2 shard count specified in index settings. Then, try running the search multiple times, I believe you will get inconsistent results.

@CorentinLimier
Copy link
Author

@buddharajusahil I get consistent results even with 2 shards with this sample of data.

In production, we actually have only one shard, but also one replica.

I also tried these settings with this sample of data and couldn't reproduce :/

@CorentinLimier
Copy link
Author

CorentinLimier commented Jan 6, 2025

We switched the engine from faiss to lucene and results are by far better.
With lucene, here are some searches on random words with k = 38

Searching passion in account account1: 34
Searching jovial in account account1: 38
Searching keen in account account1: 38
Searching knowledge in account account1: 38
Searching motivation in account account1: 38
Searching open-mindedly in account account1: 38
Searching park in account account1: 38
Searching read in account account1: 38
Searching village in account account1: 38
Searching yearning in account account1: 38
Searching collaboration in account account2: 0
Searching dance in account account2: 0
Searching focus in account account2: 0
Searching innovation in account account2: 0
Searching queen in account account2: 0
Searching trustworthy in account account2: 0
Searching clarity in account account2: 38
Searching effort in account account3: 0
Searching endurance in account account3: 0
Searching freedom in account account3: 0
Searching jovially in account account3: 0
Searching strong in account account3: 0
Searching sun in account account3: 0
Searching virtue in account account3: 0
Searching guidance in account account3: 38
Searching climb in account account4: 38
Searching enjoy in account account4: 38
Searching giraffe in account account4: 38
Searching innocently in account account4: 38
Searching loudly in account account4: 38
Searching reliability in account account4: 38
Searching unique in account account4: 38
Searching warmly in account account4: 38
Searching yearly in account account4: 38
Searching eagerly in account account5: 0
Searching faithfully in account account5: 0
Searching graceful in account account5: 0
Searching graceful in account account5: 0
Searching kinetic in account account5: 0
Searching mindfulness in account account5: 0
Searching quiet in account account5: 0
Searching rarely in account account5: 0
Searching resilience in account account5: 0
Searching respect in account account5: 0
Searching truly in account account5: 0
Searching unity in account account5: 0
Searching quiet in account account6: 2
Searching garden in account account6: 38
Searching justice in account account6: 38
Searching youth in account account6: 38
Searching zeal in account account6: 38
Searching house in account account7: 38
Searching kindness in account account7: 38
Searching merrily in account account7: 38
Searching navigate in account account7: 38
Searching opportunity in account account7: 38
Searching optimistic in account account7: 38
Searching orchestra in account account7: 38
Searching steadily in account account7: 38
Searching carefully in account account8: 38
Searching clarity in account account8: 38
Searching create in account account8: 38
Searching energy in account account8: 38
Searching fiercely in account account8: 38
Searching gather in account account8: 38
Searching iceberg in account account8: 38
Searching lion in account account8: 38
Searching night in account account8: 38
Searching open-mindedly in account account8: 38
Searching open-mindedly in account account8: 38
Searching opportunity in account account8: 38
Searching resilient in account account8: 38
Searching respectfully in account account8: 38
Searching steadily in account account8: 38
Searching teamwork in account account8: 38
Searching teamwork in account account8: 38
Searching upliftment in account account8: 38
Searching vibrant in account account8: 38
Searching waterfall in account account8: 38
Searching wholeheartedly in account account8: 38
Searching wholeheartedly in account account8: 38
Searching forest in account account9: 38
Searching kinetic in account account9: 38
Searching openly in account account9: 38
Searching patience in account account9: 38
Searching practice in account account9: 38
Searching teach in account account9: 38
Searching unicorn in account account9: 38
Searching wonderful in account account9: 38
Searching yearly in account account9: 38
Searching absolutely in account account10: 38
Searching adventure in account account10: 38
Searching collaboration in account account10: 38
Searching eagerly in account account10: 38
Searching graceful in account account10: 38
Searching jewelry in account account10: 38
Searching magical in account account10: 38
Searching steadily in account account10: 38
Searching understanding in account account10: 38
Searching uplift in account account10: 38

Note that for account1 provided 34 results while other words provided 38. Searches on account3 worked only for word guidance. For account6, search on quiet didn't work well

With faiss :

Searching knowledge in account account1: 38
Searching open-mindedly in account account1: 38
Searching park in account account1: 38
Searching read in account account1: 38
Searching village in account account1: 38
Searching jovial in account account1: 76
Searching keen in account account1: 76
Searching motivation in account account1: 76
Searching passion in account account1: 76
Searching yearning in account account1: 76
Searching collaboration in account account2: 18
Searching trustworthy in account account2: 18
Searching clarity in account account2: 56
Searching dance in account account2: 56
Searching focus in account account2: 56
Searching innovation in account account2: 56
Searching queen in account account2: 56
Searching freedom in account account3: 38
Searching effort in account account3: 76
Searching endurance in account account3: 76
Searching guidance in account account3: 76
Searching jovially in account account3: 76
Searching strong in account account3: 76
Searching sun in account account3: 76
Searching virtue in account account3: 76
Searching climb in account account4: 266
Searching enjoy in account account4: 266
Searching giraffe in account account4: 266
Searching innocently in account account4: 266
Searching loudly in account account4: 266
Searching reliability in account account4: 266
Searching unique in account account4: 266
Searching warmly in account account4: 266
Searching yearly in account account4: 266
Searching eagerly in account account5: 38
Searching faithfully in account account5: 38
Searching graceful in account account5: 38
Searching graceful in account account5: 38
Searching kinetic in account account5: 38
Searching mindfulness in account account5: 38
Searching quiet in account account5: 38
Searching rarely in account account5: 38
Searching resilience in account account5: 38
Searching respect in account account5: 38
Searching truly in account account5: 38
Searching unity in account account5: 38
Searching garden in account account6: 76
Searching justice in account account6: 76
Searching quiet in account account6: 76
Searching youth in account account6: 76
Searching zeal in account account6: 76
Searching house in account account7: 114
Searching kindness in account account7: 114
Searching merrily in account account7: 114
Searching navigate in account account7: 114
Searching opportunity in account account7: 114
Searching optimistic in account account7: 114
Searching orchestra in account account7: 114
Searching steadily in account account7: 114
Searching carefully in account account8: 76
Searching clarity in account account8: 76
Searching create in account account8: 76
Searching energy in account account8: 76
Searching fiercely in account account8: 76
Searching gather in account account8: 76
Searching iceberg in account account8: 76
Searching lion in account account8: 76
Searching night in account account8: 76
Searching open-mindedly in account account8: 76
Searching open-mindedly in account account8: 76
Searching opportunity in account account8: 76
Searching resilient in account account8: 76
Searching respectfully in account account8: 76
Searching steadily in account account8: 76
Searching teamwork in account account8: 76
Searching teamwork in account account8: 76
Searching upliftment in account account8: 76
Searching vibrant in account account8: 76
Searching waterfall in account account8: 76
Searching wholeheartedly in account account8: 76
Searching wholeheartedly in account account8: 76
Searching forest in account account9: 76
Searching kinetic in account account9: 76
Searching openly in account account9: 76
Searching patience in account account9: 76
Searching practice in account account9: 76
Searching teach in account account9: 76
Searching unicorn in account account9: 76
Searching wonderful in account account9: 76
Searching yearly in account account9: 76
Searching absolutely in account account10: 38
Searching adventure in account account10: 38
Searching collaboration in account account10: 38
Searching eagerly in account account10: 38
Searching graceful in account account10: 38
Searching jewelry in account account10: 38
Searching magical in account account10: 38
Searching steadily in account account10: 38
Searching understanding in account account10: 38
Searching uplift in account account10: 38

Note that for account2, I also had issues with faiss where I got only 18 results for searches on trustworthy and collaboration on account2, but it seems to be very rare to have less than k results (I understand that we have often more because faiss provides k * segment * shards)

Note as well the distribution of the nb of documents per account :

account1: 3758
account2: 4498
account3: 4501
account4: 22358
account5: 845
account6: 1151
account7: 6841
account8: 711
account9: 1477
account10: 1409

I thought at first that account with small amounts of documents would be the ione impacted but seems not to be the case.

Still trying to reproduce with dummy data but for now without success

@heemin32
Copy link
Collaborator

heemin32 commented Jan 6, 2025

This issue might be related with #2359?

@CorentinLimier
Copy link
Author

@heemin32 in my case I only have one shard, but maybe issue #2359 also applies to shard segments ?

@heemin32
Copy link
Collaborator

heemin32 commented Jan 8, 2025

@heemin32 in my case I only have one shard, but maybe issue #2359 also applies to shard segments ?

@buddharajusahil could you answer to this question?

@buddharajusahil
Copy link
Contributor

@CorentinLimier @heemin32 I don't think this is the same issue as #2359 . I believe that issue can only occur in a multi shard setup, not on single shard.

@buddharajusahil
Copy link
Contributor

Hi @CorentinLimier do you know if this problem only started arising in version 2.17.1? Also, on these certain queries produce incorrect results, are they consistent and get the same result every time? Also have you tried different filters to see if that produces different results for the same query text?

Appreciate the info thus far!

@CorentinLimier
Copy link
Author

CorentinLimier commented Jan 10, 2025

@buddharajusahil Thank you for your answer !

Also, on these certain queries produce incorrect results, are they consistent and get the same result every time

Yes at least for a certain amount of time but since we index new documents every day I believe it can change. But without indexing new documents I had same results with same queries executed multiple times even with request_cache set to false

Also have you tried different filters to see if that produces different results for the same query text?

Yes some query_text worked fine with some filters (producing results) and not on others (0 result while we expect some). On example provided above collaboration gives 38 results on account 10 and none on account2.

do you know if this problem only started arising in version 2.17.1

I don't know if it started with 2.17.1 or if we had it with a prior version.

But actually, updating to 2.18.0 seems to fix the issue. Only condition is that we reindex the documents (if we just upgrade the version without droppping the index, the issue remains). Of course, I tried to recreate from scratch the index before on 2.17 and issue remained with this version.

So I wonder if the issue on 2.17 is on the hnsw side when indexing 🤔

Will keep an eye on this since we will upgrade our production cluster to 2.18.0 next week

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working v2.19.0
Projects
Status: 2.19.0
Development

No branches or pull requests

4 participants