Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MB-58901: Introduce support for BM25 scoring #2113

Open
wants to merge 26 commits into
base: master
Choose a base branch
from

Conversation

Thejas-bhat
Copy link
Member

@Thejas-bhat Thejas-bhat commented Dec 6, 2024

Introducing support for BM25 scoring

Key stats necessary for the scoring

  • fieldLength - the number of terms in a field within a doc.
  • avgDocLength - the average of terms in a field across all the docs in the index.
  • totalDocs - total number of docs in an index.

Introduces a mechanism to maintain consistent scoring in a situation where the index is partitioned as a bleve.IndexAlias. This is achieved using the existing preSearch mechanism where the first phase of the entire search involves fetching the above mentioned stats, aggregating them and redistributing back to the bleve indexes which would use them while calculating the score for a hit.

Implementation wise, the user needs to explicitly mention BM25 as the scoring mechanism at indexMapping.ScoringModel level to actually use this scoring mechanism. This parameter is a global setting, i.e. when performing a search on multiple fields, all the fields are scored with the same scoring model.
The storage layer exposes an API which returns the number of terms in a field's term dictionary which is used to compute the avgDocLength. At the indexing layer, we check if the queried field supports BM25 scoring and if consistent scoring is availed. This is followed by fetching the stats either from the local bleve index or from a context (in the case where we're availing the consistent scoring) to compute the actual score.

Note: The scoring is highly dependent on the size of an individual bleve index's termDictionary (specific to a field) so there can be some discrepancies especially given that each index is further composed of multiple 'segments'. However in large scale use cases these discrepancies can be quite small and don't affect the order of the doc hits - in which case the user may choose to avoid this altogether.

@Thejas-bhat Thejas-bhat force-pushed the bm25-refactor branch 5 times, most recently from 4b626d0 to 45efde1 Compare December 12, 2024 10:39
Base automatically changed from presearchRefactor to master December 17, 2024 08:52
@Thejas-bhat Thejas-bhat changed the title WIP: BM25 scoring MB-58901: Introduce support for BM25 scoring Jan 6, 2025
@Thejas-bhat Thejas-bhat marked this pull request as ready for review January 6, 2025 11:44
@abhinavdangeti abhinavdangeti added this to the v2.5.0 milestone Jan 6, 2025
@abhinavdangeti
Copy link
Member

@Thejas-bhat it seems you'll need to push up the Cardinality() api to all zap versions.

Copy link
Member

@abhinavdangeti abhinavdangeti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Thejas-bhat Make sure to pull the latest from origin/bm25-refactor before you push more commits here :)

mapping/index.go Outdated Show resolved Hide resolved
search/searcher/search_term.go Outdated Show resolved Hide resolved
search/util.go Show resolved Hide resolved
search/util.go Outdated Show resolved Hide resolved
search/util.go Outdated Show resolved Hide resolved
@abhinavdangeti
Copy link
Member

@Thejas-bhat a thought regarding the "global scoring" code path for bm25 - what is the default behavior in elastic?

Would you add a couple of GO benchmark test to differentiate between bm25 with and without global scoring and record these numbers within the commit message^ - trying to decide whether to enable "global scoring" by default.

search/scorer/scorer_term.go Outdated Show resolved Hide resolved
search/scorer/scorer_term.go Show resolved Hide resolved
search/searcher/search_term.go Outdated Show resolved Hide resolved
search/searcher/search_term.go Outdated Show resolved Hide resolved
@Thejas-bhat
Copy link
Member Author

@Thejas-bhat a thought regarding the "global scoring" code path for bm25 - what is the default behavior in elastic?

Would you add a couple of GO benchmark test to differentiate between bm25 with and without global scoring and record these numbers within the commit message^ - trying to decide whether to enable "global scoring" by default.

By default, elastic disables the feature. It'll be a bit difficult to benchmark this at golang unit level over here, because the latency is mainly visible when the index alias has multiple shards and each of which is spread across multiple nodes.

Copy link
Member

@abhinavdangeti abhinavdangeti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor refactor suggestion, looks good to me otherwise.

@@ -485,6 +485,8 @@ func (i *indexImpl) preSearch(ctx context.Context, req *SearchRequest, reader in
}

var fts search.FieldTermSynonymMap
var count uint64
fieldCardinality := make(map[string]int)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets change this to ..

var fieldCardinality map[string]int

.. and make only if isBM25Enabled(..) == true in line 499 below.

@@ -578,6 +604,14 @@ func (i *indexImpl) SearchInContext(ctx context.Context, req *SearchRequest) (sr
}
skipSynonymCollector = true
}
skipKNNCollector = true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this right? Something we missed?

// preSearchRequired checks if preSearch is required and returns a boolean flag
// It only allocates the preSearchFlags struct if necessary
func preSearchRequired(req *SearchRequest, m mapping.IndexMapping) (*preSearchFlags, error) {
func isBM25Enabled(m mapping.IndexMapping) bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's refactor this to scoringModel(..) which returns the scoring model to use instead.

@@ -605,6 +639,21 @@ func (i *indexImpl) SearchInContext(ctx context.Context, req *SearchRequest) (sr
ctx = context.WithValue(ctx, search.FieldTermSynonymMapKey, fts)
}

scoringModelCallback := func() string {
if isBM25Enabled(i.m) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related to other commend, let's update this method to return the scoring model to use.
To establish if it's bm25, you can add the extra check - scoringModel() == index.BM25Scoring.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants