MB-58901: Introduce support for BM25 scoring #2113

Thejas-bhat · 2024-12-06T06:50:35Z

Introducing support for BM25 scoring

Key stats necessary for the scoring

fieldLength - the number of terms in a field within a doc.
avgDocLength - the average of terms in a field across all the docs in the index.
totalDocs - total number of docs in an index.

Introduces a mechanism to maintain consistent scoring in a situation where the index is partitioned as a bleve.IndexAlias. This is achieved using the existing preSearch mechanism where the first phase of the entire search involves fetching the above mentioned stats, aggregating them and redistributing back to the bleve indexes which would use them while calculating the score for a hit.

Implementation wise, the user needs to explicitly mention BM25 as the scoring mechanism at indexMapping.ScoringModel level to actually use this scoring mechanism. This parameter is a global setting, i.e. when performing a search on multiple fields, all the fields are scored with the same scoring model.
The storage layer exposes an API which returns the number of terms in a field's term dictionary which is used to compute the avgDocLength. At the indexing layer, we check if the queried field supports BM25 scoring and if consistent scoring is availed. This is followed by fetching the stats either from the local bleve index or from a context (in the case where we're availing the consistent scoring) to compute the actual score.

Note: The scoring is highly dependent on the size of an individual bleve index's termDictionary (specific to a field) so there can be some discrepancies especially given that each index is further composed of multiple 'segments'. However in large scale use cases these discrepancies can be quite small and don't affect the order of the doc hits - in which case the user may choose to avoid this altogether.

abhinavdangeti · 2025-01-07T15:28:56Z

@Thejas-bhat it seems you'll need to push up the Cardinality() api to all zap versions.

abhinavdangeti

@Thejas-bhat Make sure to pull the latest from origin/bm25-refactor before you push more commits here :)

mapping/index.go

search/searcher/search_term.go

search/util.go

abhinavdangeti · 2025-01-09T22:11:20Z

@Thejas-bhat a thought regarding the "global scoring" code path for bm25 - what is the default behavior in elastic?

Would you add a couple of GO benchmark test to differentiate between bm25 with and without global scoring and record these numbers within the commit message^ - trying to decide whether to enable "global scoring" by default.

search/scorer/scorer_term.go

search/searcher/search_term.go

Thejas-bhat · 2025-01-10T07:24:50Z

@Thejas-bhat a thought regarding the "global scoring" code path for bm25 - what is the default behavior in elastic?

Would you add a couple of GO benchmark test to differentiate between bm25 with and without global scoring and record these numbers within the commit message^ - trying to decide whether to enable "global scoring" by default.

By default, elastic disables the feature. It'll be a bit difficult to benchmark this at golang unit level over here, because the latency is mainly visible when the index alias has multiple shards and each of which is spread across multiple nodes.

search/scorer/scorer_term.go

search/searcher/search_term.go

search/util.go

abhinavdangeti

Minor refactor suggestion, looks good to me otherwise.

abhinavdangeti · 2025-01-13T17:38:32Z

index_impl.go

@@ -485,6 +485,8 @@ func (i *indexImpl) preSearch(ctx context.Context, req *SearchRequest, reader in
 	}

 	var fts search.FieldTermSynonymMap
+	var count uint64
+	fieldCardinality := make(map[string]int)


Lets change this to ..

var fieldCardinality map[string]int

.. and make only if isBM25Enabled(..) == true in line 499 below.

abhinavdangeti · 2025-01-13T17:39:29Z

index_impl.go

@@ -578,6 +604,14 @@ func (i *indexImpl) SearchInContext(ctx context.Context, req *SearchRequest) (sr
 					}
 					skipSynonymCollector = true
 				}
+				skipKNNCollector = true


Is this right? Something we missed?

abhinavdangeti · 2025-01-13T17:40:58Z

index_alias_impl.go

-// preSearchRequired checks if preSearch is required and returns a boolean flag
-// It only allocates the preSearchFlags struct if necessary
-func preSearchRequired(req *SearchRequest, m mapping.IndexMapping) (*preSearchFlags, error) {
+func isBM25Enabled(m mapping.IndexMapping) bool {


Let's refactor this to scoringModel(..) which returns the scoring model to use instead.

abhinavdangeti · 2025-01-13T17:42:20Z

index_impl.go

@@ -605,6 +639,21 @@ func (i *indexImpl) SearchInContext(ctx context.Context, req *SearchRequest) (sr
 		ctx = context.WithValue(ctx, search.FieldTermSynonymMapKey, fts)
 	}

+	scoringModelCallback := func() string {
+		if isBM25Enabled(i.m) {


Related to other commend, let's update this method to return the scoring model to use.
To establish if it's bm25, you can add the extra check - scoringModel() == index.BM25Scoring.

Thejas-bhat force-pushed the bm25-refactor branch from 2b54a8d to 738dfe1 Compare December 6, 2024 06:51

Thejas-bhat force-pushed the presearchRefactor branch from 8b10cdf to d58474f Compare December 6, 2024 06:54

Thejas-bhat force-pushed the bm25-refactor branch 5 times, most recently from 4b626d0 to 45efde1 Compare December 12, 2024 10:39

Base automatically changed from presearchRefactor to master December 17, 2024 08:52

metonymic-smokey and others added 16 commits January 2, 2025 11:00

hacky start

bbe4ae7

use ctx in term srch

a679009

field cardinality temp save

2d8a43d

average doc length stat for a field

52b1768

bm25 scoring first implementation

42082f8

notes and keep the default tf-idf stuff

a52bd49

bug fixes and BM25 UT pass

36159b6

making bm25 presearch (i.e. global scoring) optional

f3424b5

field mapping to capture type of scoring; bm25 by default

d393616

bug fixes, unit test fixes

55e63fd

cleanup/refactor

04e1e72

bug fixes

ab58975

fix scatter-gather path

dbed957

bug fixes after merge conflict resolution

52e318d

score explanation

36db386

default similarity config for an index

e83cca0

Thejas-bhat force-pushed the bm25-refactor branch from f385ba6 to e83cca0 Compare January 6, 2025 07:16

cleanup

a643a3b

Thejas-bhat changed the title ~~WIP: BM25 scoring~~ MB-58901: Introduce support for BM25 scoring Jan 6, 2025

Thejas-bhat marked this pull request as ready for review January 6, 2025 11:44

Thejas-bhat requested review from abhinavdangeti and metonymic-smokey January 6, 2025 11:45

Thejas-bhat requested review from CascadingRadium and Likith101 January 6, 2025 11:45

abhinavdangeti added this to the v2.5.0 milestone Jan 6, 2025

Thejas-bhat and others added 2 commits January 7, 2025 17:52

keeping scoring as an index level config for consistency

b5a7c9b

Upgrade bleve_index_api, scorch_segment_api, zapx

7c4873c

Bump up zapx's v11, v12, v13, v14, v15 on account of interface change

12c2c72

abhinavdangeti requested changes Jan 8, 2025

View reviewed changes

mapping/index.go Outdated Show resolved Hide resolved

search/searcher/search_term.go Outdated Show resolved Hide resolved

search/util.go Show resolved Hide resolved

search/util.go Outdated Show resolved Hide resolved

search/util.go Outdated Show resolved Hide resolved

Thejas-bhat added 3 commits January 9, 2025 12:23

code comments and handling edge case

ce537e6

unit tests fix

79bd0c1

cleanup?

8cdb525

abhinavdangeti requested changes Jan 9, 2025

View reviewed changes

search/scorer/scorer_term.go Outdated Show resolved Hide resolved

search/scorer/scorer_term.go Show resolved Hide resolved

search/searcher/search_term.go Outdated Show resolved Hide resolved

search/searcher/search_term.go Outdated Show resolved Hide resolved

code comment, exposing the multipliers to be made configurable

d478f4f

metonymic-smokey reviewed Jan 10, 2025

View reviewed changes

search/scorer/scorer_term.go Show resolved Hide resolved

metonymic-smokey reviewed Jan 10, 2025

View reviewed changes

search/searcher/search_term.go Show resolved Hide resolved

metonymic-smokey reviewed Jan 10, 2025

View reviewed changes

search/util.go Show resolved Hide resolved

Thejas-bhat added 2 commits January 13, 2025 10:20

update score explanation, code cleanup

eaca63a

update links

fbd4ed8

abhinavdangeti reviewed Jan 13, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MB-58901: Introduce support for BM25 scoring #2113

MB-58901: Introduce support for BM25 scoring #2113

Thejas-bhat commented Dec 6, 2024 •

edited

Loading

abhinavdangeti commented Jan 7, 2025

abhinavdangeti left a comment

abhinavdangeti commented Jan 9, 2025

Thejas-bhat commented Jan 10, 2025

abhinavdangeti left a comment

abhinavdangeti Jan 13, 2025

abhinavdangeti Jan 13, 2025

abhinavdangeti Jan 13, 2025

abhinavdangeti Jan 13, 2025

MB-58901: Introduce support for BM25 scoring #2113

Are you sure you want to change the base?

MB-58901: Introduce support for BM25 scoring #2113

Conversation

Thejas-bhat commented Dec 6, 2024 • edited Loading

abhinavdangeti commented Jan 7, 2025

abhinavdangeti left a comment

Choose a reason for hiding this comment

abhinavdangeti commented Jan 9, 2025

Thejas-bhat commented Jan 10, 2025

abhinavdangeti left a comment

Choose a reason for hiding this comment

abhinavdangeti Jan 13, 2025

Choose a reason for hiding this comment

abhinavdangeti Jan 13, 2025

Choose a reason for hiding this comment

abhinavdangeti Jan 13, 2025

Choose a reason for hiding this comment

abhinavdangeti Jan 13, 2025

Choose a reason for hiding this comment

Thejas-bhat commented Dec 6, 2024 •

edited

Loading