-
Notifications
You must be signed in to change notification settings - Fork 687
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MB-58901: Introduce support for BM25 scoring #2113
base: master
Are you sure you want to change the base?
Conversation
2b54a8d
to
738dfe1
Compare
8b10cdf
to
d58474f
Compare
4b626d0
to
45efde1
Compare
f385ba6
to
e83cca0
Compare
@Thejas-bhat it seems you'll need to push up the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Thejas-bhat Make sure to pull the latest from origin/bm25-refactor
before you push more commits here :)
@Thejas-bhat a thought regarding the "global scoring" code path for bm25 - what is the default behavior in elastic? Would you add a couple of GO benchmark test to differentiate between bm25 with and without global scoring and record these numbers within the commit message^ - trying to decide whether to enable "global scoring" by default. |
By default, elastic disables the feature. It'll be a bit difficult to benchmark this at golang unit level over here, because the latency is mainly visible when the index alias has multiple shards and each of which is spread across multiple nodes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor refactor suggestion, looks good to me otherwise.
@@ -485,6 +485,8 @@ func (i *indexImpl) preSearch(ctx context.Context, req *SearchRequest, reader in | |||
} | |||
|
|||
var fts search.FieldTermSynonymMap | |||
var count uint64 | |||
fieldCardinality := make(map[string]int) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets change this to ..
var fieldCardinality map[string]int
.. and make
only if isBM25Enabled(..) == true
in line 499 below.
@@ -578,6 +604,14 @@ func (i *indexImpl) SearchInContext(ctx context.Context, req *SearchRequest) (sr | |||
} | |||
skipSynonymCollector = true | |||
} | |||
skipKNNCollector = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this right? Something we missed?
// preSearchRequired checks if preSearch is required and returns a boolean flag | ||
// It only allocates the preSearchFlags struct if necessary | ||
func preSearchRequired(req *SearchRequest, m mapping.IndexMapping) (*preSearchFlags, error) { | ||
func isBM25Enabled(m mapping.IndexMapping) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's refactor this to scoringModel(..)
which returns the scoring model to use instead.
@@ -605,6 +639,21 @@ func (i *indexImpl) SearchInContext(ctx context.Context, req *SearchRequest) (sr | |||
ctx = context.WithValue(ctx, search.FieldTermSynonymMapKey, fts) | |||
} | |||
|
|||
scoringModelCallback := func() string { | |||
if isBM25Enabled(i.m) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related to other commend, let's update this method to return the scoring model to use.
To establish if it's bm25, you can add the extra check - scoringModel() == index.BM25Scoring
.
Introducing support for BM25 scoring
Key stats necessary for the scoring
Introduces a mechanism to maintain consistent scoring in a situation where the index is partitioned as a
bleve.IndexAlias
. This is achieved using the existing preSearch mechanism where the first phase of the entire search involves fetching the above mentioned stats, aggregating them and redistributing back to the bleve indexes which would use them while calculating the score for a hit.Implementation wise, the user needs to explicitly mention BM25 as the scoring mechanism at
indexMapping.ScoringModel
level to actually use this scoring mechanism. This parameter is a global setting, i.e. when performing a search on multiple fields, all the fields are scored with the same scoring model.The storage layer exposes an API which returns the number of terms in a field's term dictionary which is used to compute the
avgDocLength
. At the indexing layer, we check if the queried field supports BM25 scoring and if consistent scoring is availed. This is followed by fetching the stats either from the local bleve index or from a context (in the case where we're availing the consistent scoring) to compute the actual score.Note: The scoring is highly dependent on the size of an individual bleve index's termDictionary (specific to a field) so there can be some discrepancies especially given that each index is further composed of multiple 'segments'. However in large scale use cases these discrepancies can be quite small and don't affect the order of the doc hits - in which case the user may choose to avoid this altogether.