-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add BM25 and TFIDF Scoring to the text index #1688
base: master
Are you sure you want to change the base?
Conversation
…adapted unit tests. Missing e2e tests.
Commit doesn't contain all changes necessary for pull request yet.
…x. This is done through passing the words and docsfile as string, and then building the text index as normal. Basic Test is existent (TODO make more edge case tests) and e2e testing is fixed.
…re still unstable because of the way nofContexts are counted. Implemented new more refined tests.
…o the wordsFileContent and docsFileContent strings. Now you can clearly see what lines are added and can writing tests is cleaner
…in the wordsFileContent and docsFileContent as pair contentsOfWordsFileAndDocsFile
…est the scores. Problems that need to be addressed are no compression for float scores, not being able to set parameters for BM25, Variable names that are not clear (this is a code wide Problem in the text index classes) and hard to understand calculation of the bm25 scores, since some words are not in the wordsFile.tsv but end up being in the vocabulary through adding from literals. Since they appear in the docsFile.tsv and in the vocabulary the scores are calculated for some combinations that can't be retrieved therefore slightly changing the scoring. Another feature to implement is testing the BM25 scoring against set scoring tables and from that calculating usefull parameters for datasets.
… counting how often a word occurs, tfidf and bm25. They can be defined when building the index. Further improvements that still need to be made are the compression of the scores, since right now it doesn't matter which scoring method you are using the scores just get saved as floats which are not compressed right now. Also e2e tests don't check scores yet.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #1688 +/- ##
==========================================
- Coverage 89.86% 89.78% -0.08%
==========================================
Files 389 392 +3
Lines 37308 37490 +182
Branches 4204 4228 +24
==========================================
+ Hits 33527 33662 +135
- Misses 2485 2518 +33
- Partials 1296 1310 +14 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A first round of reviews,
mostly on the structure of the code (code duplication, what should go where).
Thoroughly read all my comments. In particular, I suggest several refactorings that can be part of smaller PRs that prepare this PR, to make the testing and reviewing simpler and more precise.
src/index/IndexImpl.Text.cpp
Outdated
for (auto word : absl::StrSplit(lineView, LiteralsTokenizationDelimiter{}, | ||
absl::SkipEmpty{})) { | ||
auto wordNormalized = localeManager.getLowercaseUtf8(word); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we have this logic also multiple times,
also make this a function for( const auto& normalizedWord : tokenizeAndNormalizeTextLine(lineView, localeManager):
And reuse it in the addWordsFromLiteral
code.
A good way is to implement this and the docsfile parser in a separate PR that prepares this PR, as they can be also used in the old
code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use for example absl::StrSplit(....) | ql::views::transform(normalize)
as the implementation of this function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The piping doesn't work for absl::StrSplit because (as far as I found) the Splitter element returned by the function can't be lazily transformed to a views or ranges object. My solution right now was to use the InputRangeMixin class from Iterators.h to make a class that does the iterating and normalizing lazily. This can also be seen in the seperate PR (but it is also already merged into this PR).
vector<T> IndexImpl::readUncomprList(size_t nofElements, off_t from) const { | ||
LOG(DEBUG) << "Reading uncompressed list from disk...\n"; | ||
LOG(TRACE) << "NofElements: " << nofElements << ", from: " << from; | ||
T* list = new T[nofElements]; | ||
textIndexFile_.read(list, sizeof(T) * nofElements, from); | ||
vector<T> output(list, list + nofElements); | ||
delete[] list; | ||
return output; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please
- Check for code duplication (I have a feeling that
readCompressedList
can first callreadUncompressedList
and ten do the compression. - Don't use raw
new
anddelete
. Maybe read into the a vector drectly, or use aunique_ptr
(which also supports arrays).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function became useless in the text-index-compression branch. Therefore it will be removed once the branch is merged into this branch.
… breaking changes. Also finished outsourcing the Scoring
…Also added the TextScoringEnum files to get the text scoring metric enum from a string and the other way around.
Conformance check passed ✅No test result changes. |
Quality Gate passedIssues Measures |
While building the textindex one can define the scoring metrics used. Then during index building the scoring metric chosen defines how the score is calculated. In the retrieval the calculated scores are then shown and can be used to sort the relevancy of documents containing searchwords.