Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add BM25 and TFIDF Scoring to the text index #1688

Open
wants to merge 52 commits into
base: master
Choose a base branch
from

Conversation

Flixtastic
Copy link
Contributor

While building the textindex one can define the scoring metrics used. Then during index building the scoring metric chosen defines how the score is calculated. In the retrieval the calculated scores are then shown and can be used to sort the relevancy of documents containing searchwords.

Flixtastic and others added 30 commits July 12, 2024 03:12
Commit doesn't contain all changes necessary for pull request yet.
…x. This is done through passing the words and docsfile as string, and then building the text index as normal. Basic Test is existent (TODO make more edge case tests) and e2e testing is fixed.
…re still unstable because of the way nofContexts are counted. Implemented new more refined tests.
…o the wordsFileContent and docsFileContent strings. Now you can clearly see what lines are added and can writing tests is cleaner
…in the wordsFileContent and docsFileContent as pair contentsOfWordsFileAndDocsFile
Flixtastic and others added 5 commits December 8, 2024 20:56
…est the scores. Problems that need to be addressed are no compression for float scores, not being able to set parameters for BM25, Variable names that are not clear (this is a code wide Problem in the text index classes) and hard to understand calculation of the bm25 scores, since some words are not in the wordsFile.tsv but end up being in the vocabulary through adding from literals. Since they appear in the docsFile.tsv and in the vocabulary the scores are calculated for some combinations that can't be retrieved therefore slightly changing the scoring. Another feature to implement is testing the BM25 scoring against set scoring tables and from that calculating usefull parameters for datasets.
… counting how often a word occurs, tfidf and bm25. They can be defined when building the index. Further improvements that still need to be made are the compression of the scores, since right now it doesn't matter which scoring method you are using the scores just get saved as floats which are not compressed right now. Also e2e tests don't check scores yet.
Copy link

codecov bot commented Dec 17, 2024

Codecov Report

Attention: Patch coverage is 82.37410% with 49 lines in your changes missing coverage. Please review.

Project coverage is 89.78%. Comparing base (acb6633) to head (c1d763d).

Files with missing lines Patch % Lines
src/index/TextScoring.cpp 78.30% 19 Missing and 4 partials ⚠️
src/index/IndexImpl.Text.cpp 81.81% 12 Missing and 4 partials ⚠️
src/index/Index.cpp 62.50% 3 Missing ⚠️
src/index/TextScoring.h 66.66% 2 Missing and 1 partial ⚠️
src/parser/WordsAndDocsFileParser.cpp 94.00% 2 Missing and 1 partial ⚠️
src/index/IndexImpl.h 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1688      +/-   ##
==========================================
- Coverage   89.86%   89.78%   -0.08%     
==========================================
  Files         389      392       +3     
  Lines       37308    37490     +182     
  Branches     4204     4228      +24     
==========================================
+ Hits        33527    33662     +135     
- Misses       2485     2518      +33     
- Partials     1296     1310      +14     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Member

@joka921 joka921 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A first round of reviews,
mostly on the structure of the code (code duplication, what should go where).
Thoroughly read all my comments. In particular, I suggest several refactorings that can be part of smaller PRs that prepare this PR, to make the testing and reviewing simpler and more precise.

src/index/IndexImpl.h Outdated Show resolved Hide resolved
src/index/IndexImpl.h Outdated Show resolved Hide resolved
src/index/IndexImpl.Text.cpp Outdated Show resolved Hide resolved
Comment on lines 165 to 167
for (auto word : absl::StrSplit(lineView, LiteralsTokenizationDelimiter{},
absl::SkipEmpty{})) {
auto wordNormalized = localeManager.getLowercaseUtf8(word);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have this logic also multiple times,
also make this a function for( const auto& normalizedWord : tokenizeAndNormalizeTextLine(lineView, localeManager):

And reuse it in the addWordsFromLiteral code.
A good way is to implement this and the docsfile parser in a separate PR that prepares this PR, as they can be also used in the old code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can use for example absl::StrSplit(....) | ql::views::transform(normalize) as the implementation of this function.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The piping doesn't work for absl::StrSplit because (as far as I found) the Splitter element returned by the function can't be lazily transformed to a views or ranges object. My solution right now was to use the InputRangeMixin class from Iterators.h to make a class that does the iterating and normalizing lazily. This can also be seen in the seperate PR (but it is also already merged into this PR).

src/index/IndexImpl.Text.cpp Outdated Show resolved Hide resolved
src/index/IndexImpl.Text.cpp Show resolved Hide resolved
Comment on lines +1107 to +1115
vector<T> IndexImpl::readUncomprList(size_t nofElements, off_t from) const {
LOG(DEBUG) << "Reading uncompressed list from disk...\n";
LOG(TRACE) << "NofElements: " << nofElements << ", from: " << from;
T* list = new T[nofElements];
textIndexFile_.read(list, sizeof(T) * nofElements, from);
vector<T> output(list, list + nofElements);
delete[] list;
return output;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please

  1. Check for code duplication (I have a feeling that readCompressedList can first call readUncompressedList and ten do the compression.
  2. Don't use raw new and delete. Maybe read into the a vector drectly, or use a unique_ptr (which also supports arrays).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function became useless in the text-index-compression branch. Therefore it will be removed once the branch is merged into this branch.

src/index/TextMetaData.h Outdated Show resolved Hide resolved
src/index/TextMetaData.h Outdated Show resolved Hide resolved
src/index/Vocabulary.cpp Outdated Show resolved Hide resolved
@sparql-conformance
Copy link

Copy link

sonarqubecloud bot commented Jan 9, 2025

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants