Add BM25 and TFIDF Scoring to the text index #1688

Flixtastic · 2024-12-17T00:52:28Z

While building the textindex one can define the scoring metrics used. Then during index building the scoring metric chosen defines how the score is calculated. In the retrieval the calculated scores are then shown and can be used to sort the relevancy of documents containing searchwords.

…adapted unit tests. Missing e2e tests.

Commit doesn't contain all changes necessary for pull request yet.

…x. This is done through passing the words and docsfile as string, and then building the text index as normal. Basic Test is existent (TODO make more edge case tests) and e2e testing is fixed.

…re still unstable because of the way nofContexts are counted. Implemented new more refined tests.

…ommented it

…o the wordsFileContent and docsFileContent strings. Now you can clearly see what lines are added and can writing tests is cleaner

…in the wordsFileContent and docsFileContent as pair contentsOfWordsFileAndDocsFile

…est the scores. Problems that need to be addressed are no compression for float scores, not being able to set parameters for BM25, Variable names that are not clear (this is a code wide Problem in the text index classes) and hard to understand calculation of the bm25 scores, since some words are not in the wordsFile.tsv but end up being in the vocabulary through adding from literals. Since they appear in the docsFile.tsv and in the vocabulary the scores are calculated for some combinations that can't be retrieved therefore slightly changing the scoring. Another feature to implement is testing the BM25 scoring against set scoring tables and from that calculating usefull parameters for datasets.

… counting how often a word occurs, tfidf and bm25. They can be defined when building the index. Further improvements that still need to be made are the compression of the scores, since right now it doesn't matter which scoring method you are using the scores just get saved as floats which are not compressed right now. Also e2e tests don't check scores yet.

codecov · 2024-12-17T01:31:04Z

Codecov Report

Attention: Patch coverage is 82.37410% with 49 lines in your changes missing coverage. Please review.

Project coverage is 89.78%. Comparing base (acb6633) to head (c1d763d).

Files with missing lines	Patch %	Lines
src/index/TextScoring.cpp	78.30%	19 Missing and 4 partials ⚠️
src/index/IndexImpl.Text.cpp	81.81%	12 Missing and 4 partials ⚠️
src/index/Index.cpp	62.50%	3 Missing ⚠️
src/index/TextScoring.h	66.66%	2 Missing and 1 partial ⚠️
src/parser/WordsAndDocsFileParser.cpp	94.00%	2 Missing and 1 partial ⚠️
src/index/IndexImpl.h	0.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1688      +/-   ##
==========================================
- Coverage   89.86%   89.78%   -0.08%     
==========================================
  Files         389      392       +3     
  Lines       37308    37490     +182     
  Branches     4204     4228      +24     
==========================================
+ Hits        33527    33662     +135     
- Misses       2485     2518      +33     
- Partials     1296     1310      +14

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

joka921

A first round of reviews,
mostly on the structure of the code (code duplication, what should go where).
Thoroughly read all my comments. In particular, I suggest several refactorings that can be part of smaller PRs that prepare this PR, to make the testing and reviewing simpler and more precise.

src/index/IndexImpl.h

src/index/IndexImpl.Text.cpp

joka921 · 2024-12-19T11:33:52Z

src/index/IndexImpl.Text.cpp

+      for (auto word : absl::StrSplit(lineView, LiteralsTokenizationDelimiter{},
+                                      absl::SkipEmpty{})) {
+        auto wordNormalized = localeManager.getLowercaseUtf8(word);


I think we have this logic also multiple times,
also make this a function for( const auto& normalizedWord : tokenizeAndNormalizeTextLine(lineView, localeManager):

And reuse it in the addWordsFromLiteral code.
A good way is to implement this and the docsfile parser in a separate PR that prepares this PR, as they can be also used in the old code.

You can use for example absl::StrSplit(....) | ql::views::transform(normalize) as the implementation of this function.

The piping doesn't work for absl::StrSplit because (as far as I found) the Splitter element returned by the function can't be lazily transformed to a views or ranges object. My solution right now was to use the InputRangeMixin class from Iterators.h to make a class that does the iterating and normalizing lazily. This can also be seen in the seperate PR (but it is also already merged into this PR).

src/index/IndexImpl.Text.cpp

joka921 · 2024-12-19T11:59:37Z

src/index/IndexImpl.Text.cpp

+vector<T> IndexImpl::readUncomprList(size_t nofElements, off_t from) const {
+  LOG(DEBUG) << "Reading uncompressed list from disk...\n";
+  LOG(TRACE) << "NofElements: " << nofElements << ", from: " << from;
+  T* list = new T[nofElements];
+  textIndexFile_.read(list, sizeof(T) * nofElements, from);
+  vector<T> output(list, list + nofElements);
+  delete[] list;
+  return output;
+}


Please

Check for code duplication (I have a feeling that readCompressedList can first call readUncompressedList and ten do the compression.

Don't use raw new and delete. Maybe read into the a vector drectly, or use a unique_ptr (which also supports arrays).

The function became useless in the text-index-compression branch. Therefore it will be removed once the branch is merged into this branch.

src/index/TextMetaData.h

src/index/Vocabulary.cpp

This reverts commit dfff837, reversing changes made to a4e9509.

…s of texts

… breaking changes. Also finished outsourcing the Scoring

…Also added the TextScoringEnum files to get the text scoring metric enum from a string and the other way around.

sparql-conformance · 2025-01-09T13:03:47Z

Conformance check passed ✅

No test result changes.

Details: https://qlever.cs.uni-freiburg.de/sparql-conformance-ui?cur=c1d763d4be34847fd9ea872c46ed3f18dde2b500&prev=acb6633debc7341985341aff147b5038cc8d951b

sonarqubecloud · 2025-01-09T13:34:59Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

Flixtastic and others added 30 commits July 12, 2024 03:12

ql:contains-word now can show the respective word-score.

ea9d39c

Fixed tests and formatted files.

30736ef

New formatting for Word Score Variables. Changed where necessary and …

e752db8

…adapted unit tests. Missing e2e tests.

Merge branch 'ad-freiburg:master' into master

4ef4d93

Merge branch 'ad-freiburg:master' into master

d52063f

Merge branch 'master' of github.com:Flixtastic/qlever.

c6fe0c6

Commit doesn't contain all changes necessary for pull request yet.

Added getWordSCoreVariable for std::string_view

d0b9ee8

Merge branch 'ad-freiburg:master' into master

2eade97

Merge branch 'ad-freiburg:master' into master

595cb57

Merge branch 'ad-freiburg:master' into master

b4c8c3b

Merge branch 'ad-freiburg:master' into master

72e5d64

Merge branch 'ad-freiburg:master' into master

d8f9df4

Made it possible to construct query execution contexts with text inde…

29511c6

…x. This is done through passing the words and docsfile as string, and then building the text index as normal. Basic Test is existent (TODO make more edge case tests) and e2e testing is fixed.

Merge branch 'ad-freiburg:master' into master

3855978

Reduced usage of column copying in TextIndexScanForWord.cpp

6021401

Merge branch 'ad-freiburg:master' into master

d9701ae

Merge branch 'ad-freiburg:master' into master

5f0ce01

Merge branch 'ad-freiburg:master' into master

e2c47cf

Merge branch 'ad-freiburg:master' into master

e6a0cf7

Changed the counting of nofNonLiterals to nofLiterals. Some methods a…

ed9fbda

…re still unstable because of the way nofContexts are counted. Implemented new more refined tests.

Merge branch 'ad-freiburg:master' into master

5ad3d8f

Merge branch 'ad-freiburg:master' into master

af6bd64

Cleaned up the filtering in TextIndexScanForWord::computeResult and c…

56ea531

…ommented it

renamed nofLiterals to nofLiteralsInTextIndex

e1e12e9

Removed redundant method getWordScoreVariable

017588c

added method appendEscapedWord to escape special chars in Variables

46666d0

Added two function in the TextIndexScanTestHelpers.h to add content t…

f36f189

…o the wordsFileContent and docsFileContent strings. Now you can clearly see what lines are added and can writing tests is cleaner

Added tests for Scores. Also commented tests and refined them

c62a7e6

Changed the getQec function and the respective makeTestIndex to take …

89f0b27

…in the wordsFileContent and docsFileContent as pair contentsOfWordsFileAndDocsFile

Merge branch 'ad-freiburg:master' into master

058e8ed

Flixtastic and others added 5 commits December 8, 2024 20:56

Changed the way of keeping track of the scores to a nested hashmap.

556fed4

Merge branch 'bm25-branch'

d06231b

Merge branch 'ad-freiburg:master' into master

38425dc

joka921 requested changes Dec 19, 2024

View reviewed changes

Flixtastic and others added 17 commits December 25, 2024 19:16

Merge branch 'ad-freiburg:master' into master

3b83918

Extra classes for Words- and Docsfile parsing

b93fde4

Added method to tokenize and normalize at the same time.

9c40084

Merge branch 'words-and-docs-file-parsing'

a4e9509

Added the tokenization to the ql_utility namespace

c365935

Merge branch 'words-and-docs-file-parsing'

dfff837

Revert "Merge branch 'words-and-docs-file-parsing'"

92a4874

This reverts commit dfff837, reversing changes made to a4e9509.

Used the custom InputRangeMixin to lazily tokenize and normalize word…

0f8e65d

…s of texts

Small formatting

320455c

Seperate files for the scoring of texts

d25b107

Now saving the ScoringMetric in the Settings and reverted other index…

12cc197

… breaking changes. Also finished outsourcing the Scoring

Further refined the text scoring methods to remove code duplication. …

ce79911

…Also added the TextScoringEnum files to get the text scoring metric enum from a string and the other way around.

Merge branch 'ad-freiburg:master' into master

b164d55

Removed unnecessary function getRawId

3d02d84

Merge branch 'ad-freiburg:master' into master

dad2d35

Merge branch 'ad-freiburg:master' into master

2f8ed2d

Merge branch 'ad-freiburg:master' into master

c1d763d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add BM25 and TFIDF Scoring to the text index #1688

Add BM25 and TFIDF Scoring to the text index #1688

Flixtastic commented Dec 17, 2024

codecov bot commented Dec 17, 2024 •

edited

Loading

joka921 left a comment

joka921 Dec 19, 2024

joka921 Dec 19, 2024

Flixtastic Jan 2, 2025

joka921 Dec 19, 2024

Flixtastic Jan 6, 2025

sparql-conformance bot commented Jan 9, 2025

sonarqubecloud bot commented Jan 9, 2025

Add BM25 and TFIDF Scoring to the text index #1688

Are you sure you want to change the base?

Add BM25 and TFIDF Scoring to the text index #1688

Conversation

Flixtastic commented Dec 17, 2024

codecov bot commented Dec 17, 2024 • edited Loading

Codecov Report

joka921 left a comment

Choose a reason for hiding this comment

joka921 Dec 19, 2024

Choose a reason for hiding this comment

joka921 Dec 19, 2024

Choose a reason for hiding this comment

Flixtastic Jan 2, 2025

Choose a reason for hiding this comment

joka921 Dec 19, 2024

Choose a reason for hiding this comment

Flixtastic Jan 6, 2025

Choose a reason for hiding this comment

sparql-conformance bot commented Jan 9, 2025

Conformance check passed ✅

sonarqubecloud bot commented Jan 9, 2025

Quality Gate passed

codecov bot commented Dec 17, 2024 •

edited

Loading