Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add BM25 and TFIDF Scoring to the text index #1688

Open
wants to merge 52 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
52 commits
Select commit Hold shift + click to select a range
ea9d39c
ql:contains-word now can show the respective word-score.
Flixtastic Jul 12, 2024
30736ef
Fixed tests and formatted files.
Flixtastic Jul 12, 2024
e752db8
New formatting for Word Score Variables. Changed where necessary and …
Flixtastic Jul 27, 2024
4ef4d93
Merge branch 'ad-freiburg:master' into master
Flixtastic Jul 27, 2024
d52063f
Merge branch 'ad-freiburg:master' into master
Flixtastic Jul 29, 2024
c6fe0c6
Merge branch 'master' of github.com:Flixtastic/qlever.
Flixtastic Jul 29, 2024
d0b9ee8
Added getWordSCoreVariable for std::string_view
Flixtastic Jul 29, 2024
2eade97
Merge branch 'ad-freiburg:master' into master
Flixtastic Sep 23, 2024
595cb57
Merge branch 'ad-freiburg:master' into master
Flixtastic Oct 4, 2024
b4c8c3b
Merge branch 'ad-freiburg:master' into master
Flixtastic Oct 26, 2024
72e5d64
Merge branch 'ad-freiburg:master' into master
Flixtastic Nov 12, 2024
d8f9df4
Merge branch 'ad-freiburg:master' into master
Flixtastic Nov 15, 2024
29511c6
Made it possible to construct query execution contexts with text inde…
Flixtastic Nov 15, 2024
3855978
Merge branch 'ad-freiburg:master' into master
Flixtastic Nov 17, 2024
6021401
Reduced usage of column copying in TextIndexScanForWord.cpp
Flixtastic Nov 17, 2024
d9701ae
Merge branch 'ad-freiburg:master' into master
Flixtastic Nov 19, 2024
5f0ce01
Merge branch 'ad-freiburg:master' into master
Flixtastic Dec 3, 2024
e2c47cf
Merge branch 'ad-freiburg:master' into master
Flixtastic Dec 3, 2024
e6a0cf7
Merge branch 'ad-freiburg:master' into master
Flixtastic Dec 4, 2024
ed9fbda
Changed the counting of nofNonLiterals to nofLiterals. Some methods a…
Flixtastic Dec 4, 2024
5ad3d8f
Merge branch 'ad-freiburg:master' into master
Flixtastic Dec 4, 2024
af6bd64
Merge branch 'ad-freiburg:master' into master
Flixtastic Dec 5, 2024
56ea531
Cleaned up the filtering in TextIndexScanForWord::computeResult and c…
Flixtastic Dec 5, 2024
e1e12e9
renamed nofLiterals to nofLiteralsInTextIndex
Flixtastic Dec 5, 2024
017588c
Removed redundant method getWordScoreVariable
Flixtastic Dec 5, 2024
46666d0
added method appendEscapedWord to escape special chars in Variables
Flixtastic Dec 5, 2024
f36f189
Added two function in the TextIndexScanTestHelpers.h to add content t…
Flixtastic Dec 5, 2024
c62a7e6
Added tests for Scores. Also commented tests and refined them
Flixtastic Dec 5, 2024
89f0b27
Changed the getQec function and the respective makeTestIndex to take …
Flixtastic Dec 5, 2024
058e8ed
Merge branch 'ad-freiburg:master' into master
Flixtastic Dec 6, 2024
68fe453
Implemented BM25 with set parameters. Also implemented functions to t…
Flixtastic Dec 8, 2024
978d817
Added the possibilty of choosing between 3 Scoring Metrics. Those are…
Flixtastic Dec 9, 2024
556fed4
Changed the way of keeping track of the scores to a nested hashmap.
Flixtastic Dec 10, 2024
d06231b
Merge branch 'bm25-branch'
Flixtastic Dec 16, 2024
38425dc
Merge branch 'ad-freiburg:master' into master
Flixtastic Dec 17, 2024
3b83918
Merge branch 'ad-freiburg:master' into master
Flixtastic Dec 25, 2024
b93fde4
Extra classes for Words- and Docsfile parsing
Flixtastic Dec 28, 2024
9c40084
Added method to tokenize and normalize at the same time.
Flixtastic Dec 28, 2024
a4e9509
Merge branch 'words-and-docs-file-parsing'
Flixtastic Dec 28, 2024
c365935
Added the tokenization to the ql_utility namespace
Flixtastic Dec 28, 2024
dfff837
Merge branch 'words-and-docs-file-parsing'
Flixtastic Dec 28, 2024
92a4874
Revert "Merge branch 'words-and-docs-file-parsing'"
Flixtastic Dec 28, 2024
0f8e65d
Used the custom InputRangeMixin to lazily tokenize and normalize word…
Flixtastic Jan 2, 2025
320455c
Small formatting
Flixtastic Jan 2, 2025
d25b107
Seperate files for the scoring of texts
Flixtastic Jan 2, 2025
12cc197
Now saving the ScoringMetric in the Settings and reverted other index…
Flixtastic Jan 2, 2025
ce79911
Further refined the text scoring methods to remove code duplication. …
Flixtastic Jan 2, 2025
b164d55
Merge branch 'ad-freiburg:master' into master
Flixtastic Jan 4, 2025
3d02d84
Removed unnecessary function getRawId
Flixtastic Jan 2, 2025
dad2d35
Merge branch 'ad-freiburg:master' into master
Flixtastic Jan 6, 2025
2f8ed2d
Merge branch 'ad-freiburg:master' into master
Flixtastic Jan 8, 2025
c1d763d
Merge branch 'ad-freiburg:master' into master
Flixtastic Jan 9, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion src/global/Id.h
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@
#include "util/Exception.h"

using Id = ValueId;
typedef uint16_t Score;
using Score = float;

// TODO<joka921> Make the following ID and index types strong.
using ColumnIndex = uint64_t;
Expand Down
1 change: 1 addition & 0 deletions src/global/IndexTypes.h
Original file line number Diff line number Diff line change
Expand Up @@ -16,3 +16,4 @@ using LocalVocabIndex = const LocalVocabEntry*;
using TextRecordIndex = ad_utility::TypedIndex<uint64_t, "TextRecordIndex">;
using WordVocabIndex = ad_utility::TypedIndex<uint64_t, "WordVocabIndex">;
using BlankNodeIndex = ad_utility::TypedIndex<uint64_t, "BlankNodeIndex">;
using DocumentIndex = ad_utility::TypedIndex<uint64_t, "DocumentIndex">;
2 changes: 1 addition & 1 deletion src/index/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -6,5 +6,5 @@ add_library(index
DocsDB.cpp FTSAlgorithms.cpp
PrefixHeuristic.cpp CompressedRelation.cpp
PatternCreator.cpp ScanSpecification.cpp
DeltaTriples.cpp LocalVocabEntry.cpp)
DeltaTriples.cpp LocalVocabEntry.cpp TextScoring.cpp TextScoringEnum.cpp)
qlever_target_link_libraries(index util parser vocabulary ${STXXL_LIBRARIES})
17 changes: 14 additions & 3 deletions src/index/Index.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -25,9 +25,10 @@
}

// ____________________________________________________________________________
void Index::addTextFromContextFile(const std::string& contextFile,
bool addWordsFromLiterals) {
pimpl_->addTextFromContextFile(contextFile, addWordsFromLiterals);
void Index::buildTextIndexFile(
const std::pair<std::string, std::string>& wordsAndDocsFile,
bool addWordsFromLiterals) {
pimpl_->buildTextIndexFile(wordsAndDocsFile, addWordsFromLiterals);
}

// ____________________________________________________________________________
Expand Down Expand Up @@ -215,6 +216,16 @@
return pimpl_->setNumTriplesPerBatch(numTriplesPerBatch);
}

// ____________________________________________________________________________
void Index::setScoringMetricsUsedInSettings(TextScoringMetric scoringMetric) {
return pimpl_->setScoringMetricsUsedInSettings(scoringMetric);
}

// ____________________________________________________________________________
void Index::setBM25ParmetersUsedInSettings(float b, float k) {
return pimpl_->setBM25ParametersInSettings(b, k);
}

Check warning on line 227 in src/index/Index.cpp

View check run for this annotation

Codecov / codecov/patch

src/index/Index.cpp#L225-L227

Added lines #L225 - L227 were not covered by tests

// ____________________________________________________________________________
const std::string& Index::getTextName() const { return pimpl_->getTextName(); }

Expand Down
13 changes: 10 additions & 3 deletions src/index/Index.h
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
#include "index/InputFileSpecification.h"
#include "index/Permutation.h"
#include "index/StringSortComparator.h"
#include "index/TextScoringEnum.h"
#include "index/Vocabulary.h"
#include "parser/TripleComponent.h"
#include "util/CancellationHandle.h"
Expand Down Expand Up @@ -61,7 +62,8 @@ class Index {
// Stores the index of the entity of each result.
vector<Id> eids_;
// Stores for each result how often an entity
// appears in its associated TextRecord.
// appears in its associated TextRecord. [[OLD DEFINITION]]
// Now scores BM25 scores for all words that are in the voacabulary
vector<Score> scores_;
};

Expand Down Expand Up @@ -94,8 +96,9 @@ class Index {

// Add a text index to a complete KB index. First read the given context
// file (if file name not empty), then add words from literals (if true).
void addTextFromContextFile(const std::string& contextFile,
bool addWordsFromLiterals);
void buildTextIndexFile(
const std::pair<std::string, std::string>& wordsAndDocsFile,
bool addWordsFromLiterals);

// Build docsDB file from given file (one text record per line).
void buildDocsDB(const std::string& docsFile);
Expand Down Expand Up @@ -206,6 +209,10 @@ class Index {

void setNumTriplesPerBatch(uint64_t numTriplesPerBatch);

void setScoringMetricsUsedInSettings(TextScoringMetric scoringMetric);

void setBM25ParmetersUsedInSettings(float b, float k);

const std::string& getTextName() const;

const std::string& getKbName() const;
Expand Down
30 changes: 28 additions & 2 deletions src/index/IndexBuilderMain.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -155,6 +155,7 @@ int main(int argc, char** argv) {
string textIndexName;
string kbIndexName;
string settingsFile;
string scoringMetric = "count";
std::vector<string> filetype;
std::vector<string> inputFile;
std::vector<string> defaultGraphs;
Expand All @@ -164,6 +165,8 @@ int main(int argc, char** argv) {
bool keepTemporaryFiles = false;
bool onlyPsoAndPos = false;
bool addWordsFromLiterals = false;
float bScoringParam = 0.75;
float kScoringParam = 1.75;
std::optional<ad_utility::MemorySize> stxxlMemory;
std::optional<ad_utility::MemorySize> parserBufferSize;
optind = 1;
Expand Down Expand Up @@ -214,6 +217,16 @@ int main(int argc, char** argv) {
add("add-text-index,A", po::bool_switch(&onlyAddTextIndex),
"Only build the text index. Assumes that a knowledge graph index with "
"the same `index-basename` already exists.");
add("set-bm25-b-param", po::value(&bScoringParam),
"Sets the b param in the BM25 scoring metric. This has to be between "
"(including) 0 and 1. The default is 0.75.");
add("set-bm25-k-param", po::value(&kScoringParam),
"Sets the k param in the BM25 scoring metric. This has to be greater "
"than or equal to 0. The default is 1.75.");
add("set-scoring-metric,S", po::value(&scoringMetric),
"Sets the scoring metric used. Options are \"count\" for count, "
"\"tf-idf\" for tf idf "
"and \"bm25\" for bm25. The default is count.");

// Options for the knowledge graph index.
add("settings-file,s", po::value(&settingsFile),
Expand Down Expand Up @@ -245,6 +258,14 @@ int main(int argc, char** argv) {
return EXIT_SUCCESS;
}
po::notify(optionsMap);
if (kScoringParam < 0) {
throw std::invalid_argument("The value of bm25-k must be >= 0");
}
if (bScoringParam < 0 || bScoringParam > 1) {
throw std::invalid_argument(
"The value of bm25-b must be between and "
"including 0 and 1");
}
} catch (const std::exception& e) {
std::cerr << "Error in command-line argument: " << e.what() << '\n';
std::cerr << boostOptions << '\n';
Expand Down Expand Up @@ -330,8 +351,13 @@ int main(int argc, char** argv) {
index.createFromFiles(fileSpecifications);
}

if (!wordsfile.empty() || addWordsFromLiterals) {
index.addTextFromContextFile(wordsfile, addWordsFromLiterals);
if ((!wordsfile.empty() && !docsfile.empty()) || addWordsFromLiterals) {
index.setScoringMetricsUsedInSettings(
getTextScoringMetricFromString(scoringMetric));
index.setBM25ParmetersUsedInSettings(bScoringParam, kScoringParam);
index.buildTextIndexFile(
std::pair<std::string, std::string>{wordsfile, docsfile},
addWordsFromLiterals);
}

if (!docsfile.empty()) {
Expand Down
Loading
Loading