-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore(embeddings): use framework embeddings, refactor ai providers #143
Conversation
25173d7
to
b72da10
Compare
b72da10
to
4e37f54
Compare
4e37f54
to
a2ba5d8
Compare
|
||
const wikipedia = new WikipediaTool({ | ||
filters: { minPageNameSimilarity: 0.25, excludeOthersOnExactMatch: false }, | ||
output: { maxSerializedLength: MAX_CONTENT_LENGTH_CHARS } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maxSerializedLength
is not affecting the markdown output.
I removed the restrictions on max content length from markdown as well as the simplified extraction:
extraction: { fields: { markdown: {} } },
Previously we had table extraction disabled and the output was truncated to 25k characters due to slow embeddings, however the issue has mostly been resolved and we can include this data again.
Also this makes our implementation more aligned with framework defaults.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The maxSerializedLength
property affects only the serialized output, which contains the markdown output and it works correctly.
const instance = new WikipediaTool({
output: {
maxSerializedLength: 100,
},
});
const response = await instance.run({
query: "ice hockey",
});
expect(response.getTextContent()).toHaveLength(100);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But getTextContent
aggregates text from all documents, which is not what we want here, because we add aditional information about the source to each chunk
a2ba5d8
to
fab2e95
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a few suggestions. Good work 👍🏻
|
||
const wikipedia = new WikipediaTool({ | ||
filters: { minPageNameSimilarity: 0.25, excludeOthersOnExactMatch: false }, | ||
output: { maxSerializedLength: MAX_CONTENT_LENGTH_CHARS } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The maxSerializedLength
property affects only the serialized output, which contains the markdown output and it works correctly.
const instance = new WikipediaTool({
output: {
maxSerializedLength: 100,
},
});
const response = await instance.run({
query: "ice hockey",
});
expect(response.getTextContent()).toHaveLength(100);
query: input.question, | ||
documents: output.results.flatMap((document, idx) => | ||
Array.from( | ||
splitString(document.fields.markdown as string, { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is weird, why isn't markdown a string already? If it can be something else, we must handle it or fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, is markdown good input type for splitString
? 🤔 Splitting tags might cause problems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is typed as unknown
, but we already have this in the examples:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That might be just for simplicity. We need to make sure it is type-safe here. With a typeof else throw
guard if necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please make sure the markdown handling is safe at runtime, otherwise LGTM 👍
34d7a8d
to
0eaa1a2
Compare
Actually, it could be |
It is still types as unknown though, the typecast is just not safe there. |
@Tomas2D is this fix correct? |
Signed-off-by: Radek Ježek <[email protected]>
d2eae5b
to
162a708
Compare
Signed-off-by: Radek Ježek <[email protected]>
162a708
to
a20b6af
Compare
BREAKING CHANGE:
unification of
LLM_BACKEND
,EMBEDDING_BACKEND
->AI_BACKEND
open for discussion