Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

chore(embeddings): use framework embeddings, refactor ai providers #143

Merged
merged 3 commits into from
Jan 8, 2025

Conversation

jezekra1
Copy link
Contributor

@jezekra1 jezekra1 commented Dec 20, 2024

BREAKING CHANGE:

unification of LLM_BACKEND, EMBEDDING_BACKEND -> AI_BACKEND

open for discussion

@jezekra1 jezekra1 requested a review from a team as a code owner December 20, 2024 14:17
@jezekra1 jezekra1 force-pushed the use-framework-embedding branch from 25173d7 to b72da10 Compare December 20, 2024 14:18
@jezekra1 jezekra1 added the env label Dec 20, 2024
@jezekra1 jezekra1 force-pushed the use-framework-embedding branch from b72da10 to 4e37f54 Compare December 20, 2024 14:21
.env.example Show resolved Hide resolved
@jezekra1 jezekra1 force-pushed the use-framework-embedding branch from 4e37f54 to a2ba5d8 Compare December 20, 2024 15:04

const wikipedia = new WikipediaTool({
filters: { minPageNameSimilarity: 0.25, excludeOthersOnExactMatch: false },
output: { maxSerializedLength: MAX_CONTENT_LENGTH_CHARS }
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maxSerializedLength is not affecting the markdown output.

I removed the restrictions on max content length from markdown as well as the simplified extraction:

extraction: { fields: { markdown: {} } },

Previously we had table extraction disabled and the output was truncated to 25k characters due to slow embeddings, however the issue has mostly been resolved and we can include this data again.

Also this makes our implementation more aligned with framework defaults.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The maxSerializedLength property affects only the serialized output, which contains the markdown output and it works correctly.

    const instance = new WikipediaTool({
      output: {
        maxSerializedLength: 100,
      },
    });
    const response = await instance.run({
      query: "ice hockey",
    });
    expect(response.getTextContent()).toHaveLength(100);

Source: https://github.com/i-am-bee/bee-agent-framework/blob/6332b0c3a6cb82310fdcef7f576c4a5a6e2d40fd/src/tools/search/wikipedia.ts#L100-L111

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But getTextContent aggregates text from all documents, which is not what we want here, because we add aditional information about the source to each chunk

@jezekra1 jezekra1 force-pushed the use-framework-embedding branch from a2ba5d8 to fab2e95 Compare December 20, 2024 15:14
Copy link
Contributor

@Tomas2D Tomas2D left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few suggestions. Good work 👍🏻


const wikipedia = new WikipediaTool({
filters: { minPageNameSimilarity: 0.25, excludeOthersOnExactMatch: false },
output: { maxSerializedLength: MAX_CONTENT_LENGTH_CHARS }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The maxSerializedLength property affects only the serialized output, which contains the markdown output and it works correctly.

    const instance = new WikipediaTool({
      output: {
        maxSerializedLength: 100,
      },
    });
    const response = await instance.run({
      query: "ice hockey",
    });
    expect(response.getTextContent()).toHaveLength(100);

Source: https://github.com/i-am-bee/bee-agent-framework/blob/6332b0c3a6cb82310fdcef7f576c4a5a6e2d40fd/src/tools/search/wikipedia.ts#L100-L111

src/runs/execution/tools/file-search-tool.ts Outdated Show resolved Hide resolved
src/runs/execution/provider.ts Outdated Show resolved Hide resolved
.env.example Show resolved Hide resolved
query: input.question,
documents: output.results.flatMap((document, idx) =>
Array.from(
splitString(document.fields.markdown as string, {
Copy link
Contributor

@pilartomas pilartomas Jan 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is weird, why isn't markdown a string already? If it can be something else, we must handle it or fail.

Copy link
Contributor

@pilartomas pilartomas Jan 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, is markdown good input type for splitString? 🤔 Splitting tags might cause problems.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@pilartomas pilartomas Jan 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That might be just for simplicity. We need to make sure it is type-safe here. With a typeof else throw guard if necessary.

@jezekra1 jezekra1 requested a review from pilartomas January 2, 2025 14:50
Copy link
Contributor

@pilartomas pilartomas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make sure the markdown handling is safe at runtime, otherwise LGTM 👍

@jezekra1 jezekra1 force-pushed the use-framework-embedding branch from 34d7a8d to 0eaa1a2 Compare January 6, 2025 08:05
@jezekra1
Copy link
Contributor Author

jezekra1 commented Jan 6, 2025

Please make sure the markdown handling is safe at runtime, otherwise LGTM 👍

Actually, it could be null (src), the document will now be skipped in such case.

@pilartomas
Copy link
Contributor

It is still types as unknown though, the typecast is just not safe there.

@jezekra1
Copy link
Contributor Author

jezekra1 commented Jan 6, 2025

It is still types as unknown though, the typecast is just not safe there.

@Tomas2D is this fix correct?

i-am-bee/bee-agent-framework#267

@jezekra1 jezekra1 force-pushed the use-framework-embedding branch from d2eae5b to 162a708 Compare January 7, 2025 12:21
Signed-off-by: Radek Ježek <[email protected]>
@jezekra1 jezekra1 force-pushed the use-framework-embedding branch from 162a708 to a20b6af Compare January 8, 2025 12:27
@jezekra1 jezekra1 requested a review from pilartomas January 8, 2025 12:27
@jezekra1 jezekra1 merged commit b611ab9 into main Jan 8, 2025
6 checks passed
@jezekra1 jezekra1 deleted the use-framework-embedding branch January 8, 2025 13:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants