-
Notifications
You must be signed in to change notification settings - Fork 279
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LLM_EXTRACT_TEXT implementation #18435
LLM_EXTRACT_TEXT implementation #18435
Conversation
PR-Agent was enabled for this repository. To continue using it, please link your git user with your CodiumAI identity here. PR Reviewer Guide 🔍
|
PR-Agent was enabled for this repository. To continue using it, please link your git user with your CodiumAI identity here. PR Code Suggestions ✨
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you know you are extracting a pdf file? How about add a const "file type" or "extractor type" arg to the function?
Extractor type arg might be better, if we happen to have many different kind of extractors for same file type (or a extractor for many file types).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add a test.
Extractor type arg added |
Unit and BVT tests have been added. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add result verification in the bvt test.
|
What type of PR is this?
Which issue(s) this PR fixes:
issue #18664
What this PR does / why we need it:
As part of our document LLM support, we are introducing the
LLM_EXTRACT_TEXT
function. This function extracts text from PDF files and writes the extracted text to a specified text file, extractor type can be specified by the third argument.Usage:
llm_extract_text(<input PDF datalink>, <output text file datalink>, <extractor type string>);
Return Value: A boolean indicating whether the extraction and writing process was successful.
Note:
Example SQL:
Example return: