Ospp/new llm extract text #19725

charleschile · 2024-10-31T14:23:50Z

User description

What type of PR is this?

Which issue(s) this PR fixes:

What this PR does / why we need it:

As part of our document LLM support, we are introducing the LLM_EXTRACT_TEXT function. This function extracts text from PDF files and writes the extracted text to a specified text file, extractor type can be specified by the third argument.

Usage: llm_extract_text(<input PDF datalink>, <output text file datalink>, <extractor type string>);

Return Value: A boolean indicating whether the extraction and writing process was successful.

Note:

Both the input and output paths must be absolute paths.

Example SQL:

select llm_extract_text(cast('file:///Users/charles/Desktop/codes/matrixone/matrixone/test/distributed/resources/llm_test/extract_text/example.pdf?offset=0&size=4' as datalink), cast('file:///Users/charles/Desktop/codes/matrixone/matrixone/test/distributed/resources/llm_test/extract_text/example.txt' as datalink), "pdf");

Example return:

mysql> select llm_extract_text(cast('file:///Users/charles/Desktop/codes/matrixone/matrixone/test/distributed/resources/llm_test/extract_text/example.pdf?offset=0&size=4' as datalink), cast('file:///Users/charles/Desktop/codes/matrixone/matrixone/test/distributed/resources/llm_test/extract_text/example.txt' as datalink), "pdf");
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| llm_extract_text(cast(file:///Users/charles/Desktop/codes/matrixone/matrixone/test/distributed/resources/llm_test/extract_text/example.pdf?offset=0&size=4 as datalink), cast(file:///Users/charles/Desktop/codes/matrixone/matrixone/test/distributed/resources/llm_test/extract_text/example.txt as datalink), pdf)   |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| true                                                                                                                                                                                                                                                                                                                    |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.10 sec)

PR Type

Enhancement, Tests

Description

Implemented the LLMExtractText function to extract text from PDF files and write it to a specified text file.
Added unit tests to verify the functionality of LLMExtractText, including both valid and invalid test cases.
Registered the new function ID and updated the list of supported built-in functions.
Added a new dependency on the github.com/ledongthuc/pdf package for PDF processing.
Included SQL test cases and results to validate the integration of the new function.

Changes walkthrough 📝

Relevant files

Enhancement

func_llm.go Implement `LLMExtractText` function for PDF text extraction pkg/sql/plan/function/func_llm.go Implemented `LLMExtractText` function to extract text from PDF files. Added error handling for file type validation. Utilized `pdf` package for reading PDF content. Integrated file writing with retry logic.	+178/-0
function_id.go Register `LLM_EXTRACT_TEXT` function ID pkg/sql/plan/function/function_id.go Registered `LLM_EXTRACT_TEXT` function ID. Updated function ID map with new function.	+3/-0
list_builtIn.go Add `LLM_EXTRACT_TEXT` to built-in functions pkg/sql/plan/function/list_builtIn.go Added `LLM_EXTRACT_TEXT` to supported built-in functions. Defined overloads for the new function.	+23/-2

Tests

func_llm_test.go Add unit tests for `LLMExtractText` function pkg/sql/plan/function/func_llm_test.go Added unit tests for `LLMExtractText` function. Included test cases for both valid and invalid inputs. Utilized `testify` for assertions.	+127/-0
func_llm_extract_file.result Add test results for `LLMExtractText` test/distributed/cases/function/func_llm_extract_file.result Added test results for `LLMExtractText` function. Verified successful text extraction from PDF files.	+6/-0
func_llm_extract_file.sql Add SQL test cases for `LLMExtractText` test/distributed/cases/function/func_llm_extract_file.sql Added SQL test cases for `LLMExtractText` function. Included tests for extracting text from PDF files.	+2/-0

Dependencies

go.mod `Add PDF package dependency` go.mod Added `github.com/ledongthuc/pdf` dependency.	+1/-0
go.sum `Update checksum for PDF package` go.sum Updated checksum for `github.com/ledongthuc/pdf`.	+2/-0

💡 PR-Agent usage: Comment /help "your question" on any pull request to receive relevant information

qodo-merge-pro · 2024-10-31T14:24:30Z

PR-Agent was enabled for this repository. To continue using it, please link your git user with your CodiumAI identity here.

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
🧪 PR contains tests
🔒 Security concerns Potential file path manipulation: The function accepts file paths as input, which could potentially lead to unauthorized access to files if not properly validated. Ensure that the file paths are properly sanitized and access is restricted to authorized locations only.
⚡ Recommended focus areas for review Error Handling The function returns a generic error in some cases, which might not provide enough information for debugging. Consider using more specific error types or adding more context to the errors. Resource Management The function opens a PDF file but defers closing it inside the readPdfToString function. Consider moving the defer statement outside of this function to ensure the file is closed even if an error occurs. Test Coverage The test cases don't cover all possible scenarios, such as handling of large files or various error conditions. Consider adding more comprehensive test cases.

qodo-merge-pro · 2024-10-31T14:25:17Z

PR-Agent was enabled for this repository. To continue using it, please link your git user with your CodiumAI identity here.

PR Code Suggestions ✨

Explore these optional code suggestions:

Category	Suggestion	Score
Enhancement	Expand test cases to cover more diverse scenarios and edge cases Consider adding more diverse test cases to cover edge cases and potential error scenarios, such as processing very large PDFs, handling PDFs with complex layouts, or dealing with corrupted PDF files. pkg/sql/plan/function/func_llm_test.go [32-58] regularCases := []struct { info string input []string output []string extractorType []string wants []bool }{ { info: "test encode - simple text", input: []string{ fmt.Sprintf("file://%s/../../../../test/distributed/resources/llm_test/extract_text/MODocs1.pdf?offset=0&size=4", GetFilePath()), fmt.Sprintf("file://%s/../../../../test/distributed/resources/llm_test/extract_text/example.pdf?offset=0&size=4", GetFilePath()), + fmt.Sprintf("file://%s/../../../../test/distributed/resources/llm_test/extract_text/large_file.pdf?offset=0&size=4", GetFilePath()), + fmt.Sprintf("file://%s/../../../../test/distributed/resources/llm_test/extract_text/complex_layout.pdf?offset=0&size=4", GetFilePath()), + fmt.Sprintf("file://%s/../../../../test/distributed/resources/llm_test/extract_text/corrupted.pdf?offset=0&size=4", GetFilePath()), }, output: []string{ fmt.Sprintf("file://%s/../../../../test/distributed/resources/llm_test/extract_text/MODocs1.txt", GetFilePath()), fmt.Sprintf("file://%s/../../../../test/distributed/resources/llm_test/extract_text/example.txt", GetFilePath()), + fmt.Sprintf("file://%s/../../../../test/distributed/resources/llm_test/extract_text/large_file.txt", GetFilePath()), + fmt.Sprintf("file://%s/../../../../test/distributed/resources/llm_test/extract_text/complex_layout.txt", GetFilePath()), + fmt.Sprintf("file://%s/../../../../test/distributed/resources/llm_test/extract_text/corrupted.txt", GetFilePath()), }, extractorType: []string{ + "pdf", + "pdf", + "pdf", "pdf", "pdf", }, wants: []bool{ true, true, + true, + true, + false, }, }, } Apply this suggestion Suggestion importance[1-10]: 8 Why: Expanding test cases to include edge cases and diverse scenarios would significantly improve the robustness and reliability of the function by ensuring it handles various inputs correctly. This is a valuable enhancement to the testing suite.	8
	Add context parameter for improved cancellation and timeout handling Consider adding a context parameter to the `extractTextFromPdfAndWriteToFile` function to allow for better cancellation and timeout handling, especially for large PDF files or slow file operations. pkg/sql/plan/function/func_llm.go [90-95] -func extractTextFromPdfAndWriteToFile(pdfPath string, txtPath string, proc process.Process) (bool, error) { +func extractTextFromPdfAndWriteToFile(ctx context.Context, pdfPath string, txtPath string, proc process.Process) (bool, error) { // read PDF to string - content, err := readPdfToString(pdfPath) + content, err := readPdfToString(ctx, pdfPath) if err != nil { return false, moerr.NewInvalidInputNoCtxf("Invalid PDF input.") } Apply this suggestion Suggestion importance[1-10]: 7 Why: Adding a context parameter to the function would enhance its ability to handle cancellations and timeouts, especially for operations that might take a long time. This is a relevant and practical improvement for robustness.	7
	Improve error handling for null pages in PDF processing Consider implementing a more robust error handling mechanism in the `readPdfToString` function. Instead of silently continuing when encountering a null page, it might be better to log a warning or return a partial error. pkg/sql/plan/function/func_llm.go [155-159] for pageIndex := 1; pageIndex <= totalPage; pageIndex++ { p := r.Page(pageIndex) if p.V.IsNull() { + log.Warnf("Encountered null page at index %d", pageIndex) continue } Apply this suggestion Suggestion importance[1-10]: 6 Why: Logging a warning when encountering a null page in a PDF is a reasonable enhancement for better debugging and monitoring. However, it is a minor improvement and does not address a critical issue.	6

💡 Need additional feedback ? start a PR chat

charleschile and others added 13 commits August 30, 2024 21:52

LLM_EXTRACT_TEXT implementation

8a76a0b

Merge branch 'main' into ospp/LLM_EXTRACT_TEXT

76f2764

Merge branch 'matrixorigin:main' into ospp/LLM_EXTRACT_TEXT

b805a99

fix bug about Write() only works when a file does not exist

f619224

fix bug about Only pdf file supported

373d39d

add extractor type arg

11b1dcc

add unit tests

f02d9be

add BVT tests

f12a9b1

Merge branch 'matrixorigin:main' into ospp/LLM_EXTRACT_TEXT

4b487dd

Merge branch 'matrixorigin:main' into ospp/LLM_EXTRACT_TEXT

227e8e2

Merge branch 'main' into ospp/LLM_EXTRACT_TEXT

5498d07

fix bug

16580c7

Merge branch 'main' into ospp/LLM_EXTRACT_TEXT

590c1e1

charleschile requested review from m-schen, aressu1985, heni02 and zhangxu19830126 as code owners October 31, 2024 14:23

charleschile had a problem deploying to ci October 31, 2024 14:24 — with GitHub Actions Failure

charleschile temporarily deployed to ci October 31, 2024 14:24 — with GitHub Actions Inactive

charleschile had a problem deploying to ci October 31, 2024 14:24 — with GitHub Actions Failure

charleschile temporarily deployed to ci October 31, 2024 14:24 — with GitHub Actions Inactive

mergify bot added the kind/feature label Oct 31, 2024

qodo-merge-pro bot added the Review effort [1-5]: 4 label Oct 31, 2024

matrix-meow added the size/M Denotes a PR that changes [100,499] lines label Oct 31, 2024

mergify bot mentioned this pull request Oct 31, 2024

LLM_EXTRACT_TEXT implementation #18435

Merged

7 tasks

aressu1985 approved these changes Nov 6, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ospp/new llm extract text #19725

Ospp/new llm extract text #19725

charleschile commented Oct 31, 2024 •

edited by qodo-merge-pro bot

Loading

qodo-merge-pro bot commented Oct 31, 2024

qodo-merge-pro bot commented Oct 31, 2024

Ospp/new llm extract text #19725

Are you sure you want to change the base?

Ospp/new llm extract text #19725

Conversation

charleschile commented Oct 31, 2024 • edited by qodo-merge-pro bot Loading

User description

What type of PR is this?

Which issue(s) this PR fixes:

What this PR does / why we need it:

PR Type

Description

Changes walkthrough 📝

qodo-merge-pro bot commented Oct 31, 2024

PR Reviewer Guide 🔍

qodo-merge-pro bot commented Oct 31, 2024

PR Code Suggestions ✨

charleschile commented Oct 31, 2024 •

edited by qodo-merge-pro bot

Loading