Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ospp/new llm extract text #19725

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

charleschile
Copy link
Contributor

@charleschile charleschile commented Oct 31, 2024

User description

What type of PR is this?

  • API-change
  • BUG
  • Improvement
  • Documentation
  • Feature
  • Test and CI
  • Code Refactoring

Which issue(s) this PR fixes:

issue #18664

What this PR does / why we need it:

As part of our document LLM support, we are introducing the LLM_EXTRACT_TEXT function. This function extracts text from PDF files and writes the extracted text to a specified text file, extractor type can be specified by the third argument.

Usage: llm_extract_text(<input PDF datalink>, <output text file datalink>, <extractor type string>);

Return Value: A boolean indicating whether the extraction and writing process was successful.

Note:

  • Both the input and output paths must be absolute paths.

Example SQL:

select llm_extract_text(cast('file:///Users/charles/Desktop/codes/matrixone/matrixone/test/distributed/resources/llm_test/extract_text/example.pdf?offset=0&size=4' as datalink), cast('file:///Users/charles/Desktop/codes/matrixone/matrixone/test/distributed/resources/llm_test/extract_text/example.txt' as datalink), "pdf");

Example return:

mysql> select llm_extract_text(cast('file:///Users/charles/Desktop/codes/matrixone/matrixone/test/distributed/resources/llm_test/extract_text/example.pdf?offset=0&size=4' as datalink), cast('file:///Users/charles/Desktop/codes/matrixone/matrixone/test/distributed/resources/llm_test/extract_text/example.txt' as datalink), "pdf");
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| llm_extract_text(cast(file:///Users/charles/Desktop/codes/matrixone/matrixone/test/distributed/resources/llm_test/extract_text/example.pdf?offset=0&size=4 as datalink), cast(file:///Users/charles/Desktop/codes/matrixone/matrixone/test/distributed/resources/llm_test/extract_text/example.txt as datalink), pdf)   |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| true                                                                                                                                                                                                                                                                                                                    |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.10 sec)

PR Type

Enhancement, Tests


Description

  • Implemented the LLMExtractText function to extract text from PDF files and write it to a specified text file.
  • Added unit tests to verify the functionality of LLMExtractText, including both valid and invalid test cases.
  • Registered the new function ID and updated the list of supported built-in functions.
  • Added a new dependency on the github.com/ledongthuc/pdf package for PDF processing.
  • Included SQL test cases and results to validate the integration of the new function.

Changes walkthrough 📝

Relevant files
Enhancement
func_llm.go
Implement `LLMExtractText` function for PDF text extraction

pkg/sql/plan/function/func_llm.go

  • Implemented LLMExtractText function to extract text from PDF files.
  • Added error handling for file type validation.
  • Utilized pdf package for reading PDF content.
  • Integrated file writing with retry logic.
  • +178/-0 
    function_id.go
    Register `LLM_EXTRACT_TEXT` function ID                                   

    pkg/sql/plan/function/function_id.go

  • Registered LLM_EXTRACT_TEXT function ID.
  • Updated function ID map with new function.
  • +3/-0     
    list_builtIn.go
    Add `LLM_EXTRACT_TEXT` to built-in functions                         

    pkg/sql/plan/function/list_builtIn.go

  • Added LLM_EXTRACT_TEXT to supported built-in functions.
  • Defined overloads for the new function.
  • +23/-2   
    Tests
    func_llm_test.go
    Add unit tests for `LLMExtractText` function                         

    pkg/sql/plan/function/func_llm_test.go

  • Added unit tests for LLMExtractText function.
  • Included test cases for both valid and invalid inputs.
  • Utilized testify for assertions.
  • +127/-0 
    func_llm_extract_file.result
    Add test results for `LLMExtractText`                                       

    test/distributed/cases/function/func_llm_extract_file.result

  • Added test results for LLMExtractText function.
  • Verified successful text extraction from PDF files.
  • +6/-0     
    func_llm_extract_file.sql
    Add SQL test cases for `LLMExtractText`                                   

    test/distributed/cases/function/func_llm_extract_file.sql

  • Added SQL test cases for LLMExtractText function.
  • Included tests for extracting text from PDF files.
  • +2/-0     
    Dependencies
    go.mod
    Add PDF package dependency                                                             

    go.mod

    • Added github.com/ledongthuc/pdf dependency.
    +1/-0     
    go.sum
    Update checksum for PDF package                                                   

    go.sum

    • Updated checksum for github.com/ledongthuc/pdf.
    +2/-0     

    💡 PR-Agent usage: Comment /help "your question" on any pull request to receive relevant information

    Copy link

    PR-Agent was enabled for this repository. To continue using it, please link your git user with your CodiumAI identity here.

    PR Reviewer Guide 🔍

    Here are some key observations to aid the review process:

    ⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
    🧪 PR contains tests
    🔒 Security concerns

    Potential file path manipulation:
    The function accepts file paths as input, which could potentially lead to unauthorized access to files if not properly validated. Ensure that the file paths are properly sanitized and access is restricted to authorized locations only.

    ⚡ Recommended focus areas for review

    Error Handling
    The function returns a generic error in some cases, which might not provide enough information for debugging. Consider using more specific error types or adding more context to the errors.

    Resource Management
    The function opens a PDF file but defers closing it inside the readPdfToString function. Consider moving the defer statement outside of this function to ensure the file is closed even if an error occurs.

    Test Coverage
    The test cases don't cover all possible scenarios, such as handling of large files or various error conditions. Consider adding more comprehensive test cases.

    @matrix-meow matrix-meow added the size/M Denotes a PR that changes [100,499] lines label Oct 31, 2024
    @mergify mergify bot mentioned this pull request Oct 31, 2024
    7 tasks
    Copy link

    PR-Agent was enabled for this repository. To continue using it, please link your git user with your CodiumAI identity here.

    PR Code Suggestions ✨

    Explore these optional code suggestions:

    CategorySuggestion                                                                                                                                    Score
    Enhancement
    Expand test cases to cover more diverse scenarios and edge cases

    Consider adding more diverse test cases to cover edge cases and potential error
    scenarios, such as processing very large PDFs, handling PDFs with complex layouts,
    or dealing with corrupted PDF files.

    pkg/sql/plan/function/func_llm_test.go [32-58]

     regularCases := []struct {
         info          string
         input         []string
         output        []string
         extractorType []string
         wants         []bool
     }{
         {
             info: "test encode - simple text",
             input: []string{
                 fmt.Sprintf("file://%s/../../../../test/distributed/resources/llm_test/extract_text/MODocs1.pdf?offset=0&size=4", GetFilePath()),
                 fmt.Sprintf("file://%s/../../../../test/distributed/resources/llm_test/extract_text/example.pdf?offset=0&size=4", GetFilePath()),
    +            fmt.Sprintf("file://%s/../../../../test/distributed/resources/llm_test/extract_text/large_file.pdf?offset=0&size=4", GetFilePath()),
    +            fmt.Sprintf("file://%s/../../../../test/distributed/resources/llm_test/extract_text/complex_layout.pdf?offset=0&size=4", GetFilePath()),
    +            fmt.Sprintf("file://%s/../../../../test/distributed/resources/llm_test/extract_text/corrupted.pdf?offset=0&size=4", GetFilePath()),
             },
             output: []string{
                 fmt.Sprintf("file://%s/../../../../test/distributed/resources/llm_test/extract_text/MODocs1.txt", GetFilePath()),
                 fmt.Sprintf("file://%s/../../../../test/distributed/resources/llm_test/extract_text/example.txt", GetFilePath()),
    +            fmt.Sprintf("file://%s/../../../../test/distributed/resources/llm_test/extract_text/large_file.txt", GetFilePath()),
    +            fmt.Sprintf("file://%s/../../../../test/distributed/resources/llm_test/extract_text/complex_layout.txt", GetFilePath()),
    +            fmt.Sprintf("file://%s/../../../../test/distributed/resources/llm_test/extract_text/corrupted.txt", GetFilePath()),
             },
             extractorType: []string{
    +            "pdf",
    +            "pdf",
    +            "pdf",
                 "pdf",
                 "pdf",
             },
             wants: []bool{
                 true,
                 true,
    +            true,
    +            true,
    +            false,
             },
         },
     }
    • Apply this suggestion
    Suggestion importance[1-10]: 8

    Why: Expanding test cases to include edge cases and diverse scenarios would significantly improve the robustness and reliability of the function by ensuring it handles various inputs correctly. This is a valuable enhancement to the testing suite.

    8
    Add context parameter for improved cancellation and timeout handling

    Consider adding a context parameter to the extractTextFromPdfAndWriteToFile function
    to allow for better cancellation and timeout handling, especially for large PDF
    files or slow file operations.

    pkg/sql/plan/function/func_llm.go [90-95]

    -func extractTextFromPdfAndWriteToFile(pdfPath string, txtPath string, proc *process.Process) (bool, error) {
    +func extractTextFromPdfAndWriteToFile(ctx context.Context, pdfPath string, txtPath string, proc *process.Process) (bool, error) {
         // read PDF to string
    -    content, err := readPdfToString(pdfPath)
    +    content, err := readPdfToString(ctx, pdfPath)
         if err != nil {
             return false, moerr.NewInvalidInputNoCtxf("Invalid PDF input.")
         }
    • Apply this suggestion
    Suggestion importance[1-10]: 7

    Why: Adding a context parameter to the function would enhance its ability to handle cancellations and timeouts, especially for operations that might take a long time. This is a relevant and practical improvement for robustness.

    7
    Improve error handling for null pages in PDF processing

    Consider implementing a more robust error handling mechanism in the readPdfToString
    function. Instead of silently continuing when encountering a null page, it might be
    better to log a warning or return a partial error.

    pkg/sql/plan/function/func_llm.go [155-159]

     for pageIndex := 1; pageIndex <= totalPage; pageIndex++ {
         p := r.Page(pageIndex)
         if p.V.IsNull() {
    +        log.Warnf("Encountered null page at index %d", pageIndex)
             continue
         }
    • Apply this suggestion
    Suggestion importance[1-10]: 6

    Why: Logging a warning when encountering a null page in a PDF is a reasonable enhancement for better debugging and monitoring. However, it is a minor improvement and does not address a critical issue.

    6

    💡 Need additional feedback ? start a PR chat

    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Labels
    kind/feature Review effort [1-5]: 4 size/M Denotes a PR that changes [100,499] lines
    Projects
    None yet
    Development

    Successfully merging this pull request may close these issues.

    3 participants