Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: Create Textract Middleware #46

Open
HQarroum opened this issue Jul 30, 2024 · 1 comment
Open

Feature request: Create Textract Middleware #46

HQarroum opened this issue Jul 30, 2024 · 1 comment
Assignees
Labels
new-middleware A label associated with a new middleware. triage

Comments

@HQarroum
Copy link
Contributor

HQarroum commented Jul 30, 2024

Use case

Implement a middleware that exposes the Textract capabilities within a Lakechain document processing pipeline.

Solution/User Experience

Below is the temporary design for an API for this middleware.

Table data extraction.
Input(s) : PDF, Images
Output(s) : 'markdown' and/or 'text' and/or 'excel' and/or 'csv' and/or 'html'

const textract = new TextractProcessor.Builder()
  .withScope(this)
  .withIdentifier('Trigger')
  .withCacheStorage(cache)
  .withTask(new TableExtractionTask.Builder()
    .withOutputType('markdown' | 'text' | 'excel' | 'csv' | 'html')
    // Defines whether a document will be created for each table,
    // or whether to group them all in one document.
    .withGroupOutput(false)
    .build())
  .build();

Key value pair extraction.
Input(s) : PDF, Images
Output(s) : 'json' | 'csv'

const textract = new TextractProcessor.Builder()
  .withScope(this)
  .withIdentifier('Trigger')
  .withCacheStorage(cache)
  .withTask(new KvExtractionTask.Builder()
    .withOutputType('json' | 'csv')
    .build())
  .build();

Visualize task.
Input(s) : PDF, Images
Output(s) : One or multiple images

const textract = new TextractProcessor.Builder()
  .withScope(this)
  .withIdentifier('Trigger')
  .withCacheStorage(cache)
  .withTask(new ImageVisualizationTask.Builder()
    .withCheckboxes(true)
    .withKeyValues(true)
    .withTables(true)
    .withSearch('rent', { top_k: 10 })
    .build())
  .build();

Expense analysis.
Input(s) : PDF, Images
Output(s) : CSV

const textract = new TextractProcessor.Builder()
  .withScope(this)
  .withIdentifier('Trigger')
  .withCacheStorage(cache)
  .withTask(new ExpenseAnalysisTask.Builder()
    .withOutputType('csv')
    .build())
  .build();

ID Analysis.
Input(s) : PDF, Images
Output(s) : JSON, CSV

const textract = new TextractProcessor.Builder()
  .withScope(this)
  .withIdentifier('Trigger')
  .withCacheStorage(cache)
  .withTask(new IdAnalysisTask.Builder()
    .withOutputType('json' | 'csv')
    .build())
  .build();

Layout Analysis.
Input(s) : PDF, Images
Output(s) : PDF, Images + Metadata
Exports layout information in a structured way in the document metadata.

const textract = new TextractProcessor.Builder()
  .withScope(this)
  .withIdentifier('Trigger')
  .withCacheStorage(cache)
  .withTask(new LayoutAnalysisTask.Builder()
    .build())
  .build();

Alternative solutions

No response

@HQarroum HQarroum added triage new-middleware A label associated with a new middleware. labels Jul 30, 2024
@HQarroum HQarroum self-assigned this Jul 30, 2024
@HQarroum HQarroum moved this to Planned in Project Lakechain Jul 30, 2024
@HQarroum
Copy link
Contributor Author

HQarroum commented Jul 30, 2024

Adding support for optional custom text linearization functions as per @mrtj comment.

Text linearization function.

const textract = new TextractProcessor.Builder()
  .withScope(this)
  .withIdentifier('Trigger')
  .withCacheStorage(cache)
  .withTask(new TableExtractionTask.Builder()
    .withOutputType('text')
    .withLinearizationFunction(new TextLinearizationFunction.Builder()
      .withKeyPrefix('<key>')
      .withKeySuffix('</key>')
      .withValuePrefix('<value>')
      .withValueSuffix('</value>')
      .build())
    .build())
  .build();

HTML linearization function.

const textract = new TextractProcessor.Builder()
  .withScope(this)
  .withIdentifier('Trigger')
  .withCacheStorage(cache)
  .withTask(new TableExtractionTask.Builder()
    .withOutputType('text')
    .withLinearizationFunction(new HtmlLinearizationFunction.Builder()
      .withTableCellHeaderPrefix('<td>')
      .withTableCellHeaderSuffix('</td>')
      .withKeyPrefix('<key>')
      .withKeySuffix('</key>')
      .withValuePrefix('<value>')
      .withValueSuffix('</value>')
      .build())
    .build())
  .build();

@mrtj does it look good to you ?

@HQarroum HQarroum moved this from Planned to In progress in Project Lakechain Jul 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
new-middleware A label associated with a new middleware. triage
Projects
Status: In progress
Development

No branches or pull requests

1 participant