Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature][s3] 增加支持读取apache tika 支持的所有类型文档、Excel #1918

Closed
3 tasks done
libailin opened this issue Aug 27, 2024 · 0 comments · Fixed by #1919
Closed
3 tasks done

[Feature][s3] 增加支持读取apache tika 支持的所有类型文档、Excel #1918

libailin opened this issue Aug 27, 2024 · 0 comments · Fixed by #1919
Labels
feature-request this is a feature requests on the product

Comments

@libailin
Copy link
Contributor

libailin commented Aug 27, 2024

Search before asking

  • I had searched in the issues and found no similar feature requirement.

Description

增加支持读取apache tika 支持的所有类型文档
增加支持读取excel格式文件

两类参数不支持同时使用。

Use case

CREATE TABLE source
(
    content String,
    metadata String
) WITH (
    'connector' = 's3-x',
    'assessKey' = 'xxx',
    'secretKey' = 'xxx',
    'bucket' = 'di-test',
    'objects' = '["/pdf-source/20240528/.*"]',
    'endpoint' = 'http://10.x.x.x',
    -- 是否启动分块, 默认false
    'tika-use-extract' = 'true'
    -- 分块大小, 默认 -1 不分块,抽取取全部
    ,'tika-chunk-size' = '40'
    -- 内容重合度比例值 0-100
    ,'tika-overlap-ratio' = '0'
    -- 禁用 Bucket 名称注入到 endpoint 前缀, 默认false, 如果使用域名需要设置成true
    ,'disableBucketNameInEndpoint' = 'true'
    -- 匹配对象的正则表达式
    ,'objectsRegex' = '.*\.doc'
   -- 读取excel 文件
    ,'use-excel-format' = 'true'
   -- 配置对应到excel里列索引
    ,'column-index'='0,1,3'
    --指定读取excel里具体的工作表
    ,'sheet-no'='0,2'
);


CREATE TABLE sink
(
    content String,
    metadata String
) WITH (
      'connector' = 'stream-x',
      'print' = 'true'
      );

INSERT INTO sink SELECT * FROM source;

Related issues

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@libailin libailin added the feature-request this is a feature requests on the product label Aug 27, 2024
libailin added a commit to libailin/chunjun that referenced this issue Aug 28, 2024
libailin added a commit to libailin/chunjun that referenced this issue Sep 20, 2024
@libailin libailin changed the title [Feature][s3] 增加支持读取apache tika 支持的所有类型文档 [Feature][s3] 增加支持读取apache tika 支持的所有类型文档、Excel Sep 20, 2024
libailin added a commit to libailin/chunjun that referenced this issue Sep 20, 2024
…ents supported by Apache Tika, read excel format
libailin added a commit to libailin/chunjun that referenced this issue Sep 20, 2024
…ents supported by Apache Tika, read excel format

[Feature-DTStack#1918][s3] Add support for reading all types of documents supported by Apache Tika, read excel format
zoudaokoulife pushed a commit that referenced this issue Sep 24, 2024
…pported by Apache Tika, read excel format

[Feature-#1918][s3] Add support for reading all types of documents supported by Apache Tika, read excel format
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature-request this is a feature requests on the product
Projects
None yet
1 participant