Operators are a collection of basic processes that assist in data modification, cleaning, filtering, deduplication, etc. We support a wide range of data sources and file formats, and allow for flexible extension to custom datasets.
算子 (Operator) 是协助数据修改、清理、过滤、去重等基本流程的集合。我们支持广泛的数据来源和文件格式,并支持对自定义数据集的灵活扩展。
This page offers a basic description of the operators (OPs) in Data-Juicer.
Users can refer to the
API documentation for the specific
parameters of each operator. Users can refer to and run the unit tests
(tests/ops/...
) for examples of operator-wise usage as well
as the effects of each operator when applied to built-in test data samples.
Besides, you can try to use agent to automatically route suitable OPs and
call them. E.g., refer to
Agentic Filters of DJ,
Agentic Mappers of DJ
这个页面提供了OP的基本描述,用户可以参考API文档更细致了解每个
OP的具体参数,并且可以查看、运行单元测试 (tests/ops/...
),来体验
各OP的用法示例以及每个OP作用于内置测试数据样本时的效果。例如,参考
Agentic Filters of DJ,
Agentic Mappers of DJ
The operators in Data-Juicer are categorized into 7 types. Data-Juicer 中的算子分为以下 7 种类型。
Type 类型 | Number 数量 | Description 描述 |
---|---|---|
aggregator | 4 | Aggregate for batched samples, such as summary or conclusion. 对批量样本进行汇总,如得出总结或结论。 |
deduplicator | 10 | Detects and removes duplicate samples. 识别、删除重复样本。 |
filter | 45 | Filters out low-quality samples. 过滤低质量样本。 |
formatter | 9 | Discovers, loads, and canonicalizes source data. 发现、加载、规范化原始数据。 |
grouper | 3 | Group samples to batched samples. 将样本分组,每一组组成一个批量样本。 |
mapper | 75 | Edits and transforms samples. 对数据样本进行编辑和转换。 |
selector | 5 | Selects top samples based on ranking. 基于排序选取高质量样本。 |
All the specific operators are listed below, each featured with several capability tags. 下面列出所有具体算子,每种算子都通过多个标签来注明其主要功能。
- Modality Tags
- 🔤Text: process text data specifically. 专用于处理文本。
- 🏞Image: process image data specifically. 专用于处理图像。
- 📣Audio: process audio data specifically. 专用于处理音频。
- 🎬Video: process video data specifically. 专用于处理视频。
- 🔮Multimodal: process multimodal data. 用于处理多模态数据。
- Resource Tags
- 💻CPU: only requires CPU resource. 只需要 CPU 资源。
- 🚀GPU: requires GPU/CUDA resource as well. 额外需要 GPU/CUDA 资源。
- Usability Tags
- 🔴Alpha: alpha version OP. Only the basic OP implementations are finished. 表示 alpha 版本算子。只完成了基础的算子实现。
- 🟡Beta: beta version OP. Based on the alpha version, unittests for this OP are added as well. 表示 beta 版本算子。基于 alpha 版本,添加了算子的单元测试。
- 🟢Stable: stable version OP. Based on the beta version, OP optimizations related to DJ (e.g. model management, batched processing, OP fusion, ...) are added to this OP. 表示 stable 版本算子。基于 beta 版本,完善了DJ相关的算子优化项(如模型管理,批处理,算子融合等)。
- Model Tags
- 🔗API: equipped with API-based models. (e.g. ChatGPT, GPT-4o). 支持基于 API 调用模型(如 ChatGPT,GPT-4o)。
- 🌊vLLM: equipped with models supported by vLLM. 支持基于 vLLM 进行模型推理。
- 🧩HF: equipped with models from HuggingFace Hub. 支持来自于 HuggingFace Hub 的模型。
Operator 算子 | Tags 标签 | Description 描述 | Source code 源码 | Unit tests 单测样例 |
---|---|---|---|---|
entity_attribute_aggregator | 💻CPU 🔗API 🟢Stable | Return conclusion of the given entity's attribute from some docs. 从一些文档返回给定实体的属性的结论。 | code | tests |
meta_tags_aggregator | 💻CPU 🔗API 🟢Stable | Merge similar meta tags to one tag. 将类似的元标记合并到一个标记。 | code | tests |
most_relavant_entities_aggregator | 💻CPU 🔗API 🟢Stable | Extract entities closely related to a given entity from some texts, and sort them in descending order of importance. 从一些文本中提取与给定实体密切相关的实体,并按重要性的降序对它们进行排序。 | code | tests |
nested_aggregator | 🔤Text 💻CPU 🔗API 🟢Stable | Considering the limitation of input length, nested aggregate contents for each given number of samples. 考虑到输入长度的限制,嵌套聚合每个给定数量的样本的内容。 | code | tests |
Operator 算子 | Tags 标签 | Description 描述 | Source code 源码 | Unit tests 单测样例 |
---|---|---|---|---|
document_deduplicator | 🔤Text 💻CPU 🟢Stable | Deduplicator to deduplicate samples at document-level using exact matching. Deduplicator使用精确匹配在文档级别删除重复的样本。 | code | tests |
document_minhash_deduplicator | 🔤Text 💻CPU 🟢Stable | Deduplicator to deduplicate samples at document-level using MinHashLSH. Deduplicator使用MinHashLSH在文档级别删除重复的样本。 | code | tests |
document_simhash_deduplicator | 🔤Text 💻CPU 🟢Stable | Deduplicator to deduplicate samples at document-level using SimHash. Deduplicator使用SimHash在文档级别对样本进行重复数据删除。 | code | tests |
image_deduplicator | 🏞Image 💻CPU 🟢Stable | Deduplicator to deduplicate samples at document-level using exact matching of images between documents. Deduplicator使用文档之间的图像精确匹配在文档级别删除重复的样本。 | code | tests |
ray_basic_deduplicator | 💻CPU 🔴Alpha | Backend for deduplicator. deduplicator的后端。 | code | - |
ray_bts_minhash_deduplicator | 🔤Text 💻CPU 🔴Alpha | A distributed implementation of Union-Find with load balancing. 具有负载平衡的Union-Find的分布式实现。 | code | - |
ray_document_deduplicator | 🔤Text 💻CPU 🔴Alpha | Deduplicator to deduplicate samples at document-level using exact matching. Deduplicator使用精确匹配在文档级别删除重复的样本。 | code | - |
ray_image_deduplicator | 🏞Image 💻CPU 🔴Alpha | Deduplicator to deduplicate samples at document-level using exact matching of images between documents. Deduplicator使用文档之间的图像精确匹配在文档级别删除重复的样本。 | code | - |
ray_video_deduplicator | 🎬Video 💻CPU 🔴Alpha | Deduplicator to deduplicate samples at document-level using exact matching of videos between documents. Deduplicator使用文档之间的视频精确匹配在文档级别删除重复的样本。 | code | - |
video_deduplicator | 🎬Video 💻CPU 🟢Stable | Deduplicator to deduplicate samples at document-level using exact matching of videos between documents. Deduplicator使用文档之间的视频精确匹配在文档级别删除重复的样本。 | code | tests |
Operator 算子 | Tags 标签 | Description 描述 | Source code 源码 | Unit tests 单测样例 |
---|---|---|---|---|
alphanumeric_filter | 🔤Text 💻CPU 🧩HF 🟢Stable | Filter to keep samples with alphabet/numeric ratio within a specific range. 过滤器保持样品与字母/数字的比例在一个特定的范围内。 | code | tests |
audio_duration_filter | 📣Audio 💻CPU 🟢Stable | Keep data samples whose audios' durations are within a specified range. 保留音频持续时间在指定范围内的数据样本。 | code | tests |
audio_nmf_snr_filter | 📣Audio 💻CPU 🟢Stable | Keep data samples whose audios' SNRs (computed based on NMF) are within a specified range. 保留音频的snr (根据NMF计算) 在指定范围内的数据样本。 | code | tests |
audio_size_filter | 📣Audio 💻CPU 🟢Stable | Keep data samples whose audio size (in bytes/kb/MB/...) within a specific range. 保留音频大小 (以字节/kb/MB/... 为单位) 在特定范围内的数据样本。 | code | tests |
average_line_length_filter | 🔤Text 💻CPU 🟢Stable | Filter to keep samples with average line length within a specific range. 过滤器,以保持平均线长度在特定范围内的样本。 | code | tests |
character_repetition_filter | 🔤Text 💻CPU 🟢Stable | Filter to keep samples with char-level n-gram repetition ratio within a specific range. 过滤器将具有char级n-gram重复比率的样本保持在特定范围内。 | code | tests |
flagged_words_filter | 🔤Text 💻CPU 🟢Stable | Filter to keep samples with flagged-word ratio less than a specific max value. 过滤以保持标记词比率小于特定最大值的样本。 | code | tests |
image_aesthetics_filter | 🏞Image 🚀GPU 🧩HF 🟢Stable | Filter to keep samples with aesthetics scores within a specific range. 过滤以保持美学分数在特定范围内的样品。 | code | tests |
image_aspect_ratio_filter | 🏞Image 💻CPU 🟢Stable | Filter to keep samples with image aspect ratio within a specific range. 过滤器,以保持样本的图像纵横比在特定范围内。 | code | tests |
image_face_count_filter | 🏞Image 💻CPU 🟢Stable | Filter to keep samples with the number of faces within a specific range. 过滤以保持样本的面数在特定范围内。 | code | tests |
image_face_ratio_filter | 🏞Image 💻CPU 🟢Stable | Filter to keep samples with face area ratios within a specific range. 过滤以保持面面积比在特定范围内的样本。 | code | tests |
image_nsfw_filter | 🏞Image 🚀GPU 🧩HF 🟢Stable | Filter to keep samples whose images have low nsfw scores. 过滤器保留图像具有低nsfw分数的样本。 | code | tests |
image_pair_similarity_filter | 🏞Image 🚀GPU 🧩HF 🟢Stable | Filter to keep image pairs with similarities between images within a specific range. 过滤器将图像之间具有相似性的图像对保持在特定范围内。 | code | tests |
image_shape_filter | 🏞Image 💻CPU 🟢Stable | Filter to keep samples with image shape (w, h) within specific ranges. 过滤器保持样品的图像形状 (w,h) 在特定范围内。 | code | tests |
image_size_filter | 🏞Image 💻CPU 🟢Stable | Keep data samples whose image size (in Bytes/KB/MB/...) within a specific range. 保留图像大小 (以字节/KB/MB/... 为单位) 在特定范围内的数据样本。 | code | tests |
image_text_matching_filter | 🔮Multimodal 🚀GPU 🧩HF 🟢Stable | Filter to keep samples those matching score between image and text within a specific range. 过滤器将图像和文本之间的匹配分数保持在特定范围内。 | code | tests |
image_text_similarity_filter | 🔮Multimodal 🚀GPU 🧩HF 🟢Stable | Filter to keep samples those similarities between image and text within a specific range. 过滤器将图像和文本之间的相似性保持在特定范围内。 | code | tests |
image_watermark_filter | 🏞Image 🚀GPU 🧩HF 🟢Stable | Filter to keep samples whose images have no watermark with high probability. 过滤器以保持其图像没有水印的样本具有高概率。 | code | tests |
language_id_score_filter | 🔤Text 💻CPU 🟢Stable | Filter to keep samples in a specific language with confidence score larger than a specific min value. 过滤器以保留置信度得分大于特定最小值的特定语言的样本。 | code | tests |
maximum_line_length_filter | 🔤Text 💻CPU 🟢Stable | Filter to keep samples with maximum line length within a specific range. 过滤器将最大行长度的样本保持在特定范围内。 | code | tests |
perplexity_filter | 🔤Text 💻CPU 🟢Stable | Filter to keep samples with perplexity score less than a specific max value. 过滤以保留困惑度分数小于特定最大值的样本。 | code | tests |
phrase_grounding_recall_filter | 🔮Multimodal 🚀GPU 🧩HF 🟢Stable | Filter to keep samples whose locating recalls of phrases extracted from text in the images are within a specified range. 过滤器,用于保留从图像中的文本中提取的短语的定位回忆在指定范围内的样本。 | code | tests |
special_characters_filter | 🔤Text 💻CPU 🟢Stable | Filter to keep samples with special-char ratio within a specific range. 过滤器将具有特殊字符比率的样品保持在特定范围内。 | code | tests |
specified_field_filter | 💻CPU 🟢Stable | Filter based on specified field information. 根据指定的字段信息进行筛选。 | code | tests |
specified_numeric_field_filter | 💻CPU 🟢Stable | Filter based on specified numeric field information. 根据指定的数值字段信息进行筛选。 | code | tests |
stopwords_filter | 🔤Text 💻CPU 🟢Stable | Filter to keep samples with stopword ratio larger than a specific min value. 过滤以保持停止词比率大于特定最小值的样本。 | code | tests |
suffix_filter | 💻CPU 🟢Stable | Filter to keep samples with specified suffix. 过滤器以保留具有指定后缀的样本。 | code | tests |
text_action_filter | 🔤Text 💻CPU 🟢Stable | Filter to keep texts those contain actions in the text. 过滤以保留文本中包含操作的文本。 | code | tests |
text_entity_dependency_filter | 🔤Text 💻CPU 🟢Stable | Identify the entities in the text which are independent with other token, and filter them. 识别文本中与其他令牌独立的实体,并对其进行过滤。 | code | tests |
text_length_filter | 🔤Text 💻CPU 🟢Stable | Filter to keep samples with total text length within a specific range. 过滤以保持文本总长度在特定范围内的样本。 | code | tests |
text_pair_similarity_filter | 🔤Text 🚀GPU 🧩HF 🟢Stable | Filter to keep text pairs with similarities between texts within a specific range. 过滤器将文本之间具有相似性的文本对保留在特定范围内。 | code | tests |
token_num_filter | 🔤Text 💻CPU 🧩HF 🟢Stable | Filter to keep samples with total token number within a specific range. 筛选器将总令牌数的样本保留在特定范围内。 | code | tests |
video_aesthetics_filter | 🎬Video 🚀GPU 🧩HF 🟢Stable | Filter to keep data samples with aesthetics scores for specified frames in the videos within a specific range. 过滤器将视频中指定帧的美学得分数据样本保留在特定范围内。 | code | tests |
video_aspect_ratio_filter | 🎬Video 💻CPU 🟢Stable | Filter to keep samples with video aspect ratio within a specific range. 过滤器将视频纵横比的样本保持在特定范围内。 | code | tests |
video_duration_filter | 🎬Video 💻CPU 🟢Stable | Keep data samples whose videos' durations are within a specified range. 保留视频持续时间在指定范围内的数据样本。 | code | tests |
video_frames_text_similarity_filter | 🔮Multimodal 🚀GPU 🧩HF 🟢Stable | Filter to keep samples those similarities between sampled video frame images and text within a specific range. 过滤以保持采样视频帧图像和文本之间的相似性在特定范围内。 | code | tests |
video_motion_score_filter | 🎬Video 💻CPU 🟢Stable | Filter to keep samples with video motion scores within a specific range. 过滤器将视频运动分数的样本保持在特定范围内。 | code | tests |
video_motion_score_raft_filter | 🎬Video 🚀GPU 🟢Stable | Filter to keep samples with video motion scores within a specified range. 过滤器将视频运动分数的样本保持在指定范围内。 | code | tests |
video_nsfw_filter | 🎬Video 🚀GPU 🧩HF 🟢Stable | Filter to keep samples whose videos have low nsfw scores. 过滤器以保留其视频具有低nsfw分数的样本。 | code | tests |
video_ocr_area_ratio_filter | 🎬Video 🚀GPU 🟢Stable | Keep data samples whose detected text area ratios for specified frames in the video are within a specified range. 保留检测到的视频中指定帧的文本面积比率在指定范围内的数据样本。 | code | tests |
video_resolution_filter | 🎬Video 💻CPU 🟢Stable | Keep data samples whose videos' resolutions are within a specified range. 保留视频分辨率在指定范围内的数据样本。 | code | tests |
video_tagging_from_frames_filter | 🎬Video 🚀GPU 🟢Stable | Filter to keep samples whose videos contain the given tags. 过滤器以保留其视频包含给定标签的样本。 | code | tests |
video_watermark_filter | 🎬Video 🚀GPU 🧩HF 🟢Stable | Filter to keep samples whose videos have no watermark with high probability. 过滤器以保持其视频具有高概率没有水印的样本。 | code | tests |
word_repetition_filter | 🔤Text 💻CPU 🟢Stable | Filter to keep samples with word-level n-gram repetition ratio within a specific range. 过滤器将单词级n-gram重复比率的样本保持在特定范围内。 | code | tests |
words_num_filter | 🔤Text 💻CPU 🟢Stable | Filter to keep samples with total words number within a specific range. 过滤器,以保持总字数在特定范围内的样本。 | code | tests |
Operator 算子 | Tags 标签 | Description 描述 | Source code 源码 | Unit tests 单测样例 |
---|---|---|---|---|
csv_formatter | 🟢Stable | The class is used to load and format csv-type files. 类用于加载和格式化csv类型的文件。 | code | tests |
empty_formatter | 🟢Stable | The class is used to create empty data. 类用于创建空数据。 | code | tests |
json_formatter | 🔴Alpha | The class is used to load and format json-type files. 类用于加载和格式化json类型的文件。 | code | - |
local_formatter | 🟢Stable | The class is used to load a dataset from local files or local directory. 类用于从本地文件或本地目录加载数据集。 | code | tests |
mixture_formatter | 🟢Stable | The class mixes multiple datasets by randomly selecting samples from every dataset and merging them, and then exports the merged datasset as a new mixed dataset. 该类通过从每个数据集中随机选择样本并合并它们来混合多个数据集,然后将合并的datasset导出为新的混合数据集。 | code | tests |
parquet_formatter | 🟢Stable | The class is used to load and format parquet-type files. 该类用于加载和格式化镶木地板类型的文件。 | code | tests |
remote_formatter | 🟢Stable | The class is used to load a dataset from repository of huggingface hub. 该类用于从huggingface hub的存储库加载数据集。 | code | tests |
text_formatter | 🔴Alpha | The class is used to load and format text-type files. 类用于加载和格式化文本类型文件。 | code | - |
tsv_formatter | 🟢Stable | The class is used to load and format tsv-type files. 该类用于加载和格式化tsv类型的文件。 | code | tests |
Operator 算子 | Tags 标签 | Description 描述 | Source code 源码 | Unit tests 单测样例 |
---|---|---|---|---|
key_value_grouper | 🔤Text 💻CPU 🟢Stable | Group samples to batched samples according values in given keys. 根据给定键中的值将样本分组为批处理样本。 | code | tests |
naive_grouper | 💻CPU 🟢Stable | Group all samples to one batched sample. 将所有样品分组为一批样品。 | code | tests |
naive_reverse_grouper | 💻CPU 🟢Stable | Split batched samples to samples. 将批处理的样品拆分为样品。 | code | tests |
Operator 算子 | Tags 标签 | Description 描述 | Source code 源码 | Unit tests 单测样例 |
---|---|---|---|---|
audio_ffmpeg_wrapped_mapper | 📣Audio 💻CPU 🟢Stable | Simple wrapper for FFmpeg audio filters. FFmpeg音频滤波器的简单包装。 | code | tests |
calibrate_qa_mapper | 🔤Text 💻CPU 🔗API 🟢Stable | Mapper to calibrate question-answer pairs based on reference text. 映射器基于参考文本校准问题-答案对。 | code | tests |
calibrate_query_mapper | 💻CPU 🟢Stable | Mapper to calibrate query in question-answer pairs based on reference text. 映射器基于参考文本校准问答对中的查询。 | code | tests |
calibrate_response_mapper | 💻CPU 🟢Stable | Mapper to calibrate response in question-answer pairs based on reference text. 映射器基于参考文本校准问答对中的响应。 | code | tests |
chinese_convert_mapper | 🔤Text 💻CPU 🟢Stable | Mapper to convert Chinese between Traditional Chinese, Simplified Chinese and Japanese Kanji. 映射器在繁体中文,简体中文和日语汉字之间转换中文。 | code | tests |
clean_copyright_mapper | 🔤Text 💻CPU 🟢Stable | Mapper to clean copyright comments at the beginning of the text samples. Mapper清理版权注释开头的文本样本。 | code | tests |
clean_email_mapper | 🔤Text 💻CPU 🟢Stable | Mapper to clean email in text samples. 映射器清理文本样本中的电子邮件。 | code | tests |
clean_html_mapper | 🔤Text 💻CPU 🟢Stable | Mapper to clean html code in text samples. 映射器来清理文本示例中的html代码。 | code | tests |
clean_ip_mapper | 🔤Text 💻CPU 🟢Stable | Mapper to clean ipv4 and ipv6 address in text samples. 映射器以清除文本示例中的ipv4和ipv6地址。 | code | tests |
clean_links_mapper | 🔤Text 💻CPU 🟢Stable | Mapper to clean links like http/https/ftp in text samples. 映射器来清理链接,如文本示例中的http/https/ftp。 | code | tests |
dialog_intent_detection_mapper | 💻CPU 🔗API 🟢Stable | Mapper to generate user's intent labels in dialog. 映射器在对话框中生成用户的意图标签。 | code | tests |
dialog_sentiment_detection_mapper | 💻CPU 🔗API 🟢Stable | Mapper to generate user's sentiment labels in dialog. 映射器在对话框中生成用户的情绪标签。 | code | tests |
dialog_sentiment_intensity_mapper | 💻CPU 🔗API 🟢Stable | Mapper to predict user's sentiment intensity (from -5 to 5 in default prompt) in dialog. Mapper在对话框中预测用户的情绪强度 (在默认提示中从-5到5)。 | code | tests |
dialog_topic_detection_mapper | 💻CPU 🔗API 🟢Stable | Mapper to generate user's topic labels in dialog. 映射器在对话框中生成用户的主题标签。 | code | tests |
expand_macro_mapper | 🔤Text 💻CPU 🟢Stable | Mapper to expand macro definitions in the document body of Latex samples. Mapper来扩展Latex示例文档主体中的宏定义。 | code | tests |
extract_entity_attribute_mapper | 🔤Text 💻CPU 🔗API 🟢Stable | Extract attributes for given entities from the text. 从文本中提取给定实体的属性。 | code | tests |
extract_entity_relation_mapper | 🔤Text 💻CPU 🔗API 🟢Stable | Extract entities and relations in the text for knowledge graph. 提取知识图谱的文本中的实体和关系。 | code | tests |
extract_event_mapper | 🔤Text 💻CPU 🔗API 🟢Stable | Extract events and relevant characters in the text. 提取文本中的事件和相关字符。 | code | tests |
extract_keyword_mapper | 🔤Text 💻CPU 🔗API 🟢Stable | Generate keywords for the text. 为文本生成关键字。 | code | tests |
extract_nickname_mapper | 🔤Text 💻CPU 🔗API 🟢Stable | Extract nickname relationship in the text. 提取文本中的昵称关系。 | code | tests |
extract_support_text_mapper | 🔤Text 💻CPU 🔗API 🟢Stable | Extract support sub text for a summary. 提取摘要的支持子文本。 | code | tests |
fix_unicode_mapper | 🔤Text 💻CPU 🟢Stable | Mapper to fix unicode errors in text samples. 映射器修复文本示例中的unicode错误。 | code | tests |
generate_qa_from_examples_mapper | 🚀GPU 🌊vLLM 🧩HF 🟢Stable | Mapper to generate question and answer pairs from examples. 映射器从示例生成问题和答案对。 | code | tests |
generate_qa_from_text_mapper | 🔤Text 🚀GPU 🌊vLLM 🧩HF 🟢Stable | Mapper to generate question and answer pairs from text. 映射器从文本生成问题和答案对。 | code | tests |
image_blur_mapper | 🏞Image 💻CPU 🟢Stable | Mapper to blur images. 映射器来模糊图像。 | code | tests |
image_captioning_from_gpt4v_mapper | 🔮Multimodal 💻CPU 🔴Alpha | Mapper to generate samples whose texts are generated based on gpt-4-visison and the image. Mapper生成样本,其文本基于gpt-4-visison和图像生成。 | code | - |
image_captioning_mapper | 🔮Multimodal 🚀GPU 🧩HF 🟢Stable | Mapper to generate samples whose captions are generated based on another model and the figure. 映射器生成样本,其标题是基于另一个模型和图生成的。 | code | tests |
image_diffusion_mapper | 🔮Multimodal 🚀GPU 🧩HF 🟢Stable | Generate image by diffusion model. 通过扩散模型生成图像。 | code | tests |
image_face_blur_mapper | 🏞Image 💻CPU 🟢Stable | Mapper to blur faces detected in images. 映射器模糊图像中检测到的人脸。 | code | tests |
image_segment_mapper | 🏞Image 🚀GPU 🟢Stable | Perform segment-anything on images and return the bounding boxes. 在图像上执行segment-anything并返回边界框。 | code | tests |
image_tagging_mapper | 🏞Image 🚀GPU 🟢Stable | Mapper to generate image tags. 映射器生成图像标签。 | code | tests |
mllm_mapper | 🔮Multimodal 🚀GPU 🧩HF 🟢Stable | Mapper to use MLLMs for visual question answering tasks. Mapper使用MLLMs进行视觉问答任务。 | code | tests |
nlpaug_en_mapper | 🔤Text 💻CPU 🟢Stable | Mapper to simply augment samples in English based on nlpaug library. 映射器基于nlpaug库简单地增加英语样本。 | code | tests |
nlpcda_zh_mapper | 🔤Text 💻CPU 🟢Stable | Mapper to simply augment samples in Chinese based on nlpcda library. 基于nlpcda库的映射器可以简单地增加中文样本。 | code | tests |
optimize_qa_mapper | 🚀GPU 🌊vLLM 🧩HF 🟢Stable | Mapper to optimize question-answer pairs. 映射器来优化问题-答案对。 | code | tests |
optimize_query_mapper | 🚀GPU 🟢Stable | Mapper to optimize query in question-answer pairs. 映射器来优化问答对中的查询。 | code | tests |
optimize_response_mapper | 🚀GPU 🟢Stable | Mapper to optimize response in question-answer pairs. 映射器来优化问答对中的响应。 | code | tests |
pair_preference_mapper | 🔤Text 💻CPU 🔗API 🟢Stable | Mapper to construct paired preference samples. 映射器来构造成对的偏好样本。 | code | tests |
punctuation_normalization_mapper | 🔤Text 💻CPU 🟢Stable | Mapper to normalize unicode punctuations to English punctuations in text samples. 映射器将文本示例中的unicode标点规范化为英文标点。 | code | tests |
python_file_mapper | 💻CPU 🟢Stable | Mapper for executing Python function defined in a file. Mapper用于执行文件中定义的Python函数。 | code | tests |
python_lambda_mapper | 💻CPU 🟢Stable | Mapper for executing Python lambda function on data samples. 用于对数据示例执行Python lambda函数的映射器。 | code | tests |
query_intent_detection_mapper | 🚀GPU 🧩HF 🧩HF 🟢Stable | Mapper to predict user's Intent label in query. Mapper在查询中预测用户的意图标签。 | code | tests |
query_sentiment_detection_mapper | 🚀GPU 🧩HF 🧩HF 🟢Stable | Mapper to predict user's sentiment label ('negative', 'neutral' and 'positive') in query. Mapper在查询中预测用户的情绪标签 (“负面”,“中性” 和 “正面”)。 | code | tests |
query_topic_detection_mapper | 🚀GPU 🧩HF 🧩HF 🟢Stable | Mapper to predict user's topic label in query. Mapper在查询中预测用户的主题标签。 | code | tests |
relation_identity_mapper | 🔤Text 💻CPU 🔗API 🟢Stable | identify relation between two entity in the text. 确定文本中两个实体之间的关系。 | code | tests |
remove_bibliography_mapper | 🔤Text 💻CPU 🟢Stable | Mapper to remove bibliography at the end of documents in Latex samples. 映射器删除Latex样本中文档末尾的参考书目。 | code | tests |
remove_comments_mapper | 🔤Text 💻CPU 🟢Stable | Mapper to remove comments in different kinds of documents. 映射器删除不同类型的文档中的注释。 | code | tests |
remove_header_mapper | 🔤Text 💻CPU 🟢Stable | Mapper to remove headers at the beginning of documents in Latex samples. 映射器删除Latex示例中文档开头的标题。 | code | tests |
remove_long_words_mapper | 🔤Text 💻CPU 🟢Stable | Mapper to remove long words within a specific range. 映射器删除特定范围内的长词。 | code | tests |
remove_non_chinese_character_mapper | 🔤Text 💻CPU 🟢Stable | Mapper to remove non chinese Character in text samples. 映射器删除文本样本中的非中文字符。 | code | tests |
remove_repeat_sentences_mapper | 🔤Text 💻CPU 🟢Stable | Mapper to remove repeat sentences in text samples. 映射器删除文本样本中的重复句子。 | code | tests |
remove_specific_chars_mapper | 🔤Text 💻CPU 🟢Stable | Mapper to clean specific chars in text samples. 映射器来清理文本样本中的特定字符。 | code | tests |
remove_table_text_mapper | 🔤Text 💻CPU 🟢Stable | Mapper to remove table texts from text samples. 映射器从文本样本中删除表文本。 | code | tests |
remove_words_with_incorrect_substrings_mapper | 🔤Text 💻CPU 🟢Stable | Mapper to remove words with incorrect substrings. 映射器删除不正确的子字符串的单词。 | code | tests |
replace_content_mapper | 🔤Text 💻CPU 🟢Stable | Mapper to replace all content in the text that matches a specific regular expression pattern with a designated replacement string. 映射程序将文本中与特定正则表达式模式匹配的所有内容替换为指定的替换字符串。 | code | tests |
sdxl_prompt2prompt_mapper | 🔮Multimodal 🚀GPU 🟢Stable | Generate pairs of similar images by the SDXL model. 通过SDXL模型生成相似图像对。 | code | tests |
sentence_augmentation_mapper | 🔤Text 🚀GPU 🧩HF 🟢Stable | Mapper to augment sentences. 映射器来增加句子。 | code | tests |
sentence_split_mapper | 🔤Text 💻CPU 🟢Stable | Mapper to split text samples to sentences. 映射器将文本样本拆分为句子。 | code | tests |
text_chunk_mapper | 🔤Text 💻CPU 🔗API 🟢Stable | Split input text to chunks. 将输入文本拆分为块。 | code | tests |
video_captioning_from_audio_mapper | 🔮Multimodal 🚀GPU 🧩HF 🟢Stable | Mapper to caption a video according to its audio streams based on Qwen-Audio model. 映射器根据基于qwen-audio模型的音频流为视频添加字幕。 | code | tests |
video_captioning_from_frames_mapper | 🔮Multimodal 🚀GPU 🧩HF 🟢Stable | Mapper to generate samples whose captions are generated based on an image-to-text model and sampled video frames. 映射器生成样本,其字幕是基于图像到文本模型和采样的视频帧生成的。 | code | tests |
video_captioning_from_summarizer_mapper | 🔮Multimodal 🚀GPU 🧩HF 🟢Stable | Mapper to generate video captions by summarizing several kinds of generated texts (captions from video/audio/frames, tags from audio/frames, ...). 映射器通过总结几种生成的文本 (来自视频/音频/帧的字幕,来自音频/帧的标签,...) 来生成视频字幕。 | code | tests |
video_captioning_from_video_mapper | 🔮Multimodal 🚀GPU 🧩HF 🟢Stable | Mapper to generate samples whose captions are generated based on a video-to-text model and sampled video frame. 映射器生成样本,其字幕是基于视频到文本模型和采样的视频帧生成的。 | code | tests |
video_extract_frames_mapper | 🔮Multimodal 💻CPU 🟢Stable | Mapper to extract frames from video files according to specified methods. 映射器根据指定的方法从视频文件中提取帧。 | code | tests |
video_face_blur_mapper | 🎬Video 💻CPU 🟢Stable | Mapper to blur faces detected in videos. 映射器模糊在视频中检测到的人脸。 | code | tests |
video_ffmpeg_wrapped_mapper | 🎬Video 💻CPU 🟢Stable | Simple wrapper for FFmpeg video filters. FFmpeg视频过滤器的简单包装。 | code | tests |
video_remove_watermark_mapper | 🎬Video 💻CPU 🟢Stable | Remove the watermarks in videos given regions. 删除视频给定区域中的水印。 | code | tests |
video_resize_aspect_ratio_mapper | 🎬Video 💻CPU 🟢Stable | Mapper to resize videos by aspect ratio. 映射器按纵横比调整视频大小。 | code | tests |
video_resize_resolution_mapper | 🎬Video 💻CPU 🟢Stable | Mapper to resize videos resolution. 映射器来调整视频分辨率。 | code | tests |
video_split_by_duration_mapper | 🔮Multimodal 💻CPU 🟢Stable | Mapper to split video by duration. 映射器按持续时间分割视频。 | code | tests |
video_split_by_key_frame_mapper | 🔮Multimodal 💻CPU 🟢Stable | Mapper to split video by key frame. 映射器按关键帧分割视频。 | code | tests |
video_split_by_scene_mapper | 🔮Multimodal 💻CPU 🟢Stable | Mapper to cut videos into scene clips. 映射器将视频剪切成场景剪辑。 | code | tests |
video_tagging_from_audio_mapper | 🎬Video 🚀GPU 🧩HF 🟢Stable | Mapper to generate video tags from audio streams extracted by video using the Audio Spectrogram Transformer. 映射器使用音频频谱图转换器从视频提取的音频流生成视频标签。 | code | tests |
video_tagging_from_frames_mapper | 🎬Video 🚀GPU 🟢Stable | Mapper to generate video tags from frames extract by video. 映射器从视频提取的帧生成视频标签。 | code | tests |
whitespace_normalization_mapper | 🔤Text 💻CPU 🟢Stable | Mapper to normalize different kinds of whitespaces to whitespace ' ' (0x20) in text samples. 映射器,用于将文本示例中的不同类型的空白标准化为空白 “” (0 x20)。 | code | tests |
Operator 算子 | Tags 标签 | Description 描述 | Source code 源码 | Unit tests 单测样例 |
---|---|---|---|---|
frequency_specified_field_selector | 💻CPU 🟢Stable | Selector to select samples based on the sorted frequency of specified field. 选择器根据指定字段的排序频率选择样本。 | code | tests |
random_selector | 💻CPU 🟢Stable | Selector to random select samples. 选择器来随机选择样本。 | code | tests |
range_specified_field_selector | 💻CPU 🟢Stable | Selector to select a range of samples based on the sorted specified field value from smallest to largest. 选择器根据从最小到最大的排序指定字段值选择样本范围。 | code | tests |
tags_specified_field_selector | 💻CPU 🟢Stable | Selector to select samples based on the tags of specified field. 选择器根据指定字段的标签选择样本。 | code | tests |
topk_specified_field_selector | 💻CPU 🟢Stable | Selector to select top samples based on the sorted specified field value. 选择器根据已排序的指定字段值选择顶部样本。 | code | tests |
We welcome contributions of adding new operators. Please refer to How-to Guide for Developers.
我们欢迎社区贡献新的算子,具体请参考开发者指南。