+ add documents about OP docs update and fusible OP development. (#121)

* + add documents about OP docs update and fusible OP development. * * replace the Chinese strings in the English docs
modelscope · Dec 12, 2023 · 48c081e · 48c081e
1 parent 66eda06
commit 48c081e
Show file tree

Hide file tree

Showing 2 changed files with 272 additions and 13 deletions.
diff --git a/docs/DeveloperGuide.md b/docs/DeveloperGuide.md
@@ -2,7 +2,8 @@
 
 * [How-to Guide for Developers](#how-to-guide-for-developers)
    * [Coding Style](#coding-style)
-   * [Build your own ops](#build-your-own-ops)
+   * [Build your own OPs](#build-your-own-ops)
+      * [(Optional) Make your OP fusible](#optional-make-your-op-fusible)
    * [Build your own configs](#build-your-own-configs)
       * [Fruitful config sources &amp; Type hints](#fruitful-config-sources--type-hints)
       * [Hierarchical configs and helps](#hierarchical-configs-and-helps)
@@ -35,22 +36,22 @@ dependencies of pre-commit are consistent with the project configuration
 (which can be completed through `pre-commit clean` and `pre-commit install`); 
 and ② execute `pre-commit run --all-files` before push.
 
-## Build your own ops
+## Build your own OPs
 
-- Data-Juicer allows everybody to build their own ops.
-- Before implementing a new op, please refer to [Operators](Operators.md) to avoid unnecessary duplication.
+- Data-Juicer allows everybody to build their own OPs.
+- Before implementing a new OP, please refer to [Operators](Operators.md) to avoid unnecessary duplication.
 - Assuming we want to add a new Filter operator called "TextLengthFilter" to get corpus of expected text length, we can follow these steps to build it.
 
-1. (Optional) Add a new StatsKeys in `data_juicer/utils/constant.py` to store the statistical variable of the new op.
+1. (Optional) Add a new StatsKeys in `data_juicer/utils/constant.py` to store the statistical variable of the new OP.
 
 ```python
 class StatsKeys(object):
     ...              # other keys
     text_len = 'text_len'
 ```
 
-2. Create a new op file `text_length_filter.py` in the corresponding `data_juicer/ops/filter/` directory as follows.
-   - Because it's a Filter op, so the new op needs to inherit from the basic `Filter` class in the `base_op.py`, and be decorated with `OPERATORS` to register itself automatically.
+2. Create a new OP file `text_length_filter.py` in the corresponding `data_juicer/ops/filter/` directory as follows.
+   - Because it's a Filter OP, so the new OP needs to inherit from the basic `Filter` class in the `base_op.py`, and be decorated with `OPERATORS` to register itself automatically.
 
 ```python
 import sys
@@ -103,27 +104,27 @@ class TextLengthFilter(Filter):
             return False
 ```
 
-3. After implemention, add it to the op dictionary in the `__init__.py` file in `data_juicer/ops/filter/` directory.
+3. After implemention, add it to the OP dictionary in the `__init__.py` file in `data_juicer/ops/filter/` directory.
 
 ```python
-from . import (...,              # other ops
-               text_length_filter)  # import this new op module
+from . import (...,              # other OPs
+               text_length_filter)  # import this new OP module
 ```
 
-4. Now you can use this new op with custom arguments in your own config files!
+4. Now you can use this new OP with custom arguments in your own config files!
 
 ```yaml
 # other configs
 ...
 
 # process configs
 process:
-  - text_length_filter:  # add this op to your process list and set the parameters
+  - text_length_filter:  # add this OP to your process list and set the parameters
       min_len: 10
       max_len: 1000
 ```
 
-5. (Strongly Recommend) It's better to add corresponding tests for your own ops. For `TextLengthFilter` above, you would like to add `test_text_length_filter.py` into `tests/ops/filter/` directory as below.
+5. (Strongly Recommend) It's better to add corresponding tests for your own OPs. For `TextLengthFilter` above, you would like to add `test_text_length_filter.py` into `tests/ops/filter/` directory as below.
 
 ```python
 import unittest
@@ -141,6 +142,142 @@ class TextLengthFilterTest(unittest.TestCase):
         pass
 ```
 
+6. (Strongly Recommend) In order to facilitate the use of other users, we also need to update this new OP information to
+the corresponding documents, including the following docs:
+   1. `configs/config_all.yaml`: this complete config file contains a list of all OPs and their arguments, serving as an
+   important document for users to refer to all available OPs. Therefore, after adding the new OP, we need to add it to the process
+   list (grouped by the OP type and sorted in alphabetical order):
+
+   ```yaml
+   ...
+   - stopwords_filter:                                       # filter text with stopword ratio smaller than a specific min value
+       lang: en                                                # consider stopwords in what language
+       tokenization: false                                     # whether to use model to tokenize documents
+       min_ratio: 0.3                                          # the min ratio to filter text
+       stopwords_dir: ./assets                                 # directory to store stopwords dictionaries
+       use_words_aug: false                                    # whether to augment words, especially for Chinese and Vietnamese
+       words_aug_group_sizes: [2]                              # the group size of words to augment
+       words_aug_join_char: ""                                 # the join char between words to augment
+   - text_length_filter:                                     # filter text with length out of specific range
+       min_len: 10                                             # the min length of filter range
+       max_len: 10000                                          # the max length of filter range
+   - token_num_filter:                                       # filter text with total token number out of specific range
+       hf_tokenizer: EleutherAI/pythia-6.9b-deduped            # name of used Hugging Face tokenizer
+       min_num: 10                                             # the min number of filter range
+       max_num: 10000                                          # the max number of filter range
+   ...
+   ```
+
+   2. `docs/Operators.md`: this doc maintains categorized lists of available OPs. We can add the information of new OP to the list
+   of corresponding type of OPs (sorted in alphabetical order). At the same time, in the Overview section at the top of this doc,
+   we also need to update the number of OPs for the corresponding OP type:
+
+   ```markdown
+   ## Overview
+   ...
+   | [ Filter ]( #filter )             |   21 (+1 HERE)   | Filters out low-quality samples                 |
+   ...
+   ## Filter <a name="filter"/>
+   ...
+   | suffix_filter                  | General | en, zh | Keeps samples with specified suffixes                                                      |
+   | text_length_filter             | General | en, zh | Keeps samples with total text length within the specified range                            |
+   | token_num_filter               | General | en, zh | Keeps samples with token count within the specified range                                  |
+   ...
+   ```
+
+   3. `docs/Operators_ZH.md`: this doc is the Chinese version of the doc in 6.ii, so we need to update the Chinese content at
+   the same positions.
+
+### (Optional) Make your OP fusible
+
+- If the calculation process of some intermediate variables in the new OP is reused in other existing OPs, this new OP can be
+added to the fusible OPs to accelerate the whole data processing with OP fusion technology. (e.g. both the `word_num_filter`
+and `word_repetition_filter` need to split the input text into words)
+- When opening OP fusion, these reused calculation processes and intermediate variables can be shared in the `context` between
+OPs, thus reducing repeated calculations.
+- OPs that contain common intermediate variables can be fused in OP fusion through the following steps:
+
+1. (Optional) If a new intermediate variable is generated in the new OP, we need to add this new intermediate variable name to 
+the `InterVars` class in `utils/constant.py`. In general, we need to add a prefix `DEFAULT_PREFIX` before the name.
+
+```python
+class InterVars(object):
+    # text
+    lines = DEFAULT_PREFIX + 'lines'
+    words = DEFAULT_PREFIX + 'words'  # add the new intermediate variable here
+    ...
+```
+
+2. (Optional) We need to define a registry group in `ops/op_fusion.py` for the new intermediate variable in the 1st step, and add
+this registry group to the registry group list that stores all groups of intermediate variables. This facilitates the OP Fusion module
+to track OPs involving these intermediate variables.
+
+```python
+...
+# Type of intermediate vars
+# text
+INTER_LINES = Registry(InterVars.lines)
+INTER_WORDS = Registry(InterVars.words)  # define registry group for the new intermediate variable
+
+# images
+LOADED_IMAGES = Registry(InterVars.loaded_images)
+
+# all
+ALL_INTER_VARS = [INTER_LINES, INTER_WORDS, LOADED_IMAGES]  # and add it to the registry group list
+...
+```
+
+3. Before the OP class definition that involves the intermediate variable, register this OP in the registry group corresponding
+to this intermediate variable, indicating that the intermediate variable may be calculated and used in this OP.
+
+```python
+...
+@OPERATORS.register_module(OP_NAME)
+@INTER_WORDS.register_module(OP_NAME)  # register this new OP into the registry group
+class WordNumFilter(Filter):
+...
+```
+
+4. In the calculation process of this intermediate variable of the new OP, we can modify the calculation logic to:
+   1. If the argument `context` is True, it means the OP fusion is opening, so we get the value of this intermediate variable 
+   from `context` first, which has been calculated by the previous OPs.
+   2. If this intermediate variable doesn't exist in the `context`, it means it's the first time to calculate this variable in this
+   OP, so we need to define a unique key and use it to store the intermediate variable in the `context` for subsequent OPs after
+   it's calculated by this new OP.
+   3. If the argument `context` is False, just follow the normal calculation process.
+
+```python
+# before modification
+...
+tokenizer = get_model(self.model_key,
+                      lang=self.lang,
+                      model_type='sentencepiece')
+words = get_words_from_document(
+    sample[self.text_key],
+    token_func=tokenizer.encode_as_pieces if tokenizer else None)
+...        
+
+# after modification
+...
+words_key = f'{InterVars.words}-{self.model_key}'
+if context and words_key in sample[Fields.context]:
+    # get the value of intermediate variable from context directly
+    words = sample[Fields.context][words_key]
+else:
+    # normal calculation process
+    tokenizer = get_model(self.model_key,
+                          lang=self.lang,
+                          model_type='sentencepiece')
+    words = get_words_from_document(
+        sample[self.text_key],
+        token_func=tokenizer.encode_as_pieces if tokenizer else None)
+    if context:
+        # After calculating the intermediate variable for the first time,
+        # store it in the context for subsequent OPs.
+        sample[Fields.context][words_key] = words
+...
+```
+
 ## Build your own configs
 - We provide easy configuration based on [jsonargparse](https://github.com/omni-us/jsonargparse/) to reduce cost for boilerplate codes.
 

diff --git a/docs/DeveloperGuide_ZH.md b/docs/DeveloperGuide_ZH.md
@@ -3,6 +3,7 @@
 * [开发者指南](#开发者指南)
    * [编码规范](#编码规范)
    * [构建自己的算子](#构建自己的算子)
+      * [（可选）使新算子可以进行算子融合](#可选使新算子可以进行算子融合)
    * [构建自己的配置](#构建自己的配置)
       * [丰富的配置源和类型提示](#丰富的配置源和类型提示)
       * [层次化的配置和帮助](#层次化的配置和帮助)
@@ -137,6 +138,127 @@ class TextLengthFilterTest(unittest.TestCase):
         pass
 ```
 
+6. （强烈推荐）为了方便其他用户使用，我们还需要将新增的算子信息更新到相应的文档中，具体包括如下文档：
+   1. `configs/config_all.yaml`：该全集配置文件保存了所有算子及参数的一个列表，作为用户参考可用算子的一个重要文档。因此，在新增算子后，需要将其添加到该文档process列表里（按算子类型分组并按字母序排序）：
+
+   ```yaml
+   ...
+   - stopwords_filter:                                       # filter text with stopword ratio smaller than a specific min value
+       lang: en                                                # consider stopwords in what language
+       tokenization: false                                     # whether to use model to tokenize documents
+       min_ratio: 0.3                                          # the min ratio to filter text
+       stopwords_dir: ./assets                                 # directory to store stopwords dictionaries
+       use_words_aug: false                                    # whether to augment words, especially for Chinese and Vietnamese
+       words_aug_group_sizes: [2]                              # the group size of words to augment
+       words_aug_join_char: ""                                 # the join char between words to augment
+   - text_length_filter:                                     # filter text with length out of specific range
+       min_len: 10                                             # the min length of filter range
+       max_len: 10000                                          # the max length of filter range
+   - token_num_filter:                                       # filter text with total token number out of specific range
+       hf_tokenizer: EleutherAI/pythia-6.9b-deduped            # name of used Hugging Face tokenizer
+       min_num: 10                                             # the min number of filter range
+       max_num: 10000                                          # the max number of filter range
+   ...
+   ```
+
+   2. `docs/Operators.md`：该文档维护了可用算子的分类列表。我们可以把新增算子的信息添加到对应类别算子的列表中（算子按字母排序）。同时，在文档最上方Overview章节，我们也需要更新对应类别的可用算子数目：
+
+   ```markdown
+   ## Overview
+   ...
+   | [ Filter ]( #filter )             |   21 (+1 HERE)   | Filters out low-quality samples                 |
+   ...
+   ## Filter <a name="filter"/>
+   ...
+   | suffix_filter                  | General | en, zh | Keeps samples with specified suffixes                                                      |
+   | text_length_filter             | General | en, zh | Keeps samples with total text length within the specified range                            |
+   | token_num_filter               | General | en, zh | Keeps samples with token count within the specified range                                  |
+   ...
+   ```
+
+   3. `docs/Operators_ZH.md`：该文档为6.ii中`docs/Operators.md`文档的中文版，需要更新相同位置处的中文内容。
+
+### （可选）使新算子可以进行算子融合
+
+- 如果我们的新算子中的部分中间变量的计算过程与已有的算子重复，那么可以将其添加到可融合算子中，以在数据处理时利用算子融合进行加速。（如`word_num_filter`与`word_repetition_filter`都需要对输入文本进行分词）
+- 当算子融合（OP Fusion）功能开启时，这些重复的计算过程和中间变量是可以在算子之间的`context`中共享的，从而可以减少重复计算。
+- 可通过如下步骤使包含共有中间变量的算子可进行算子融合（以`word_num_filter`算子为例）。
+
+1. （可选）如果新算子中产生了新的中间变量，需要在`utils/constant.py`中的`InterVars`类中添加新的中间变量名称。通常需要在名称前加上`DEFAULT_PREFIX`前缀。
+
+```python
+class InterVars(object):
+    # text
+    lines = DEFAULT_PREFIX + 'lines'
+    words = DEFAULT_PREFIX + 'words'  # 在这里添加新的中间变量
+    ...
+```
+
+2. （可选）第1步中添加的新的中间变量还需在`ops/op_fusion.py`中为其定义一个注册组，并添加到保存了所有注册组的列表中，方便算子融合模块追踪涉及到这些中间变量的算子。
+
+```python
+...
+# Type of intermediate vars
+# text
+INTER_LINES = Registry(InterVars.lines)
+INTER_WORDS = Registry(InterVars.words)  # 为新的中间变量定义注册组
+
+# images
+LOADED_IMAGES = Registry(InterVars.loaded_images)
+
+# all
+ALL_INTER_VARS = [INTER_LINES, INTER_WORDS, LOADED_IMAGES]  # 并添加到注册组列表中
+...
+```
+
+3. 在涉及到该中间变量的算子前，将该算子注册到中间变量对应的注册组中，表示该算子中可能对该中间变量进行了计算与使用。
+
+```python
+...
+@OPERATORS.register_module(OP_NAME)
+@INTER_WORDS.register_module(OP_NAME)  # 将该算子注册到注册组中
+class WordNumFilter(Filter):
+...
+```
+
+4. 在算子计算该中间变量的过程中，可将计算逻辑修改为：
+   1. 如果`context`参数为True，则表示已开启了算子融合，优先从`context`中获取前序算子已经计算过的该中间变量的值
+   2. 如果中间变量在`context`中不存在，则表示在该算子中首次对该中间变量进行计算，在计算完成后，定义一个唯一的key并将其存放到`context`中，以供后续算子使用
+   3. 如果`context`参数为False，则按照正常计算流程进行
+
+```python
+# 修改计算逻辑前
+...
+tokenizer = get_model(self.model_key,
+                      lang=self.lang,
+                      model_type='sentencepiece')
+words = get_words_from_document(
+    sample[self.text_key],
+    token_func=tokenizer.encode_as_pieces if tokenizer else None)
+...        
+
+# 修改计算逻辑后
+...
+words_key = f'{InterVars.words}-{self.model_key}'
+if context and words_key in sample[Fields.context]:
+    # 直接使用context中已有的中间变量值
+    words = sample[Fields.context][words_key]
+else:
+    # 正常计算流程
+    tokenizer = get_model(self.model_key,
+                          lang=self.lang,
+                          model_type='sentencepiece')
+    words = get_words_from_document(
+        sample[self.text_key],
+        token_func=tokenizer.encode_as_pieces if tokenizer else None)
+    if context:
+        # 第一次计算该中间变量后，放入context供后续算子使用
+        sample[Fields.context][words_key] = words
+...
+```
+
+- 至此，该算子已经能够在算子融合功能开启后，自动地与其他算子进行融合并共享共有的中间变量，减少重复计算，加快整体的数据处理速度
+
 ## 构建自己的配置
 
 - 我们提供基于 [jsonargparse](https://github.com/omni-us/jsonargparse/) 的简单配置以降低样板代码的成本。