Docs/bad data exhibition (#195)

* + add "bad" data exhibition * + add these two new docs in the doc list in the readme * * minor modification + test the hidden anchor * + test the hidden anchor * + test the hidden anchor * + test the hidden anchor * + add involved OPs for the exhibition * + add MMC4 to "Bad" Data Exhibition * * fix bugs for dj_to_mmc4/mmc4_to_dj tools: there might be multiple images matching to the same sentence * * pip install before increase swapfile * + add disk space checking logs * * try to allocate swapfile on /mnt * * try to store models in /mnt * * try to store models in /mnt with sudo * * try to store models in /mnt with sudo * * try to store models in /mnt with sudo * * restore the unit test process
modelscope · Feb 7, 2024 · 43be23f · 43be23f
1 parent 77c7559
commit 43be23f
Show file tree

Hide file tree

Showing 9 changed files with 749 additions and 36 deletions.
diff --git a/.github/workflows/unit-test.yml b/.github/workflows/unit-test.yml
@@ -15,31 +15,40 @@ jobs:
 
     steps:
     - uses: actions/checkout@v3
+    - name: Check disk space
+      run: |
+        df -h
     - name: Cache data-juicer assets and models
       uses: actions/cache@v3
       with:
         path: ~/.cache/data_juicer
         key: dj-assets-models
+    - name: Check disk space
+      run: |
+        df -h
     - name: Set up Python 3.8
       uses: actions/setup-python@v3
       with:
         python-version: "3.8"
         cache: 'pip'
         cache-dependency-path: 'environments/**_requires.txt'
-    - name: Increase swapfile
+    - name: Check disk space
       run: |
         df -h
-        free -h
-        sudo swapoff -a
-        sudo fallocate -l 12G /swapfile
-        sudo chmod 600 /swapfile
-        sudo mkswap /swapfile
-        sudo swapon /swapfile
-        sudo swapon --show
     - name: Install dependencies
       run: |
         python -m pip install --upgrade pip
         pip install -v -e .[all]
+    - name: Increase swapfile
+      run: |
+        df -h
+        free -h
+        sudo swapoff -a
+        sudo fallocate -l 12G /mnt/swapfile
+        sudo chmod 600 /mnt/swapfile
+        sudo mkswap /mnt/swapfile
+        sudo swapon /mnt/swapfile
+        sudo swapon --show
     - name: Run the test
       run: |
         python tests/run.py
diff --git a/README.md b/README.md
@@ -300,6 +300,7 @@ docker exec -it <container_id> bash
 - [Operator Zoo](docs/Operators.md) | [算子库](docs/Operators_ZH.md)
 - [Configs](configs/README.md) | [配置系统](configs/README_ZH.md)
 - [Developer Guide](docs/DeveloperGuide.md) | [开发者指南](docs/DeveloperGuide_ZH.md)
+- ["Bad" Data Exhibition](docs/BadDataExhibition.md) | [“坏”数据展览](docs/BadDataExhibition_ZH.md)
 - Dedicated Toolkits | 专用工具箱
   - [Quality Classifier](tools/quality_classifier/README.md) | [质量分类器](tools/quality_classifier/README_ZH.md)
   - [Auto Evaluation](tools/evaluator/README.md) | [自动评测](tools/evaluator/README_ZH.md)

diff --git a/README_ZH.md b/README_ZH.md
@@ -277,6 +277,7 @@ docker exec -it <container_id> bash
 * [Operator Zoo](docs/Operators.md) | [算子库](docs/Operators_ZH.md)
 * [Configs](configs/README.md) | [配置系统](configs/README_ZH.md)
 * [Developer Guide](docs/DeveloperGuide.md) | [开发者指南](docs/DeveloperGuide_ZH.md)
+* ["Bad" Data Exhibition](docs/BadDataExhibition.md) | [“坏”数据展览](docs/BadDataExhibition_ZH.md)
 * Dedicated Toolkits | 专用工具箱
   * [Quality Classifier](tools/quality_classifier/README.md) | [质量分类器](tools/quality_classifier/README_ZH.md)
   * [Auto Evaluation](tools/evaluator/README.md) | [自动评测](tools/evaluator/README_ZH.md)

diff --git a/configs/config_all.yaml b/configs/config_all.yaml
@@ -192,7 +192,7 @@ process:
       lang: en                                                # compute perplexity in what language
       max_ppl: 1500                                           # the max perplexity score to filter text
   - phrase_grounding_recall_filter:                         # filter samples according to the locating recall of phrases extracted from text in the images.
-      hf_clip: openai/clip-vit-base-patch32                   # name of used Hugging Face Owl-ViT
+      hf_owlvit: openai/clip-vit-base-patch32                   # name of used Hugging Face Owl-ViT
       min_recall: 0.1                                         # the min phrase grounding recall of filter range
       max_recall: 1.0                                         # the max phrase grounding recall of filter range
       horizontal_flip: false                                  # flip image horizontally (left to right).

diff --git a/data_juicer/utils/mm_utils.py b/data_juicer/utils/mm_utils.py
@@ -69,6 +69,7 @@ def load_images(paths):
 def load_image(path):
     img_feature = Image()
     img = img_feature.decode_example(img_feature.encode_example(path))
+    img = img.convert('RGB')
     return img
 
 

diff --git a/docs/BadDataExhibition.md b/docs/BadDataExhibition.md
diff --git a/docs/BadDataExhibition_ZH.md b/docs/BadDataExhibition_ZH.md
diff --git a/tools/multimodal/data_juicer_format_to_target_format/dj_to_mmc4.py b/tools/multimodal/data_juicer_format_to_target_format/dj_to_mmc4.py
@@ -233,31 +233,32 @@ def main(
 
                     # remove possible image_special_token and update
                     # matched_text_index for corresponding image_info
-                    found_image = False
-                    if sent.startswith(image_special_token):
+                    found_image_num = 0
+                    while sent.startswith(image_special_token):
                         sent = sent[len(image_special_token):].strip()
-                        found_image = True
+                        found_image_num += 1
                         if sent.startswith(sent_seperator):
                             sent = sent[len(sent_seperator):].strip()
-                    elif sent.endswith(image_special_token):
+                    while sent.endswith(image_special_token):
                         sent = sent[:-len(image_special_token)].strip()
-                        found_image = True
+                        found_image_num += 1
                         if sent.endswith(sent_seperator):
                             sent = sent[:-len(sent_seperator)].strip()
                     sentences.append(sent)
-                    if found_image:
-                        if curr_image_idx < len(image_infos):
-                            image_infos[curr_image_idx][
-                                'matched_text_index'] = text_idx
-                            curr_image_idx += 1
-                        else:
-                            # if there are extra images, just skip them and
-                            # report a warning
-                            logger.warning(f'Sample with line number '
-                                           f'[{line_num}] contains unaligned '
-                                           f'numbers of images and image '
-                                           f'tokens. Please check and retry '
-                                           f'if needed.')
+                    if found_image_num > 0:
+                        for _ in range(found_image_num):
+                            if curr_image_idx < len(image_infos):
+                                image_infos[curr_image_idx][
+                                    'matched_text_index'] = text_idx
+                                curr_image_idx += 1
+                            else:
+                                # if there are extra images, just skip them and
+                                # report a warning
+                                logger.warning(f'Sample with line number '
+                                               f'[{line_num}] contains '
+                                               f'unaligned numbers of images '
+                                               f'and image tokens. Please '
+                                               f'check and retry if needed.')
 
                 # convert image_name to relative paths
                 if convert_to_relative_paths:

diff --git a/tools/multimodal/source_format_to_data_juicer_format/mmc4_to_dj.py b/tools/multimodal/source_format_to_data_juicer_format/mmc4_to_dj.py
@@ -210,24 +210,30 @@ def main(
                 img_idx = 0
                 new_sents = []
                 for sent_idx, sent in enumerate(sentences):
-                    if img_idx < len(image_infos) and image_infos[img_idx][
+                    # find the matched sentence of the current image
+                    image_num_this_sent = 0
+                    while img_idx < len(image_infos) and image_infos[img_idx][
                             'matched_text_index'] == sent_idx:
-                        # find the matched sentence of the current image,
-                        # insert a image_special_token to specific position.
+                        image_num_this_sent += 1
+                        img_idx += 1
+
+                    if image_num_this_sent > 0:
+                        # insert several image_special_tokens to specific
+                        # position.
+                        image_special_tokens = sent_seperator.join(
+                            [image_special_token] * image_num_this_sent)
                         if image_special_token_insert_pos == 'before':
-                            sent = image_special_token + sent_seperator + sent
+                            sent = image_special_tokens + sent_seperator + sent
                         elif image_special_token_insert_pos == 'after':
-                            sent += sent_seperator + image_special_token
+                            sent += sent_seperator + image_special_tokens
                         else:
                             if random.random() < 0.5:
                                 # before
-                                sent = image_special_token + sent_seperator \
+                                sent = image_special_tokens + sent_seperator \
                                        + sent
                             else:
                                 # after
-                                sent += sent_seperator + image_special_token
-                        # check the next img_idx
-                        img_idx += 1
+                                sent += sent_seperator + image_special_tokens
                     new_sents.append(sent)
 
                 join_sep = f' {eoc_special_token}{sent_seperator}'