Skip to content

Commit

Permalink
Docs/bad data exhibition (#195)
Browse files Browse the repository at this point in the history
* + add "bad" data exhibition

* + add these two new docs in the doc list in the readme

* * minor modification
+ test the hidden anchor

* + test the hidden anchor

* + test the hidden anchor

* + test the hidden anchor

* + add involved OPs for the exhibition

* + add MMC4 to "Bad" Data Exhibition

* * fix bugs for dj_to_mmc4/mmc4_to_dj tools: there might be multiple images matching to the same sentence

* * pip install before increase swapfile

* + add disk space checking logs

* * try to allocate swapfile on /mnt

* * try to store models in /mnt

* * try to store models in /mnt with sudo

* * try to store models in /mnt with sudo

* * try to store models in /mnt with sudo

* * restore the unit test process
  • Loading branch information
HYLcool authored Feb 7, 2024
1 parent 77c7559 commit 43be23f
Show file tree
Hide file tree
Showing 9 changed files with 749 additions and 36 deletions.
25 changes: 17 additions & 8 deletions .github/workflows/unit-test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -15,31 +15,40 @@ jobs:

steps:
- uses: actions/checkout@v3
- name: Check disk space
run: |
df -h
- name: Cache data-juicer assets and models
uses: actions/cache@v3
with:
path: ~/.cache/data_juicer
key: dj-assets-models
- name: Check disk space
run: |
df -h
- name: Set up Python 3.8
uses: actions/setup-python@v3
with:
python-version: "3.8"
cache: 'pip'
cache-dependency-path: 'environments/**_requires.txt'
- name: Increase swapfile
- name: Check disk space
run: |
df -h
free -h
sudo swapoff -a
sudo fallocate -l 12G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile
sudo swapon --show
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install -v -e .[all]
- name: Increase swapfile
run: |
df -h
free -h
sudo swapoff -a
sudo fallocate -l 12G /mnt/swapfile
sudo chmod 600 /mnt/swapfile
sudo mkswap /mnt/swapfile
sudo swapon /mnt/swapfile
sudo swapon --show
- name: Run the test
run: |
python tests/run.py
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -300,6 +300,7 @@ docker exec -it <container_id> bash
- [Operator Zoo](docs/Operators.md) | [算子库](docs/Operators_ZH.md)
- [Configs](configs/README.md) | [配置系统](configs/README_ZH.md)
- [Developer Guide](docs/DeveloperGuide.md) | [开发者指南](docs/DeveloperGuide_ZH.md)
- ["Bad" Data Exhibition](docs/BadDataExhibition.md) | [“坏”数据展览](docs/BadDataExhibition_ZH.md)
- Dedicated Toolkits | 专用工具箱
- [Quality Classifier](tools/quality_classifier/README.md) | [质量分类器](tools/quality_classifier/README_ZH.md)
- [Auto Evaluation](tools/evaluator/README.md) | [自动评测](tools/evaluator/README_ZH.md)
Expand Down
1 change: 1 addition & 0 deletions README_ZH.md
Original file line number Diff line number Diff line change
Expand Up @@ -277,6 +277,7 @@ docker exec -it <container_id> bash
* [Operator Zoo](docs/Operators.md) | [算子库](docs/Operators_ZH.md)
* [Configs](configs/README.md) | [配置系统](configs/README_ZH.md)
* [Developer Guide](docs/DeveloperGuide.md) | [开发者指南](docs/DeveloperGuide_ZH.md)
* ["Bad" Data Exhibition](docs/BadDataExhibition.md) | [“坏”数据展览](docs/BadDataExhibition_ZH.md)
* Dedicated Toolkits | 专用工具箱
* [Quality Classifier](tools/quality_classifier/README.md) | [质量分类器](tools/quality_classifier/README_ZH.md)
* [Auto Evaluation](tools/evaluator/README.md) | [自动评测](tools/evaluator/README_ZH.md)
Expand Down
2 changes: 1 addition & 1 deletion configs/config_all.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -192,7 +192,7 @@ process:
lang: en # compute perplexity in what language
max_ppl: 1500 # the max perplexity score to filter text
- phrase_grounding_recall_filter: # filter samples according to the locating recall of phrases extracted from text in the images.
hf_clip: openai/clip-vit-base-patch32 # name of used Hugging Face Owl-ViT
hf_owlvit: openai/clip-vit-base-patch32 # name of used Hugging Face Owl-ViT
min_recall: 0.1 # the min phrase grounding recall of filter range
max_recall: 1.0 # the max phrase grounding recall of filter range
horizontal_flip: false # flip image horizontally (left to right).
Expand Down
1 change: 1 addition & 0 deletions data_juicer/utils/mm_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,7 @@ def load_images(paths):
def load_image(path):
img_feature = Image()
img = img_feature.decode_example(img_feature.encode_example(path))
img = img.convert('RGB')
return img


Expand Down
350 changes: 350 additions & 0 deletions docs/BadDataExhibition.md

Large diffs are not rendered by default.

344 changes: 344 additions & 0 deletions docs/BadDataExhibition_ZH.md

Large diffs are not rendered by default.

37 changes: 19 additions & 18 deletions tools/multimodal/data_juicer_format_to_target_format/dj_to_mmc4.py
Original file line number Diff line number Diff line change
Expand Up @@ -233,31 +233,32 @@ def main(

# remove possible image_special_token and update
# matched_text_index for corresponding image_info
found_image = False
if sent.startswith(image_special_token):
found_image_num = 0
while sent.startswith(image_special_token):
sent = sent[len(image_special_token):].strip()
found_image = True
found_image_num += 1
if sent.startswith(sent_seperator):
sent = sent[len(sent_seperator):].strip()
elif sent.endswith(image_special_token):
while sent.endswith(image_special_token):
sent = sent[:-len(image_special_token)].strip()
found_image = True
found_image_num += 1
if sent.endswith(sent_seperator):
sent = sent[:-len(sent_seperator)].strip()
sentences.append(sent)
if found_image:
if curr_image_idx < len(image_infos):
image_infos[curr_image_idx][
'matched_text_index'] = text_idx
curr_image_idx += 1
else:
# if there are extra images, just skip them and
# report a warning
logger.warning(f'Sample with line number '
f'[{line_num}] contains unaligned '
f'numbers of images and image '
f'tokens. Please check and retry '
f'if needed.')
if found_image_num > 0:
for _ in range(found_image_num):
if curr_image_idx < len(image_infos):
image_infos[curr_image_idx][
'matched_text_index'] = text_idx
curr_image_idx += 1
else:
# if there are extra images, just skip them and
# report a warning
logger.warning(f'Sample with line number '
f'[{line_num}] contains '
f'unaligned numbers of images '
f'and image tokens. Please '
f'check and retry if needed.')

# convert image_name to relative paths
if convert_to_relative_paths:
Expand Down
24 changes: 15 additions & 9 deletions tools/multimodal/source_format_to_data_juicer_format/mmc4_to_dj.py
Original file line number Diff line number Diff line change
Expand Up @@ -210,24 +210,30 @@ def main(
img_idx = 0
new_sents = []
for sent_idx, sent in enumerate(sentences):
if img_idx < len(image_infos) and image_infos[img_idx][
# find the matched sentence of the current image
image_num_this_sent = 0
while img_idx < len(image_infos) and image_infos[img_idx][
'matched_text_index'] == sent_idx:
# find the matched sentence of the current image,
# insert a image_special_token to specific position.
image_num_this_sent += 1
img_idx += 1

if image_num_this_sent > 0:
# insert several image_special_tokens to specific
# position.
image_special_tokens = sent_seperator.join(
[image_special_token] * image_num_this_sent)
if image_special_token_insert_pos == 'before':
sent = image_special_token + sent_seperator + sent
sent = image_special_tokens + sent_seperator + sent
elif image_special_token_insert_pos == 'after':
sent += sent_seperator + image_special_token
sent += sent_seperator + image_special_tokens
else:
if random.random() < 0.5:
# before
sent = image_special_token + sent_seperator \
sent = image_special_tokens + sent_seperator \
+ sent
else:
# after
sent += sent_seperator + image_special_token
# check the next img_idx
img_idx += 1
sent += sent_seperator + image_special_tokens
new_sents.append(sent)

join_sep = f' {eoc_special_token}{sent_seperator}'
Expand Down

0 comments on commit 43be23f

Please sign in to comment.