Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce C4 load time with subset from Nota's S3 #15

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

bokyeong1015
Copy link
Member

@bokyeong1015 bokyeong1015 commented Sep 13, 2024

Purpose of This PR

Improve C4 data loading speed (used as calibration data for GPTQ).

  • C4 load time on Intel(R) Xeon(R) Gold 6326 CPU @ 2.90GHz:
    • Reduced from 1784s to 1s.

Changes in src/dataset.py

  • Removed slow resolving of data en/c4-train.00000-of-01024.json.gz.
  • Now, a subset of the C4-train first split is downloaded from Nota's S3 bucket:
    • 256 randomly selected long sequences (with length > 3072).
    • To see the text data, please visit this link.

Reproducibility Check

Model Size PPL-Wiki2 PPL-PTB ACC-AVG BoolQ PIQA HellaSwag WinoGrande ARC-e ARC-c OBQA
5.5B
(Paper)
15.1 59.3 60.6 69.7 75.9 68.9 63.9 68.5 38.5 38.6
5.5B
(PR #11)
15.2 59.9 60.9 69.7 75.5 69.1 65.0 69.0 39.0 39.0
5.5B
(This PR)
15.2 59.4 60.8 70.4 75.2 69.1 64.3 68.4 39.2 39.2
3.7B
(Paper)
16.6 61.5 57.1 63.8 74.5 62.7 61.0 65.8 34.2 37.8
3.7B
(PR #11)
16.5 60.8 56.9 64.3 74.2 62.4 61.2 65.7 33.0 37.2
3.7B
(This PR)
16.6 61.9 56.7 63.1 74.5 62.6 60.9 65.6 33.9 36.6
2.7B
(Paper)
17.7 64.7 54.6 61.9 73.1 58.4 58.8 62.5 31.8 35.6
2.7B
(PR #11)
17.7 64.5 54.5 62.1 73.3 58.7 57.1 61.4 30.3 38.2
2.7B
(This PR)
17.7 65.2 54.4 61.5 73.0 58.1 58.0 62.4 30.6 36.8
1.5B
(Paper)
21.4 80.0 48.5 48.9 70.1 48.8 54.1 55.7 26.8 35.0
1.5B
(PR #11)
21.1 80.1 48.8 52.5 70.1 48.5 54.6 56.2 27.4 32.6
1.5B
(This PR)
21.3 80.0 48.5 50.9 69.9 48.7 54.2 55.9 27.0 32.8

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant