You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In Chapter 3 (Biomedical Concept Alignment Data)of the paper <LLaVA-Med: Training a Large Language-and-Vision
Assistant for Biomedicine in One Day>, it is mentioned that "We sample 600K image-text pairs from PMC-15M".
However, the actual data (llava_med_alignment_500k.json) provided in the GitHub repository only contains 500k pairs. Where did the remaining 100k pairs go?
The text was updated successfully, but these errors were encountered:
Total entries: 467710
Present images: 467336
Missing images: 374
in the "500k" file. and as demonstrated above, the script used to download fails to fetch a portion of the articles which results in further missing images. and the script is extremely slow even with parallelizing to 200 threads : ) took like 4 days.
In Chapter 3 (Biomedical Concept Alignment Data)of the paper <LLaVA-Med: Training a Large Language-and-Vision
Assistant for Biomedicine in One Day>, it is mentioned that "We sample 600K image-text pairs from PMC-15M".
However, the actual data (llava_med_alignment_500k.json) provided in the GitHub repository only contains 500k pairs. Where did the remaining 100k pairs go?
The text was updated successfully, but these errors were encountered: