biomedical concept alignment data #85

hddbang · 2024-07-24T07:23:10Z

In Chapter 3 (Biomedical Concept Alignment Data)of the paper <LLaVA-Med: Training a Large Language-and-Vision
Assistant for Biomedicine in One Day>, it is mentioned that "We sample 600K image-text pairs from PMC-15M".

However, the actual data (llava_med_alignment_500k.json) provided in the GitHub repository only contains 500k pairs. Where did the remaining 100k pairs go?

alyakin314 · 2024-08-12T16:35:15Z

furthermore, there are only

Total entries: 467710
Present images: 467336
Missing images: 374

in the "500k" file. and as demonstrated above, the script used to download fails to fetch a portion of the articles which results in further missing images. and the script is extremely slow even with parallelizing to 200 threads : ) took like 4 days.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

biomedical concept alignment data #85

biomedical concept alignment data #85

hddbang commented Jul 24, 2024

alyakin314 commented Aug 12, 2024

biomedical concept alignment data #85

biomedical concept alignment data #85

Comments

hddbang commented Jul 24, 2024

alyakin314 commented Aug 12, 2024