-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
bioresources should be standard text files, version controlled, directly readable, etc. #743
Comments
I've been dreaming about this for a long time! Looking at the files with vi has worked for me without decompression but comparing versions for diffs is really a pain. I think the only issues are with the file size limit and the fact that at the level of interacting with the repo itself, things would get more bulky and a bit slower (if there are large diffs being carried around in the git history). |
This was my call at the time because of file size limits in github. If we can uncompress files and still push them, I am all in favor! |
I did a quick test, the repo's size is 778 MB with gziped files and 992 MB without compression. This is barely under the 1GB limit of the free tier. However, I am not sure if trying to push unzipped files will accumulate the sizes due to versioning or if the quota is computed by the size of HEAD. I will fork it and test it on my personal account Another option is to use GitHub's Large File Storage (LFS) and pay $5 a month. That gives us 50GB storage for the repo. |
Ah, I wasn't aware of a per repo limit. Are you sure? https://docs.github.com/en/github/managing-large-files/what-is-my-disk-quota and https://stackoverflow.com/questions/38768454/repository-size-limits-for-github-com mention other numbers. When I've checked the LFS possibility before, there was a troublesome data transfer limit to worry about. Even if there is enough space, moving the data back and forth might still be a problem. |
You're right @kwalcock. It is a recommended size. I did the test in my personal fork. Uncompressed all the Of course, this will require some refactoring of |
There are probably good reasons that the bioresources are stored in gzip files, but maybe it's time to revisit them. It is incredibly difficult (for people spoiled by large hard drives and fast network connections, etc.) to do very useful things with them like observe how they have changed over time or even just read them. Only one of the files, uniprot-proteins.tsv, expands to a size larger than the 100MB limit that GitHub imposes. Although there are probably other repercussions, it's just a text file and could be easily split into two parts. If need be, there are ways to create gzip files for deployment during the packaging process. The files we have in kb/ner aren't very large, so it seems like that shouldn't be necessary. It would be so great if they were just there like all the other files.
The text was updated successfully, but these errors were encountered: