Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shuffle protein data on first load and then store in memmap (on disk) instead of in memory. #54

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

georgeamccarthy
Copy link
Owner

@georgeamccarthy georgeamccarthy commented Aug 10, 2021

PR type

  • 🏆 Enhancements
  • 🏠 Internal

Purpose

  • Shuffles proteins on first index.
  • Uses memmap to store proteins instead of keeping them in memory due to large size (~50 MB+ for ~100k proteins).

Why?

  • Currently embeddings are computed sequentially with increasing PDB ID. This makes embedding a fraction of the dataset index favour lower PDB ID which makes evaluating metric similarity less meaningful.
  • Memmap is preferred for large doc arrays.
  • Helps avoid memory issues on small deployment server.

Feedback required over

  • A quick pair of 👀 on the code

Mentions

Future work

  • Currently using pandas to shuffle the data. One could use the jina built in .shuffle (see cookbook). However I couldn't get this working properly.

References

  • Previous meeting with Jina AI devs.

Legal

@georgeamccarthy
Copy link
Owner Author

Added a feature to log number of culled proteins.

@fissoreg
Copy link
Collaborator

Future work

  • Currently using pandas to shuffle the data. One could use the jina built in .shuffle (see cookbook). However I couldn't get this working properly.

Apparently the shuffle method is a recent addition: jina-ai/serve@2302e45

It will work if you upgrade:

pip install --upgrade jina

@georgeamccarthy
Copy link
Owner Author

Great find! TODO :)

@fissoreg fissoreg requested review from fissoreg and removed request for fissoreg August 14, 2021 09:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants