-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Updates to rasp-data-loading.ipynb #6
Comments
Hi Tom, thanks so much for looking at the example. I am a little busy at the moment preparing for my PhD defense in a week. After I will have more time to look at things. I just wanted to ask whether it would be helpful to have a larger sample of data? |
No worries, my update isn't really close to being done yet. I'm going to
run through the entire training next (hopefully this weekend).
A larger dataset (something that doesn't fit in memory on a single machine)
would be interesting, but no rush on that.
…On Fri, Mar 8, 2019 at 6:03 AM Stephan Rasp ***@***.***> wrote:
Hi Tom,
thanks so much for looking at the example. I am a little busy at the
moment preparing for my PhD defense in a week. After I will have more time
to look at things.
I just wanted to ask whether it would be helpful to have a larger sample
of data?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#6 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABQHIuE15fKtIg7DzkRw72HGwyMwduRzks5vUlGugaJpZM4biWOd>
.
|
Is there a convenient way for me to share the dataset with you (several 100G). I currently do not have a good option. |
Absolutely we can host the data!
…Sent from my iPhone
On Mar 8, 2019, at 1:44 PM, Noah D Brenowitz ***@***.***> wrote:
Maybe this is something that Pangeo would consider hosting. What do you think @jhamman @rabernat?
Otherwise, you could write a function to make a mock dataset with the same variable names and shapes etc.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
I'm playing with the example from #2. See https://nbviewer.jupyter.org/gist/TomAugspurger/f23c5342bef938a120b83a11d1cae077 for the updates.
On this subset, it seems like the dask + xarray overhead over h5py is about 2x. I think this is pretty encouraging. It seems like it'll be common to make a pre-processing pass over the data to do a bunch of stuff before writing the data back to disk in a form that's friendly to the deep learning framework. In this case, the overhead is 2x for a single sample. With a little effort, we'll be able to process batches of samples at once, which I suspect will give us better parallelism.
Before I get too much further, can an xarray user check my work in https://nbviewer.jupyter.org/gist/TomAugspurger/f23c5342bef938a120b83a11d1cae077#XArray-based-Generator?
I also haven't done any real profiling yet, beyond glancing at the scheduler dashboard. We're getting good parallel reading and computation overlapping with reading. But since we're just processing a single sample right now, there isn't too much room for parallelism yet.
Thanks for the very clear examples @raspstephan.
The text was updated successfully, but these errors were encountered: