Updates to rasp-data-loading.ipynb #6

TomAugspurger · 2019-03-07T04:37:20Z

I'm playing with the example from #2. See https://nbviewer.jupyter.org/gist/TomAugspurger/f23c5342bef938a120b83a11d1cae077 for the updates.

On this subset, it seems like the dask + xarray overhead over h5py is about 2x. I think this is pretty encouraging. It seems like it'll be common to make a pre-processing pass over the data to do a bunch of stuff before writing the data back to disk in a form that's friendly to the deep learning framework. In this case, the overhead is 2x for a single sample. With a little effort, we'll be able to process batches of samples at once, which I suspect will give us better parallelism.

Before I get too much further, can an xarray user check my work in https://nbviewer.jupyter.org/gist/TomAugspurger/f23c5342bef938a120b83a11d1cae077#XArray-based-Generator?

class DataGenerator2(DataGenerator):

    def __getitem__(self, index):
        time, lat, lon = self.get_indices(index)
        subset = self.ds.isel(time=xr.DataArray(time, dims='z'),
                              lat=xr.DataArray(lat, dims='z'),
                              lon=xr.DataArray(lon, dims='z'))
        X = xr.concat(subset[self.input_vars].to_array(), dim='lev')
        y = xr.concat(subset[self.output_vars].to_array(), dim='lev')
        
        return X, y

I also haven't done any real profiling yet, beyond glancing at the scheduler dashboard. We're getting good parallel reading and computation overlapping with reading. But since we're just processing a single sample right now, there isn't too much room for parallelism yet.

Thanks for the very clear examples @raspstephan.

raspstephan · 2019-03-08T12:03:58Z

Hi Tom,

thanks so much for looking at the example. I am a little busy at the moment preparing for my PhD defense in a week. After I will have more time to look at things.

I just wanted to ask whether it would be helpful to have a larger sample of data?

TomAugspurger · 2019-03-08T12:14:50Z

No worries, my update isn't really close to being done yet. I'm going to run through the entire training next (hopefully this weekend). A larger dataset (something that doesn't fit in memory on a single machine) would be interesting, but no rush on that.

…

On Fri, Mar 8, 2019 at 6:03 AM Stephan Rasp ***@***.***> wrote: Hi Tom, thanks so much for looking at the example. I am a little busy at the moment preparing for my PhD defense in a week. After I will have more time to look at things. I just wanted to ask whether it would be helpful to have a larger sample of data? — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#6 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABQHIuE15fKtIg7DzkRw72HGwyMwduRzks5vUlGugaJpZM4biWOd> .

raspstephan · 2019-03-08T13:42:13Z

Is there a convenient way for me to share the dataset with you (several 100G). I currently do not have a good option.

nbren12 · 2019-03-08T18:44:36Z

Maybe this is something that Pangeo would consider hosting. What do you think @jhamman @rabernat?

Otherwise, you could write a function to make a mock dataset with the same variable names and shapes etc.

rabernat · 2019-03-09T12:32:27Z

Absolutely we can host the data!

…

Sent from my iPhone

On Mar 8, 2019, at 1:44 PM, Noah D Brenowitz ***@***.***> wrote: Maybe this is something that Pangeo would consider hosting. What do you think @jhamman @rabernat? Otherwise, you could write a function to make a mock dataset with the same variable names and shapes etc. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Updates to rasp-data-loading.ipynb #6

Updates to rasp-data-loading.ipynb #6

TomAugspurger commented Mar 7, 2019 •

edited

Loading

raspstephan commented Mar 8, 2019

TomAugspurger commented Mar 8, 2019 via email

raspstephan commented Mar 8, 2019

nbren12 commented Mar 8, 2019

rabernat commented Mar 9, 2019 via email

Updates to rasp-data-loading.ipynb #6

Updates to rasp-data-loading.ipynb #6

Comments

TomAugspurger commented Mar 7, 2019 • edited Loading

raspstephan commented Mar 8, 2019

TomAugspurger commented Mar 8, 2019 via email

raspstephan commented Mar 8, 2019

nbren12 commented Mar 8, 2019

rabernat commented Mar 9, 2019 via email

TomAugspurger commented Mar 7, 2019 •

edited

Loading