Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updates to rasp-data-loading.ipynb #6

Open
TomAugspurger opened this issue Mar 7, 2019 · 5 comments
Open

Updates to rasp-data-loading.ipynb #6

TomAugspurger opened this issue Mar 7, 2019 · 5 comments

Comments

@TomAugspurger
Copy link
Member

TomAugspurger commented Mar 7, 2019

I'm playing with the example from #2. See https://nbviewer.jupyter.org/gist/TomAugspurger/f23c5342bef938a120b83a11d1cae077 for the updates.

On this subset, it seems like the dask + xarray overhead over h5py is about 2x. I think this is pretty encouraging. It seems like it'll be common to make a pre-processing pass over the data to do a bunch of stuff before writing the data back to disk in a form that's friendly to the deep learning framework. In this case, the overhead is 2x for a single sample. With a little effort, we'll be able to process batches of samples at once, which I suspect will give us better parallelism.

Before I get too much further, can an xarray user check my work in https://nbviewer.jupyter.org/gist/TomAugspurger/f23c5342bef938a120b83a11d1cae077#XArray-based-Generator?

class DataGenerator2(DataGenerator):

    def __getitem__(self, index):
        time, lat, lon = self.get_indices(index)
        subset = self.ds.isel(time=xr.DataArray(time, dims='z'),
                              lat=xr.DataArray(lat, dims='z'),
                              lon=xr.DataArray(lon, dims='z'))
        X = xr.concat(subset[self.input_vars].to_array(), dim='lev')
        y = xr.concat(subset[self.output_vars].to_array(), dim='lev')
        
        return X, y

I also haven't done any real profiling yet, beyond glancing at the scheduler dashboard. We're getting good parallel reading and computation overlapping with reading. But since we're just processing a single sample right now, there isn't too much room for parallelism yet.

Thanks for the very clear examples @raspstephan.

@raspstephan
Copy link
Collaborator

Hi Tom,

thanks so much for looking at the example. I am a little busy at the moment preparing for my PhD defense in a week. After I will have more time to look at things.

I just wanted to ask whether it would be helpful to have a larger sample of data?

@TomAugspurger
Copy link
Member Author

TomAugspurger commented Mar 8, 2019 via email

@raspstephan
Copy link
Collaborator

Is there a convenient way for me to share the dataset with you (several 100G). I currently do not have a good option.

@nbren12
Copy link
Collaborator

nbren12 commented Mar 8, 2019

Maybe this is something that Pangeo would consider hosting. What do you think @jhamman @rabernat?

Otherwise, you could write a function to make a mock dataset with the same variable names and shapes etc.

@rabernat
Copy link
Member

rabernat commented Mar 9, 2019 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants