Skip to content
This repository has been archived by the owner on Dec 22, 2022. It is now read-only.

Removing units from Quantity #65

Open
nbren12 opened this issue Apr 9, 2021 · 10 comments
Open

Removing units from Quantity #65

nbren12 opened this issue Apr 9, 2021 · 10 comments

Comments

@nbren12
Copy link
Contributor

nbren12 commented Apr 9, 2021

The idea of having a special units metadata is well motivated. In a world where all (or even most) computations preserved units metadata, it would make sense to keep this metadata around. However, in practice most python code doesn't do this, so this required piece of "metadata" often raises errors and leads to workarounds like this: ai2cm/fv3net#1145 (review).

Also, my understanding is that Quantity mostly represents arrays of numbers that can be passed around by MPI. This concern is different from the concern of enforcing metadata requirements and "physical correctness". Can we do away with the units attribute or possibly make it optional or make a QuantityWithUnits subclass of Quantity?

@mcgibbon
Copy link
Collaborator

mcgibbon commented May 4, 2021

However, in practice most python code doesn't do this

Any xarray user code can do this by defining the units expected in its interfaces, and setting the appropriate units metadata on the output. The idea of including this attribute was to encourage such user code. What you described as a "workaround" isn't a workaround in my opinion, it's an important part of the code which defines the units interface as code.

In the code you linked, I would argue the code to supply units that was added should have been added whether it was using Quantity or only using DataArray before writing to netcdf.

my understanding is that Quantity mostly represents arrays of numbers that can be passed around by MPI

Quantity is an array combined with the metadata you need to understand its purpose and use in the model. The one piece of information still missing (which @ofuhrer has advocated to add, and we still might) is something like a standard_name attribute, instead of relying on a dictionary key for that information.

In the case of MPI, the communicator is also responsible for passing metadata, e.g. scattering this units attribute when doing scatter operations.

The docs outline (and discourage) that if you really don't want to manage units, you can set units to an empty string or other "unknown" value. If we really need to in fv3net, we can define a constructor function for Quantity that does this. I'd discourage this in favor of writing code that preserves unit information in our runfiles.

@nbren12
Copy link
Contributor Author

nbren12 commented May 4, 2021

I would argue the code to supply units that was added should have been added whether it was using Quantity or only using DataArray before writing to netcdf.

I agree that units could be enforced before I/O to provide full metadata for downstream users.

However, the conversion from numpy array/DataArray to Quantity is

  1. the main/only (?) place where units requirements are imposed
  2. the least likely place to result in a units error.

Quantity does not support units-aware arithmetic/indexing/etc so any interesting computation will eventually lose its units metadata by converting to numpy or xarray. Therefore, units are only currently checked for operations like get_state/state_state and halo updates that do not change units, which doesn't seem worth raising an error for.

The feature would have more benefit if Quantity supported units aware arithmetic, but currently it mainly raises false positives.

@mcgibbon
Copy link
Collaborator

mcgibbon commented May 4, 2021

I'd be kind of excited to add units-aware arithmetic support, it's always been a long-term goal. The hardest part is coming up with an API set that's small enough to maintain and keep backwards-compatible but large enough to cover most of our use cases. Early on we didn't have enough experience to determine this, but maybe we could now.

@ofuhrer
Copy link
Contributor

ofuhrer commented May 4, 2021

I think there is a middle ground between dumping units entirely (which is really annoying when inspecting fields or doing I/O) and ensuring that they are always up to date with overloading arithmetic operations between Quantity's and doing units checking.
In my mind, the main reason for having Quantity was having a wrapper around raw arrays that contain this sort of meta-data and to provide a wrapper for different storage backends (e.g. numpy, cupy, gt4py, and apple-pie), otherwise we would just use raw data pointers for MPI.

@nbren12
Copy link
Contributor Author

nbren12 commented May 4, 2021

raw arrays that contain this sort of meta-data and to provide a wrapper for different storage backends (e.g. numpy, cupy, gt4py, and apple-pie),

To me the "and" in this sentence indicates that these features maybe should be more decoupled in the code than they currently are, so that the limitations/assumptions of one do not limit the usability of the other.

@nbren12
Copy link
Contributor Author

nbren12 commented May 4, 2021

To me, a layered strategy where there is a quantity w/o metadata, and a quantity w/ metadata would be the best of both worlds. The former could be used for low-level model internals, and the latter for I/O, etc.

@mcgibbon
Copy link
Collaborator

mcgibbon commented May 5, 2021

To me, a layered strategy where there is a quantity w/o metadata, and a quantity w/ metadata would be the best of both worlds.

This is already supported - any user package (including the prognostic run) can make a pretty small Quantity subclass, or even a factory function, that automatically sets units to "unknown".

The former could be used for low-level model internals, and the latter for I/O, etc.

Part of the purpose of Quantity is to help debug high-level model internals. Low-level model internals use quantity.data or quantity.storage.

@ofuhrer
Copy link
Contributor

ofuhrer commented May 5, 2021

Just wanted to write the same thing. While the single responsibility principle is a good one, here I would actually think that just not setting the units is a fine approach instead of having two separate concepts.

@nbren12
Copy link
Contributor Author

nbren12 commented May 5, 2021

This is already supported - any user package (including the prognostic run) can make a pretty small Quantity subclass, or even a factory function, that automatically sets units to "unknown".

Would this work with set_state in the wrapper? Does that routine make sure that the units match the values in the json files?

@mcgibbon
Copy link
Collaborator

mcgibbon commented May 5, 2021

Currently yes. If we ever added unit validation in the wrapper set_state, e.g. to automatically convert units, I would expect we at least offer a validate_units=True keyword argument that can be used to disable it.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants