Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Experiment with polars #472

Draft
wants to merge 28 commits into
base: main
Choose a base branch
from
Draft

Experiment with polars #472

wants to merge 28 commits into from

Conversation

ecomodeller
Copy link
Member

@ecomodeller ecomodeller commented Nov 29, 2024

Rationale

  • Polars optimizes expressions, e.g. calculating residuals is done in almost every metric, this will be optimized.
  • Polars has in many cases a more readable api, where mutation is not possible.
  • Polars is strict and data type conversion must be made explicit, avoiding surprises.

This is an experiment.

Some conclusions:

  • Polars is strict about order of columns and data types in order to be fast!
  • Pandas and it's weird MultiIndex is deeply intertwined in modelskill.
  • It is not easier to apply a list of python functions to a polars dataframe, than it is to a pandas one.

But if you buy into polars, it is actually quite readable:

obs = pl.col("obs_val")
mod = pl.col("mod_val")
diff = obs - mod

named_metrics: dict[str, pl.Expr] = {
      "bias": diff.mean().alias("bias"),
      "rmse": diff.pow(2).mean().sqrt().alias("rmse"),
# TODO add more ...
}

sel_metrics = [named_metrics[metric] for metric in metrics]

res = df.group_by(by).agg(*sel_metrics)

And the aggregated dataframe looks like this in the simplest case:

shape: (1, 3)
┌─────────────┬──────────┬──────────┐
│ observation ┆ rmse     ┆ bias     │
│ ---         ┆ ---      ┆ ---      │
│ cat         ┆ f64      ┆ f64      │
╞═════════════╪══════════╪══════════╡
│ alti        ┆ 0.111143 ┆ 0.064629 │
└─────────────┴──────────┴──────────┘

@ecomodeller
Copy link
Member Author

ecomodeller commented Dec 10, 2024

Remaining:

  • Temporal aggregation by frequency (cc.skill(by="freq:D"))
  • Temporal aggregation by date (cc.skill(by="dt:month")
  • Directional metrics
  • Custom metrics
  • Quantities
    image

@ecomodeller
Copy link
Member Author

The skill table output with polars include the datatypes, but lacks the nice display of the multiindex.
image

One option is to convert back to pandas for display and style and the other option is to use great_tables
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant