Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Challenge 23 - Using Machine Learning to Emulate the Earth’s Surface #12

Open
RubenRT7 opened this issue Feb 20, 2024 · 22 comments
Open
Assignees
Labels
ECMWF New feature or request Machine Learning Machine learning for Earth Sciences applications

Comments

@RubenRT7
Copy link
Contributor

RubenRT7 commented Feb 20, 2024

Challenge 23- Using Machine Learning to Emulate the Earth’s Surface

Stream 2 - Machine Learning for Earth Sciences applications

Goal

Evaluating and improving the performance of ECMWF’s current land surface Machine Learning model prototype.

Mentors and skills

  • Mentors: Ewan Pinnington, Christoph Herbert, Patricia de Rosnay, Peter Weston, Sébastien Garrigues, Souhail Boussetta, David Fairbairn (all ECMWF)
  • Skills required:
    • Experience with Python programming
    • Expertise in statistical analysis
    • Background in Earth Sciences or related fields
    • Some experience with Machine Learning desirable

ai-land-comp

Challenge description

Machine Learning (ML) is becoming increasingly important for numerical weather prediction (NWP), and ML-based models have reached similar or improved forecast scores than state-of-the-art physical models. ECMWF has intensified its activities in the application of ML models for atmospheric forecasting and developed the Artificial Intelligence/Integrated Forecasting System (AIFS). To harness the potential of ML for land modelling and data assimilation activities at ECMWF, a first ML emulator prototype has been developed (Pinnington et al. AMS Annual Meeting 2024). The ML model was trained on the "offline" ECMWF Land Surface Modelling System (ECLand) using a preselected ML training database. The current prototype is based on the information of model increments without introducing further temporal constraints and provides a cheap alternative to physical models. It opens up many application possibilities such as the optimization of model parameters and the generation of cost-effective ensembles and land surface initial conditions for NWP.

So far, a qualitative comparison between ECLand-based and emulated fields has been performed on a subset of sites, which revealed that the time series of land variables match well in terms of dynamic range and general trend behaviour. However, more targeted evaluation is required to assess the performance of the land emulator prototype. The aim is to understand the model's capabilities in reproducing the ECLand spatial and temporal patterns and its performance evaluated against in-situ observations.

Scope of the challenge:

The successful team will have the opportunity to contribute to the current efforts of the coupled assimilation and modelling teams in evaluating and improving the ML emulator prototype. The training database and model fields will be available in Zarr format at the European Weather Cloud. More information on the emulator can be found here: ec-land-emulator-git.pptx

What we offer:

• Advanced Python skills: packages Xarray, Zarr, Dask, PyTorch
• Advancing first-of-its-kind land ML prototype
• Tools for land model verification (LANDVER package)

The following steps are proposed to be carried out by the candidate(s) as part of the challenge:

• Comparison between emulated and ECLand variables: evaluation regarding different soil and vegetation types; capability of capturing the diurnal cycle and seasonal variability, revealing patterns of differences and similarities

• Assessment of the performance of the ML emulator: validation with in-situ soil temperature, soil moisture and surface flux observations using the land verification software (LANDVER) or possibly other ground-based observations (e.g. snow) using different verification metrics (correlation, RMSE)

• Testing the benefit of introducing time-varying Leaf Area Index (LAI): Apply the ML emulator using time-varying LAI as an input and assess the performance against ECLand which uses a fixed vegetation climatology

• Extension: selection of input features and target variables for model training; hyperparameter tuning and updating architecture; retraining of the ML model to improve selected variables, e.g. snow cover fraction, against observations and/or reanalysis.

@EsperanzaCuartero EsperanzaCuartero changed the title Challenge 12 - Using Machine Learning to Emulate the Earth’s Surface Challenge 11 - Using Machine Learning to Emulate the Earth’s Surface Feb 22, 2024
@EsperanzaCuartero EsperanzaCuartero added the Machine Learning Machine learning for Earth Sciences applications label Feb 22, 2024
@EsperanzaCuartero EsperanzaCuartero changed the title Challenge 11 - Using Machine Learning to Emulate the Earth’s Surface Challenge 23 - Using Machine Learning to Emulate the Earth’s Surface Feb 23, 2024
@amozaffari
Copy link

Would it be possible to share the slide from Pinnington et al., AMS Annual Meeting 2024? Thanks! 🙏

@yikuizh
Copy link

yikuizh commented Mar 7, 2024

Hi
We are very interested in the challenge but for preparing the proposal, would you please share more information about the model of Pinnington et al., AMS Annual Meeting 2024, like the neural network structure, input/output, time and spatial scale etc.
Thank you very much.

@RubenRT7 RubenRT7 added the ECMWF New feature or request label Mar 7, 2024
@chris-herb
Copy link

chris-herb commented Mar 7, 2024 via email

@pinnstorm
Copy link

pinnstorm commented Mar 7, 2024

Have uploaded the slides here too for convenience 🙂!
ec-land-emulator-git.pptx

@pinnstorm
Copy link

Hi We are very interested in the challenge but for preparing the proposal, would you please share more information about the model of Pinnington et al., AMS Annual Meeting 2024, like the neural network structure, input/output, time and spatial scale etc. Thank you very much.

Please find the slides above with some more of this info 🙂. Thanks!

@yikuizh
Copy link

yikuizh commented Mar 8, 2024

Hi We are very interested in the challenge but for preparing the proposal, would you please share more information about the model of Pinnington et al., AMS Annual Meeting 2024, like the neural network structure, input/output, time and spatial scale etc. Thank you very much.

Please find the slides above with some more of this info 🙂. Thanks!

Thanks a lot!

@yikuizh
Copy link

yikuizh commented Mar 8, 2024

Hi
I have another question about the main topic of this challenge: I noticed from the challenge description, that three steps work for the validation and only one step is for the model development. Should we focus on the validation of the current model as the main focus? Or is it flexible to choose our emphasis when preparing the proposal as those steps are just for reference?
Thank you very much.
Yikui Zhang

@pinnstorm
Copy link

Hi I have another question about the main topic of this challenge: I noticed from the challenge description, that three steps work for the validation and only one step is for the model development. Should we focus on the validation of the current model as the main focus? Or is it flexible to choose our emphasis when preparing the proposal as those steps are just for reference? Thank you very much. Yikui Zhang

Hi Yikui!
Thanks for your question. The challenge is flexible, we have specified more validation as we thought this would be more achievable in the scope of the Code4Earth challenge and will also be very useful for ongoing activities at ECMWF. However, if you are already confident with the technologies used for model training and development of MLP's in general, then there is definitely more scope for focus on the model development and iteration too.
Thanks,
Ewan

@tfohrmann
Copy link

Hi,
we are currently thinking about ways to do the verification, but are wondering what functionality the LANDVER package has? Is it used to bring the in-situ data into a format that can be compared to the model data? Does it already compute some statistics that can be used for verification?
Thanks,
Till

@pinnstorm
Copy link

Hi, we are currently thinking about ways to do the verification, but are wondering what functionality the LANDVER package has? Is it used to bring the in-situ data into a format that can be compared to the model data? Does it already compute some statistics that can be used for verification? Thanks, Till

Hi Till!
Yes the LANDVER package includes in-situ observations of soil moisture, soil temperature and surface fluxes which are compared to model fields from the closest model grid point. It calculates lots of statistics like RMSE, MAE, correlation, etc. producing Taylor diagrams and bar charts of the results. We also have some model fields already processed which it will be good to compare the emulator with in the first instance to judge how well it mimics the full physical model and what fields it struggles to reproduce. Other novel sources of verification are welcome or if you'd prefer to use other observations/packages is good too! The emulator is currently predicting targets of soil moisture, soil temperature, 2m temperature, 2m dewpoint, skin temperature and snow cover fraction (with possibility to extend to additional flux variables quite easily).
Thanks,
Ewan

@thisisrohan
Copy link

Hi all, the link to the slides appears to be broken, could a fresh one please be added?
Thanks,
Rohan

@chris-herb
Copy link

chris-herb commented Mar 13, 2024 via email

@pinnstorm
Copy link

Hi all, the link to the slides appears to be broken, could a fresh one please be added? Thanks, Rohan

Thanks for spotting this Rohan, I have updated the link to the slides in the challenge description and my previous comment too 🙏

@amozaffari
Copy link

Hi, Thank you for your quick response. I have a question regarding testing the impact of the time-varying LAI. Will the ECMWF provide a time-varying LAI map to be fed into the emulator? Additionally, do we need to adapt the emulator to receive time-varying LAI as input, or is it already capable of accepting it?

@chris-herb
Copy link

Hi Amirpasha, Maps of time-varying LAI will be provided. The current emulator is trained using fixed LAI, but would be interesting to see the benefits of applying the current model or training a new model using time-varying LAI.

@amozaffari
Copy link

Thanks @chris-herb 🙏

@SamMajumder
Copy link

Hello mentors and fellow participants,

I am interested in this challenge, and I'd like to participate. I am a first-time participant in the Code4Earth challenge, and I am really interested in this particular project. I have a couple of specific question regarding the submission process.

Do I independently start developing a proposal for this project and contact any of the mentors along the way if I have questions?

Do I need to run my proposal by the mentors of this project, prior to the final submission?

Any insight is greatly appreciated! I look forward to participating and all the best everyone!! :)

Sambadi

@chris-herb
Copy link

chris-herb commented Mar 19, 2024 via email

@yikuizh
Copy link

yikuizh commented Mar 21, 2024

Hi
We have some more questions here about the model and the dataset:

  1. Are the ML emulator outputs already available to use, or do we need to run the ML model by ourselves before we can do the validation?
  2. Has the LAI already been used in ECLand as well? If so, is it a climatology or dynamic LAI?
  3. What is the spatial and temporal resolution of the ECLand model that the emulator has used?
  4. We would like to ask what might be the rationale for evaluating the ML model against observation data? In this case, in our opinion, it would make more sense to only compare the ECLand and Emulator output as the ML model is only trained to emulate the ECLand model rather than the real-world observations. I am not sure if our understanding is correct about this point.

Thank you very much for your help!
Kind Regards
Yikui

@pinnstorm
Copy link

Hi We have some more questions here about the model and the dataset:

  1. Are the ML emulator outputs already available to use, or do we need to run the ML model by ourselves before we can do the validation?
  2. Has the LAI already been used in ECLand as well? If so, is it a climatology or dynamic LAI?
  3. What is the spatial and temporal resolution of the ECLand model that the emulator has used?
  4. We would like to ask what might be the rationale for evaluating the ML model against observation data? In this case, in our opinion, it would make more sense to only compare the ECLand and Emulator output as the ML model is only trained to emulate the ECLand model rather than the real-world observations. I am not sure if our understanding is correct about this point.

Thank you very much for your help! Kind Regards Yikui

Hi Yikui!

Thanks for the questions 🙂 . In response:

  • There will be ML emulator outputs ready to use, but the model will be setup to perform additional runs as required during the project by the candidate
  • Yes ECLand uses a climatological LAI and the emulator is trained on the ECLand run with the climatological LAI values. The emulator is trained to account for the effect of LAI varying in time, so we can run it with LAI which is not climatological as well.
  • We have trained the initial emulator at Tco399 (~30 km) spatial resolution with a time step of 6-hours
  • You make a very good point here and the main aim is indeed to compare the emulator to the ECLand model output. Additionally comparing to observations allows us to judge if the emulator is appropriate even if it isn't exactly mimicking the ECLand model in certain locations. As we can also re-run the emulator with time-varying LAI very simplistically (or tweaking other climatological variables) we can then compare back to observations to see what impact this might have. The current emulator also has the scope to be "fine-tuned" towards observations depending on the potential progress on the project. This being said the main aim is a thorough comparison to ECLand model output which will be provided on the project and if the secondary aim of including obs cannot be met this will still be sufficient.

I hope this helps and do let us know if you have any more questions!
Thanks,
Ewan

@yikuizh
Copy link

yikuizh commented Apr 5, 2024

Dear mentors
We have a question about the length of validation dataset. As shown in the page 3 of the slides, it seems that the training dataset of the emulator is from 2018 to 2021 while only 2022 was used as the testing(or validation) dataset. Does this mean that we can only have the 2022 dataset from ECLand to validate the emulator?
Thank you very much for your help!
Kind Regards
Yikui

@pinnstorm
Copy link

Dear mentors We have a question about the length of validation dataset. As shown in the page 3 of the slides, it seems that the training dataset of the emulator is from 2018 to 2021 while only 2022 was used as the testing(or validation) dataset. Does this mean that we can only have the 2022 dataset from ECLand to validate the emulator? Thank you very much for your help! Kind Regards Yikui

Hi Yikui!

Good question! We have a dataset created from 2010-2023, so we can retrain a version of the emulator leaving more years for validation within this period.

Thanks 🙂
Ewan

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ECMWF New feature or request Machine Learning Machine learning for Earth Sciences applications
Projects
None yet
Development

No branches or pull requests