Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nex to bin speedup #714

Merged
merged 2 commits into from
Dec 19, 2023
Merged

Conversation

JoshCu
Copy link
Contributor

@JoshCu JoshCu commented Dec 19, 2023

Mostly Resolves performance issues mentioned in #713

Changes

nex_files_to_binary

  • Restructured operations to perform most of the file manipulation with bash commands instead of expensive python operations

Old pseudocode

for each nexus file
    read as csv
    format and pivot data
    for each timestep
        open the parquet for that timestep
        write one entry to the parquet
        close the parquet

New pseudocode

for each nexusfile
    bash command to split csv as plaintext into timestep files
for each timestepfile
    read timestepfile csv
    convert timestepfile to parquet
    save parquetfile

The data is still being written the same amount of times, but it's mostly plaintext appends which bypasses the expensive parquet IO until the final pass.

Testing

data I used
python -m nwm_routing -V4 config/ngen.yaml

  1. follow instructions here to download some test data and a docker file capable of running t-route
  2. go through guide.sh until it asks you to if you want to run serial or parallel, keep the terminal open at this step
  3. Open a new terminal, exec into the container
  4. Find the python file that needs updating with find / -type f -name "AbstractNetwork.py"
  5. open that file and update it to the changes in this pr
  6. go back to guide.sh terminal and run in parallel

Screenshots

Comparison of just file IO sections

image

master test

image

PR Changes test

image

Notes

  • Tests were performed on a sata ssd with ~450MB/s IO
  • Those last two tests are using the NGIAB image so the last 3 times are including ngen
  • The Forcing array construction times are what this effects
  • This still is executing sequentially, further improvement could be had by performing the bash commands in parallel and the parquet conversion in parallel
  • Even more performance could be gained from saving the temp csv files as larger parquet files with multiple timesteps, but that's a much larger change

@shorvath-noaa shorvath-noaa merged commit b502211 into NOAA-OWP:master Dec 19, 2023
4 checks passed
@JoshCu JoshCu deleted the nex_to_bin_speedup branch September 3, 2024 21:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants