Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Echometrics improvements early 2023 #22

Closed
wants to merge 37 commits into from

Conversation

jr3cermak
Copy link

Initiate PR with additional updates forthcoming. This is primarily to check results of upstream CI tests before proceeding with additional work.

  • Rename pseudogram to echogram
  • Rename ecometrics to echometrics
  • Ensure all tests pass before proceeding with additional updates
  • convertDBD.sh
    • Better match conditions for input files (rt[tbd] or delayed mode[EBD]) to decoder script
    • Remove use of slocum dbd2asc decoder in favor of dbdreader (next)
    • Move echogram files separately from dba_extension files since they normally do not exist
  • Update templates to track updated names

 * Rename pseudogram to echogram
 * Rename ecometrics to echometrics
 * Ensure all tests pass before proceeding with additional updates
 * convertDBD.sh
   * Better match conditions for input files (rt[tbd] or delayed mode[EBD]) to decoder script
   * Remove use of slocum dbd2asc decoder in favor of dbdreader (next)
   * Move echogram files separately from dba_extension files since they normally do not exist
 * Update templates to track updated names
@jr3cermak
Copy link
Author

CI updates in progress via #20.

Tests within docker are ok and matches independent testing.

==== 27 passed, 8 deselected, 1 xpassed, 3034 warnings in 267.18s (0:04:27) ====

@kwilcox
Copy link
Collaborator

kwilcox commented Mar 21, 2023

@jr3cermak Can you rebase? CI changes are merged

@kwilcox
Copy link
Collaborator

kwilcox commented Mar 21, 2023

Can we completely replace convertDBD.sh (and the slocum binaries) with functionality from dbdreader at this point?

@jr3cermak
Copy link
Author

It would be great to skip the ascii conversion stage and promote important bits into the 2nd stage (netcdf) code. We would then have access to direct data frames instead having to read them from intermediate files. I am not sure dbdreader can completely replace all of what the slocum binaries can do. For straight ascii conversion, yes. For our data processing pipeline, we have moved completely away from the slocum binaries. The slocum binaries ascii conversion does not provide enough floating point precision to decode the embedded echograms which is why dbdreader is necessary. After these updates (other PRs will follow), we can take a deeper dive into the ascii and convertDBD.sh code and see what can be done.

@kwilcox
Copy link
Collaborator

kwilcox commented Mar 21, 2023

How about an intermediary format other than the ascii that currently exists (like 1->N parquet files)? I would like to keep a table-like serialization for other types of processing, distribution, analysis, etc. The netCDF format was really meant as the format to submit to the IOOS Glider DAC and is always going to be lossy.

@jr3cermak
Copy link
Author

Switching to parquet should be fine. Any shift will be require an initial lift to get started. I think I get it now. You are also not enthralled with netcdf as the unifying backend. I had hopes seeing activity at xpublish. So, the goal is to do away with convertDBD.sh and the ascii part but make the converter more useful to serve better purpose than just throwing the ascii part away. Right now the general process is DBD->convertDBD.sh/dbdreader->ascii->netcdf->ERDDAP->data portal. The goal is DBD->parquet/dbdreader->parquet storage->netcdf. The data portal would at some point begin to pull from parquet storage. Is there a xpublish type framework that would sit on top of parquet as xpublish is envisioned for xarray/zarr?

@kwilcox
Copy link
Collaborator

kwilcox commented Mar 22, 2023

Sounds like we are on the same page!

*.*bd files 
    -> dbdreader
        -> parquet files 
            -> profile netCDF files (via pocean) for glider dac
            -> backend analysis/viz (static plots, etc.)
            -> xpublish
                -> frontend analysis/viz (dynamic plots, etc.)

xpublish can sit on top of parquet files the same way it can sit upon xarray datasets... with a little plugin magic. I have a proof of concept that uses duckdb on top of parquet file served through xpublish and it is pretty nice for a quick API, that is the direction I'll be heading.

@jr3cermak
Copy link
Author

Splendid. I will let you know when this PR finished and ready to go. Then we can look into pyarrow a bit more.

    Documentation

     * Update README.pdf for echotools

    General Notes

     * Rename driver script
     * Remove unneeded scripts
     * Most of the updates are within echotools processing
     * Rename as we go: pseudogram => echogram
     * Rename as we go: echosounder => echogram
     * Hopefully not breaking any APIs in the process
     * Set title to {segment} for echograms
     * Note deprecated arguments
     * Remove debug print statement
     * Allow user supplied arguments to override deployment.conf
       arguments
     * Print help menu if a data type is not provided
     * Convert limits to python lists if provided
     * Use segment for filenames if "default" is part of the
       filename
     * Default plot type is "binned"
     * Fix up "time" axis for echogram plots
     * Allow deployment.conf to update arguments not provided on
       the command line

   Specific

     * tests: TestEchoMetricsTwo: echometrics2; test_slocum.py
       * Fix date in deployment.json for echometrics2
       * Replace with modern example
       * Includes test of teledyne module and creation
         of numpy, pandas and xarray data frames from
         echogram to provide an example of how it can
         start to be leveraged from within the GUTILS
         framework.

     * gutils: slocum/__init__.py
       * move setting of enable_image to a better spot
       * Update calls to convertDBD.sh to match API changes
       * Update calls to decodePseudogram.py => processEchograms.py

     * gutils: tests/resources/slocum/*
       * Eco => Echo
       * Update extra_kwargs to match API changes/updates
       * Rename as we go: pseudogram => echogram
       * Rename as we go: echosounder => echogram
       * Add more tests to Echodroid/Echometrics tests to
         attempt to capture backend failures
       * Remove some some unneeded files

     * echotools: teledyne.py
       * specify matplotlib backend to potentially make
         faster; defaults tend to trigger X11 layer
       * rename [pP]seudo => [eE]cho
       * Check arguments for None
       * default plot type => pcolormesh => binned
       * set plot facecolors to lightgrey
       * recalculate range bin size if defaults are changed
       * fix an inventory filtering bug
       * add a function to create groups of DBD files based on
         a raw DBD file listing
       * a deployment directory is not required if not writing
         netcdf files
       * a template file is required when writing netcdf files
       * add a function that overrides key arguments if a
         deployment file is read.  Mainly applies to echograms.
       * allow plot types to be specified in a comma delimited
         list or with "all"
       * rearrange plotting code so all three types can be
         produced
       * ensure the color bar is consistent between the three
         plotting techniques
       * Rename as we go: pseudogram => echogram
       * Rename as we go: echosounder => echogram

     * echotools: processEchograms.py
       * Rename as we go: pseudogram => echogram
       * Rename as we go: echosounder => echogram
       * Always look for a deployment configuration file
       * Update arguments based on deployment configuration file
@jr3cermak jr3cermak mentioned this pull request Mar 25, 2023
6 tasks
@jr3cermak
Copy link
Author

This PR is good to go anytime. gutils/tests/test_slocum.py::TestEchoMetricsTwo::test_echogram has code that will read the echogram and put the profile into a dataframe: numpy, pandas and xarray. Initial starting point for migrating away from convertDBD.sh. The teledyne.py module has morphed a few times as it has encountered various sources of code. It can probably use an overhaul when we get closer to tackling the convertDBD.sh script and friends later. In an earlier life, it also moved away from the slocum binaries. Class functions still lurk in there even after converting to the dbdreader module. Progress.

 * slocum/__init__.py
   * Set default plot type to 'binned'
   * Emit a log message for debugging
 * convertDBD.sh
   * Remove other debugging materials
 * echotools/teledyne.py
   * Leave default cmap selection as ascii and convert to
     actual colormap when requested
   * remove debugging material
   * if the date/time ticks are days that line up
     with 00:00:00, omit the time portion of the string
   * More migration of Pesudo => Echo
   * Bug/Enhancement: echogram is now properly time
     adjusted for "combo" and "egram" glider modes
 * Ability to create timeseries views of echogram profiles
 * Add new default product plot type "profile"
 * Update examples to use default product type
kwilcox and others added 9 commits June 16, 2023 10:49
* Reduce testing resource size and allow testing in Docker image
* Update dependencies and Docker image version
* Pseudogram -> Echogram
* Whoops, the binary utilities still require focal
* dbdreader to the rescue... eventually!
* Run the watch tests in Docker CI testing
 * gutils/__init__.py
   * set logging level
 * gutils/filters.py
   * pass extra_kwargs as optional argument to reader_class
   * extra_kwargs now part of the class, remove from extras()
 * gutils/nc.py
   * Do not emit debug log message for orphans when the set is empty
   * Do not cross fill variables (is this needed for the DAC?)
 * gutils/slocum/__init__.py
   * Add datetime, warnings, dbdreader and pyarrow.parquet to needed modules
   * Reduce logging output for dbdreader and matplotlib
   * Add optional kwargs to SlocumReader()
   * Add flag to enabling parquet processing
   * Set processing mode based on ascii_file
   * Use internal kwargs to class
   * Let routine fail gracefully if parquet processing is enabled but
     unable to find appropriate modules
   * Fix: echometrics data was not being propogated to _extra_*.nc du
     to restrictive time filter.
   * Be sure to drop echogram_sv from main profile data
   * All echo data now stored in _extras.
   * Add block_read() to perform same task as dbd2asc
   * Add read_header() to fetch certain metadata from dbd
   * Adapt convert() routine to work as ascii or parquet
   * Updates to ascii processing for echograms
 * gutils/slocum/bin/convertDbds.sh
   * Add two arguments for echogram processing via ascii method
   * Add some proper defaults
 * Update gutils/slocum/echotools/teledyne.py large amount of
   upstream fixes and improvements.
 * pytest
   * New default: echograms produce two image files for each profile
   * echometrics5 and echometrics7 use ascii method
   * echometrics6 and echometrics8 use parquet method
 * gutils/slocum/echotools/teledyne.py now flake8 compliant
 * gutils/nc.py
   * move imports to top
   * add debug flag to get at bugs at deep levels
 * gutils/tests/test_slocum.py
   * update answers with bug fixes and updates
 * setup.cfg
   * remove exception for echotools
jr3cermak added 17 commits June 24, 2023 13:46
 * __init__.py
   * SciPy 1.13.0 warnings
 * nc.py
   * Skip writing extras file for echograms that are short
 * slocum/__init__.py
   * enable deeper debugging
 * slocum/bin/convertDbds.sh
   * enforce grep to use ascii on binary files
 * slocum/echotools/teledyne.py
   * allow PYTHONWARNINGS to override coded warnings
   * utilize dbdreader skip_inital_line feature
   * numpy datetime conversion warning
   * flake8 file matching update
   * ensure file descriptors are released
   * merge duplicate time coordinate data from DBD
   * flake8 updates
 * improve handling when using the parquet format
 * use glider terms to refer to data file sources (glider -> flight)
 * use python classes instead of [] and ()
 * set output directory appropriately when argument is not set
 * add metadata to data dict for tracking of created files
 * add env var GUTILS_PYTEST_NOTEARDOWN to tests to defer removal of ascii and netcdf directories during testing (debugging)
 * fix checks in a few tests with recent improvements to echotools processing
@jr3cermak
Copy link
Author

Finally ready to move forward with updates to echometrics processing and the addition of parquet as intermediate storage. The pytests all pass when run manually.

2023-08-28 14:06:24,221 - gutils.slocum - INFO - Converted usf-bass-2016-253-0-4.sbd,usf-bass-2016-253-0-4.tbd to usf_bass_2016_253_0_4_sbd.dat
2023-08-28 14:06:24,273 - gutils.slocum - INFO - Converted usf-bass-2016-253-0-5.sbd,usf-bass-2016-253-0-5.tbd to usf_bass_2016_253_0_5_sbd.dat
2023-08-28 14:06:24,425 - gutils.slocum - INFO - Converted usf-bass-2016-253-0-6.sbd,usf-bass-2016-253-0-6.tbd to usf_bass_2016_253_0_6_sbd.dat
PASSED
gutils/tests/test_watch.py::TestWatchClasses::test_gutils_netcdf_to_erddap_watch PASSED

================================================================================== warnings summary ==================================================================================
gutils/tests/test_nc.py: 3030 warnings
gutils/tests/test_slocum.py: 870 warnings
  /home/cermak/miniconda3/envs/gutils_py3_9/lib/python3.9/site-packages/compliance_checker/suite.py:185: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
    latest_version = str(max(StrictVersion(v) for v in version_nums))

gutils/tests/test_slocum.py::TestEchoMetricsSix::test_echogram
gutils/tests/test_slocum.py::TestEchoMetricsSix::test_echogram
gutils/tests/test_slocum.py::TestEchoMetricsSix::test_echogram
gutils/tests/test_slocum.py::TestEchoMetricsSix::test_echogram
gutils/tests/test_slocum.py::TestEchoMetricsSix::test_echogram
  /home/cermak/miniconda3/envs/gutils_py3_9/lib/python3.9/site-packages/pyarrow/pandas_compat.py:354: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
    if _pandas_api.is_sparse(col):

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
============================================================== 38 passed, 1 xpassed, 3905 warnings in 440.84s (0:07:20) ==============================================================

Workflow tests are not working.

Looks like the build process needs to know about dbdreader. Utilization of dbdreader will completely replace reliance on the x86 slocum binaries for decoding.

There is an odd dependency failure via conda/Docker:

E   ImportError: /lib/x86_64-linux-gnu/libstdc++.so.6: version `GLIBCXX_3.4.30' not found (required by /opt/conda/lib/python3.9/site-packages/scipy/fft/_pocketfft/pypocketfft.cpython-39-x86_64-linux-gnu.so)

This update has a hook to process glider data into parquet intermediate files. The parquet enabled processing a shade faster than the ascii method.

This update references issue #12, #24 and #26.

 - slocum/__init__.py
   - Final alignment of data types: rt, rtd, delayed
 - bin/convertDbds.sh
   - platform compatability fix for unit tests (see below)
 - tests/test_slocum.py tests/test_watch.py
   - do additional clean up of cache files after test completes

Unit tests failed on different platforms due to either magic
or operational diffences of file.  Testing for data or ASCII
allows all the unit test to successfully pass.

-    ftype=$(file $dbdSource | grep data);
+    ftype=$(file $dbdSource | grep 'data\|ASCII');
@jr3cermak
Copy link
Author

This update to the PR includes a small fix that was discovered when trying to run unit tests on other platforms with different versions of OS and modules. Some small differences in magic or the way file operates, the shell must test for data or ASCII to allow unit tests to pass.

@jr3cermak
Copy link
Author

From our standpoint, this is ready for implementation. Not sure what needs to be done to fix it in the workflow unit tests. Unit tests pass when run manually on all our platforms we have been testing.

@VeckoTheGecko VeckoTheGecko mentioned this pull request Dec 7, 2023
3 tasks
@jr3cermak jr3cermak closed this Apr 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants