-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for checking missing sources with extra-data-validate #280
base: master
Are you sure you want to change the base?
Conversation
Also removed the `suffix` argument from progress_bar() and `nproblems` from RunValidator.progress().
This uses a user-defined threshold of trains with missing data to decide which sources to report.
Thanks James. To answer some of your questions:
It is not critical, but it is important
Probably not. In general all keys come from the same data hash in pipeline data sources, so it's probably safe to assume if one is written, all keys from the same hash are too. The MHz detector are a special case, they have 4 sub-sections ( now, I'm not sure this is the right place for this check. Currently extra-data-validate purpose is to check that a file is not broken, i.e. it respects the format definition. Missing data is not a file problem and is actually something "normal" as probably almost all runs will have missing data for one or multiple data sources. I also know that people tend to let some instrument selected in the DAQ even if they aren't used, so you get no data for them all the time. Also, for your interest, Robert and I are working on extending how we inform people about data issues. We will have a DnD to brainstorm on that idea when Robert is back. And thinking of a way to actively notifying beam staff about these problems is a great use case (is that a reasonable timeframe for you?) There are also a couple of way already available to "solve" this issue:
e.g. something like: $ lsxfel ./r0001 --detail "*:*" | grep "data for" -B2 | sed -n 1~2p
- HED_EXP_IC1/ADC/1:channel_0.output
data for 0 trains (0.00%), up to 0 entries per train
- HED_EXP_IC1/ADC/1:channel_1.output
data for 0 trains (0.00%), up to 0 entries per train
- HED_EXP_IC1/ADC/1:channel_2.output
data for 2734 trains (100.00%), up to 1 entries per train
...
|
maybe @takluyver has a different opinion? |
I think the option of quickly checking for this is very useful, but as Thomas suggested it is not universally an error condition for data to be missing on a given train. On the contrary, there are devices either intentionally or by their nature emitting data at less than 10 Hz. But that's not a problem with the feature itself, but giving it a finite default value. It should be disabled by default and easily turned on, even permanently if the instrument chooses so for the run validator. On the more technical side, I think there are plenty of assumptions you can include to reduce the number of walked keys:
|
I think there's a practical problem with this: as I understand it, one of the easiest ways for data to go missing is when the source is not configured in the DAQ. In that case, 100% of the data is missing, but the source name isn't there in the files, so this check wouldn't see a problem. My inclination is to agree with @tmichela, that we should keep the scope of extra-data-validate limited to 'are these files correctly structured?', and missing data is not an issue with the file structure. But from a user's perspective, I know the difference between a DAQ bug and a bug or a misconfiguration somewhere else in Karabo is not important, and the 'validation' that a lot of people want is 'is my data being saved?' So I don't want to get too hung up on the principle of this. I wonder if the subjective question of 'is enough of the data that I care about present in these files?' is better dealt answered by a separate tool than the objective one extra-data-validate tries to answer. This is roughly what I was going for with the percentages in lsxfel, but that's still distinctly clumsy. |
That makes sense, and indeed I didn't think of cases where a device doesn't send data at 10Hz (one example I can think of would be the Shimadzu camera at SPB).
Hmm, that's not really the problem I'm trying to solve. If a source is missing from the DAQ then that's definitely an operator error. This validation is only looking for cases where the DAQ is configured correctly and data should have been written, but it wasn't. That's a more insidious problem to discover because technically the bad sources are present in the files, so
I don't have a strong opinion on that. Personally I would lean towards keeping it in So my takeaways are:
Does that sound good? I think it's simpler if the feature stays in |
Yes, that sounds reasonable. Although, since we are starting to work on a project that is meant to handle any kind of data validity issues (and data loss definitely falls in this scope) I would prefer not to add that right now, and wait until we have a clearer plan on how this project will handle and inform users with data issues. What about this?
|
Yep yep, that sounds good to me 👍 |
This adds support for checking that sources in runs are not missing from more than some percentage of trains. MID noticed that occasionally certain devices (particularly cameras and fastADCs) would be missing from runs, and this wasn't noticed until later when they tried to analyze the data.
The goal is for this to be run automatically by the RunValidator device, perhaps with a setting to change the missing data threshold. You can test it with this run, which is missing a bunch of data from the AGIPD and some cameras:
/gpfs/exfel/exp/MID/202221/p003210/raw/r0008
.Couple of things I'm not too sure about:
FileValidator
to report the data counts for each source/key it contains.<source> is missing data for <n> trains
instead of listing every key. But maybe pipeline data is handled differently, e.g. for the big detectors?(BTW it's probably easiest to review each commit individually, they're fairly atomic)