-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal for tracking row-based provenance #1113
Comments
ProblemI'd like to state the problem concisely before exploring solutions:
Does that capture it? I'll refer to these as 'the version problem' and 'the update problem' as I think out loud below. SolutionsHidden fieldsRow-wise insertion/update information could track everything
Data density could be mitigated by being selective in which tables had these fields. Merge tables or file tables ( File as sparse monitoringAs Sam mentioned, spyglass version in nwb files already provides a sparse solution for the version problem, but it may be sufficiently dense to delete all cases in between file saves. Is an additional tool needed to compare across all files from the same session? Is an additional field needed for user?
Expanding built-in log useThe mixin would allow us to intercept
Custom log tableIf row-wise is too dense for version tracking, can we keep track of when a user updates their environment, and when their environment is editable, using snippets provided by Ryan in #1087. A new 'user history' table loads whenever spyglass is imported. It checks the user's current spyglass version and/or conda environment against their last entry, and inserts a row if different from last time
A Permissions changesIf we're worried about 'update', we can disable it for non-admin either in SQL or in the python interface.
Replication toolIf we're worried about whether or not a given downstream key/file could be exactly replicated, we could instead focus on doing that, rather that data tracking possible errors. Given a downstream key, we can already export a vertical slice of the database, with all paramset table pairings. I spent some time exploring such a tool in #1057 with the idea of running the same analysis on a different session. We could fully solve the version problem for a given analysis by instead rerunning the same session(s).
Hypothetical goal: The production database a staging ground; to finalize analyses means a dockerized from-scratch rerun General thoughtsI propose adding a line to the builtin log for I'm in favor of sparser methods of protecting against the version issue until we see cases of definite red flags where we retroactively wished we had access to row-wise logging such as this. My gut says that adding the user history log, and adding created-by user to analysis files, is enough to cross reference for who made which file when. Not a full solve of the version problem, but perhaps an adequate record to prevent full reprocessing |
For common tables, users may accidentally (or intentionally) modify a row created by another user. So it is useful to know who was the last person who modified a row and when. Similarly, it is useful to know who created the row in the first place.
A couple types of actions (declare table, drop table, delete from table) are tracked in the DataJoint hidden
~log
table that is part of each database (e.g.,common_nwbfile
), but those are limited to table-wide actions and deletions, and it may be hard to parse for everyday users.There are many ways to do this, but since we care only about the latest state of each row, one common way seems to be to add columns to the data tables about the latest state:
created_by
orcreation_user
(user) (with or without IP address)created_on
orcreated_time
orcreation_timestamp
(datetime)last_modified_by
(user)last_modified_on
orlast_modified_time
orlast_modification_timestamp
(datetime)last_modified_reason
(string, e.g., "database migration" or "fix typo") (not sure if useful)last_modified_spyglass_version
(string) (not sure if useful)spyglass_environment
- a text-based dump of the virtual environment, host name, and operating system, but I think that is overkill and would result in a bloated database. we have thought about this for NWB and are settling on version of the primary software package as being good enough for our initial attempt at tracking provenance)This is now doable in the latest version of DataJoint (not released yet?) without cluttering the common display of tables (i.e., the columns are "hidden" but can be queried/displayed).
Alternative: we could create a history table for every data table and add a foreign key from the data table to the history table. I'm not sure if this separation adds any value now that we can have hidden columns.
Concern: Both approaches will increase the size of the database. Is it worthwhile?
It would be nice if this were built into DataJoint as mysql triggers, but until then, we could add the values every time we call
populate
.@CBroz1 noted:
Questions:
The text was updated successfully, but these errors were encountered: