exp run: new code (or any untracked files) can't be queued [QA] #5816
Replies: 14 comments
-
@pmrowla's reply (#5593 (comment)): The issue with creating a new dvc.yaml file is the same as creating any new code file for experiments. In order for any new (completely untracked) files to be included in the queued experiment, you must at least stage them with git add (but they do not need to be committed) (it looks like this is not currently documented, see: #5029). For workspace runs since the files are present in the workspace itself, this issue doesn't show up. The alternative would be for us to explicitly include all of the untracked files in our experiment git commits. Doing so will likely result in us git-committing objects which do not belong in git (like large binary files, or things like venv dirs or pycache dirs that the user forgot to git ignore), or files which the user was explicitly not tracking in DVC or git for a reason (like authentication credentials). @dberenbaum's reply (#5593 (comment)): Files not tracked by Git will only be reflected in workspace experiments and not in temp experiments. Workspace experiments will use whatever is in the workspace, but temp experiments copy files to the temp dir by checking them out with Git. In this case, I think it's unclear what the expected behavior is. |
Beta Was this translation helpful? Give feedback.
-
Yes, I see @pmrowla. We could also just copy all the workspace contents to the tmp dir without ever committing untracked files, but then the resulting exp commit could be missing files needed to reproduce its results. Tricky! I guess queued/temp runs are fundamentally different from workspace runs in some ways. Should def. document this...
I don't think it's just a Maybe adding an option like |
Beta Was this translation helpful? Give feedback.
-
A related edge case in this discussion - do we capture the |
Beta Was this translation helpful? Give feedback.
-
btw, not sure what is the problem with this phrase specifically? it looks this is a duplicate/related to this one #5800 - same problem pretty much |
Beta Was this translation helpful? Give feedback.
-
We do not, the issue here is similar to where we do not respect the local config when doing certain commands which use our erepo clone functionality.
And yes, this is the same issue. We can copy untracked files (and things like the local config) directly into the temp directory, but that adds the question of how to "stage" this copy for multiple queued runs. Some options here would be:
One other potential thing that I came up with at the time of that early PR was #4452 (comment)
This would fix the "storing multiple copies of large binary files" issue, but the downside would be that these untracked files end up cached by DVC until the next time the user runs Finally, none of these solutions would directly address the issue in #5800, where the user has gitignored files that also need to be present in the workspace in order to run their pipeline. In this case, it's credentials, but in theory, stuff like (properly gitignored) venvs also needs to be copied over as well. This issue is also related to cloud execution: iterative/enhancement-proposals#3 Running experiments on remote machines or in CI will also require making sure that these files exist on the remote machine (i.e. the same issue as making sure they exist in the local temp directory) |
Beta Was this translation helpful? Give feedback.
-
What could that affect? Remotes shouldn't be a problem. Wrong cache link types is the only significant problem I can think of rn. (Do we need a separate issue for this?)
True @shcheklein, prob. not the specific text we'll need to update. Code changes (including to dvc.yaml) do get picked up by queued exps. I updated this issues' title for clarity.
I don't think it's the same problem (maybe same solution thugh?). I'm suggesting a new option to copy untracked files into tmp dirs (which can currently be done, mostly, by git-staging them, see #5029), but that EXCLUDES git/dvc-ignored ones to my mind. In #5800 they're requesting certain git-ignored files to be copied over as well. |
Beta Was this translation helpful? Give feedback.
-
I like the direction of this thinking @pmrowla. How about just copying untracked files to tmp exp dirs (IMO preferably with a new
➕. More of a reason to figure this out (at some point). |
Beta Was this translation helpful? Give feedback.
-
Excluding the gitignore issue in #5800, one option is to leave as is and document. If we're only focused on the narrow issue here of new code or other files that should be Git-tracked, users ought to be able to work around this limitation pretty easily with EDIT: I missed that this has already been proposed in #5029 . Do we need a docs issue created? Can we convert this ticket into a discussion until there are some other actions decided? |
Beta Was this translation helpful? Give feedback.
-
Yep, documentation for #5029 was missed. If we're not addressing this issue soon, I hope that can be done for now.
Sure. And this issue is basically about revisiting the current behavior at this point. Summarizing my reasons:
Proposed actions:
|
Beta Was this translation helpful? Give feedback.
-
You already have to
I still disagree with this, IMO the user should have to be specific about what files they want to stage, and we should not provide an option to blanket add all untracked files. |
Beta Was this translation helpful? Give feedback.
-
This is the only situation where DVC bases anything off staged-only files AFAIK. Git-committing yes, as we use Git revisions (for versioning); Git-adding by itself, not that I know of.
Sure. The UI can be different e.g. some sort of exp.config file (a manifest like you suggested), or new field in dvc.yaml to list which untracked files to include in the exp. I wouldn't call it "staging" or "adding" though, since ideally we never want those in Git. I don't think we should use Git at all here. |
Beta Was this translation helpful? Give feedback.
-
I guess I don't really see the difference between "requiring users to run 2 git commands first in order for some DVC functionality to work" and "requiring users to run 1 git command first in order for some DVC functionality to work". e:
is the now established workflow for using
Sure, this is true. And experiments is a new feature and is the only situation where DVC is trying to execute a pipeline outside of the user's workspace. Since experiments is brand new, there is no established workflow for experiments at all. If we want to tell users that the workflow for using experiments is
we can, because experiments is new and we can set the "standard practice" workflow to whatever we want it to be. If we wanted to, we could even document that using |
Beta Was this translation helpful? Give feedback.
-
I think it's not about Git commands. We integrate with Git versions ("revisions") regardless of what Git interface was used to create them. I don't think we should integrate with the Git index ("staging area").
We could. But a workflow like I could be wrong. Let's make this a discussion and invite others to weight in? For now BTW, #5800 (comment) seems like it's becoming the same issue indeed (a request along the line of @pmrowla's exp manifest/cache idea and to my new option suggestion, I think). UPDATE: Migrated to a discussion! ⚔️ |
Beta Was this translation helpful? Give feedback.
-
Looks like we have a user endorsing this change idea (#5800 (comment)).
@dberenbaum yeah I guess a special part of the cache for sensitive info that should never leave the local env could have several uses. DVC Vault |
Beta Was this translation helpful? Give feedback.
-
UPDATE: For summary jump to #5816 (comment).
In the docs we say "Before running an experiment, you'll probably want to make modifications such as data and code updates, or hyperparameter tuning." Is this the intended behavior?
Because, if we see dvc.yaml as code — and I think we do as we one of DVC's principles is pipeline codification — then this statement isn't completely true when it comes to queueing experiments.
Specifically, if you create (or modify) dvc.yaml between queued runs and then try to
--run-all
, you get errors. Example:It works with regular experiments though, which can access all workspace files (not in a tmp dir).
Beta Was this translation helpful? Give feedback.
All reactions