-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DOCS] Clarify difference in function between data/external and data/raw #136
Comments
In practice, what we usually do is Also, WRT to processing pipeline, I've seen both A and B in practice. It just depends on how much processing For example, if we want to add geographic regions to a dataset, we need shapefiles for those regions. Often these come from third-party sources. We usually put these in On the other hand, we've done projects where we do something like comparing published country-level poverty rates to those calculated from a survey. In this case, the raw data in the survey gets aggregated to country-level estimates during TLDR; I would recommend everything in Does that help answer your question? |
Thanks for the detailed response! That certainly clears up a lot. As I understand it, then, the difference between
Like I said: humble. (It would appear – in this analysis at least – that 'project-specificity' is the winning dimension.) In any case, I found the following sentence to be particularly helpful, since it really struck a chord.
It's such a simple idea, but I've lost count of how many times I've done the exact opposite, only to be confronted with the ugly consequences when revisiting the project months later. While every use case is different, I would almost suggest incorporating it into the documentation somehow. |
Changed title and clarified that this is an easy doc fix. |
Based on the
<-
descriptions in the 'Directory structure' section of the documentation, there doesn't seem to be clear-cut criterion for choosing betweendata/external
anddata/raw
in those cases where the original data dump originates from a third-party source, i.e., fulfills the conditions for inclusion in either directory.What sort of criteria do you apply in such ambiguous cases?
On a related note, where does
data/external
fit in your 'mental model' of the preprocessing pipeline? (Pick one)The text was updated successfully, but these errors were encountered: