Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DOCS] Clarify difference in function between data/external and data/raw #136

Open
Tracked by #246
a3lem opened this issue Aug 3, 2018 · 3 comments
Open
Tracked by #246

Comments

@a3lem
Copy link

a3lem commented Aug 3, 2018

Based on the <- descriptions in the 'Directory structure' section of the documentation, there doesn't seem to be clear-cut criterion for choosing between data/external and data/raw in those cases where the original data dump originates from a third-party source, i.e., fulfills the conditions for inclusion in either directory.

What sort of criteria do you apply in such ambiguous cases?

On a related note, where does data/external fit in your 'mental model' of the preprocessing pipeline? (Pick one)

  • A
raw ---> interim ---> processed ---+
                                   |
                                   +---> [ analysis ]
                                   |
                      external ----+
  • B
raw -------+
           |
           +--> interim ---> processed ---> [ analysis ]
           |
external --+
@pjbull
Copy link
Member

pjbull commented Aug 3, 2018

In practice, what we usually do is raw unless there is a clear use case for external. We generally don't restrict raw to one dataset, which means we could put everything in raw. That said, often we're asked to look at a "primary" dataset. Over the course of the project, we find other datasources that are relevant or that we want to look at including. Storing those in external means at the end of the project we know which data sources were "provided" vs. which ones we found elsewhere.

Also, WRT to processing pipeline, I've seen both A and B in practice. It just depends on how much processing external needs for your particular analysis.

For example, if we want to add geographic regions to a dataset, we need shapefiles for those regions. Often these come from third-party sources. We usually put these in external, and they get used early on in the pipeline to augment the raw data B. We then may select only specific regions as part of interim -> processed and these feed into the analysis.

On the other hand, we've done projects where we do something like comparing published country-level poverty rates to those calculated from a survey. In this case, the raw data in the survey gets aggregated to country-level estimates during interim -> processed. We then directly compare these to the external datasets that feed into the analysis.

TLDR; I would recommend everything in raw unless there is a clear internal or "primary" dataset.

Does that help answer your question?

@a3lem
Copy link
Author

a3lem commented Aug 3, 2018

Thanks for the detailed response! That certainly clears up a lot.

As I understand it, then, the difference between external and raw doesn't relate so much to the question of where the data comes but more to a (variable) combination of the data's 'function' within and 'specificity' to the project in addition to the data's origin. Here's a truly humble attempt at visualizing what I mean: =p

             Function?
+-----------------------------------+
|    Central     |    Supporting    |
+----------------+------------------+-----+
|  raw                raw           | Yes |
|                                   |-----|  Project-specific?
|  external           external      | No  |
+-----------------------------------+-----+

Like I said: humble. (It would appear – in this analysis at least – that 'project-specificity' is the winning dimension.)

In any case, I found the following sentence to be particularly helpful, since it really struck a chord.

Storing those in external means at the end of the project we know which data sources were "provided" vs. which ones we found elsewhere.

It's such a simple idea, but I've lost count of how many times I've done the exact opposite, only to be confronted with the ugly consequences when revisiting the project months later.

While every use case is different, I would almost suggest incorporating it into the documentation somehow.

@isms isms changed the title Difference in function between data/external and data/raw? [DOCS] Clarify difference in function between data/external and data/raw Jan 29, 2019
@isms
Copy link
Contributor

isms commented Jan 29, 2019

Changed title and clarified that this is an easy doc fix.

@pjbull pjbull mentioned this issue Aug 2, 2022
49 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants