[DOCS] Clarify difference in function between data/external and data/raw #136

a3lem · 2018-08-03T10:15:28Z

Based on the <- descriptions in the 'Directory structure' section of the documentation, there doesn't seem to be clear-cut criterion for choosing between data/external and data/raw in those cases where the original data dump originates from a third-party source, i.e., fulfills the conditions for inclusion in either directory.

What sort of criteria do you apply in such ambiguous cases?

On a related note, where does data/external fit in your 'mental model' of the preprocessing pipeline? (Pick one)

A

raw ---> interim ---> processed ---+
                                   |
                                   +---> [ analysis ]
                                   |
                      external ----+

B

raw -------+
           |
           +--> interim ---> processed ---> [ analysis ]
           |
external --+

The text was updated successfully, but these errors were encountered:

pjbull · 2018-08-03T15:15:57Z

In practice, what we usually do is raw unless there is a clear use case for external. We generally don't restrict raw to one dataset, which means we could put everything in raw. That said, often we're asked to look at a "primary" dataset. Over the course of the project, we find other datasources that are relevant or that we want to look at including. Storing those in external means at the end of the project we know which data sources were "provided" vs. which ones we found elsewhere.

Also, WRT to processing pipeline, I've seen both A and B in practice. It just depends on how much processing external needs for your particular analysis.

For example, if we want to add geographic regions to a dataset, we need shapefiles for those regions. Often these come from third-party sources. We usually put these in external, and they get used early on in the pipeline to augment the raw data B. We then may select only specific regions as part of interim -> processed and these feed into the analysis.

On the other hand, we've done projects where we do something like comparing published country-level poverty rates to those calculated from a survey. In this case, the raw data in the survey gets aggregated to country-level estimates during interim -> processed. We then directly compare these to the external datasets that feed into the analysis.

TLDR; I would recommend everything in raw unless there is a clear internal or "primary" dataset.

Does that help answer your question?

a3lem · 2018-08-03T20:04:46Z

Thanks for the detailed response! That certainly clears up a lot.

As I understand it, then, the difference between external and raw doesn't relate so much to the question of where the data comes but more to a (variable) combination of the data's 'function' within and 'specificity' to the project in addition to the data's origin. Here's a truly humble attempt at visualizing what I mean: =p

             Function?
+-----------------------------------+
|    Central     |    Supporting    |
+----------------+------------------+-----+
|  raw                raw           | Yes |
|                                   |-----|  Project-specific?
|  external           external      | No  |
+-----------------------------------+-----+

Like I said: humble. (It would appear – in this analysis at least – that 'project-specificity' is the winning dimension.)

In any case, I found the following sentence to be particularly helpful, since it really struck a chord.

Storing those in external means at the end of the project we know which data sources were "provided" vs. which ones we found elsewhere.

It's such a simple idea, but I've lost count of how many times I've done the exact opposite, only to be confronted with the ugly consequences when revisiting the project months later.

While every use case is different, I would almost suggest incorporating it into the documentation somehow.

isms · 2019-01-29T17:58:07Z

Changed title and clarified that this is an easy doc fix.

isms changed the title ~~Difference in function between data/external and data/raw?~~ [DOCS] Clarify difference in function between data/external and data/raw Jan 29, 2019

isms added gg-easy docs labels Jan 29, 2019

pjbull mentioned this issue Aug 2, 2022

[WIP] Version 2 #246

Merged

49 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DOCS] Clarify difference in function between data/external and data/raw #136

[DOCS] Clarify difference in function between data/external and data/raw #136

a3lem commented Aug 3, 2018

pjbull commented Aug 3, 2018

a3lem commented Aug 3, 2018

isms commented Jan 29, 2019

[DOCS] Clarify difference in function between data/external and data/raw #136

[DOCS] Clarify difference in function between data/external and data/raw #136

Comments

a3lem commented Aug 3, 2018

pjbull commented Aug 3, 2018

a3lem commented Aug 3, 2018

isms commented Jan 29, 2019