diff --git a/.nojekyll b/.nojekyll index 1f9130b..58766e1 100644 --- a/.nojekyll +++ b/.nojekyll @@ -1 +1 @@ -d7a0f323 \ No newline at end of file +fa893d75 \ No newline at end of file diff --git a/Class Notebooks/01A_EDA.html b/Class Notebooks/01A_EDA.html index 983bb6d..863c62e 100644 --- a/Class Notebooks/01A_EDA.html +++ b/Class Notebooks/01A_EDA.html @@ -8,7 +8,7 @@ -
Welcome to the first workbook for Module 2 of this course, covering Exploratory Data Analysis, or EDA.
In class, we learned that EDA is the process of examining your data to:
The timeline for completing these workbooks will be given on the training website and communicated to you in class. Unlike the Foundations Module workbooks, these workbooks should be completed as homework after we have discussed the material in class.
This workbook will cover both SQL and R coding concepts, so we need to set up our environment to connect to the proper database and run R code only accessible in packages external to the basic R environment. Typically, throughout these workbooks, we use SQL for the majority of data exploration and creation of the analytic frame, and then read that analytic frame into R for the descriptive analysis and visualization.
Note: If you aren’t concerned with the technical setup of this workbook, please feel free to skip ahead to the next section, WI PROMIS data: ds_wi_dwd.promis
.
.Renviron
where John.Doe.P00002
is replaced with your username and xxxxxxxxxx
is replaced with your password (both still in quotes!) The setup of this code is nearly identical to that required in the Foundations Module workspace - however, DBUSER
should now end with .T00111
instead of .T00112
.
ds_wi_dwd.promis
ds_wi_dwd.promis
The primary dataset we will use in this class is the Wisconsin PROMIS data. The PROMIS (Program for Measuring Insured Unemployed Statistics) data, stored on Redshift as ds_wi_dwd.promis
, provides information on unemployment insurance claimants in Wisconsin. Specifically, according to the LAUS Extraction Guide, the data includes “initial claims, additional initial claims, and continued claims that were either new or updated.”
Using the data dictionary, identify one or more further variables from the PROMIS data that might be relevant to your group’s analysis. Think through what these variables “should” look like, as well as what issues might arise. Working individually or with your group, examine the distribution of these variables. Document any EDA-related concerns and findings in your team’s project template. Brainstorm as to what the cause of these issues might be, and how it could impact your analysis.
ds_wi_dsd.ui_wage
ds_wi_dsd.ui_wage
We’re now going to apply these same EDA concepts to a second dataset, Wisconsin’s UI wage records, which are stored on Redshift as ds_wi_dwd.ui_wage
.
We will keep the narrative of our exploration far briefer in this section. You are encouraged to read through the following output and think about how it pertains to the discussions that we had above.
As with the PROMIS data, use the data dictionary to identify one or more further variables from the UI wage records that might be relevant to your group’s analysis. Think through what these variables “should” look like, as well as what issues might arise. Working individually or with your group, examine the distribution of these variables. Document any EDA-related concerns and findings in your project template. Brainstorm as to what the cause of these issues might be, and how it could impact your analysis.
The workbook provides a structure for you to start your EDA process on the data within the scope of your project. The data coverage and row definition for the two primary datasets in this training is available, allowing you to focus on evaluating the distribution of variables potentially relevant to your analysis. The data coverage is particularly essential for project ideas linking the two datasets, as you will want to select a set of years, quarters, and weeks that are available in potentially both datasets.
As you evaluate variable distributions, you can start by repurposing the code in these sections. There are code snippets for distributions of numeric, time-based, and categorical variables that may be appropriate depending on the type of column you are interested in exploring.
In doing so, as recommended in the checkpoints, note your findings in your team’s project template. As your project progresses, it will be helpful to look back at these notes, especially in thinking through how to most accurately and best communicate your team’s final product to an external audience. Ultimately, the EDA process is an essential step in the project development lifecycle, as it provides helpful contextual information on the variables you may choose to use (or not use) in your analysis.
For all of these steps, remember not to take notes or discuss exact results outside the ADRF. Instead, create notes or output inside the ADRF, and store them either in your U: drive or in your team’s folder on the P: drive. When discussing results with your team, remember to speak broadly, and instead direct them to look at specific findings within the ADRF. And, as always, feel free to reach out to the Coleridge team if you have any questions as you get used to this workflow!
AR EDA Notebook (link to come)
@@ -946,7 +958,7 @@Our next notebook in Module 2 will build off the EDA concepts discussed in the first one, extending the years, quarters, and weeks as part of the data coverage component to a method rooted in a specific moment in time - cross-section analysis. A cross-section allows us to look at a slice of our data in time so we can evaluate the stock of observations, just at that particular snapshot. Through the remainder of the class notebooks, we will apply each topic to the same focused research topic, all aimed at better understanding unemployment to reemployment pathways for a specific set of claimants receiving benefits after COVID-imposed restrictions were lifted in Wisconsin.
Composing a cross-section enables for broad understandings of volume and in this context, claimant compositions. Especially as a workforce board, it can be immensely useful to understand common characteristics of those receiving UI benefits, regardless of benefit duration, particularly in evaluating workforce alignment scenarios to identify promising job matches between prospective employee and employer.
Cross section analyses are limited in gaining a deep understanding of experiences over time, though, because they are tracking stocks of observations at certain points in time, rather than observations consistently throughout the time period. A different analysis method is more appropriate for a longitudinal study, one that we will introduce in the next notebook. At a minimum, even for those intending on evaluating claimant experiences longitudinally, cross sections can provide important context.
Here, we will reintroduce the code required to set up our environment to connect to the proper database and load certain packages. If you aren’t concerned with the technical setup of this workbook, please feel free to skip ahead to the next section, Cross-section.
For this code to work, you need to have an .Renviron
file in your user folder (i.e. U:\John.Doe.P00002) containing your username and password.
Even though we will eventually build out a longitudinal study for claimants starting to receive UI benefits after COVID-related restrictions ended in the state, starting with a cross-sectional analysis will help us better understand the dynamics of the entire set of individuals receiving UI benefits at this time. Here, we aim to evaluate this stock of claimants in a variety of ways:
If you think a cross-sectional analysis would be helpful for your group project, identify variables, or combinations of variables, you’d like to look into after developing your cross-section. Working individually or with your group, if you end up developing a cross-section, examine the distribution of these variables. Document any concerns and findings in your team’s project template, and think about how you may want to contextualize these findings within your overall project.
This workbook applies the concepts of a cross-sectional analysis to Wisconsin’s PROMIS data and discusses some of the considerations and potential of such a investigation. Even if your team’s ultimate plan is to perform a longitudinal analysis, a cross-sectional approach may be useful. If your team deems it appropriate to develop a cross-section, you are encouraged to repurpose as much code as possible in developing your initial snapshot and subsequent descriptive analysis.
As you work through your project, it is recommended that you add your thoughts and findings to your team’s project template in the ADRF.
Tian Lou, & Dave McQuown. (2021, March 8). Data Exploration for Cross-sectional Analysis using Illinois Unemployment Insurance Data. Zenodo. https://doi.org/10.5281/zenodo.4588936
Tian Lou, & Dave McQuown. (2021, March 8). Data Visualization using Illinois Unemployment Insurance Data. Zenodo. https://doi.org/10.5281/zenodo.4589040
Census NAICS codes. https://www.census.gov/naics/
@@ -828,12 +840,12 @@Welcome to the second notebook of Module 2 of this course! Here, we will begin the process of cohort creation for our research topic spanning the entire series of class notebooks, which is focused on better understanding unemployment to reemployment pathways for a specific set of claimants receiving benefits after COVID-imposed restrictions were lifted in Wisconsin.
Previously, we applied a cross-sectional analysis to the PROMIS data, which allowed us to better understand the volume of individuals interacting with the unemployment insurance (UI) system at a specific moment in time. Since cross-sections are restricted to particular snapshots, and do not account for shocks though, they are limited in providing a framework for tracking experiences over time.
A separate method is more appropriate for a longitudinal analysis: cohort analysis. In creating a cohort, we will denote a reference point where each member of our cohort experienced a common event - this could be entry into a program, exit from a program, or any other shared experience across a set of observations. With this setup, we can better understand and compare the experiences of those encountering the same policies and economic shocks at the same time, especially across different subgroups.
@@ -307,8 +319,8 @@This notebook is concerned with the first step, as we will walk through the decision rules we will use to define a cohort from the raw microdata aimed at helping us answer our research question. The following notebooks will leverage this initial cohort as we build out the rest of the analysis.
Here, we will reintroduce the code required to set up our environment to connect to the proper database and load certain packages. If you aren’t concerned with the technical setup of this workbook, please feel free to skip ahead to the next section, Defining a Cohort.
For this code to work, you need to have an .Renviron
file in your user folder (i.e. U:\John.Doe.P00002) containing your username and password.
Before writing code for creating and exploring our cohort, it’s crucial to think through the decisions from a data literacy standpoint. Again, the key idea here is to define a set of individuals with a consistent “anchor point” in the data so we can follow them longitudinally.
First, we have to think through the underlying set of observations we want to track over time and where they exist. Fundamentally, this ties back to identifying our original population of interest.
Now that we have developed our cohort decisions, we can start building it out. We will do this in two steps:
We will cover this calculation, and others, in the upcoming longitudinal analysis notebook.
This workbook covers the conceptual approach for developing an appropriate cohort aimed at informing a specific research topic. As you work with your group, you should be thinking about the decision rules applied in this notebook and their potential pertinence to your research project. Once you define your cohort, you are encouraged to conduct a basic exploration of key subgroups before progressing with your longitudinal analysis, paying close attention to the subgroup counts.
Given that the data application decisions are not finalized, you can expect to receive an update on the translation of these cohort restrictions to the PROMIS data next class.
Tian Lou, & Dave McQuown. (2021, March 8). Data Exploration for Cohort Analysis using Illinois Unemployment Insurance Data. Zenodo. https://doi.org/10.5281/zenodo.4589024
AR Creating a Cohort Notebook (link to come)
@@ -859,12 +871,12 @@Welcome to our third notebook of this module! In this notebook, we will demonstrate how to leverage the results of record linkage and dimensional data modeling to build out an analytic frame necessary for a longitudinal cohort analysis.
In the last notebook, we learned that the first step of cohort analysis is to define its uniting “anchor point” in time, limiting observations to the initial cross-section. While this first step is essential, it doesn’t allow us to follow these individuals over time - which is, after all, the whole point of cohort analysis!
To harness the true power of a cohort analysis, we need to perform some sort of record linkage. As indicated by its name, record linkage is the process of identifying and linking all records - data points - which pertain to the entities of interest in the analysis. Broadly, record linkage allows us to follow our population of interest over time and across different data sources. Remember that at the end of the previous notebook, we started the record linkage process by joining our cohort cross-section back to the full PROMIS dataset to identify additional observations for our cohort members.
@@ -301,8 +313,8 @@Here, we will reintroduce the code required to set up our environment to connect to the proper database and load certain packages. If you aren’t concerned with the technical setup of this workbook, please feel free to skip ahead to the next section, Redefining our Cohort.
For this code to work, you need to have an .Renviron
file in your user folder (i.e. U:\John.Doe.P00002) containing your username and password.
Although in the last notebook we took a first pass at constructing the cohort for our analysis, we also left you with the following caveat:
“Given that the data application decisions are not finalized, you can expect to receive an update on the translation of these cohort restrictions to the PROMIS data next class.”
@@ -413,8 +425,8 @@Checkpoint
Given our new knowledge of the PROMIS data, do you need to go back and redefine your team’s cohort? Refer to the updated data dictionary, which has a new column, “Face Validity”, providing additional information for each variable.
Now that we have reassembled our cohort based on our new understanding of the PROMIS data, we can revisit the record linkage process with the eventual aim of constructing our final analytic frame.
When attempting to link records, however, there are many potential issues which could arise. For example:
Before proceeding with the rest of our analysis, we will explore this data frame to ensure we understand its construction and how we can best leverage it moving forward.
First, let’s evaluate the distribution of total observations for each member of our cohort in the data model.
Are all of the variables you need for your team’s research project available in the data model? Discuss with your team and note any gaps.
-In this notebook, we demonstrated how to apply the newly-created class data model to a longitudinal study with an already-developed cohort. Think through the new questions that this linked data model allows you to explore and how they relate to your team’s project. Refer back to the list of tables in our data model, and begin devising a plan for how each may contribute to your analysis. As you are doing so, take special care to think through the varied grains of the benefit and wage data in the fact table, as well as our new findings about the PROMIS data in general, and how they may impact your work in addressing your research question.
We will further this narrative in the next notebook as we continue to develop our descriptive analysis, shifting our focus to the benefit- and employment-based measures we can create using our new analytic frame to develop findings to inform our guiding research topic.
AR 2022 Record Linkage Notebook, Robert McGough, Nishav Mainali, Benjamin Feder, Josh Edelmann (Link to come)
@@ -899,12 +911,12 @@Welcome to Notebook 4 of Module 2! Up to this point in the course, most of our work with the Wisconsin data has been focused on data preparation and project scoping, culminating in the development of our analytic frame in last week’s notebook. In this notebook, we will bridge the gap between this project scoping work and the actual process of longitudinal analysis by developing the measures that will serve as our primary outcomes of interest.
As you’ve learned, when we are analyzing administrative data not developed for research purposes, it is important to create new measures that will help us answer our policy-relevant questions. When we say “measure”, we usually mean a person-level variable that we can use to compare outcomes for individuals in our cohort. Creating measures at the person level allows us to compare outcomes for different subgroups of individuals based on their characteristics and experiences.
Here, we will demonstrate how to create several measures to describe our cohort members’ UI experience and subsequent workforce outcomes. While your group may choose to generate different measures based on your research question, the code displayed here should provide a good starting place for thinking about how to best create and analyze person-level measures.
As in previous notebooks, we will reintroduce the code required to set up our environment to connect to the proper database and load certain packages. If you aren’t concerned with the technical setup of this workbook, please feel free to skip ahead to the next section, Loading our analytic frame.
For this code to work, you need to have an .Renviron
file in your user folder (i.e. U:\John.Doe.P00002) containing your username and password.
We can recreate our analytic frame dataset from the prior notebook by using SQL joins to filter the fact table to only include our cohort members.
<- "
@@ -357,8 +369,8 @@ qry Loading our analytic frame
For further details about the analytic frame, please refer back to notebook 03_record_linkage.
-The first set of measures we will construct are aimed at capturing aspects of our cohort members’ experience with the UI benefit system. Again, each of these measures is person-level - for each measure, we want to distill the wealth of information available in our analytic frame into a single outcome per individual that we can compare across subgroups of our cohort.
Since our analytic frame also includes variables describing employment experiences, we can develop measures focused on our cohort’s past and future employment relative to the benefit year in question.
Conveniently, because our cohort definition identifies individuals who started their benefit year in the last week of Q1 2022, any employment in subsequent rows (remember to aggregate by quarter!) reflect employment post-UI entry. In these examples, we will restrict the employment data to within three quarters of UI program entry.
To do so, we will create a handy reference table below, which will also track the quarter relative to entry.
@@ -759,12 +771,12 @@Hopefully, by this point in the notebook, you have been inspired to apply some of these measures to your own cohort and overall project. You are encouraged to use the base code available in this notebook, and adapt and apply it to your own work. In the realm of unemployment to reemployment trajectories, there is a wealth of potential measures that can be created by linking the PROMIS and UI wage records, and we encourage you to think through the different ways you might be able to create new measures and proxies to help answer your primary research question.
AR Measurement Notebook (link to come)
WI 2023 Record Linkage Notebook, Roy McKenzie, Benjamin Feder, Joshua Edelmann (citation to be added)
@@ -1008,12 +1020,12 @@Welcome to Notebook 5 of Module 2! At this point in our notebook series, we have built out our descriptive analysis, and are now think about the findings and how to appropriately convey them. For outputs deemed best displayed in an image, we may have started on some initial plots in ggplot2
, largely relying on its base functionality. Here, we will show you different ways you can leverage the powerful ggplot2
package to create presentation- and publication-quality data visualizations from our descriptive analysis. We will also discuss different visualization options based on the type of the analysis.
We will cover the following visualizations in this notebook:
As in previous notebooks, we will reintroduce the code required to set up our environment to connect to the proper database and load certain packages. If you aren’t concerned with the technical setup of this workbook, please feel free to skip ahead to the next section, Loading our analytic frame.
For this code to work, you need to have an .Renviron
file in your user folder (i.e. U:\\John.Doe.P00002
) containing your username and password.
As we did in the previous notebook, can recreate our analytic frame by using SQL joins to filter the fact table to only include our cohort members.
<- "
@@ -344,8 +356,8 @@ qry Loading our analytic frame
<- dbGetQuery(con, qry) analytic_frame
This initial section is quite technically-focused. If you’d like, you can skip to the Density plot subsection.
Recall the structure of traditional ggplot2
syntax:
Of your findings, which ones are most suitable to visualization? Why? Are there additional updates you would like to make to any of these plots?
Although this notebook is quite technical and focused on final outputs, it can still be useful as you are producing your descriptive analysis. In particular, this notebook provides a variety of display options, and you should think about the best choice and design for exhibiting your findings. You can start by creating the base plot and think about an ideal title, so you can adjust the aspects of the graph to highlight your findings for the audience. At a minimum, it will be helpful for the business-oriented members of your team if you reuse the ggsave()
code and save preliminary plots early and often, so they can provide their input on the direction of the analysis.
Additionally, we recommend revisiting this notebook as you begin preparing to export your final tables and graphs from the ADRF, so you can apply layering updates to ensure your exports are ready for your final presentation and report. There are many other ggplot2
layer aspects we did not cover in this notebook; thankfully, there are many open-source posts and examples for you to draw from as well.
Kamil Slowikowski (2021). ggrepel: Automatically Position Non-Overlapping Text Labels with ‘ggplot2’. R package version 0.9.1. https://CRAN.R-project.org/package=ggrepel
Pedersen, T. L. (2022, August 24). Make your ggplot2 extension package understand the new linewidth aesthetic [web log]. Retrieved July 28, 2023, from https://www.tidyverse.org/blog/2022/08/ggplot2-3-4-0-size-to-linewidth/.
Tian Lou, & Dave McQuown. (2021, March 8). Data Visualization using Illinois Unemployment Insurance Data. Zenodo. https://doi.org/10.5281/zenodo.4589040
@@ -1120,12 +1132,12 @@Investigating the demand side of the labor market can help us understand the different types of employers within it. The majority of the research on labor market outcomes lays emphasis on the role of the employee (labor market supply). While this is important, understanding the employer’s role is also critical for developing employment outcomes.
In the previous notebooks, we used descriptive statistics to analyze employment outcomes for our cohort. The goal of this notebook is now to demonstrate how we can leverage descriptive statistics for the purpose of characterizing labor demand and better contextualizing opportunities for employment by job sector. This will allow us to understand the types of employers individuals in our cohort are employed by and their relationship to our outcome measures, as well as recognize in-demand industry trends in Wisconsin.
As in previous notebooks, we will reintroduce the code required to set up our environment to connect to the proper database and load certain packages. If you aren’t concerned with the technical setup of this workbook, please feel free to skip ahead to the next section, Employer-side Analysis.
For this code to work, you need to have an .Renviron
file in your user folder (i.e. U:\\John.Doe.P00002
) containing your username and password.
An individual in our cohort may have multiple employers of focus - their previous one(s) before claiming UI benefits, and subsequent one(s) upon reemployment. Here, we will provide separate examples focusing on these different employers, and their relationship with some of the outcome measures developed in the Measurement workbook.
Shifting gears, with demand, we can also look at the quantity of job openings by employer characteristic. There are many sources for tracking job postings, one of which is Opportunity Insights’ job postings data from Lightcast, which was formerly known as Burning Glass Technologies.
We can see that the outlook appears to be relatively positive for those previously in manufacturing, for example, in terms of future job availability.
This notebook is all about potential analyses - if you work through the concepts covered in the previous notebook, your project should be more than good enough. However, if you feel intrigued by the possibility of including either one of these types of analyses, whether it is of employer characteristics or job postings, we encourage you to use it to supplement your analysis.
At the very least, even if you don’t incorporate this work into your project, we hope you are inspired to consider a demand-focused analysis in the future, either on its own or as a supplement to one focusing on potential employees.
Garner, Maryah, Nunez, Allison, Mian, Rukhshan, & Feder, Benjamin. (2022). Characterizing Labor Demand with Descriptive Analysis using Indiana’s Temporary Assistance for Needy Families Data and UI Wage Data. https://doi.org/10.5281/zenodo.7459656
Job postings data from Lightcast, aggregated by Opportunity Insights.
“The Economic Impacts of COVID-19: Evidence from a New Public Database Built Using Private Sector Data”, by Raj Chetty, John Friedman, Nathaniel Hendren, Michael Stepner, and the Opportunity Insights Team. November 2020. Available at: https://opportunityinsights.org/wp-content/uploads/2020/05/tracker_paper.pdf
@@ -985,12 +997,12 @@This workbook provides information on how to prepare research output for disclosure control. It outlines how to prepare different kinds of outputs before submitting an export request and gives an overview of the information needed for disclosure review. Please read through the entire workbook because it will separately discuss different types of outputs that will be flagged in the disclosure review process.
We will apply the Wisconsin export rules to the following files in this workbook:
When exporting results, there are 3 items to be concerned with:
Export file(s): this is the file you wish to export. This file needs to be disclosure-proofed; we will eventually walk through those steps in this notebook, first introducing them to you in the next section
As in previous workbooks, we will reintroduce the code required to set up our environment to connect to the proper database and load certain packages. If you are not concerned with the technical setup of this workbook, please feel free to skip ahead to the next section, Loading our analytic frame.
Since we will be adapting tables and visuals we have created in past notebook that mostly relied on the same underlying analytic frame, we will recreate it and read it into R first.
<- "
@@ -485,8 +497,8 @@ qry Loading our analytic frame
<- dbGetQuery(con, qry) analytic_frame
Our first file we will prepare for export is a table containing future claims by employment growth created in the Characterizing Demand notebook. In reality, the output development and disclosure review preparation are done in tandem. However, for simplicity, we will do this in separate steps, as we have already generated the initial output file.
Our second file to export is a bar plot showing the exit counts by week for our cohort in 2022. We initially created this bar plot in the Visualization notebook.
We will remind you of how to save this final plot at the end of the notebook.
Our third file to prepare for export will build off of the line plot from the Visualization notebook. The line plot in that notebook depicted average wages over time; here, we are going to pivot slightly and show median wages over time.
We’ll save this figure at the end of the notebook.
For our final export file we will be disclosure-proofing the heatmap from the visualization notebook, which displays counties by their UI claim rate at a specific point in time.
Note that with the redaction rules, the counties with the five highest claim rates are slightly different than those noted prior to applying the disclosure controls.
In this section, we provide examples of different techniques for exporting our presentation-ready plots. We can use ggsave()
to save our visuals in a png, jpeg and pdf format without losing quality, demonstrating saving as each file type on the final plots.
This notebook may appear to be overwhelming, but majority of the code has been copied from previous notebooks to recreate the final tables and graphs. Focus your attention on the disclosure rules and procedures applied to each output, as this provides useful information and code techniques to apply to a variety of outputs. We recommend saving all output early so your team members can provide a fresh set of eyes on all the final files to ensure the all rules have been appropriately applied.
Additionally, we recommend revisiting this notebook as you begin disclosure proofing your final tables and graphs so you can ensure your exports are ready for your final presentation and report.
VDC 2022 Presentation Preparation Notebook, Joshua Edelmann and Benjamin Feder (citation to be added)
WI 2023 Characterizing Labor Demand Notebook, Roy McKenzie, Benjamin Feder (citation to be added)
WI 2023 Data Visualization Notebook, Corey Sparks, Benjamin Feder, Roy McKenzie, and Joshua Edelmann (citation to be added)
@@ -1323,12 +1335,12 @@This supplemental notebook covers record linkage and creating a linked data model to facilitate longitudinal analyses.
Analyses involving administrative data often require:
This notebook will introduce and demonstrate some helpful techniques for linking administrative data while mediating the above issues. The output of the notebook should provide a flexible and performant framework that meets the needs of most projects and can be easily customized to include additional variables or characteristics.
The linked data assets documented in this notebook have already been completely created and loaded in the tr_wi_2023 schema as tables beginning with a “wi” prefix. This notebook will not create or load duplicative copies of the linked dataset, but rather cover the techniques used to construct and load the model and hopefully serve as a resource to use when building future linked datasets.
Here, we will reintroduce the code required to set up our environment to connect to the proper database and load certain packages. If you aren’t concerned with the technical setup of this workbook, please feel free to skip ahead to the next section, Record linkage and Dimensional Modeling.
For this code to work, you need to have an .Renviron
file in your user folder (i.e. U:\John.Doe.P00002) containing your username and password.
Record linkage is an important component of any analysis, unless you have a fictitious perfectly curated dataset with no messiness or missing variables, and especially when it comes to linking administrative records. Unlike survey data that allows for perfectly selected variables with some potential for messiness, administrative data is tailored to administrative purposes (not academic). That means that we will not have all of the variables we ideally want, and it also means that the data can be messy (either missing responses or with variables that we may not quite understand or have at our disposal). While we may not directly address missing responses (more on indirectly addressing this in the inference lecture), we can enrich our data set by pulling in relevant information from other sources.
To facilitate easy and performant analysis of very large record sets (quarterly wages, PROMIS file), we will be formatting the data in a dimensional model. This type of model:
Unlike reference data that is consistent across states (NAICS, SOC), master data refer to the unique collection of persons, employers, or households served by each state. A state can have many different references to the same real-world entity, and mastering is the processing of assembling a set that has one member (record) for each unique instance of an entity in the real world.
This master record can merge attributes from multiple sources, resulting in a “golden record” with a higher completeness than is available in individual sources. When multiple references to the same entity have different values, those differences are resolved through a process called survivorship in which decisions are made about which value to keep (most recent, most frequent, highest quality source, etc.).
In our example, due to the messy nature of administrative data, there are individuals whose gender, race, ethnicity, and birth date values change over time, and even within the same case. First, let’s check how many individuals this concerns.
@@ -561,8 +573,8 @@The fact table stores the actual observations (facts) of interest. Since this table often contains large numbers of records, it will ideally be comprised of a small number of bytes per row and primarily consist of indexed foreign keys to dimension tables and observation-specific measures. This allows for storage of large records sets with low storage cost and high query performance (extremely helpful for supporting dashboards).
In this example, the fact table is at the grain of one row per person per week. We will create a record for every week between the first and last observations of a person for both employment and PROMIS data sets, regardless of employment or PROMIS participation in a given week. These “missing” observation weeks are materialized because unemployment and non-participation may be just as interesting for some analyses and longitudinal analysis benefits from consistent representation across time periods of consistent grain.
Some of our cohort members have observations for multiple employers in a single quarter. Since our unit of analysis is the person, not the person-employer combination, we need to resolve these one-to-many relationships into a single observation while retaining the information pertinent to analysis. In this example, the primary employer and associated wages were identified and recorded based on the employer with the largest wages in the quarter. In order to minimize loss of potentially relevant information, the total wages and number of employers are also included on each observation.
@@ -690,8 +702,8 @@McGough, R., et.al., Spring 2022 Applied Data Analytics Training, Arkansas Work-Based Learning to Workforce Outcomes, Linked Dataset Construction for Longitudinal Analysis
Abowd, et. al., The LEHD Infrastructure Files and the Creation of the Quarterly Workforce Indicators, 2006 (https://lehd.ces.census.gov/doc/technical_paper/tp-2006-01.pdf).
Kimball, R., & Ross, M. (2019). The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling, Ed. Wiley.
@@ -936,12 +948,12 @@This supplemental notebook provides a demonstration of how we can build employer-level characteristics, at the yearly grain, from the Unemployment Insurance (UI) wage records dataset. Our final output from this notebook is a permanent table with employer-level information aggregated to the calendar year for each employer with at least 5 employees in Wisconsin that appears in its UI wage records.
We will start by loading necessary packages not readily available in the base R setup.
@@ -314,8 +326,8 @@Introduction
For this code to work, you need to have an
.Renviron
file in your user folder (i.e. U:\John.Doe.P00002) containing your username and password.
We will define each employer as a unique ui_account
value in the UI wage records,, developing the following measures for each ui_account
:
Firm characteristics
Now that we have our aggregations and growth rates calculated, we will combine these into a single yearly aggregation table. We need to create our start and end strings of the query and then we will paste these strings together.
= "
@@ -876,8 +888,8 @@ string Aggregation to the Calendar Year
# DBI::dbExecute(con, qry)
Feder, Benjamin, Garner, Maryah, Nunez, Allison, & Mian, Rukhshan. (2022, December 19). Creating Supplemental Employment Measures using Indiana’s Unemployment Insurance Wage Records. Zenodo. https://doi.org/10.5281/zenodo.7459730
@@ -1120,12 +1132,12 @@This supplemental notebook focuses on linking the NAICS-employer crosswalk with the data model, particularly the fact table. A similar procedure can be followed for matching the crosswalk with the UI wage records table.
We will start by loading necessary packages not readily available in the base R setup.
@@ -303,8 +315,8 @@Introduction
For this code to work, you need to have an
.Renviron
file in your user folder (i.e.U:\\John.Doe.P00002
) containing your username and password.
Before linking the crosswalk with additional employer information available in other tables, it is helpful to identify any potential discrepancies that may affect the quality of the linkage. Here, we will investigate the columns we plan to use in our join to ensure consistency between the sources.
Employer information is available in all three data sources - PROMIS, UI Wage Records, and of course, the NAICS crosswalk. We’ll start with the data we’re already using.
Note that there are some UI account numbers in the crosswalk with more than 6 digits, excluding non-leading zeroes. Although it would be theoretically possible to simply link on the last six digits of the UI account numbers, there may be different employers with the same last six digits, resulting in inaccurate joins. Therefore, the UI account numbers with more than six digits will not join to any of the employer-level information in the other files. That being said, the employer information in the PROMIS datA already contains NAICS codes, so we’re really just focused on joining to the employers in the wage records.
Given the information we’ve learned about the various columns storing employer UI account numbers in the different tables, the recommended process for joining the information in the crosswalk table to the fact (and UI wage records too) is as follows:
ui_account_root_number
to an integer.