etl
- Main project that contains a session enrichment logicetl-local
- Auxiliary project to run jobs with Spark Standalone Modedata-generator
- Project to generate input logs forSessionLogsEnrichmentJob
me.vitaly.etl.jobs.SessionLogsEnrichmentJob
- the main job which takes raw logs and previously handled session logs as input and saves new session logs in partitioned by data, month, year directories.me.vitaly.etl.runners.SessionLogsEnrichmentJobRunner
- the runner ofSessionLogsEnrichmentJob
which:- Makes validations that input logs has not been already processed.
- Calculates files to process based on configs and input parameters
- Mark files as processed after the job is finished.
me.vitaly.etl.jobs.SessionLogsEnrichmentJobTest
- parameterized unit tests to check common and edge cases.etl/src/main/resources/application.conf
- file with configurations
- Run
me.vitaly.etl.generator.DataGeneratorKt.main
from thedata-generator
to generate logs todata/raw/year=2021/month=04/day=14
. Note that is astupidnaive generator without any session logic. - Run
me.vitaly.etl.local.SessionLogsLocalRunnerKt.main
from theetl-local
. It runsSessionLogsEnrichmentJob
using configetl/src/main/resources/application.conf
and the data from thedata
folder on Spark Standalone. The results should be saved todata/processed/year=2021/month=04/day=13
.