Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2317/Copy data from grants-db into the opportunity table(s) in the analytics db #3228

Merged
merged 64 commits into from
Dec 18, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
64 commits
Select commit Hold shift + click to select a range
6c79061
rough func to copy data between grants-db and analytics-db
babebe Dec 6, 2024
f23baea
make command
babebe Dec 9, 2024
f0a7513
add to data_migration_blueprint
babebe Dec 9, 2024
d27d289
change analytics db listening port
babebe Dec 9, 2024
f4506c9
add analytics db properties to api local.env
babebe Dec 9, 2024
df979ec
add public schema to Schemas class
babebe Dec 9, 2024
335173d
update copy_data_from_grants_db_to_s3
babebe Dec 9, 2024
7b6387e
test files
babebe Dec 11, 2024
1419aae
create command to run opp data func
babebe Dec 12, 2024
4c93f0c
create s3 config
babebe Dec 12, 2024
4263a16
function to read and write opp data
babebe Dec 12, 2024
855a948
delete old func
babebe Dec 12, 2024
984d15d
rm change on api
babebe Dec 12, 2024
e01a461
rm schema change
babebe Dec 12, 2024
c81ad44
rm command
babebe Dec 12, 2024
443b6fa
cleanup
babebe Dec 12, 2024
740cf3d
move test files
babebe Dec 13, 2024
a2e350f
move to constants
babebe Dec 13, 2024
3492eee
add aws fixitures
babebe Dec 13, 2024
f1bdc67
cleanup
babebe Dec 13, 2024
dba6cc7
add s3 env vars
babebe Dec 13, 2024
35fc857
update s3 config
babebe Dec 13, 2024
5daf912
clean up
babebe Dec 13, 2024
ffbe7c5
fixture for test db
babebe Dec 14, 2024
508a435
wrapper func for etldb
babebe Dec 14, 2024
e4eaf35
cleanup
babebe Dec 16, 2024
4bbebc5
add moto dependecy
babebe Dec 16, 2024
ac4f968
cleanup
babebe Dec 16, 2024
e312509
update test
babebe Dec 16, 2024
d56b3d6
updated poetry.lock
babebe Dec 16, 2024
3c23fe3
merge main
babebe Dec 16, 2024
6b480e3
move moto to dev dep
babebe Dec 16, 2024
82c13fd
clean up
babebe Dec 16, 2024
5f90e16
no need for schema spec
babebe Dec 16, 2024
8505b0c
simplify main function, remove wrapper func
babebe Dec 16, 2024
2aebe5d
edit doc string
babebe Dec 16, 2024
5a2efef
clean up
babebe Dec 16, 2024
5264900
add package smart-open
babebe Dec 16, 2024
1c45c41
rm patch host
babebe Dec 16, 2024
b12032e
adding doc strings
babebe Dec 16, 2024
a56db5b
fix formats
babebe Dec 16, 2024
36aceba
Merge branch 'main' of https://github.com/HHS/simpler-grants-gov into…
babebe Dec 16, 2024
c147488
make format lint fix
babebe Dec 17, 2024
218b649
revert db_host
babebe Dec 17, 2024
4c5f216
cleanup
babebe Dec 17, 2024
46b3e6c
Merge branch 'main' of https://github.com/HHS/simpler-grants-gov into…
babebe Dec 17, 2024
e98cdd7
fix schema mpatching
babebe Dec 17, 2024
3e63ef5
format
babebe Dec 17, 2024
5244876
update path prefix
babebe Dec 17, 2024
ed56b1f
added LOAD_OPPORTUNITY_DATA_FILE_PATH and removed bucket and path prefix
babebe Dec 17, 2024
a7a0425
replace generic s3 config to specif to opp data and move to main code
babebe Dec 17, 2024
03a0de7
update upload path
babebe Dec 17, 2024
d7477f6
format
babebe Dec 17, 2024
deb6980
update with fixiture to del records after each test
babebe Dec 17, 2024
c4a8089
fix sql query
babebe Dec 17, 2024
ab220aa
Merge branch 'main' of https://github.com/HHS/simpler-grants-gov into…
babebe Dec 17, 2024
34dbdb9
rm dup test
babebe Dec 17, 2024
13ab8db
Revert "rm dup test"
babebe Dec 17, 2024
df3148d
format lint
babebe Dec 17, 2024
0341699
dbhost
babebe Dec 17, 2024
398b7ab
Merge branch 'main' of https://github.com/HHS/simpler-grants-gov into…
babebe Dec 18, 2024
6fb60da
rm fixture add func to trunc records first
babebe Dec 18, 2024
690e424
Merge branch 'main' of https://github.com/HHS/simpler-grants-gov into…
babebe Dec 18, 2024
16216e7
Merge branch 'main' of https://github.com/HHS/simpler-grants-gov into…
babebe Dec 18, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions analytics/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -150,6 +150,12 @@ db-migrate:
$(POETRY) analytics etl db_migrate
@echo "====================================================="
babebe marked this conversation as resolved.
Show resolved Hide resolved

opportunity-load:
@echo "=> Ingesting opportunity data into the database"
@echo "====================================================="
$(POETRY) analytics etl opportunity-load
@echo "====================================================="

gh-transform-and-load:
@echo "=> Transforming and loading GitHub data into the database"
@echo "====================================================="
Expand Down
6 changes: 6 additions & 0 deletions analytics/local.env
Original file line number Diff line number Diff line change
Expand Up @@ -52,3 +52,9 @@ LOG_FORMAT=human-readable

# Change the message length for the human readable formatter
# LOG_HUMAN_READABLE_FORMATTER__MESSAGE_WIDTH=50

############################
# S3
############################

LOAD_OPPORTUNITY_DATA_FILE_PATH=S3://local-opportunities/public-extracts
194 changes: 193 additions & 1 deletion analytics/poetry.lock

Large diffs are not rendered by default.

4 changes: 3 additions & 1 deletion analytics/pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -22,10 +22,11 @@ python = "~3.13"
slack-sdk = "^3.23.0"
typer = { extras = ["all"], version = "^0.15.0" }
sqlalchemy = "^2.0.30"
psycopg = ">=3.0.7"
pydantic-settings = "^2.3.4"
boto3 = "^1.35.56"
boto3-stubs = "^1.35.56"
psycopg = "^3.2.3"
smart-open = "^7.0.5"

[tool.poetry.group.dev.dependencies]
black = "^24.0.0"
Expand All @@ -36,6 +37,7 @@ pytest-cov = "^5.0.0"
ruff = "^0.8.0"
safety = "^3.0.0"
types-requests = "^2.32.0.20241016"
moto = "^5.0.22"

[build-system]
build-backend = "poetry.core.masonry.api"
Expand Down
9 changes: 9 additions & 0 deletions analytics/src/analytics/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,9 @@
from analytics.etl.utils import load_config
from analytics.integrations import etldb, slack
from analytics.integrations.db import PostgresDbClient
from analytics.integrations.extracts.load_opportunity_data import (
extract_copy_opportunity_data,
)
from analytics.logs import init as init_logging
from analytics.logs.app_logger import init_app
from analytics.logs.ecs_background_task import ecs_background_task
Expand Down Expand Up @@ -301,3 +304,9 @@ def transform_and_load(

# finish
print("transform and load is done")


@etl_app.command(name="opportunity-load")
def load_opportunity_data() -> None:
"""Grabs data from s3 bucket and loads it into opportunity tables."""
extract_copy_opportunity_data()
6 changes: 6 additions & 0 deletions analytics/src/analytics/integrations/extracts/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
"""
We use this package to load opportunity data from s3.

It extracts CSV files from S3 bucket and loads the records into respective
opportunity tables.
"""
101 changes: 101 additions & 0 deletions analytics/src/analytics/integrations/extracts/constants.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
"""Holds all constant values."""

from enum import StrEnum


class OpportunityTables(StrEnum):
"""Opportunity tables that are copied over to analytics database."""

LK_OPPORTUNITY_STATUS = "lk_opportunity_status"
LK_OPPORTUNITY_CATEGORY = "lk_opportunity_category"
OPPORTUNITY = "opportunity"
OPPORTUNITY_SUMMARY = "opportunity_summary"
CURRENT_OPPORTUNITY_SUMMARY = "current_opportunity_summary"


LK_OPPORTUNITY_STATUS_COLS = (
"OPPORTUNITY_STATUS_ID",
"DESCRIPTION",
"CREATED_AT",
"UPDATED_AT",
)

LK_OPPORTUNITY_CATEGORY_COLS = (
"OPPORTUNITY_CATEGORY_ID",
"DESCRIPTION",
"CREATED_AT",
"UPDATED_AT",
)
OPPORTUNITY_COLS = (
"OPPORTUNITY_ID",
"OPPORTUNITY_NUMBER",
"OPPORTUNITY_TITLE",
"AGENCY_CODE",
"OPPORTUNITY_CATEGORY_ID",
"CATEGORY_EXPLANATION",
"IS_DRAFT",
"REVISION_NUMBER",
"MODIFIED_COMMENTS",
"PUBLISHER_USER_ID",
"PUBLISHER_PROFILE_ID",
"CREATED_AT",
"UPDATED_AT",
)
OPOORTUNITY_SUMMARY_COLS = (
"OPPORTUNITY_SUMMARY_ID",
"OPPORTUNITY_ID",
"SUMMARY_DESCRIPTION",
"IS_COST_SHARING",
"IS_FORECAST",
"POST_DATE",
"CLOSE_DATE",
"CLOSE_DATE_DESCRIPTION",
"ARCHIVE_DATE",
"UNARCHIVE_DATE",
"EXPECTED_NUMBER_OF_AWARDS",
"ESTIMATED_TOTAL_PROGRAM_FUNDING",
"AWARD_FLOOR",
"AWARD_CEILING",
"ADDITIONAL_INFO_URL",
"ADDITIONAL_INFO_URL_DESCRIPTION",
"FORECASTED_POST_DATE",
"FORECASTED_CLOSE_DATE",
"FORECASTED_CLOSE_DATE_DESCRIPTION",
"FORECASTED_AWARD_DATE",
"FORECASTED_PROJECT_START_DATE",
"FISCAL_YEAR",
"REVISION_NUMBER",
"MODIFICATION_COMMENTS",
"FUNDING_CATEGORY_DESCRIPTION",
"APPLICANT_ELIGIBILITY_DESCRIPTION",
"AGENCY_CODE",
"AGENCY_NAME",
"AGENCY_PHONE_NUMBER",
"AGENCY_CONTACT_DESCRIPTION",
"AGENCY_EMAIL_ADDRESS",
"AGENCY_EMAIL_ADDRESS_DESCRIPTION",
"IS_DELETED",
"CAN_SEND_MAIL",
"PUBLISHER_PROFILE_ID",
"PUBLISHER_USER_ID",
"UPDATED_BY",
"CREATED_BY",
"CREATED_AT",
"UPDATED_AT",
"VERSION_NUMBER",
)
CURRENT_OPPORTUNITY_SUMMARY_COLS = (
"OPPORTUNITY_ID",
"OPPORTUNITY_SUMMARY_ID",
"OPPORTUNITY_STATUS_ID",
"CREATED_AT",
"UPDATED_AT",
)

MAP_TABLES_TO_COLS: dict[OpportunityTables, tuple[str, ...]] = {
OpportunityTables.LK_OPPORTUNITY_STATUS: LK_OPPORTUNITY_STATUS_COLS,
OpportunityTables.LK_OPPORTUNITY_CATEGORY: LK_OPPORTUNITY_CATEGORY_COLS,
OpportunityTables.OPPORTUNITY: OPPORTUNITY_COLS,
OpportunityTables.OPPORTUNITY_SUMMARY: OPOORTUNITY_SUMMARY_COLS,
OpportunityTables.CURRENT_OPPORTUNITY_SUMMARY: CURRENT_OPPORTUNITY_SUMMARY_COLS,
}
babebe marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
# pylint: disable=invalid-name, line-too-long
"""Loads opportunity tables with opportunity data from S3."""

import logging
import os
from contextlib import ExitStack

import smart_open # type: ignore[import]
from pydantic import Field
from pydantic_settings import BaseSettings
from sqlalchemy import Connection

from analytics.integrations.etldb.etldb import EtlDb
from analytics.integrations.extracts.constants import (
MAP_TABLES_TO_COLS,
OpportunityTables,
)

logger = logging.getLogger(__name__)


class LoadOpportunityDataFileConfig(BaseSettings):
"""Configure S3 properties for opportunity data."""

load_opportunity_data_file_path: str | None = Field(
default=None,
alias="LOAD_OPPORTUNITY_DATA_FILE_PATH",
)


def extract_copy_opportunity_data() -> None:
"""Instantiate Etldb class and call helper funcs to truncate and insert data in one txn."""
etldb_conn = EtlDb()

with etldb_conn.connection() as conn, conn.begin():
_trancate_opportunity_table_records(conn)

_fetch_insert_opportunity_data(conn)

logger.info("Extract opportunity data completed successfully")


def _trancate_opportunity_table_records(conn: Connection) -> None:
"""Truncate existing records from all tables."""
cursor = conn.connection.cursor()
schema = os.environ["DB_SCHEMA"]
babebe marked this conversation as resolved.
Show resolved Hide resolved
for table in OpportunityTables:
stmt_trct = f"TRUNCATE TABLE {schema}.{table} CASCADE"
cursor.execute(stmt_trct)
logger.info("Truncated all records from all tables")


def _fetch_insert_opportunity_data(conn: Connection) -> None:
"""Streamlines opportunity tables from S3 and insert into the database."""
s3_config = LoadOpportunityDataFileConfig()

cursor = conn.connection.cursor()
for table in OpportunityTables:
logger.info("Copying data for table: %s", table)

columns = MAP_TABLES_TO_COLS.get(table, ())
query = f"""
COPY {f"{os.getenv("DB_SCHEMA")}.{table} ({', '.join(columns)})"}
FROM STDIN WITH (FORMAT CSV, DELIMITER ',', QUOTE '"', HEADER)
"""

with ExitStack() as stack:
file = stack.enter_context(
smart_open.open(
f"{s3_config.load_opportunity_data_file_path}/{table}.csv",
"r",
),
)
copy = stack.enter_context(cursor.copy(query))

while data := file.read():
copy.write(data)

logger.info("Successfully loaded data for table: %s", table)
18 changes: 18 additions & 0 deletions analytics/src/analytics/integrations/extracts/s3_config.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
"""Configuration for S3."""

import boto3
import botocore


def get_s3_client(
session: boto3.Session | None = None,
boto_config: botocore.config.Config | None = None,
) -> botocore.client.BaseClient:
"""Return an S3 client."""
if boto_config is None:
boto_config = botocore.config.Config(signature_version="s3v4")

if session is not None:
return session.client("s3", config=boto_config)

return boto3.client("s3", config=boto_config)
Loading
Loading