Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GTC-2618 Allow appending layers to existing tables #525

Merged
merged 25 commits into from
Jul 1, 2024
Merged
Show file tree
Hide file tree
Changes from 17 commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
ebfe539
Merge pull request #499 from wri/develop
solomon-negusse May 1, 2024
086ffc7
Disable expire_s3_objects call in delete_all_assets (used by delete_v…
danscales May 7, 2024
2d2a5be
Merge pull request #511 from wri/gtc-2708d
danscales May 8, 2024
6ce63f9
Merge pull request #520 from wri/develop
jterry64 May 16, 2024
3d1cd5b
Make name field optionl in User model
jterry64 May 16, 2024
fcbf56a
Merge pull request #521 from wri/gtc-2833/optional_name
jterry64 May 16, 2024
218b11f
GTC-2618 Allow appends of GPKG layers to table
manukala6 May 20, 2024
c9ee67d
increase production fargate task size
solomon-negusse May 27, 2024
65c496c
Add tiles_info endpoint
dmannarino May 22, 2024
0219171
GTC-2618 Fix tests
manukala6 Jun 5, 2024
5d3817d
Merge branch 'develop' into feature/vector_layer_appends
manukala6 Jun 5, 2024
cda9016
GTC-2618 Fix tests
manukala6 Jun 6, 2024
ebf854c
GTC-2618 Fix test creation options
manukala6 Jun 10, 2024
ddb66d8
GTC-2618 Fix test creation options
manukala6 Jun 11, 2024
13d1145
GTC-2618 Add exception for datapump appends
manukala6 Jun 12, 2024
a5509f1
GTC-2618 Remove missing layers test
manukala6 Jun 13, 2024
4edd3b4
Merge branch 'develop' into feature/vector_layer_appends
manukala6 Jun 13, 2024
4e52f20
GTC-2618 Clean up version append
manukala6 Jun 21, 2024
957023e
GTC-2618 Remove source_driver requirement
manukala6 Jun 24, 2024
0608757
GTC-2618 Add tests for append
manukala6 Jun 26, 2024
b414ea1
GTC-2618 Simplify append logic
manukala6 Jun 26, 2024
5013b66
GTC-2618 Update tests_v2
manukala6 Jun 27, 2024
5c478d1
Merge branch 'develop' into feature/vector_layer_appends
manukala6 Jun 27, 2024
63ae482
Merge branch 'develop' into feature/vector_layer_appends
manukala6 Jul 1, 2024
f75a142
Merge branch 'develop' into feature/vector_layer_appends
manukala6 Jul 1, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion app/models/pydantic/authentication.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ class SignUpRequestIn(StrictBaseModel):

class User(BaseModel):
id: str
name: str
name: Optional[str]
email: EmailStr
createdAt: datetime
role: str
Expand Down
6 changes: 2 additions & 4 deletions app/models/pydantic/creation_options.py
Original file line number Diff line number Diff line change
Expand Up @@ -226,10 +226,8 @@ class VectorSourceCreationOptions(StrictBaseModel):
def validate_source_uri(cls, v, values, **kwargs):
if values.get("source_driver") == VectorDrivers.csv:
assert len(v) >= 1, "CSV sources require at least one input file"
else:
assert (
len(v) == 1
), "Non-CSV vector sources require one and only one input file"
elif values.get("source_driver") in [VectorDrivers.esrijson, VectorDrivers.shp, VectorDrivers.geojson_seq, VectorDrivers.geojson]:
assert (len(v) == 1), "GeoJSON and ESRI Shapefile vector sources require one and only one input file"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're intending for this change to allow multiple input files for GPKG, but is that actually supported in the code? Maybe I'm looking in the wrong place, but in vector_source_assets.py it looks like we currently only make use of the first source_uri (perhaps because of the issues that might arise from specifying layers for multiple files?): https://github.com/wri/gfw-data-api/blob/master/app/tasks/vector_source_assets.py#L254

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This wasn't for allowing multiple input files, rather its for updating the version creation options after the append operation is successful (there would be more than 1 source_uri in this case)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I see. But would it have the side effect of now allowing multiple GPKG sources to be specified, though those after the first will be silently ignored?

return v


Expand Down
12 changes: 10 additions & 2 deletions app/models/pydantic/versions.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,8 @@
from typing import List, Optional, Tuple, Union

from pydantic import BaseModel, Field
from pydantic import BaseModel, Field, validator

from ..enum.creation_options import VectorDrivers
from ..enum.versions import VersionStatus
from .base import BaseRecord, StrictBaseModel
from .creation_options import SourceCreationOptions
Expand Down Expand Up @@ -59,7 +60,14 @@ class VersionUpdateIn(StrictBaseModel):

class VersionAppendIn(StrictBaseModel):
source_uri: List[str]

source_driver: Optional[VectorDrivers] = Field(
None, description="Driver of source file. Must be an OGR driver"
)
layers: Optional[List[str]] = Field(
None,
description="List of layer names to append to version. "
"If not set, all layers in source_uri will be appended.",
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks to me like if layers is not specified it will be assumed that there is only a layer named like the file (though I may be looking in the wrong place in the code): https://github.com/wri/gfw-data-api/blob/master/app/tasks/vector_source_assets.py#L209-L214

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, I corrected the description. Layer names are required for .GDB and .GPKG, otherwise the file name is used as the layer name.


class VersionResponse(Response):
data: Version
28 changes: 25 additions & 3 deletions app/routes/assets/asset.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,12 +25,12 @@

# from fastapi import APIRouter, BackgroundTasks, Depends, HTTPException, Path, status
from fastapi.responses import ORJSONResponse
from starlette.responses import JSONResponse
from starlette.responses import JSONResponse, RedirectResponse

from app.models.pydantic.responses import Response
from app.settings.globals import API_URL
from ..datasets.downloads import _get_presigned_url

from ...authentication.token import is_admin
from ...crud import assets
from ...crud import metadata as metadata_crud
from ...crud import tasks
Expand Down Expand Up @@ -67,7 +67,7 @@
delete_static_vector_tile_cache_assets,
)
from ...utils.paginate import paginate_collection
from ...utils.path import infer_srid_from_grid
from ...utils.path import infer_srid_from_grid, split_s3_path
from ..assets import asset_response
from ..tasks import paginated_tasks_response, tasks_response

Expand Down Expand Up @@ -310,6 +310,28 @@ async def get_extent(asset_id: UUID = Path(...)):
return ExtentResponse(data=extent)


@router.get(
"/{asset_id}/tiles_info",
response_class=RedirectResponse,
tags=["Assets"],
status_code=307,
)
async def get_tiles_info(asset_id: UUID = Path(...)):
asset: ORMAsset = await assets.get_asset(asset_id)

if asset.asset_type != AssetType.raster_tile_set:
raise HTTPException(
status_code=status.HTTP_400_BAD_REQUEST,
detail="Tiles information only available for raster tile sets"
)

bucket, asset_key = split_s3_path(asset.asset_uri)
tiles_geojson_key = asset_key.replace("{tile_id}.tif", "tiles.geojson")
presigned_url = await _get_presigned_url(bucket, tiles_geojson_key)

return RedirectResponse(url=presigned_url)


@router.get(
"/{asset_id}/stats",
response_class=ORJSONResponse,
Expand Down
40 changes: 37 additions & 3 deletions app/routes/datasets/versions.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
from copy import deepcopy
from typing import Any, Dict, List, Optional, Sequence, Tuple, Union, cast
from urllib.parse import urlparse
import fiona

from fastapi import (
APIRouter,
Expand Down Expand Up @@ -211,8 +212,7 @@ async def update_version(
"/{dataset}/{version}/append",
response_class=ORJSONResponse,
tags=["Versions"],
response_model=VersionResponse,
deprecated=True,
response_model=VersionResponse
)
async def append_to_version(
*,
Expand Down Expand Up @@ -240,13 +240,36 @@ async def append_to_version(
# For the background task, we only need the new source uri from the request
input_data = {"creation_options": deepcopy(default_asset.creation_options)}
input_data["creation_options"]["source_uri"] = request.source_uri

# If source_driver is "text", this is a datapump request
if input_data["creation_options"]["source_driver"] != "text":
manukala6 marked this conversation as resolved.
Show resolved Hide resolved
# Verify that source_driver is not None
if input_data["creation_options"]["source_driver"] is None:
Copy link
Member

@solomon-negusse solomon-negusse Jun 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pydanic model for appending should be updated to take in creation_options (or update_options), right? That'd be the right place to validate input like so:

raise HTTPException(
status_code=400,
detail="Source driver must be specified for non-datapump requests."
)

# Append the new layers to the existing ones
if input_data["creation_options"].get("layers") is None: # ERROR: layers is not defined
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe meant this to be:
if input_data["creation_options"].get("layers") is None and request.layers is not None?

Another way to simplify the whole logic could be:

if request.layers:
    if input_data["creation_options"].get("layers") is None:
        input_data["creation_options"]["layers"] += request.layers
    else:
        input_data["creation_options"]["layers"] += request.layers

Another alternative is to set the default layers value to [] but would be good to check that doesn't introduce breaking changes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the input, I refactored it similarly.

input_data["creation_options"]["layers"] = request.layers
elif request.layers is not None:
input_data["creation_options"]["layers"] += request.layers
else:
input_data["creation_options"]["layers"] = request.layers
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this one. The original dataset does have a layers value, so why are you then replacing input_data["creation_options"]["layers"] with the request.layers value which is equal to None? It seems like you want to either have an error if request.layers is None or leave the current layers the same, right? Or can you add a comment here on why you are setting layers to None?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

input_data["creation_options"] is the body for the append POST request, so we want to exclude the layers from the version creation creations for that operation. Ln 268-76 are updating the creation options object for the version itself. I added comments to reflect this.


background_tasks.add_task(
append_default_asset, dataset, version, input_data, default_asset.asset_id
)

# We now want to append the new uris to the existing ones and update the asset
update_data = {"creation_options": deepcopy(default_asset.creation_options)}
update_data["creation_options"]["source_uri"] += request.source_uri
update_data["creation_options"]["source_uri"] += request.source_uri # ERROR: only one source_uri is allowed
manukala6 marked this conversation as resolved.
Show resolved Hide resolved
if input_data["creation_options"].get("layers") is not None:
if update_data["creation_options"]["layers"] is not None:
update_data["creation_options"]["layers"] += request.layers
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If request.layers is None (which you check for explicitly at line 256 above), then you will get a Python error here when you try add None (which is not a list) to an existing list.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I refactored this section to account for this. None is now not being added to a list.

else:
update_data["creation_options"]["layers"] = request.layers
manukala6 marked this conversation as resolved.
Show resolved Hide resolved
await assets.update_asset(default_asset.asset_id, **update_data)

version_orm: ORMVersion = await versions.get_version(dataset, version)
Expand Down Expand Up @@ -536,6 +559,17 @@ async def _version_response(

return VersionResponse(data=Version(**data))

#def _verify_layer_exists(source_uri: List[str], layes: List[str]) -> None:
# with fiona.open(source_uri[0].replace("s3://", "/vsizip//vsis3/"), "r") as src:
# layers = src.layer_names
# for layer in layers:
# if layer in layers:
# return
# else:
# raise HTTPException(
# status_code=400,
# detail=f"Layer {layer} not found in source file."
# )
manukala6 marked this conversation as resolved.
Show resolved Hide resolved

def _verify_source_file_access(sources: List[str]) -> None:

Expand Down
2 changes: 2 additions & 0 deletions terraform/vars/terraform-production.tfvars
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,8 @@ rw_api_url = "https://api.resourcewatch.org"
desired_count = 2
auto_scaling_min_capacity = 2
auto_scaling_max_capacity = 15
fargate_cpu = 2048
fargate_memory = 4096
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is OK, but somehow you are including hotfixes from master into your change?

lambda_analysis_workspace = "default"
key_pair = "dmannarino_gfw"
new_relic_license_key_arn = "arn:aws:secretsmanager:us-east-1:401951483516:secret:newrelic/license_key-CyqUPX"
3 changes: 3 additions & 0 deletions tests/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,9 @@
SHP_NAME = "test.shp.zip"
SHP_PATH = os.path.join(os.path.dirname(__file__), "fixtures", SHP_NAME)

GPKG_NAME = "test.gpkg.zip"
GPKG_PATH = os.path.join(os.path.dirname(__file__), "fixtures", GPKG_NAME)

BUCKET = "test-bucket"
PORT = 9000

Expand Down
3 changes: 3 additions & 0 deletions tests/conftest.py
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,8 @@
GEOJSON_NAME2,
GEOJSON_PATH,
GEOJSON_PATH2,
GPKG_NAME,
GPKG_PATH,
PORT,
SHP_NAME,
SHP_PATH,
Expand Down Expand Up @@ -308,6 +310,7 @@ def copy_fixtures():
s3_client.upload_file(CSV2_PATH, BUCKET, CSV2_NAME)
s3_client.upload_file(TSV_PATH, BUCKET, TSV_NAME)
s3_client.upload_file(SHP_PATH, BUCKET, SHP_NAME)
s3_client.upload_file(GPKG_PATH, BUCKET, GPKG_NAME)
s3_client.upload_file(APPEND_TSV_PATH, BUCKET, APPEND_TSV_NAME)

# upload a separate for each row so we can test running large numbers of sources in parallel
Expand Down
Binary file added tests/fixtures/test.gpkg.zip
Binary file not shown.
5 changes: 3 additions & 2 deletions tests/routes/datasets/test_versions.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@

from app.settings.globals import S3_ENTRYPOINT_URL
from app.utils.aws import get_s3_client
from tests import BUCKET, DATA_LAKE_BUCKET, SHP_NAME
from tests import BUCKET, DATA_LAKE_BUCKET, SHP_NAME, GPKG_NAME
from tests.conftest import FAKE_INT_DATA_PARAMS
from tests.tasks import MockCloudfrontClient
from tests.utils import (
Expand Down Expand Up @@ -326,7 +326,8 @@ async def test_invalid_source_uri(async_client: AsyncClient):

# Test appending to a version that exists
response = await async_client.post(
f"/dataset/{dataset}/{version}/append", json={"source_uri": source_uri}
f"/dataset/{dataset}/{version}/append",
json={"source_uri": source_uri, "source_driver": "ESRI Shapefile"}
)
assert response.status_code == 400
assert response.json()["status"] == "failed"
Expand Down
Loading