Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding a function to pick important features by grid #2001

Merged
merged 17 commits into from
Dec 10, 2021

Conversation

iandees
Copy link
Member

@iandees iandees commented Nov 17, 2021

For #1999

This adds a new transform function that filters point features by placing them into a grid and then only including the most important features in each grid cell.

TODO:

  • Update tests
  • Update docs
  • How do we decide what is important in each grid cell?
  • How do we want to handle stuff that isn't a point and doesn't match the items_matching? Should they be removed entirely or passed along without modification?

@nvkelso
Copy link
Member

nvkelso commented Nov 18, 2021

How do we decide what is important in each grid cell?

This should be a config, but for locality points in the places layer it should be ORDER DESC min_zoom, collision_rank, population.

  • min_zoom captures the feature's kind_detail value rank ordered of OSM place=* tags.
  • collision_rank uses population_rank buckets and makes sure capitals (country and region) collide out other localities in the same population_rank bucket
  • population is the raw OSM population value, or backfilled estimate... so if features are tied in the first 2 sorts (a lot will), population is the final arbitrator.

How do we want to handle stuff that isn't a point and doesn't match the items_matching? Should they be removed entirely or passed along without modification?

I imagine this to limit it's work to subset of features based on property matching, like items_matching: { kind: locality }

I'm fine to limit it to point geometries.

If the filter conditions aren't met, then there the features aren't touched (meaning they stay in tiles passed along without modification).

@iandees
Copy link
Member Author

iandees commented Nov 19, 2021

Status on this:

  • The keep_n_features_gridded function is written and has unit tests to show that it culls dense points, keeping the top N places in each bucket.
    • Features that don't match items_matching will pass through unchanged
    • Features that are not points will pass through unchanged
    • The top places are determined by sorting the value of the properties named in sorting_keys

Three major pieces remain:

  • The sorting is entirely "descending", but we need to support both ascending and descending. In other words, we want to sort by min_zoom ASC (min_zoom of 4.0 is better than min_zoom of 13.0) and then by population DESC (bigger cities are better than small cities). This is more complicated than my single call to sorted() allows.
    • I was thinking that if the sorting_keys element has a - (minus symbol) in front of the name we could treat it as DESC otherwise default to ASC.
    • There is a failing unit test (test_points_keep_1_multisort_minzoom) that covers this case.
  • I wasn't able to craft a change to the YAML that represents what Nathaniel's asking for yet.
  • I wasn't able to add integration tests using data from Overpass yet.

@nvkelso
Copy link
Member

nvkelso commented Nov 19, 2021

I wasn't able to add integration tests using data from Overpass yet.

Test data can be found in:

Which has 375 populated points that cover the area of the 7/113/50 tile (in 512 px coordinates) gathered from Overpass and processed in QGIS, copied out as GeoJSON, and reformatted with some regex to fit the expected dsl format.

This is around 1/2 of the features I see in the actual tile itself, possibly because each feature is being duplicated in the vector tile? For example, I see two Tokyo's in the tile MVT each with a different ID but with otherwise same properties. Not sure where that is coming from, but the upside should be we remove the dup's with this grid function anyhow.

But we should probably also sort on ID at the end to make it predictable which one is kept.

Pattern for how to write the test in:

@nvkelso
Copy link
Member

nvkelso commented Nov 19, 2021

I wasn't able to craft a change to the YAML that represents what Nathaniel's asking for yet.

Probably we'd add new logic to queries.yaml just above where we rank neighbourhoods to exercise the new function:

I imagine the syntax would be similar to what we do for keep_n_features:

This is around 1/2 of the features I see in the actual tile itself, possibly because each feature is being duplicated in the vector tile? For example, I see two Tokyo's in the tile MVT each with a different ID but with otherwise same properties. Not sure where that is coming from, but the upside should be we remove the dup's with this grid function anyhow.

There's also a function to remove duplicate features we might want to run before running the grid function? Though that is indiscriminate within a search radius and the gridding will likely take care of it for us anyhow.

@nvkelso
Copy link
Member

nvkelso commented Nov 21, 2021

The sorting is entirely "descending", but we need to support both ascending and descending. In other words, we want to sort by min_zoom ASC (min_zoom of 4.0 is better than min_zoom of 13.0) and then by population DESC (bigger cities are better than small cities). This is more complicated than my single call to sorted() allows.
I was thinking that if the sorting_keys element has a - (minus symbol) in front of the name we could treat it as DESC otherwise default to ASC.

The intercut and overlap functions already assume an order, and take a reverse boolean when it should be sorted the other direction (eg ASC versus DESC):

  - fn: vectordatasource.transform.intercut
    params:
      base_layer: roads
      cutting_layer: landuse
      attribute: kind
      target_attribute: landuse_kind
      cutting_attrs: { sort_key: 'sort_rank', reverse: True }

@iandees
Copy link
Member Author

iandees commented Dec 2, 2021

Ok, was able to make some progress on this today.

  • The keep_n_features_gridded function now supports reversible multi-sort for numeric property values. It will raise an exception if it encounters a value that is non-numeric when trying to reverse-sort it. The syntax for the sorting_keys config matches the cutting_attrs example in this comment
  • I added a configuration to queries.yaml that filters dense locality kinds in the places layer. I used values that seemed to make sense to me, but we can definitely change that.
  • I added an integration test using the data provided above (thanks!). It tests that at zoom 8 we filter out places and that Tokyo wins out over nearby, smaller features. I also check that the same nearby place is present on a more zoomed-in tile.

@nvkelso
Copy link
Member

nvkelso commented Dec 3, 2021

Can you post a new tile here, that matches the tile coords in #1999 (comment) please?

@iandees
Copy link
Member Author

iandees commented Dec 3, 2021

Here's an image based on the integration test data you gave me:

image

Red dots: Data from the integration test (sourced here) querying 8/227/100 before adding keep_n_features_gridded to queries.yaml.

Green dots: The same data and tile after adding keep_n_features_gridded to queries.yaml.

Basically, the red dots in the screenshot here are the ones that are thinned out. Green dots remain in the tile after calling keep_n_features_gridded.

@iandees
Copy link
Member Author

iandees commented Dec 3, 2021

Something is probably off on my bucket calculation here, because this is what it looks like when I use grid_size: 2:

image

There should be a maximum of 2 x 2 = 4 points here. Blue square is the boundary of the 8/227/100 tile.

@iandees
Copy link
Member Author

iandees commented Dec 6, 2021

Ah, it's correctly outputting 4 kind: locality features (based on the "items_matching" clause) and including 3 kind: region features that didn't match.

@iandees
Copy link
Member Author

iandees commented Dec 6, 2021

Based on this and some other fiddling, I think the code is doing the right thing, we just need to tweak the queries.yaml to match what we want the output to look like. Let me know what would make iterating on that easier, @nvkelso.

@nvkelso
Copy link
Member

nvkelso commented Dec 6, 2021

Good find on the region points being mixed in there with the locality points!

What we do for the other layers is a progressive generalization per zoom... so at zoom x it might be grid size of Y, like with 3 transform configs:

zoom 8 start, zoom 9 end, grid size 4
zoom 9 start, zoom 11 end, grid size 8
zoom 11 start, zoom 13 end, grid size 16

Adjusted to taste.

We might also want to play with different size multiples, like 3 instead of 4:

zoom 8 start, zoom 9 end, grid size 3
zoom 9 start, zoom 11 end, grid size 6
zoom 11 start, zoom 13 end, grid size 12

@iandees
Copy link
Member Author

iandees commented Dec 6, 2021

Here are two sets of 4 GeoJSONs in a ZIP zooming into Tokyo from 8/227/100 to 11/1819/806. One is for grid sizes multiples of 4, another for multiples of 3. My intuition is saying that the multiple of 4 lets too many through, and 3 is slightly better.

Note that they are still in epsg:3857, but at least they have the properties visible in QGIS now.

@nvkelso
Copy link
Member

nvkelso commented Dec 7, 2021

Reviewing the samples (there is confusion around 512 px tile sizes versus 256 here, beware!) , I propose:

don't forget zoom 7 for 256px logical for when OSM comes in!
zoom 8 start, zoom 9 end, grid size 4 (is this in 512 space?)
zoom 9 start, zoom 12 end, grid size 6
zoom 10 start, zoom 12 end, grid size 8
no zoom 12 config, because by that time it seems like the locality features are well spaced out
consider neighbourhood gridding at higher zooms, see below

For the examples, we start out with, and end up with:

zoom 512 coord orig_places new_places orig_localities new_localities
8 8-227-100 207 19 202 16
9 9-454-201 102 35 100 34
10 10-909-403 24 21 *23 21
11 11-1819-806 23 **9 9 9

Now what tile size does this config assume? I think it's targeting 512 pixel sized tiles (please confirm)... but it should be 1/2'd for 256 px tiles (since the dimensions are 1/2 in width and height)? But Tilezen is 256 logical pizel size, so should all those be offset by 1 and halved by default, and if the tilesize is 512 then the grid multiplied by 2?

When I review 512 sized zooms 8, 9, and 10 I see a mix of region and locality points. But when I review zoom 11, I only see locality points after the transform, all the neighbourhood points are being dropped? Or is that the way the tests were constructed?

I reviewed Tokyo "zoom 10" (512 logical zoom or zoom 11 at 256 px) being the last zoom seems fine. Same in Paris, London, New York, and other regions. But in Beijing there are dense locality points at 512 px zoom 11 so I think keep the config to logical zoom 11 (stop zoom 12).

* NOTE: This should be 24 but it's 23 instead. Rounding problems vis-a-vis the test data?
** NOTE: This should probably be be 23 with 13 neighbourhoods. Probably because the test data doesn't include neighbourhoods?

The 2nd note raises a question if we should also thin the neighbourhood placetype at zoom 11 (ahem 12).

We already add a tile_kind_rank but looks like we don't then drop features but we aren't keep_n_features where Bubble Wrap only uses the first 8. Alternatively the data should be cleaned up in Japan... reviewing USA and Europe this isn't a problem because the Who's On First data has been curated quite a bit more there. To guard against no curation (like in Japan) I can see adding:

Either keep_n_features (but really the max_items should vary a little by zoom):

  - fn: vectordatasource.transform.keep_n_features
    params:
      source_layer: places
      start_zoom: 12
      end_zoom: 15
      items_matching: { kind: [neighbourhood, microhood, macrohood] }
      max_items: 8

Or better yet grid thinning like (modify grid_size to taste):

  - fn: vectordatasource.transform.keep_n_features_gridded
    params:
      source_layer: places
      start_zoom: 12
      end_zoom: 15
      items_matching: { kind: [neighbourhood, microhood, macrohood] }
      max_items: 1
      grid_size: 8
      sorting_keys:
        - { sort_key: 'min_zoom', reverse: False }
        - { sort_key: 'collision_rank', reverse: False }
        - { sort_key: 'population', reverse: True }
        - { sort_key: 'id', reverse: True }

@@ -3166,8 +3169,8 @@ def keep_n_features_gridded(ctx):
return None

minx, miny, maxx, maxy = ctx.unpadded_bounds
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By here do we know if we're in a 256 or 512 px tile and can adjust a grid_width or grid_height multiplier accordingly?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, there's no concept of pixel size here as far as I know. It's just processing a bounding box of vector data.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wacky (future) idea: seems like you could even compute the features to keep in for multiple tile sizes and tag them as such. We could include this stuff in our layer output but then weed it out in tilequeue where we know what size we're cutting.

@iandees
Copy link
Member Author

iandees commented Dec 10, 2021

Ok, I have a tileserver running locally to compare like-for-like tiles. I asked tileserver for /512/all/x/y/z.mvt in all these cases:

coord with thinning without thinning
7/113/50 image image
8/227/100 image image
9/454/201 image image
10/909/403 image image
11/1819/806 image image

All MVTs: Archive.zip

I noticed that these "512/all/z/x/y.mvt" tiles are 4x bigger on the screen in QGIS than the same tiles requested from Tapalcatl with "512/all/z/x/y.mvt" (same geographic area, though), so I'm not really sure what's going on size-wise. At least we can compare before/after within tileserver.

Another thing to note is that the "poi" layer is quite busy with transit stations starting at z10 and then at z11 with other OSM-sourced POI. We might want to consider thinning those out later, too.

@nvkelso
Copy link
Member

nvkelso commented Dec 10, 2021

Another thing to note is that the "poi" layer is quite busy with transit stations starting at z10 and then at z11 with other OSM-sourced POI. We might want to consider thinning those out later, too.

Let's track that separately with #2033.

@nvkelso
Copy link
Member

nvkelso commented Dec 10, 2021

I noticed that these "512/all/z/x/y.mvt" tiles are 4x bigger on the screen in QGIS than the same tiles requested from Tapalcatl with "512/all/z/x/y.mvt" (same geographic area, though), so I'm not really sure what's going on size-wise. At least we can compare before/after within tileserver.

If you rename the files so they lead with the z-x-y component and then follow on with the description then they'll load in real world locations for you, btw.

Eg: 8-227-100-thinned-512.mvt

Copy link
Member

@nvkelso nvkelso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I confirmed the width numbers are working for these 512 px tiles in QGIS.

This might make problems for people still using 256 px tile config, so leave a note about that in the source code and call it a day?

@nvkelso
Copy link
Member

nvkelso commented Dec 10, 2021

@travisgrigsby please also take a look and leave a PR review.

self.assertEqual("test_shape_2", output_features[2][2])
self.assertEqual("test_shape_3", output_features[3][2])

def test_fail_on_non_integer_reverse_sort_key(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

great tests. Really illustrate what this does

Copy link
Contributor

@tgrigsby-sc tgrigsby-sc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is, for the level of interesting that this is accomplishing, some of the easiest to read code ever :D Nice work.

v = props.get(k['sort_key'])

if v is None:
values.append(v)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I understand why you're adding none to the list but maybe I'll get it in a second

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh I see - it's the value of the property. If it's none, then that value goes into the list for sorting.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, the idea is to include the None to sort on if it's there.

# Sort the features in each bucket and pick the top items to include in the output
for features_in_bucket in buckets.values():
sorted_features = sorted(features_in_bucket, key=sorting_values_for_feature)
new_features.extend(sorted_features[:max_items])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice use of pythonisms to make this super easy to read and understand

end_zoom: 13
items_matching: { kind: locality }
max_items: 1
grid_width: 12
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I take it all these values for grid width and start/end zoom were arrived at experimentally? Did you try anything like reducing the width but increasing max_items to get a more organic look?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I did a bit of experimentation based on the suggestions Nathaniel made above and these values seemed to look best. At least in the really dense area around Tokyo.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't try more than one item per bucket. That and wider-than-taller buckets (since labels are wider than taller) are two things we could try in the future I think.

@@ -3166,8 +3169,8 @@ def keep_n_features_gridded(ctx):
return None

minx, miny, maxx, maxy = ctx.unpadded_bounds
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wacky (future) idea: seems like you could even compute the features to keep in for multiple tile sizes and tag them as such. We could include this stuff in our layer output but then weed it out in tilequeue where we know what size we're cutting.

@iandees
Copy link
Member Author

iandees commented Dec 10, 2021

This might make problems for people still using 256 px tile config, so leave a note about that in the source code and call it a day?

I added a comment to the new function in ab0d108.

@iandees
Copy link
Member Author

iandees commented Dec 10, 2021

The failing tests don't seem to be related to this PR, so I'm going to merge.

@iandees iandees merged commit 18bf48c into master Dec 10, 2021
@iandees iandees deleted the dees/keep_n_features_gridded branch December 10, 2021 20:57
@peitili peitili mentioned this pull request Dec 13, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants