Adding a function to pick important features by grid #2001

iandees · 2021-11-17T22:14:09Z

For #1999

This adds a new transform function that filters point features by placing them into a grid and then only including the most important features in each grid cell.

TODO:

Update tests
Update docs
How do we decide what is important in each grid cell?
How do we want to handle stuff that isn't a point and doesn't match the items_matching? Should they be removed entirely or passed along without modification?

nvkelso · 2021-11-18T00:03:28Z

How do we decide what is important in each grid cell?

This should be a config, but for locality points in the places layer it should be ORDER DESC min_zoom, collision_rank, population.

min_zoom captures the feature's kind_detail value rank ordered of OSM place=* tags.
collision_rank uses population_rank buckets and makes sure capitals (country and region) collide out other localities in the same population_rank bucket
population is the raw OSM population value, or backfilled estimate... so if features are tied in the first 2 sorts (a lot will), population is the final arbitrator.

How do we want to handle stuff that isn't a point and doesn't match the items_matching? Should they be removed entirely or passed along without modification?

I imagine this to limit it's work to subset of features based on property matching, like items_matching: { kind: locality }

I'm fine to limit it to point geometries.

If the filter conditions aren't met, then there the features aren't touched (meaning they stay in tiles passed along without modification).

iandees · 2021-11-19T00:15:17Z

Status on this:

The keep_n_features_gridded function is written and has unit tests to show that it culls dense points, keeping the top N places in each bucket.
- Features that don't match items_matching will pass through unchanged
- Features that are not points will pass through unchanged
- The top places are determined by sorting the value of the properties named in sorting_keys

Three major pieces remain:

The sorting is entirely "descending", but we need to support both ascending and descending. In other words, we want to sort by min_zoom ASC (min_zoom of 4.0 is better than min_zoom of 13.0) and then by population DESC (bigger cities are better than small cities). This is more complicated than my single call to sorted() allows.
- I was thinking that if the sorting_keys element has a - (minus symbol) in front of the name we could treat it as DESC otherwise default to ASC.
- There is a failing unit test (test_points_keep_1_multisort_minzoom) that covers this case.
I wasn't able to craft a change to the YAML that represents what Nathaniel's asking for yet.
I wasn't able to add integration tests using data from Overpass yet.

nvkelso · 2021-11-19T07:19:22Z

I wasn't able to add integration tests using data from Overpass yet.

Test data can be found in:

https://gist.github.com/nvkelso/ef38d80b11d0bfd009555524d97af4cf#file-tokyo-area-locality-text-fixture

Which has 375 populated points that cover the area of the 7/113/50 tile (in 512 px coordinates) gathered from Overpass and processed in QGIS, copied out as GeoJSON, and reformatted with some regex to fit the expected dsl format.

This is around 1/2 of the features I see in the actual tile itself, possibly because each feature is being duplicated in the vector tile? For example, I see two Tokyo's in the tile MVT each with a different ID but with otherwise same properties. Not sure where that is coming from, but the upside should be we remove the dup's with this grid function anyhow.

But we should probably also sort on ID at the end to make it predictable which one is kept.

Pattern for how to write the test in:

https://github.com/tilezen/vector-datasource/blob/master/integration-test/981-remove-unstyled-ne-places.py

nvkelso · 2021-11-19T07:26:52Z

I wasn't able to craft a change to the YAML that represents what Nathaniel's asking for yet.

Probably we'd add new logic to queries.yaml just above where we rank neighbourhoods to exercise the new function:

https://github.com/tilezen/vector-datasource/blob/master/queries.yaml#L974-L980

I imagine the syntax would be similar to what we do for keep_n_features:

https://github.com/tilezen/vector-datasource/blob/master/queries.yaml#L1415-L1422

This is around 1/2 of the features I see in the actual tile itself, possibly because each feature is being duplicated in the vector tile? For example, I see two Tokyo's in the tile MVT each with a different ID but with otherwise same properties. Not sure where that is coming from, but the upside should be we remove the dup's with this grid function anyhow.

There's also a function to remove duplicate features we might want to run before running the grid function? Though that is indiscriminate within a search radius and the gridding will likely take care of it for us anyhow.

https://github.com/tilezen/vector-datasource/blob/master/queries.yaml#L947-L953

nvkelso · 2021-11-21T22:46:25Z

The sorting is entirely "descending", but we need to support both ascending and descending. In other words, we want to sort by min_zoom ASC (min_zoom of 4.0 is better than min_zoom of 13.0) and then by population DESC (bigger cities are better than small cities). This is more complicated than my single call to sorted() allows.
I was thinking that if the sorting_keys element has a - (minus symbol) in front of the name we could treat it as DESC otherwise default to ASC.

The intercut and overlap functions already assume an order, and take a reverse boolean when it should be sorted the other direction (eg ASC versus DESC):

  - fn: vectordatasource.transform.intercut
    params:
      base_layer: roads
      cutting_layer: landuse
      attribute: kind
      target_attribute: landuse_kind
      cutting_attrs: { sort_key: 'sort_rank', reverse: True }

iandees · 2021-12-02T23:09:44Z

Ok, was able to make some progress on this today.

The keep_n_features_gridded function now supports reversible multi-sort for numeric property values. It will raise an exception if it encounters a value that is non-numeric when trying to reverse-sort it. The syntax for the sorting_keys config matches the cutting_attrs example in this comment
I added a configuration to queries.yaml that filters dense locality kinds in the places layer. I used values that seemed to make sense to me, but we can definitely change that.
I added an integration test using the data provided above (thanks!). It tests that at zoom 8 we filter out places and that Tokyo wins out over nearby, smaller features. I also check that the same nearby place is present on a more zoomed-in tile.

nvkelso · 2021-12-03T00:08:55Z

Can you post a new tile here, that matches the tile coords in #1999 (comment) please?

iandees · 2021-12-03T20:26:59Z

Here's an image based on the integration test data you gave me:

Red dots: Data from the integration test (sourced here) querying 8/227/100 before adding keep_n_features_gridded to queries.yaml.

Green dots: The same data and tile after adding keep_n_features_gridded to queries.yaml.

Basically, the red dots in the screenshot here are the ones that are thinned out. Green dots remain in the tile after calling keep_n_features_gridded.

iandees · 2021-12-03T21:09:05Z

Something is probably off on my bucket calculation here, because this is what it looks like when I use grid_size: 2:

There should be a maximum of 2 x 2 = 4 points here. Blue square is the boundary of the 8/227/100 tile.

iandees · 2021-12-06T17:59:16Z

Ah, it's correctly outputting 4 kind: locality features (based on the "items_matching" clause) and including 3 kind: region features that didn't match.

iandees · 2021-12-06T18:00:52Z

Based on this and some other fiddling, I think the code is doing the right thing, we just need to tweak the queries.yaml to match what we want the output to look like. Let me know what would make iterating on that easier, @nvkelso.

nvkelso · 2021-12-06T18:39:44Z

Good find on the region points being mixed in there with the locality points!

What we do for the other layers is a progressive generalization per zoom... so at zoom x it might be grid size of Y, like with 3 transform configs:

zoom 8 start, zoom 9 end, grid size 4
zoom 9 start, zoom 11 end, grid size 8
zoom 11 start, zoom 13 end, grid size 16

Adjusted to taste.

We might also want to play with different size multiples, like 3 instead of 4:

zoom 8 start, zoom 9 end, grid size 3
zoom 9 start, zoom 11 end, grid size 6
zoom 11 start, zoom 13 end, grid size 12

iandees · 2021-12-06T22:13:23Z

Here are two sets of 4 GeoJSONs in a ZIP zooming into Tokyo from 8/227/100 to 11/1819/806. One is for grid sizes multiples of 4, another for multiples of 3. My intuition is saying that the multiple of 4 lets too many through, and 3 is slightly better.

Note that they are still in epsg:3857, but at least they have the properties visible in QGIS now.

nvkelso · 2021-12-07T05:44:03Z

Reviewing the samples (there is confusion around 512 px tile sizes versus 256 here, beware!) , I propose:

don't forget zoom 7 for 256px logical for when OSM comes in!
zoom 8 start, zoom 9 end, grid size 4 (is this in 512 space?)
zoom 9 start, zoom 12 end, grid size 6
zoom 10 start, zoom 12 end, grid size 8
no zoom 12 config, because by that time it seems like the locality features are well spaced out
consider neighbourhood gridding at higher zooms, see below

For the examples, we start out with, and end up with:

zoom	512 coord	orig_places	new_places	orig_localities	new_localities
8	8-227-100	207	19	202	16
9	9-454-201	102	35	100	34
10	10-909-403	24	21	*23	21
11	11-1819-806	23	**9	9	9

Now what tile size does this config assume? I think it's targeting 512 pixel sized tiles (please confirm)... but it should be 1/2'd for 256 px tiles (since the dimensions are 1/2 in width and height)? But Tilezen is 256 logical pizel size, so should all those be offset by 1 and halved by default, and if the tilesize is 512 then the grid multiplied by 2?

When I review 512 sized zooms 8, 9, and 10 I see a mix of region and locality points. But when I review zoom 11, I only see locality points after the transform, all the neighbourhood points are being dropped? Or is that the way the tests were constructed?

I reviewed Tokyo "zoom 10" (512 logical zoom or zoom 11 at 256 px) being the last zoom seems fine. Same in Paris, London, New York, and other regions. But in Beijing there are dense locality points at 512 px zoom 11 so I think keep the config to logical zoom 11 (stop zoom 12).

* NOTE: This should be 24 but it's 23 instead. Rounding problems vis-a-vis the test data?
** NOTE: This should probably be be 23 with 13 neighbourhoods. Probably because the test data doesn't include neighbourhoods?

The 2nd note raises a question if we should also thin the neighbourhood placetype at zoom 11 (ahem 12).

We already add a tile_kind_rank but looks like we don't then drop features but we aren't keep_n_features where Bubble Wrap only uses the first 8. Alternatively the data should be cleaned up in Japan... reviewing USA and Europe this isn't a problem because the Who's On First data has been curated quite a bit more there. To guard against no curation (like in Japan) I can see adding:

Either keep_n_features (but really the max_items should vary a little by zoom):

  - fn: vectordatasource.transform.keep_n_features
    params:
      source_layer: places
      start_zoom: 12
      end_zoom: 15
      items_matching: { kind: [neighbourhood, microhood, macrohood] }
      max_items: 8

Or better yet grid thinning like (modify grid_size to taste):

  - fn: vectordatasource.transform.keep_n_features_gridded
    params:
      source_layer: places
      start_zoom: 12
      end_zoom: 15
      items_matching: { kind: [neighbourhood, microhood, macrohood] }
      max_items: 1
      grid_size: 8
      sorting_keys:
        - { sort_key: 'min_zoom', reverse: False }
        - { sort_key: 'collision_rank', reverse: False }
        - { sort_key: 'population', reverse: True }
        - { sort_key: 'id', reverse: True }

nvkelso · 2021-12-08T01:11:48Z

vectordatasource/transform.py

@@ -3166,8 +3169,8 @@ def keep_n_features_gridded(ctx):
        return None

    minx, miny, maxx, maxy = ctx.unpadded_bounds


By here do we know if we're in a 256 or 512 px tile and can adjust a grid_width or grid_height multiplier accordingly?

No, there's no concept of pixel size here as far as I know. It's just processing a bounding box of vector data.

wacky (future) idea: seems like you could even compute the features to keep in for multiple tile sizes and tag them as such. We could include this stuff in our layer output but then weed it out in tilequeue where we know what size we're cutting.

iandees · 2021-12-10T00:16:36Z

Ok, I have a tileserver running locally to compare like-for-like tiles. I asked tileserver for /512/all/x/y/z.mvt in all these cases:

coord	with thinning	without thinning
7/113/50
8/227/100
9/454/201
10/909/403
11/1819/806

All MVTs: Archive.zip

I noticed that these "512/all/z/x/y.mvt" tiles are 4x bigger on the screen in QGIS than the same tiles requested from Tapalcatl with "512/all/z/x/y.mvt" (same geographic area, though), so I'm not really sure what's going on size-wise. At least we can compare before/after within tileserver.

Another thing to note is that the "poi" layer is quite busy with transit stations starting at z10 and then at z11 with other OSM-sourced POI. We might want to consider thinning those out later, too.

nvkelso · 2021-12-10T00:47:13Z

Another thing to note is that the "poi" layer is quite busy with transit stations starting at z10 and then at z11 with other OSM-sourced POI. We might want to consider thinning those out later, too.

Let's track that separately with #2033.

nvkelso · 2021-12-10T01:01:26Z

I noticed that these "512/all/z/x/y.mvt" tiles are 4x bigger on the screen in QGIS than the same tiles requested from Tapalcatl with "512/all/z/x/y.mvt" (same geographic area, though), so I'm not really sure what's going on size-wise. At least we can compare before/after within tileserver.

If you rename the files so they lead with the z-x-y component and then follow on with the description then they'll load in real world locations for you, btw.

Eg: 8-227-100-thinned-512.mvt

nvkelso

I confirmed the width numbers are working for these 512 px tiles in QGIS.

This might make problems for people still using 256 px tile config, so leave a note about that in the source code and call it a day?

nvkelso · 2021-12-10T01:04:30Z

@travisgrigsby please also take a look and leave a PR review.

tgrigsby-sc · 2021-12-10T02:20:31Z

test/test_transform.py

+        self.assertEqual("test_shape_2", output_features[2][2])
+        self.assertEqual("test_shape_3", output_features[3][2])
+
+    def test_fail_on_non_integer_reverse_sort_key(self):


great tests. Really illustrate what this does

tgrigsby-sc

This is, for the level of interesting that this is accomplishing, some of the easiest to read code ever :D Nice work.

tgrigsby-sc · 2021-12-10T02:26:56Z

vectordatasource/transform.py

+            v = props.get(k['sort_key'])
+
+            if v is None:
+                values.append(v)


I don't think I understand why you're adding none to the list but maybe I'll get it in a second

oh I see - it's the value of the property. If it's none, then that value goes into the list for sorting.

Yep, the idea is to include the None to sort on if it's there.

tgrigsby-sc · 2021-12-10T03:04:59Z

vectordatasource/transform.py

+    # Sort the features in each bucket and pick the top items to include in the output
+    for features_in_bucket in buckets.values():
+        sorted_features = sorted(features_in_bucket, key=sorting_values_for_feature)
+        new_features.extend(sorted_features[:max_items])


nice use of pythonisms to make this super easy to read and understand

tgrigsby-sc · 2021-12-10T03:11:26Z

queries.yaml

+      end_zoom: 13
+      items_matching: { kind: locality }
+      max_items: 1
+      grid_width: 12


I take it all these values for grid width and start/end zoom were arrived at experimentally? Did you try anything like reducing the width but increasing max_items to get a more organic look?

Yep, I did a bit of experimentation based on the suggestions Nathaniel made above and these values seemed to look best. At least in the really dense area around Tokyo.

I didn't try more than one item per bucket. That and wider-than-taller buckets (since labels are wider than taller) are two things we could try in the future I think.

tgrigsby-sc · 2021-12-10T03:17:46Z

vectordatasource/transform.py

@@ -3166,8 +3169,8 @@ def keep_n_features_gridded(ctx):
        return None

    minx, miny, maxx, maxy = ctx.unpadded_bounds


wacky (future) idea: seems like you could even compute the features to keep in for multiple tile sizes and tag them as such. We could include this stuff in our layer output but then weed it out in tilequeue where we know what size we're cutting.

iandees · 2021-12-10T20:30:50Z

This might make problems for people still using 256 px tile config, so leave a note about that in the source code and call it a day?

I added a comment to the new function in ab0d108.

iandees · 2021-12-10T20:56:52Z

The failing tests don't seem to be related to this PR, so I'm going to merge.

iandees added 6 commits November 17, 2021 16:10

Adding a function to pick important features by grid

19f7184

Undo import optimization

ba251f3

Missed a newline

99a5944

Add a tiny bit of comment about the bucketing

88a232f

Rename integration test for collision

17b4184

Sort buckets and take the top instead of bucketing to a single feature

d6d3980

iandees added 3 commits November 18, 2021 09:56

Pass through non-point and features that don't match items_matching

9635901

Add some unit tests

49a13cd

Add a couple more unit tests

2a05f60

iandees added 3 commits December 1, 2021 17:14

Trying integration test

f86f5be

Support multi-sort and reversing

58a831b

Make integration test work

44a3d80

iandees added 3 commits December 2, 2021 17:11

Merge branch 'master' into dees/keep_n_features_gridded

514a549

Fix trailing whitespace

a4291c7

Resolve pre-commit yaml check

23786bf

Add support for grid_height/grid_width as separate values

d8b7dbe

nvkelso reviewed Dec 8, 2021

View reviewed changes

nvkelso mentioned this pull request Dec 10, 2021

Thin stations from pois layer using new grid function #2033

Open

nvkelso reviewed Dec 10, 2021

View reviewed changes

tgrigsby-sc reviewed Dec 10, 2021

View reviewed changes

tgrigsby-sc approved these changes Dec 10, 2021

View reviewed changes

iandees mentioned this pull request Dec 10, 2021

Switch to pulling way_area from tags hstore instead of assuming it to be a column #2034

Open

Add a note about 256 vs 512 sizing

ab0d108

iandees merged commit 18bf48c into master Dec 10, 2021

iandees deleted the dees/keep_n_features_gridded branch December 10, 2021 20:57

peitili mentioned this pull request Dec 13, 2021

fix linter on master #2036

Merged

nvkelso mentioned this pull request Dec 15, 2021

Cull features from places layer based on grid #1999

Closed

iandees mentioned this pull request Jan 3, 2022

Make locality thinning grids smaller #2040

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding a function to pick important features by grid #2001

Adding a function to pick important features by grid #2001

iandees commented Nov 17, 2021

nvkelso commented Nov 18, 2021

iandees commented Nov 19, 2021

nvkelso commented Nov 19, 2021 •

edited

Loading

nvkelso commented Nov 19, 2021 •

edited

Loading

nvkelso commented Nov 21, 2021

iandees commented Dec 2, 2021

nvkelso commented Dec 3, 2021

iandees commented Dec 3, 2021 •

edited

Loading

iandees commented Dec 3, 2021 •

edited

Loading

iandees commented Dec 6, 2021

iandees commented Dec 6, 2021

nvkelso commented Dec 6, 2021 •

edited

Loading

iandees commented Dec 6, 2021

nvkelso commented Dec 7, 2021 •

edited

Loading

nvkelso Dec 8, 2021

iandees Dec 8, 2021

tgrigsby-sc Dec 10, 2021

iandees commented Dec 10, 2021

nvkelso commented Dec 10, 2021

nvkelso commented Dec 10, 2021

nvkelso left a comment

nvkelso commented Dec 10, 2021

tgrigsby-sc Dec 10, 2021

tgrigsby-sc left a comment

tgrigsby-sc Dec 10, 2021

tgrigsby-sc Dec 10, 2021

iandees Dec 10, 2021

tgrigsby-sc Dec 10, 2021

tgrigsby-sc Dec 10, 2021

iandees Dec 10, 2021

iandees Dec 10, 2021

tgrigsby-sc Dec 10, 2021

iandees commented Dec 10, 2021

iandees commented Dec 10, 2021

		@@ -3166,8 +3169,8 @@ def keep_n_features_gridded(ctx):
		return None

		minx, miny, maxx, maxy = ctx.unpadded_bounds

Adding a function to pick important features by grid #2001

Adding a function to pick important features by grid #2001

Conversation

iandees commented Nov 17, 2021

nvkelso commented Nov 18, 2021

iandees commented Nov 19, 2021

nvkelso commented Nov 19, 2021 • edited Loading

nvkelso commented Nov 19, 2021 • edited Loading

nvkelso commented Nov 21, 2021

iandees commented Dec 2, 2021

nvkelso commented Dec 3, 2021

iandees commented Dec 3, 2021 • edited Loading

iandees commented Dec 3, 2021 • edited Loading

iandees commented Dec 6, 2021

iandees commented Dec 6, 2021

nvkelso commented Dec 6, 2021 • edited Loading

iandees commented Dec 6, 2021

nvkelso commented Dec 7, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

iandees commented Dec 10, 2021

nvkelso commented Dec 10, 2021

nvkelso commented Dec 10, 2021

nvkelso left a comment

Choose a reason for hiding this comment

nvkelso commented Dec 10, 2021

Choose a reason for hiding this comment

tgrigsby-sc left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

iandees commented Dec 10, 2021

iandees commented Dec 10, 2021

nvkelso commented Nov 19, 2021 •

edited

Loading

nvkelso commented Nov 19, 2021 •

edited

Loading

iandees commented Dec 3, 2021 •

edited

Loading

iandees commented Dec 3, 2021 •

edited

Loading

nvkelso commented Dec 6, 2021 •

edited

Loading

nvkelso commented Dec 7, 2021 •

edited

Loading