Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rnet_breakup_vertices is way too slow for real network data #416

Closed
agila5 opened this issue Jul 8, 2020 · 42 comments
Closed

rnet_breakup_vertices is way too slow for real network data #416

agila5 opened this issue Jul 8, 2020 · 42 comments
Milestone

Comments

@agila5
Copy link
Collaborator

agila5 commented Jul 8, 2020

I think that the solution for the complex example showed in #412 and luukvdmeer/sfnetworks#67 is using the function stplanr::rnet_breakup_vertices before creating the sfnetwork object. The structural_properties c:4597 :Couldn't reach some vertices error is the same error as per the "roundabout" example and the underlying reason is that the algorithm behind sfnetworks and stplanr could fail with OSM data (@Robinlovelace Section 4 of the RJ paper).

Unfortunately, at the moment, the ideas are good but the code is way too slow and I need to rewrite it as soon as possible.

@agila5
Copy link
Collaborator Author

agila5 commented Aug 28, 2020

A small update here.

# packages
library(sf)
#> Linking to GEOS 3.8.0, GDAL 3.0.4, PROJ 6.3.1
library(sfnetworks)
library(mapview)

# data
download.file("https://github.com/ropensci/stplanr/releases/download/0.6.1/Example.zip", "Example.zip")
unzip("Example.zip")
roads = st_read("Example/Complete Roads/complete_roads.shp", quiet = TRUE, stringsAsFactors = FALSE) %>% 
  st_cast("LINESTRING")
#> Warning in st_cast.sf(., "LINESTRING"): repeating attributes for all sub-
#> geometries for which they may not be constant
roads_trf = st_transform(roads, 4674)

# POINTS
from = c(-49.95058, -24.77502)
to = c(-49.91084, -24.75200)
points <- st_sfc(st_point(from), st_point(to), crs = 4674)
bbox_points = st_as_sfc(st_bbox(points))
mapview(roads_trf[bbox_points, ], zcol = "osm_id") + points
#> although coordinates are longitude/latitude, st_intersects assumes that they are planar

There is clearly a path between the two points but the problem is that if we build the road network starting from the boundary points of each LINESTRING object (as in stplanr or in sfnetworks), then there is no connection between yellow, green and blue roads (since they don't share any point in the boundaries).

AFAIK, if we are working with OSM data (or similar structures) and we are creating road network objects considering only the boundary points for each LINESTRING (and this is the sfnetwork approach atm), then we should "modify the input data in two cases:
a) There exists at least one POINT which is shared by two or more LINESTRINGs and it is always an "internal" point. See, for example, the bicycleway example in the RJ paper.
b) There exists at least one POINT which is a boundary point for one linestring and an "internal" point for another linestring. See, for example, the roundabout example in the RJ paper

The following code is taken from stplanr::rnet_breakup_vertices and sfnetworks:::get_boundary_points

# Extract the coordinates for each POINT
roads_coordinates <- sf::st_coordinates(roads_trf)

# Select the boundary POINTS
L1_index = ncol(roads_coordinates)
first_pair = !duplicated(roads_coordinates[, L1_index])
last_pair = !duplicated(roads_coordinates[, L1_index], fromLast = TRUE)
idxs = first_pair | last_pair; rm(first_pair); rm(last_pair)
internal_points = roads_coordinates[!idxs, ]
duplicated_internal_points = internal_points[duplicated(internal_points[, c(1, 2)]), ]
duplicated_internal_points_sfc = st_cast(
  st_sfc(st_multipoint(duplicated_internal_points[, -L1_index]), crs = st_crs(roads_trf)), 
  "POINT"
)

Now I would like to apply lwgeom::st_split dividing roads_trf according to the points in duplicated_internal_points_sfc, but the problem is that this approach takes forever for medium/large networks. I think that there should be a better approach since lwgeom::st_split is probably too general for this use-case. Ideas?

@Robinlovelace
Copy link
Member

Breaking the linestrings down into matrices with the sfheaders and re-building them is an option to tackle scenario a) above. For scenario b) I think v.clean from GRASS, as mentioned in the sDNA docs, may be the best solution I know of. There are likely other solutions.

I just discovered a paper on this topic: https://doi.org/10.1038/sdata.2016.46

This makes me think: it's worth doing a deep dive into this and perhaps a package/paper/other documenting an open, accessible, efficient and reproducible solutions to this important and seemingly common problem.

@Robinlovelace
Copy link
Member

It sounds like the break tool in v.clean would solve problem b), right?:

The break tool breaks lines/boundaries at intersections and it also breaks lines/boundaries forming a collapsed loop. For example, 0.0;1.0;0.0 is broken at 1.0.

@agila5
Copy link
Collaborator Author

agila5 commented Aug 29, 2020

Hi! I think that when we discussed break tool from v.clean in Leeds we found that it was creating artificial nodes at the intersection points between two or more roads at different levels (like overpasses). For example, if you look at Figure 1 in the paper you linked (btw: thank you very much, it seems very useful), I think that v.clean would create a new node at the "not an intersection" point. Moreover, I spent 1h trying to run rgrass7 on windows with no success... Do you know if it's possible to specify which points should be used for breaking the network? I checked the docs, and it looks like it's an automatic procedure splitting at all intersection points (up to a threshold).

I like the idea of Rcpp code on the coordinates matrix + sfheaders that you mentioned. I hope I will test it in the next days.

@Robinlovelace
Copy link
Member

Here's another reprex, not progress but the source code overline() could be of use:

library(sf)
#> Linking to GEOS 3.8.0, GDAL 3.0.4, PROJ 7.0.0
library(stplanr)
plot(st_geometry(rnet_roundabout), lwd = 2, col = rainbow(nrow(rnet_roundabout)))
boundary_points <- st_geometry(line2points(rnet_roundabout))
points_cols <- rep(rainbow(nrow(rnet_roundabout)), each = 2)
plot(boundary_points, pch = 16, add = TRUE, col = points_cols)

# Clean the roundabout example.
rnet_roundabout_clean <- rnet_breakup_vertices(rnet_roundabout)
#> Splitting rnet object at the shared boundary points.
plot(st_geometry(rnet_roundabout_clean), lwd = 2, col = rainbow(nrow(rnet_roundabout_clean)))

rnet_roundabout$variable = 1:nrow(rnet_roundabout)
system.time(rnet_roundabout_clean <- rnet_breakup_vertices(rnet_roundabout))
#> Splitting rnet object at the shared boundary points.
#>    user  system elapsed 
#>   0.012   0.000   0.013
system.time(rnet_overline <- overline(rnet_roundabout, attrib = "variable"))
#>    user  system elapsed 
#>   0.054   0.000   0.054
plot(st_geometry(rnet_overline), lwd = 2, col = rainbow(nrow(rnet_overline)))

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
rnet_agg = rnet_roundabout_clean %>%
  group_by()

sl <- routes_fast_sf
sl$All <- flowlines$All
rnet <- overline(sl = sl, attrib = "All")
plot(rnet, lwd = 3)

sl_broken = rnet_breakup_vertices(sl)
#> Splitting rnet object at the shared internal points.
sl_agg = sl_broken %>%
  group_by(sf::st_as_text(geometry)) %>%
  summarise(All = sum(All))
#> `summarise()` ungrouping output (override with `.groups` argument)
plot(sl_agg["All"], lwd = 3)

mapview::mapview(sl_agg["All"], lwd = 3)

sl_broken = rnet_breakup_vertices(sl, breakup_internal_vertex_matches = FALSE)

sl_agg = sl_broken %>%
  group_by(sf::st_as_text(geometry)) %>%
  summarise(All = sum(All))
#> `summarise()` ungrouping output (override with `.groups` argument)
plot(sl_agg["All"], lwd = 3)

mapview::mapview(sl_agg["All"], lwd = 3)

system.time({
  sl_broken = rnet_breakup_vertices(sl, breakup_internal_vertex_matches = FALSE)

  sl_agg = sl_broken %>%
    group_by(sf::st_as_text(geometry)) %>%
    summarise(All = sum(All))
})
#> `summarise()` ungrouping output (override with `.groups` argument)
#>    user  system elapsed 
#>   0.276   0.008   0.290

system.time(rnet <- overline(sl = sl, attrib = "All"))
#>    user  system elapsed 
#>   0.093   0.004   0.100

Created on 2020-11-05 by the reprex package (v0.3.0)

@agila5
Copy link
Collaborator Author

agila5 commented Nov 13, 2020

Hi! I worked a little bit on this problem today, and I defined a new implementation of rnet_breakup_vertices that it's much much faster than current approach (I will add more details later if you want). The only (small) drawback is that it uses data.table and sfheaders functions and, at the moment, those packages are not "Imported" (data.table is just "Suggested"). Would you mind adding these new dependencies?

@agila5
Copy link
Collaborator Author

agila5 commented Nov 15, 2020

The following (super long) reprex presents the new approach. It's not perfect (since the first and last parts of that code could be translated to Rcpp), but that's good enough for every OSM data I will ever analyse. The next steps (after some reviews) are 1) more testings and 2) thinking about Z and M dimensions.

# packages
library(sf)
#> Linking to GEOS 3.8.0, GDAL 3.0.4, PROJ 6.3.1
library(stplanr)
library(osmextract)
library(data.table)
library(sfheaders)
library(sfnetworks)
library(tmap)

First I need to define new auxiliary functions (that won't be exported). This is just a faster (and simpler) version of line2points, inspired by sfnetworks:::linestring_boundary_points

get_boundary_points <- function(rnet) {
  # 1a. extract coordinates
  coordinates <- sf::st_coordinates(rnet)
  
  # 1b. Find index of L1 column
  L1_index <- ncol(coordinates)
  
  # 1c. Remove colnames
  coordinates <- unname(coordinates)
  
  # 2. Find idxs of first and last coordinate (i.e. the boundary points)
  first_pair <- !duplicated(coordinates[, L1_index])
  last_pair <- !duplicated(coordinates[, L1_index], fromLast = TRUE)
  idxs <- first_pair | last_pair
  
  # 3. Extract idxs and rebuild sfc
  pairs <- coordinates[idxs, ]
  boundary_points <- sf::st_cast(
    sf::st_sfc(
      sf::st_multipoint(pairs[, -L1_index]),
      crs = sf::st_crs(rnet)
    ),
    "POINT"
  )
  boundary_points
}

The following function is a faster version of lwgeom::st_split which works only using an sf object with LINESTRING geometry and a MULTIPOINT sfc

my_st_split <- function(rnet, points) {
  # Extract coordinates from rnet and points
  # rnet must be an sf data.frame with LINESTRING geometry (no ZM) while points must be an
  # sfc with MULTIPOINT geometry
  rnet_coordinates <- data.table(st_coordinates(rnet))
  points_coordinates <- data.table(st_coordinates(points))
  
  # Check Z/M dimensions
  if (ncol(rnet_coordinates) > 3L) {
    warning("The Z/M dimensions will be lost.", call. = FALSE, immediate. = TRUE)
  }
  
  # I cannot split a LINESTRING at a boundary point (since there is nothing "at
  # the other side" of the boundary point), so I need to find ID of all shared
  # internal points
  id_not_boundary <- rnet_coordinates[duplicated(L1) & duplicated(L1, fromLast = TRUE), which = TRUE]
  id_shared <- rnet_coordinates[points_coordinates, on = c("X", "Y"), which = TRUE]
  id <- intersect(id_not_boundary, id_shared)
  
  # Now I need to duplicate the coordinates of the shared internal points (since
  # they will become boundary points for the new LINESTRING(s)). I use sort
  # since I want to preserve the order within the points.
  rnet_coordinates <- rnet_coordinates[sort(c(1:.N, id))]
  
  # I will build the new sfc LINESTRING using sfheaders::sfc_linestring so I
  # need to build an ID for each new LINESTRING. The ID for the new
  # LINESTRING(s) is created starting from the L1 column in rnet_coordinates and
  # incrementing the id by 1 at each break point.
  break_id <- double(nrow(rnet_coordinates))
  break_id[id + 1:length(id)] <- 1L # The ID is shifted by 1 at each break point
  break_id <- cumsum(break_id)
  rnet_coordinates$linestring_id <- rnet_coordinates$L1 + break_id
  
  # Now I can create the new LINESTRING sfc
  new_linestring <- sfc_linestring(
    obj = rnet_coordinates,
    x = "X",
    y = "Y",
    linestring_id = "linestring_id"
  )
  
  # Exclude Z and/or M dimension
  attr(new_linestring, "z_range") <- NULL
  attr(new_linestring, "m_range") <- NULL
  
  # Determine the ID(s) of the old LINESTRING(s) (i.e. I need to create an sf
  # object with the new geometry column and the old fields)
  old_rnet_id <- rnet_coordinates[, list(L1 = unique(L1)), by = linestring_id]$L1
  
  # Create the new object:
  st_sf(
    st_drop_geometry(rnet)[old_rnet_id, , drop = FALSE],
    geometry = new_linestring,
    crs = st_crs(rnet),
    agr = st_agr(rnet),
    precision = st_precision(rnet)
  )
}

Define the new breakup function (the one that will be exported)

new_breakup <- function(rnet, verbose = FALSE) {
  # A few safety checks
  if (!inherits(rnet, "sf")) {
    stop("At the moment this function was tested only for sf objects.", call. = FALSE)
  }
  if (!all(st_is(rnet, "LINESTRING"))) {
    stop("The input sf object must have a LINESTRING geometry.")
  }
  if (!is.null(st_z_range(rnet)) || !is.null(st_m_range(rnet))) {
    warning("The Z/M dimensions will be lost.", call. = FALSE, immediate. = TRUE)
  }
  # Check if we need to rebuild tbl_df at the end
  rebuild_tbl <- FALSE
  if (inherits(rnet, "tbl_df")) {
    rebuild_tbl <- TRUE
  }
  
  # 1 - Find points that are both boundary points and internal points
  internal_points <- setkey(data.table(st_coordinates(line2vertices(rnet))))
  internal_points <- unique(internal_points) # I don't need duplicated pairs of coordinates
  if (verbose) {
    message("Extracted the internal points")
  }
  boundary_points <- setkey(data.table(st_coordinates(get_boundary_points(rnet))))
  boundary_points <- unique(boundary_points)
  if (verbose) {
    message("Extracted the boundary points")
  }
  
  shared_internal_points_data_table <- boundary_points[
    na.omit(boundary_points[internal_points, which = TRUE]),
  ]
  
  if (nrow(shared_internal_points_data_table) > 0) {
    shared_internal_points <- st_sfc(st_multipoint(
      x = as.matrix(shared_internal_points_data_table)
    ), crs = st_crs(rnet))
    
    # 2 - Split at shared internal points
    if (length(shared_internal_points) > 0) {
      message("Splitting the input object at the shared internal points.")
      rnet <- my_st_split(rnet, shared_internal_points)
    }
  }
  
  # 3 - Find duplicated internal points
  internal_points <- st_geometry(line2vertices(rnet))
  duplicated_internal_points <- internal_points[duplicated(internal_points)]
  
  if (verbose) {
    message("Extracted the duplicated internal points")
  }
  
  # 4 - Split at duplicated internal points
  if (length(duplicated_internal_points) > 0) {
    message("Splitting rnet object at the duplicated internal points.")
    rnet <- my_st_split(rnet, duplicated_internal_points)
  }
  
  # 5 - Maybe we need to rebuild the tbl_df structure
  if (rebuild_tbl) {
    rnet <- st_as_sf(dplyr::as_tibble(rnet))
  }
  
  rnet
}

Test roundabout example:

new_rnet_rounabout <- new_breakup(rnet_roundabout)
#> Splitting the input object at the shared internal points.

# plot
set.seed(1)
par(mar = rep(0, 4))
plot(st_geometry(new_rnet_rounabout), col = sample(sf.colors(17, categorical = TRUE)), lwd = 3)


Test overpass example

new_rnet_overpass <- new_breakup(rnet_overpass)
#> Splitting the input object at the shared internal points.
plot(st_geometry(new_rnet_overpass), col = sf.colors(9, categorical = TRUE), lwd = 3)


Test cycleway intersection

new_rnet_intersection <- new_breakup(rnet_cycleway_intersection)
#> Splitting rnet object at the duplicated internal points.
plot(st_geometry(new_rnet_intersection), col = sf.colors(4, categorical = TRUE), lwd = 3) 


Unfortunately it's quite difficult to compare the old and new results since the sfg are built in different orders. For example,

old_roundabout <- rnet_breakup_vertices(rnet_roundabout)
#> Splitting rnet object at the shared boundary points.
all.equal(old_roundabout, new_rnet_rounabout)
#>  [1] "Component \"geometry\": Component 1: Attributes: < Component 2: Mean relative difference: 0.3333333 >"
#>  [2] "Component \"geometry\": Component 1: Numeric: lengths (6, 4) differ"                                  
#>  [3] "Component \"geometry\": Component 2: Attributes: < Component 2: Mean relative difference: 1 >"        
#>  [4] "Component \"geometry\": Component 2: Numeric: lengths (4, 8) differ"                                  
#>  [5] "Component \"geometry\": Component 3: Mean relative difference: 5.901751e-06"                          
#>  [6] "Component \"geometry\": Component 4: Attributes: < Component 2: Mean relative difference: 0.25 >"     
#>  [7] "Component \"geometry\": Component 4: Numeric: lengths (8, 10) differ"                                 
#>  [8] "Component \"geometry\": Component 5: Attributes: < Component 2: Mean relative difference: 0.2 >"      
#>  [9] "Component \"geometry\": Component 5: Numeric: lengths (10, 8) differ"                                 
#> [10] "Component \"geometry\": Component 6: Attributes: < Component 2: Mean relative difference: 0.5 >"      
#> [11] "Component \"geometry\": Component 6: Numeric: lengths (8, 4) differ"                                  
#> [12] "Component \"geometry\": Component 7: Attributes: < Component 2: Mean relative difference: 0.5 >"      
#> [13] "Component \"geometry\": Component 7: Numeric: lengths (4, 6) differ"                                  
#> [14] "Component \"geometry\": Component 8: Mean relative difference: 5.782618e-06"                          
#> [15] "Component \"geometry\": Component 9: Mean relative difference: 4.591716e-06"

but

unlist(st_equals(old_roundabout, new_rnet_rounabout))
#>  [1]  9  1  2  3  4  5  6  7  8 10 11 12 13 14 15 16 17

which means that they are the same objects but with different orders. If I consider a more realistic example:

rnet <- oe_get(
  place = "Isle of Wight",
  quiet = FALSE,
  query = "
  SELECT osm_id, highway, z_order, other_tags, geometry
  FROM 'lines'
  WHERE highway IN (
   'primary', 'primary_link',
   'secondary', 'secondary_link',
   'tertiary', 'tertiary_link',
   'unclassified', 'residential', 'service'
  )
  "
)
#> The input place was matched with: Isle of Wight
#> The chosen file was already detected in the download directory. Skip downloading.
#> The corresponding gpkg file was already detected. Skip vectortranslate operations.
#> Reading layer `lines' from data source `C:\Users\Utente\Documents\osm_data\geofabrik_isle-of-wight-latest.gpkg' using driver `GPKG'
#> Simple feature collection with 11678 features and 4 fields
#> geometry type:  LINESTRING
#> dimension:      XY
#> bbox:           xmin: -1.583422 ymin: 50.57568 xmax: -1.070221 ymax: 50.76731
#> geographic CRS: WGS 84

then the results are extremely better using the new approach

system.time(rnet_old <- rnet_breakup_vertices(rnet))
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared internal points.
#>    user  system elapsed 
#>   24.11    0.25   24.95
system.time(rnet_new <- new_breakup(rnet))
#> Splitting the input object at the shared internal points.
#> Splitting rnet object at the duplicated internal points.
#>    user  system elapsed 
#>    3.40    0.01    3.47

The two results are not exactly the same since the new approach fixes a bug:

rnet_small <- rnet[789, ]
rnet_old_temp <- rnet_breakup_vertices(rnet_small)
#> Splitting rnet object at the shared boundary points.
plot(st_geometry(rnet_old_temp), col = sf.colors(nrow(rnet_old_temp)), lwd = 2)

rnet_new_temp <- new_breakup(rnet_small)
#> Splitting the input object at the shared internal points.
plot(st_geometry(rnet_new_temp), col = sf.colors(nrow(rnet_new_temp)), lwd = 2)

Now I can also fix an old bug in the interaction between stplanr and sfnetworks (see luukvdmeer/sfnetworks#67 (comment))

download.file("https://github.com/ropensci/stplanr/releases/download/0.6.1/Example.zip", "Example.zip")
unzip("Example.zip")

# sfnetworks approach ---------------------------------------
roads = st_read("Example/Complete Roads/complete_roads.shp", quiet = TRUE) %>% 
  st_cast("LINESTRING")
#> Warning in st_cast.sf(., "LINESTRING"): repeating attributes for all sub-
#> geometries for which they may not be constant
points = st_read("Example/Points/Points.shp", quiet = TRUE)
roads_trf = st_transform(roads, st_crs(points))

Now I need to "pre-process" roads_trf using new_breakup(). Please notice that

nrow(roads_trf)
#> [1] 305608

and that the old approach takes more than 2h

system.time(roads_trf <- new_breakup(roads_trf, verbose = TRUE))
#> Extracted the internal points
#> Extracted the boundary points
#> Splitting the input object at the shared internal points.
#> Extracted the duplicated internal points
#> Splitting rnet object at the duplicated internal points.
#>    user  system elapsed 
#>  111.30    3.54  115.26

Moreover, please notice that now

nrow(roads_trf)
#> [1] 813502

# calculate the sfnetwork object
system.time({
  net <- as_sfnetwork(roads_trf, directed = FALSE) %>%
    activate("edges") %>%
    dplyr::mutate(weight = edge_length())
})
#>    user  system elapsed 
#>  107.89    3.12  111.47

# routing
from = c(-49.95058, -24.77502)
to = c(-49.91084, -24.75200)
p1 = sf::st_as_sf(data.frame(x = from[1], y = from[2]), coords = c("x", "y"), crs = sf::st_crs(net))
p2 = sf::st_as_sf(data.frame(x = to[1], y = to[2]), coords = c("x", "y"), crs = sf::st_crs(net))

and finally

r <- tidygraph::convert(net, to_spatial_shortest_paths, p1, p2)
#> although coordinates are longitude/latitude, st_nearest_points assumes that they are planar
#> although coordinates are longitude/latitude, st_nearest_points assumes that they are planar

# plot
tm_shape(r %>% activate("edges") %>% st_as_sf(), ext = 1.25) + 
  tm_lines() + 
tm_shape(p1) + 
  tm_dots(col = "red", size = 1) + 
tm_shape(p2) + 
  tm_dots(col = "blue", size = 1)

Created on 2020-11-15 by the reprex package (v0.3.0)

Session info
devtools::session_info()
#> - Session info ---------------------------------------------------------------
#>  setting  value                       
#>  version  R version 3.6.3 (2020-02-29)
#>  os       Windows 10 x64              
#>  system   x86_64, mingw32             
#>  ui       RTerm                       
#>  language (EN)                        
#>  collate  Italian_Italy.1252          
#>  ctype    Italian_Italy.1252          
#>  tz       Europe/Berlin               
#>  date     2020-11-15                  
#> 
#> - Packages -------------------------------------------------------------------
#>  package      * version    date       lib source                              
#>  abind          1.4-5      2016-07-21 [1] CRAN (R 3.6.0)                      
#>  assertthat     0.2.1      2019-03-21 [1] CRAN (R 3.6.0)                      
#>  backports      1.2.0      2020-11-02 [1] CRAN (R 3.6.3)                      
#>  base64enc      0.1-3      2015-07-28 [1] CRAN (R 3.6.0)                      
#>  callr          3.5.1      2020-10-13 [1] CRAN (R 3.6.3)                      
#>  class          7.3-17     2020-04-26 [1] CRAN (R 3.6.3)                      
#>  classInt       0.4-3      2020-04-07 [1] CRAN (R 3.6.3)                      
#>  cli            2.1.0      2020-10-12 [1] CRAN (R 3.6.3)                      
#>  codetools      0.2-16     2018-12-24 [2] CRAN (R 3.6.3)                      
#>  crayon         1.3.4      2017-09-16 [1] CRAN (R 3.6.0)                      
#>  crosstalk      1.1.0.1    2020-03-13 [1] CRAN (R 3.6.3)                      
#>  curl           4.3        2019-12-02 [1] CRAN (R 3.6.1)                      
#>  data.table   * 1.13.2     2020-10-19 [1] CRAN (R 3.6.3)                      
#>  DBI            1.1.0      2019-12-15 [1] CRAN (R 3.6.3)                      
#>  desc           1.2.0      2018-05-01 [1] CRAN (R 3.6.0)                      
#>  devtools       2.3.2      2020-09-18 [1] CRAN (R 3.6.3)                      
#>  dichromat      2.0-0      2013-01-24 [1] CRAN (R 3.6.0)                      
#>  digest         0.6.27     2020-10-24 [1] CRAN (R 3.6.3)                      
#>  dplyr          1.0.2      2020-08-18 [1] CRAN (R 3.6.3)                      
#>  e1071          1.7-4      2020-10-14 [1] CRAN (R 3.6.3)                      
#>  ellipsis       0.3.1      2020-05-15 [1] CRAN (R 3.6.3)                      
#>  evaluate       0.14       2019-05-28 [1] CRAN (R 3.6.0)                      
#>  fansi          0.4.1      2020-01-08 [1] CRAN (R 3.6.2)                      
#>  foreign        0.8-75     2020-01-20 [2] CRAN (R 3.6.3)                      
#>  fs             1.5.0      2020-07-31 [1] CRAN (R 3.6.3)                      
#>  generics       0.1.0      2020-10-31 [1] CRAN (R 3.6.3)                      
#>  geosphere      1.5-10     2019-05-26 [1] CRAN (R 3.6.1)                      
#>  glue           1.4.2      2020-08-27 [1] CRAN (R 3.6.3)                      
#>  highr          0.8        2019-03-20 [1] CRAN (R 3.6.0)                      
#>  htmltools      0.5.0      2020-06-16 [1] CRAN (R 3.6.3)                      
#>  htmlwidgets    1.5.2      2020-10-03 [1] CRAN (R 3.6.3)                      
#>  httr           1.4.2      2020-07-20 [1] CRAN (R 3.6.3)                      
#>  igraph         1.2.6      2020-10-06 [1] CRAN (R 3.6.3)                      
#>  KernSmooth     2.23-18    2020-10-29 [1] CRAN (R 3.6.3)                      
#>  knitr          1.30       2020-09-22 [1] CRAN (R 3.6.3)                      
#>  lattice        0.20-41    2020-04-02 [1] CRAN (R 3.6.3)                      
#>  leafem         0.1.3      2020-07-26 [1] CRAN (R 3.6.3)                      
#>  leaflet        2.0.3      2019-11-16 [1] CRAN (R 3.6.3)                      
#>  leafsync       0.1.0      2019-03-05 [1] CRAN (R 3.6.1)                      
#>  lifecycle      0.2.0      2020-03-06 [1] CRAN (R 3.6.2)                      
#>  lwgeom         0.2-5      2020-06-12 [1] CRAN (R 3.6.3)                      
#>  magrittr       1.5        2014-11-22 [1] CRAN (R 3.6.3)                      
#>  maptools       1.0-2      2020-08-24 [1] CRAN (R 3.6.3)                      
#>  memoise        1.1.0      2017-04-21 [1] CRAN (R 3.6.0)                      
#>  mime           0.9        2020-02-04 [1] CRAN (R 3.6.2)                      
#>  osmextract   * 0.1.0      2020-11-12 [1] local                               
#>  pillar         1.4.6      2020-07-10 [1] CRAN (R 3.6.3)                      
#>  pkgbuild       1.1.0      2020-07-13 [1] CRAN (R 3.6.3)                      
#>  pkgconfig      2.0.3      2019-09-22 [1] CRAN (R 3.6.1)                      
#>  pkgload        1.1.0      2020-05-29 [1] CRAN (R 3.6.3)                      
#>  png            0.1-7      2013-12-03 [1] CRAN (R 3.6.0)                      
#>  prettyunits    1.1.1      2020-01-24 [1] CRAN (R 3.6.2)                      
#>  processx       3.4.4      2020-09-03 [1] CRAN (R 3.6.3)                      
#>  ps             1.4.0      2020-10-07 [1] CRAN (R 3.6.3)                      
#>  purrr          0.3.4      2020-04-17 [1] CRAN (R 3.6.3)                      
#>  R6             2.5.0      2020-10-28 [1] CRAN (R 3.6.3)                      
#>  raster         3.3-13     2020-07-17 [1] CRAN (R 3.6.3)                      
#>  RColorBrewer   1.1-2      2014-12-07 [1] CRAN (R 3.6.0)                      
#>  Rcpp           1.0.5      2020-07-06 [1] CRAN (R 3.6.3)                      
#>  remotes        2.2.0      2020-07-21 [1] CRAN (R 3.6.3)                      
#>  rgeos          0.5-5      2020-09-07 [1] CRAN (R 3.6.3)                      
#>  rlang          0.4.8      2020-10-08 [1] CRAN (R 3.6.3)                      
#>  rmarkdown      2.5        2020-10-21 [1] CRAN (R 3.6.3)                      
#>  rprojroot      1.3-2      2018-01-03 [1] CRAN (R 3.6.0)                      
#>  sessioninfo    1.1.1      2018-11-05 [1] CRAN (R 3.6.0)                      
#>  sf           * 0.9-6      2020-09-13 [1] CRAN (R 3.6.3)                      
#>  sfheaders    * 0.3.0      2020-10-31 [1] CRAN (R 3.6.3)                      
#>  sfnetworks   * 0.3.1      2020-10-30 [1] local                               
#>  sp             1.4-4      2020-10-07 [1] CRAN (R 3.6.3)                      
#>  stars          0.4-4      2020-08-19 [1] Github (r-spatial/stars@b7b54c8)    
#>  stplanr      * 0.8.0.9000 2020-11-10 [1] local                               
#>  stringi        1.4.6      2020-02-17 [1] CRAN (R 3.6.2)                      
#>  stringr        1.4.0      2019-02-10 [1] CRAN (R 3.6.0)                      
#>  testthat       3.0.0      2020-10-31 [1] CRAN (R 3.6.3)                      
#>  tibble         3.0.4      2020-10-12 [1] CRAN (R 3.6.3)                      
#>  tidygraph      1.2.0      2020-05-12 [1] CRAN (R 3.6.3)                      
#>  tidyr          1.1.2      2020-08-27 [1] CRAN (R 3.6.3)                      
#>  tidyselect     1.1.0      2020-05-11 [1] CRAN (R 3.6.3)                      
#>  tmap         * 3.2        2020-09-15 [1] CRAN (R 3.6.3)                      
#>  tmaptools      3.1        2020-07-12 [1] Github (mtennekes/tmaptools@947f3bd)
#>  units          0.6-7      2020-06-13 [1] CRAN (R 3.6.3)                      
#>  usethis        1.6.3      2020-09-17 [1] CRAN (R 3.6.3)                      
#>  vctrs          0.3.4      2020-08-29 [1] CRAN (R 3.6.3)                      
#>  viridisLite    0.3.0      2018-02-01 [1] CRAN (R 3.6.0)                      
#>  withr          2.3.0      2020-09-22 [1] CRAN (R 3.6.3)                      
#>  xfun           0.19       2020-10-30 [1] CRAN (R 3.6.3)                      
#>  XML            3.99-0.3   2020-01-20 [1] CRAN (R 3.6.2)                      
#>  xml2           1.3.2      2020-04-23 [1] CRAN (R 3.6.3)                      
#>  yaml           2.2.1      2020-02-01 [1] CRAN (R 3.6.2)                      
#> 
#> [1] C:/Users/Utente/Documents/R/win-library/3.6
#> [2] C:/Program Files/R/R-3.6.3/library

@Robinlovelace
Copy link
Member

Hi @agila5 many thanks for the idea and detailed reprex. I think your idea has (fast) legs! Plan to pick up on this when the 'dust has settled' around some other projects/ideas but overall I think this is a good way forward. On a related note... I have found a way to split all linestrings into a maximum length using qgisprocess: https://github.com/paleolimbot/qgisprocess

Catch up soon!

@agila5
Copy link
Collaborator Author

agila5 commented Nov 18, 2020

Hi!

Plan to pick up on this when the 'dust has settled' around some other projects/ideas but overall I think this is a good way forward.

Awesome, thanks! IMO the idea is good, and I've already started using it. I will update my previous comment if I find any bug/problem.

On a related note... I have found a way to split all linestrings into a maximum length using qgisprocess: https://github.com/paleolimbot/qgisprocess

I also checked the qgisprocess project since QGIS implements an algorithm to split a line layer using a point layer (see here). In the end, I abandoned that approach since 1) I've never used QGIS and 2) it may take forever to get permissions to install QGIS on the university server. I will be happy to see a comparison.

@Robinlovelace
Copy link
Member

Robinlovelace commented Nov 21, 2020

Heads-up @agila5, I've been doing some experiments and have found a function for getting boundary points that is 10 times faster than the sf-based one you demonstrate above, thanks again to sfheaders:

# alternative
get_boundary_points3 <- function(rnet) {
  coordinates <- sfheaders::sf_to_df(rnet)
  # names(coordinates)
  head(coordinates)
  coordinates <- as.matrix(coordinates[c("x", "y", "linestring_id")])
  L1_index <- ncol(coordinates)
  coordinates <- unname(coordinates)
  first_pair <- !duplicated(coordinates[, L1_index])
  last_pair <- !duplicated(coordinates[, L1_index], fromLast = TRUE)
  idxs <- first_pair | last_pair
  pairs <- coordinates[idxs, ]
  boundary_points <- sfheaders::sf_point(
    pairs[, -L1_index]
  )
  sf::st_crs(boundary_points) <- sf::st_crs(rnet)
  boundary_points
}

Full reprex below:

# Aim: test rnet_breakup components


# setup -------------------------------------------------------------------

remotes::install_github("itsleeds/osmextract")
#> Using github PAT from envvar GITHUB_PAT
#> Skipping install of 'osmextract' from a github remote, the SHA1 (c8f08f09) has not changed since last install.
#>   Use `force = TRUE` to force installation
library(osmextract)
#> Data (c) OpenStreetMap contributors, ODbL 1.0. https://www.openstreetmap.org/copyright.
#> Any product made from OpenStreetMap must cite OSM as the data source.
#> Geofabrik data are taken from https://download.geofabrik.de/
#> For usage details of bbbike data see https://download.bbbike.org/osm/
#> OpenStreetMap_fr data are taken from http://download.openstreetmap.fr/
library(stplanr)
rnet <- oe_get(
  place = "Isle of Wight",
  quiet = FALSE,
  query = "
  SELECT osm_id, highway, z_order, other_tags, geometry
  FROM 'lines'
  WHERE highway IN (
   'primary', 'primary_link',
   'secondary', 'secondary_link',
   'tertiary', 'tertiary_link',
   'unclassified', 'residential', 'service'
  )
  "
)
#> The input place was matched with: Isle of Wight
#> The chosen file was already detected in the download directory. Skip downloading.
#> The corresponding gpkg file was already detected. Skip vectortranslate operations.
#> Reading layer `lines' from data source `/mnt/57982e2a-2874-4246-a6fe-115c199bc6bd/data/osm/geofabrik_isle-of-wight-latest.gpkg' using driver `GPKG'
#> Simple feature collection with 11475 features and 4 fields
#> geometry type:  LINESTRING
#> dimension:      XY
#> bbox:           xmin: -1.583473 ymin: 50.57568 xmax: -1.070221 ymax: 50.76731
#> geographic CRS: WGS 84

system.time(rnet_old <- rnet_breakup_vertices(rnet)) # ~10s
#> Linking to GEOS 3.8.0, GDAL 3.0.4, PROJ 7.0.0
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared internal points.
#>    user  system elapsed 
#>   9.431   0.032   9.463

# basic checks
nrow(rnet_old) / nrow(rnet) # 1.7
#> [1] 1.747538
sum(sf::st_length(rnet_old)) / sum(sf::st_length(rnet)) # 1
#> 1 [1]

# andrea's new boundary points function:
get_boundary_points <- function(rnet) {
  # 1a. extract coordinates
  coordinates <- sf::st_coordinates(rnet)
  
  # 1b. Find index of L1 column
  L1_index <- ncol(coordinates)
  
  # 1c. Remove colnames
  coordinates <- unname(coordinates)
  
  # 2. Find idxs of first and last coordinate (i.e. the boundary points)
  first_pair <- !duplicated(coordinates[, L1_index])
  last_pair <- !duplicated(coordinates[, L1_index], fromLast = TRUE)
  idxs <- first_pair | last_pair
  
  # 3. Extract idxs and rebuild sfc
  pairs <- coordinates[idxs, ]
  boundary_points <- sf::st_cast(
    sf::st_sfc(
      sf::st_multipoint(pairs[, -L1_index]),
      crs = sf::st_crs(rnet)
    ),
    "POINT"
  )
  boundary_points
}

# alternative
get_boundary_points2 <- function(rnet) {
  coordinates <- sfheaders::sf_to_df(rnet)
  # names(coordinates)
  head(coordinates)
  coordinates <- as.matrix(coordinates[c("x", "y", "linestring_id")])
  L1_index <- ncol(coordinates)
  coordinates <- unname(coordinates)
  first_pair <- !duplicated(coordinates[, L1_index])
  last_pair <- !duplicated(coordinates[, L1_index], fromLast = TRUE)
  idxs <- first_pair | last_pair
  pairs <- coordinates[idxs, ]
  boundary_points <- sf::st_cast(
    sf::st_sfc(
      sf::st_multipoint(pairs[, -L1_index]),
      crs = sf::st_crs(rnet)
    ),
    "POINT"
  )
  boundary_points
}

# alternative
get_boundary_points3 <- function(rnet) {
  coordinates <- sfheaders::sf_to_df(rnet)
  # names(coordinates)
  head(coordinates)
  coordinates <- as.matrix(coordinates[c("x", "y", "linestring_id")])
  L1_index <- ncol(coordinates)
  coordinates <- unname(coordinates)
  first_pair <- !duplicated(coordinates[, L1_index])
  last_pair <- !duplicated(coordinates[, L1_index], fromLast = TRUE)
  idxs <- first_pair | last_pair
  pairs <- coordinates[idxs, ]
  boundary_points <- sfheaders::sf_point(
    pairs[, -L1_index]
  )
  sf::st_crs(boundary_points) <- sf::st_crs(rnet)
  boundary_points
}

system.time(bp1 <- get_boundary_points(rnet)) # 0.3s
#>    user  system elapsed 
#>   0.277   0.000   0.277
system.time(bp2 <- get_boundary_points2(rnet)) # 0.3s
#>    user  system elapsed 
#>   0.367   0.000   0.367
system.time(bp3 <- get_boundary_points3(rnet)) # 10 times faster!
#>    user  system elapsed 
#>   0.034   0.000   0.034

identical(bp1, bp2)
#> [1] TRUE
identical(sf::st_coordinates(bp1), sf::st_coordinates(bp3))
#> [1] TRUE

Created on 2020-11-21 by the reprex package (v0.3.0)

@Robinlovelace
Copy link
Member

Heads-up @agila5 I've implemented the function as rnet_junction() here: #443

Experiment at this stage but thinking this could be a good step towards fixing this long standing (and generally long) issue!

@Robinlovelace
Copy link
Member

Update @agila5, I think we can make lots of savings, it will be substantially faster just by replacing line2points with rnet_boundary_points (or maybe even rnet_boundary_unique). Based on the latest stplanr version on master, I'd be very grateful if you could try running this script: https://github.com/ropensci/stplanr/blob/master/data-raw/ad-hoc-tests/test-rnet-breakup-timing1.R

I've not tried your linestring specific implementation of st_split() but think that with that and these faster point functions we can fix this issue, and make the code more modular and easy-to-maintain.

See also ad-hoc tests in #443

@agila5
Copy link
Collaborator Author

agila5 commented Nov 28, 2020

Hi Robin! I run (part of that) code on my laptop and I think there is a problem with rnet_boundary_points:

# dev packages
remotes::install_github("ropensci/stplanr")
#> Using github PAT from envvar GITHUB_PAT
#> Skipping install of 'stplanr' from a github remote, the SHA1 (59562f89) has not changed since last install.
#>   Use `force = TRUE` to force installation

# packages
library(sf)
#> Linking to GEOS 3.8.0, GDAL 3.0.4, PROJ 6.3.1
library(stplanr)

# data
rnet = trafficalmr::tc_data_osm

# output
bpline2points <- line2points(rnet)
boundary_points <- rnet_boundary_points(rnet)

# plot
par(mar = rep(0, 4))
plot(rnet$geometry)
plot(boundary_points, add = TRUE, cex = 0.75)

# nothing happens, why? This is clearly not the intended result, right? 
boundary_points
#> Simple feature collection with 998 features and 0 fields
#> geometry type:  POINT
#> dimension:      XYZ
#> bbox:           xmin: 1 ymin: -0.116214 xmax: 499 ymax: -0.1021283
#> z_range:        zmin: 51.50323 zmax: 51.50875
#> geographic CRS: WGS 84
#> First 10 features:
#>                          geometry
#> 1  POINT Z (1 -0.113894 51.50631)
#> 2  POINT Z (1 -0.1127652 51.50...
#> 3  POINT Z (2 -0.104463 51.50714)
#> 4  POINT Z (2 -0.1043284 51.50...
#> 5  POINT Z (3 -0.112129 51.50674)
#> 6  POINT Z (3 -0.1113433 51.50...
#> 7  POINT Z (4 -0.105887 51.50862)
#> 8  POINT Z (4 -0.1076651 51.50...
#> 9  POINT Z (5 -0.1141145 51.50...
#> 10 POINT Z (5 -0.1140915 51.50...

# old output
plot(rnet$geometry)
plot(bpline2points$geometry, add = TRUE, cex = 0.75)

Created on 2020-11-28 by the reprex package (v0.3.0)

If you want I can check the problem in the afternoon.

I've not tried your linestring specific implementation of st_split() but think that with that and these faster point functions we can fix this issue, and make the code more modular and easy-to-maintain.

Great! So you are happy for a PR that add and tests the new approach?

@Robinlovelace
Copy link
Member

Robinlovelace commented Nov 29, 2020

You are of course right, it was an unfinished function as documented in #450, just fixed.

I think the boundary point functions are now finished and fast, as shown in the updated reprex below, suggesting we can already double the speed of rnet_breakup_vertices(). And in answer to your question @agila5

Great! So you are happy for a PR that add and tests the new approach?

Sorry for the slow reply, yes please!

Code for a reprex (key bits in a reprex and will push in a commit below):

remotes::install_github("ropensci/stplanr")

library(sf)
library(stplanr)
# rnet = stplanr::rnet_roundabout
# rnet = stplanr::osm_net_example
rnet = trafficalmr::tc_data_osm

rnet_boundary_rnet_breakup <- function(rnet) {
  rnet_nodes <- sf::st_geometry(line2points(rnet))
  rnet_internal_vertexes <- sf::st_geometry(line2vertices(rnet))
  unique_rnet_nodes <- do.call("c", unique(rnet_nodes))
  unique_rnet_internal_vertexes <- do.call("c", unique(rnet_internal_vertexes))
  rbind_nodes_internal_vertexes <- rbind(unique_rnet_nodes, unique_rnet_internal_vertexes)
  index_intersection_points <- duplicated(rbind_nodes_internal_vertexes)
    intersection_points <- sf::st_as_sf(
      data.frame(rbind_nodes_internal_vertexes[index_intersection_points, , drop = FALSE]),
      coords = c("x_coords", "y_coords"),
      crs = sf::st_crs(rnet)
    )
}

`%dtIN%` <- function(y, x) {
  tmp = data.table::rbindlist(list(x,y))
  len_ = nrow(x)
  tmp[, idx := any(.I <= len_) & .N > 1L, by=names(tmp)]
  tail(tmp$idx, nrow(y))
}

rnet_boundary_rnet_breakup2 <- function(rnet) {
  coordinates <- sfheaders::sf_to_df(rnet)
  first_pair <- !duplicated(coordinates[["sfg_id"]])
  last_pair <- !duplicated(coordinates[["sfg_id"]], fromLast = TRUE)
  idxs <- first_pair | last_pair
  pairs <- unique(coordinates[idxs, c("x", "y")])
  i <- coordinates[!idxs, c("x", "y")]
  # find 'internal intersections'
  i_dup <- duplicated(i)
  i_u <- unique(i)
  # this stage can be made a bit faster with data.table:
  # https://stackoverflow.com/questions/23971160
  # i_in_pairs = interaction(i_u) %in% interaction(pairs) # very slow!
  if(requireNamespace("data.table", quietly = TRUE))
  i_in_pairs <- i_u %dtIN% pairs
  p_sf <- rbind(i_u[i_in_pairs, ], i[i_dup, ])
  p <- sfheaders::sf_point(unique(p_sf))
  sf::st_crs(p) <- sf::st_crs(rnet)
  p
}

p1 = rnet_boundary_rnet_breakup(rnet)
p2 = rnet_boundary_rnet_breakup2(rnet)

nrow(p1)
nrow(p2)
plot(p1)
plot(p2, cex = 2, add = TRUE)

bpline2points <- line2points(rnet)
boundary_points <- rnet_boundary_points(rnet)
boundary_points2 <- rnet_duplicated_vertices(rnet)
boundary_points_rnb <- rnet_boundary_rnet_breakup(rnet)
boundary_points_rnb2 <- rnet_boundary_rnet_breakup2(rnet)
plot(rnet$geometry)
plot(boundary_points, add = TRUE, cex = 0.5) # too many
plot(boundary_points2, add = TRUE)
plot(boundary_points_rnb, col = "red", add = TRUE, cex = 3)

nrow(bpline2points)
nrow(boundary_points)
nrow(boundary_points2)
nrow(boundary_points_rnb)

identical(boundary_points2$geometry, boundary_points_rnb$geometry)
summary(boundary_points2$geometry %in% boundary_points_rnb$geometry)

bench::mark(check = FALSE,
  l2p = {line2points(rnet)},
  rbp = {rnet_boundary_points(rnet)},
  rdv = {rnet_duplicated_vertices(rnet)},
  rnb = {rnet_boundary_rnet_breakup(rnet)},
  rnb2 = {rnet_boundary_rnet_breakup2(rnet)},
  rnbv = {rnet_breakup_vertices(rnet)}
)

@Robinlovelace
Copy link
Member

bench::mark(check = FALSE,
  l2p = {line2points(rnet)},
  rbp = {rnet_boundary_points(rnet)},
  rdv = {rnet_duplicated_vertices(rnet)},
  rnb = {rnet_boundary_rnet_breakup(rnet)},
  rnb2 = {rnet_boundary_rnet_breakup2(rnet)},
  rnbv = {rnet_breakup_vertices(rnet)}
)
#> # A tibble: 6 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 l2p         32.81ms   35.1ms     28.4   477.23KB     5.68
#> 2 rbp          1.24ms   1.39ms    694.    422.35KB    10.0 
#> 3 rdv         13.39ms  14.09ms     70.5     1.49MB     3.92
#> 4 rnb         45.19ms  46.86ms     21.2     1.05MB     5.79
#> 5 rnb2         4.89ms    5.4ms    180.    716.48KB     5.94
#> 6 rnbv       191.64ms 195.81ms      5.12    7.51MB     6.83

Robinlovelace added a commit that referenced this issue Nov 29, 2020
@agila5
Copy link
Collaborator Author

agila5 commented Nov 29, 2020

Hi! I didn't check all the details yet but I think that the message is to use rnet_boundary_points if I simply need to extract the boundary points for a line, right?

Sorry for the slow reply, yes please!

Ok, great! I will add something here and in sfnetworks repo after lunch!

@Robinlovelace
Copy link
Member

Hi! I didn't check all the details yet but I think that the message is to use rnet_boundary_points if I simply need to extract the boundary points for a line, right?

Yes. If you want to avoid duplicate points use rnet_boundary_unique(). I'm not sure what to do about the duplication of line2points() which I think is now essentially the same as rnet_boundary_points() but that's not a high priority. Priority: speed-up rnet_breakup_vertices() which is certainly possible!

One other thing: do you think it's worth breaking the function into smaller exported pieces, such as:

`%dtIN%` <- function(y, x) {
tmp = data.table::rbindlist(list(x,y))
len_ = nrow(x)
tmp[, idx := any(.I <= len_) & .N > 1L, by=names(tmp)]
tail(tmp$idx, nrow(y))
}
rnet_boundary_rnet_breakup2 <- function(rnet) {
coordinates <- sfheaders::sf_to_df(rnet)
first_pair <- !duplicated(coordinates[["sfg_id"]])
last_pair <- !duplicated(coordinates[["sfg_id"]], fromLast = TRUE)
idxs <- first_pair | last_pair
pairs <- unique(coordinates[idxs, c("x", "y")])
i <- coordinates[!idxs, c("x", "y")]
# find 'internal intersections'
i_dup <- duplicated(i)
i_u <- unique(i)
# this stage can be made a bit faster with data.table:
# https://stackoverflow.com/questions/23971160
# i_in_pairs = interaction(i_u) %in% interaction(pairs) # very slow!
if(requireNamespace("data.table", quietly = TRUE))
i_in_pairs <- i_u %dtIN% pairs
p_sf <- rbind(i_u[i_in_pairs, ], i[i_dup, ])
p <- sfheaders::sf_point(unique(p_sf))
sf::st_crs(p) <- sf::st_crs(rnet)
p
}

I think so!

@Robinlovelace
Copy link
Member

Robinlovelace commented Nov 29, 2020

Clarification, I think the code chunk shown in the test code above can be used, after the fast version of the function rnet_boundary_rnet_breakup() is in the package and exported, to replace this:

stplanr/R/rnet-clean.R

Lines 100 to 121 in 8f055db

rnet_nodes <- sf::st_geometry(line2points(rnet))
rnet_internal_vertexes <- sf::st_geometry(line2vertices(rnet))
# For the first part of the procedure I don't need duplicated nodes or
# duplicated vertexes so I can extract their unique values
unique_rnet_nodes <- do.call("c", unique(rnet_nodes))
unique_rnet_internal_vertexes <- do.call("c", unique(rnet_internal_vertexes))
# Intersection between nodes and internal vertexes
# The following code is the same as
# intersection_point <- sf::st_intersection(unique_rnet_nodes, unique_rnet_internal_vertexes)
# but faster since we are dealing only with points
rbind_nodes_internal_vertexes <- rbind(unique_rnet_nodes, unique_rnet_internal_vertexes)
index_intersection_points <- duplicated(rbind_nodes_internal_vertexes)
if (any(index_intersection_points)) {
intersection_points <- sf::st_as_sf(
data.frame(rbind_nodes_internal_vertexes[index_intersection_points, , drop = FALSE]),
coords = c("x_coords", "y_coords"),
crs = sf::st_crs(rnet)
)

With something like this:

    intersection_points <- rnet_boundary_rnet_breakup(rnet)

@Robinlovelace
Copy link
Member

Ok, great! I will add something here and in sfnetworks repo after lunch!

Almost worth creating a new package called rnets or similar to reduce duplication but happy with that approach for not to get things moving 🚀

agila5 pushed a commit to agila5/stplanr that referenced this issue Nov 29, 2020
@luukvdmeer
Copy link

If we can get 10 times faster boundary points function in sfnetworks would be great @agila5 @Robinlovelace ! (Note that on the develop branch of sfnetworks that function is now renamed to linestring_boundary_points)

@Robinlovelace
Copy link
Member

Robinlovelace commented Dec 2, 2020

Heads-up @luukvdmeer this is the faster code, pretty quick. BTW I've had doubts about the name rnet_boundary_points() and wonder if line_boundary_points() or even line_boundaries() would be better. Thoughts @agila5 ?

rnet_boundary_points <- function(rnet) {
pairs <- rnet_boundary_df(rnet)
pairs_xyz <- pairs[names(pairs) %in% c("x", "y", "z")]
boundary_points <- sfheaders::sf_point(pairs_xyz)
sf::st_crs(boundary_points) <- sf::st_crs(rnet)
boundary_points
}
#' @rdname rnet_boundary_points
#' @export
rnet_boundary_df <- function(rnet) {
stopifnot(requireNamespace("sfheaders", quietly = TRUE))
coordinates <- sfheaders::sf_to_df(rnet)
first_pair <- !duplicated(coordinates[["sfg_id"]])
last_pair <- !duplicated(coordinates[["sfg_id"]], fromLast = TRUE)
idxs <- first_pair | last_pair
pairs <- coordinates[idxs, ]
pairs
}

@luukvdmeer
Copy link

Great! I will test this in sfnetworks and show results here. If you give permission I will include it also in the new version afterwards.

I might change coordinates <- sfheaders::sf_to_df(rnet) to coordinates <- sfheaders::sfc_to_df(sf::st_geometry(rnet)) such that it works for both sf and sfc objects. In sfnetworks this is important to allow creating a network from and sfc object. Not sure if it also makes sense in stplanr.

@luukvdmeer
Copy link

Using this makes constructing an sfnetwork from the Roxel dataset twice as fast, very nice! (20 milliseconds instead of 40 milliseconds)

@mpadge
Copy link
Member

mpadge commented Dec 2, 2020

Thanks for all contributions to this really interesting issue. Just one note via reprex to indicate the OSM ID values are intended to be the definitive guide for nodal intersections, and they exist in many cases, including stplanr::osm_net_example. Using them enables even more efficient identification of boundary points. (These lines follow on from previous code by @Robinlovelace). The following shows R code to clarify what is being done, followed by C++ code to show how to gain even more efficiency.

library(stplanr)
library(sf)
rnet = stplanr::osm_net_example

# define:
# rnet_boundary_rnet_breakup
# `%dtIN%`
# rnet_boundary_rnet_breakup2


boundary_points_by_id_r <- function (rnet) {
    ids <- lapply (rnet$geometry, function (i) rownames (as.matrix (i)))
    ends <- vapply (ids, function (i) i [c (1, length (i))],
                    character (2), USE.NAMES = FALSE)
    ends <- unique (as.vector (ends))

    mids <- lapply (ids, function (i) i [-c (1, length (i))])
    mids <- unname (do.call (c, mids))
    mid_is_end <- mids [mids %in% ends]
    mids <- table (mids)
    mids_dup <- names (mids) [which (mids > 1)]

    ids <- unique (c (mid_is_end, mids_dup))

    xy <- do.call (rbind, lapply (rnet$geometry, function (i) as.matrix (i)))
    xy <- xy [match (ids, rownames (xy)), ]
    p <- sfheaders::sf_point (xy)
    sf::st_crs(p) <- sf::st_crs(rnet)
    return (p)
}

library(Rcpp)
cppFunction('
Rcpp::CharacterVector getids (Rcpp::List g) {
    const int n = g.size ();

    std::unordered_set <std::string> endset, midset, middupset;
    Rcpp::List mids (n);
    for (Rcpp::NumericMatrix i: g) {
        Rcpp::List dimnames = i.attr ("dimnames");
        Rcpp::CharacterVector rownames = dimnames (0);

        endset.emplace (rownames (0));
        endset.emplace (rownames (rownames.size () - 1));

        rownames.erase (0);
        rownames.erase (rownames.size () - 1);
        for (auto j: rownames)
        {
            std::string js = Rcpp::as <std::string> (j);
            if (midset.find (js) != midset.end ())
                middupset.emplace (j);
            midset.emplace (js);
        }
    }

    std::unordered_set <std::string> mid_is_end;
    for (auto i: midset) {
        if (endset.find (i) != endset.end ())
            mid_is_end.emplace (i);
    }

    std::unordered_set <std::string> res;
    for (auto i: mid_is_end)
        res.emplace (i);
    for (auto i: middupset)
        res.emplace (i);

    return Rcpp::wrap (res);
    }
')

boundary_points_by_id_cpp <- function (rnet) {
    ids <- getids (rnet$geometry)
    xy <- do.call (rbind, lapply (rnet$geometry, function (i) as.matrix (i)))
    xy <- xy [match (ids, rownames (xy)), ]
    p <- sfheaders::sf_point (xy)
    sf::st_crs(p) <- sf::st_crs(rnet)
    return (p)
}

pref <- rnet_boundary_rnet_breakup2 (rnet)
pr <- boundary_points_by_id_r (rnet)
pc <- boundary_points_by_id_cpp (rnet)
nrow (pref); nrow (pr); nrow (pc)
#> [1] 55
#> [1] 55
#> [1] 55

bench::mark(check = FALSE,
  l2p = {line2points(rnet)},
  rbp = {rnet_boundary_points(rnet)},
  rdv = {rnet_duplicated_vertices(rnet)},
  rnb = {rnet_boundary_rnet_breakup(rnet)},
  rnb2 = {rnet_boundary_rnet_breakup2(rnet)},
  rnbv = {rnet_breakup_vertices(rnet)},
  rrid = {boundary_points_by_id_r(rnet)},
  rcid = {boundary_points_by_id_cpp(rnet)},
  time_unit = "ms")
#> # A tibble: 8 x 6
#>   expression    min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>  <dbl>  <dbl>     <dbl> <bch:byt>    <dbl>
#> 1 l2p         5.83   6.10      153.         NA     6.36
#> 2 rbp         0.435  0.471    2098.         NA     6.20
#> 3 rdv         2.99   3.15      313.         NA     2.03
#> 4 rnb         9.84  10.2        95.1        NA     6.48
#> 5 rnb2        2.10   2.34      405.         NA     4.12
#> 6 rnbv       42.9   44.2        22.6        NA     5.02
#> 7 rrid        1.66   1.80      539.         NA     6.24
#> 8 rcid        0.878  0.936    1021.         NA     4.22

Created on 2020-12-02 by the reprex package (v0.3.0)

Both versions there are pretty hacky, and the C++ could likely be made even more efficient. Main point I wanted to make is that when a new boundary_vertices function like this is incorporated, I'd suggest it would be important to examine whether the underlying data retain OSM IDs as rownames of the sf geometries, and to use those if so via a function like boundary_points_by_id, and only otherwise switch to using coordinates. (And even then, the coordinates version could easily be C++-ed for extra gains - let me know, and I'll be happy to PR some code.)

I've been needing a good home for a function like this too, so would be very happy to have it housed in stplanr 😄

@Robinlovelace
Copy link
Member

If you give permission I will include it also in the new version afterwards.

Very happy with that @luukvdmeer, thanks!

@Robinlovelace
Copy link
Member

Robinlovelace commented Dec 2, 2020

Many thanks for chipping in @mpadge, impressive CPP implementation. Confirmed: decent speeded-up version of rnet_boundary_rnet_breakup2. Full reprex tested below:

#>   expression    min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>  <dbl>  <dbl>     <dbl> <bch:byt>    <dbl>
#> 1 l2p         5.53   6.33      159.    96.33KB     5.77
#> 2 rbp         0.423  0.463    2098.    85.59KB     6.26
#> 3 rdv         3.25   3.51      282.   346.93KB     2.04
#> 4 rnb        10.1   10.9        90.7  276.41KB     6.64
#> 5 rnb2        2.04   2.23      440.   187.54KB     6.34
#> 6 rnbv       38.4   40.0        25.1    1.29MB     4.56
#> 7 rrid        1.70   1.89      524.    88.64KB     6.29
#> 8 rcid        0.740  0.840    1169.    82.91KB     4.11
# Aim: try to speed-up rnet_breakup_vertices

remotes::install_github("ropensci/stplanr")
#> Using github PAT from envvar GITHUB_PAT
#> Downloading GitHub repo ropensci/stplanr@HEAD
#> 
#>      checking for file ‘/tmp/Rtmpwfdfrp/remotes103c22f776dc/ropensci-stplanr-8f055db/DESCRIPTION’ ...  ✔  checking for file ‘/tmp/Rtmpwfdfrp/remotes103c22f776dc/ropensci-stplanr-8f055db/DESCRIPTION’
#>   ─  preparing ‘stplanr’:
#>      checking DESCRIPTION meta-information ...  ✔  checking DESCRIPTION meta-information
#>      ─  cleaning src
#>   ─  checking for LF line-endings in source and make files and shell scripts
#>   ─  checking for empty or unneeded directories
#>   ─  building ‘stplanr_0.8.0.9000.tar.gz’
#>      
#> 
#> Installing package into '/home/robin/R/x86_64-pc-linux-gnu-library/4.0'
#> (as 'lib' is unspecified)

library(sf)
#> Linking to GEOS 3.8.0, GDAL 3.0.4, PROJ 7.0.0
library(stplanr)
# rnet = stplanr::rnet_roundabout
# rnet = stplanr::osm_net_example
rnet = trafficalmr::tc_data_osm

rnet_boundary_rnet_breakup <- function(rnet) {
  rnet_nodes <- sf::st_geometry(line2points(rnet))
  rnet_internal_vertexes <- sf::st_geometry(line2vertices(rnet))
  unique_rnet_nodes <- do.call("c", unique(rnet_nodes))
  unique_rnet_internal_vertexes <- do.call("c", unique(rnet_internal_vertexes))
  rbind_nodes_internal_vertexes <- rbind(unique_rnet_nodes, unique_rnet_internal_vertexes)
  index_intersection_points <- duplicated(rbind_nodes_internal_vertexes)
    intersection_points <- sf::st_as_sf(
      data.frame(rbind_nodes_internal_vertexes[index_intersection_points, , drop = FALSE]),
      coords = c("x_coords", "y_coords"),
      crs = sf::st_crs(rnet)
    )
}

`%dtIN%` <- function(y, x) {
  tmp = data.table::rbindlist(list(x,y))
  len_ = nrow(x)
  tmp[, idx := any(.I <= len_) & .N > 1L, by=names(tmp)]
  tail(tmp$idx, nrow(y))
}

rnet_boundary_rnet_breakup2 <- function(rnet) {
  coordinates <- sfheaders::sf_to_df(rnet)
  first_pair <- !duplicated(coordinates[["sfg_id"]])
  last_pair <- !duplicated(coordinates[["sfg_id"]], fromLast = TRUE)
  idxs <- first_pair | last_pair
  pairs <- unique(coordinates[idxs, c("x", "y")])
  i <- coordinates[!idxs, c("x", "y")]
  # find 'internal intersections'
  i_dup <- duplicated(i)
  i_u <- unique(i)
  # this stage can be made a bit faster with data.table:
  # https://stackoverflow.com/questions/23971160
  # i_in_pairs = interaction(i_u) %in% interaction(pairs) # very slow!
  if(requireNamespace("data.table", quietly = TRUE))
  i_in_pairs <- i_u %dtIN% pairs
  p_sf <- rbind(i_u[i_in_pairs, ], i[i_dup, ])
  p <- sfheaders::sf_point(unique(p_sf))
  sf::st_crs(p) <- sf::st_crs(rnet)
  p
}

p1 = rnet_boundary_rnet_breakup(rnet)
p2 = rnet_boundary_rnet_breakup2(rnet)

nrow(p1)
#> [1] 203
nrow(p2)
#> [1] 253
plot(p1)
plot(p2, cex = 2, add = TRUE)

bpline2points <- line2points(rnet)
boundary_points <- rnet_boundary_points(rnet)
boundary_points2 <- rnet_duplicated_vertices(rnet)
boundary_points_rnb <- rnet_boundary_rnet_breakup(rnet)
boundary_points_rnb2 <- rnet_boundary_rnet_breakup2(rnet)
plot(rnet$geometry)
plot(boundary_points, add = TRUE, cex = 0.5) # too many
plot(boundary_points2, add = TRUE)
plot(boundary_points_rnb, col = "red", add = TRUE, cex = 3)

nrow(bpline2points)
#> [1] 998
nrow(boundary_points)
#> [1] 998
nrow(boundary_points2)
#> [1] 523
nrow(boundary_points_rnb)
#> [1] 203

identical(boundary_points2$geometry, boundary_points_rnb$geometry)
#> [1] FALSE
summary(boundary_points2$geometry %in% boundary_points_rnb$geometry)
#>    Mode   FALSE    TRUE 
#> logical     320     203

bench::mark(check = FALSE,
  l2p = {line2points(rnet)},
  rbp = {rnet_boundary_points(rnet)},
  rdv = {rnet_duplicated_vertices(rnet)},
  rnb = {rnet_boundary_rnet_breakup(rnet)},
  rnb2 = {rnet_boundary_rnet_breakup2(rnet)},
  rnbv = {rnet_breakup_vertices(rnet)}
)
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared internal points.
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared internal points.
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared internal points.
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared internal points.
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 6 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 l2p         30.82ms  31.93ms     29.5   477.23KB     7.88
#> 2 rbp          1.31ms   1.42ms    613.    422.35KB     9.99
#> 3 rdv         13.36ms   13.6ms     70.5     1.49MB     3.92
#> 4 rnb          42.3ms  45.39ms     21.9     1.05MB     5.96
#> 5 rnb2         5.02ms    5.2ms    177.    716.48KB     7.94
#> 6 rnbv       180.59ms 180.83ms      5.48    7.47MB     5.48

library(profvis)
profvis({
  coordinates <- sfheaders::sf_to_df(trafficalmr::tc_data_osm)
  first_pair <- !duplicated(coordinates[["sfg_id"]])
  last_pair <- !duplicated(coordinates[["sfg_id"]], fromLast = TRUE)
  idxs <- first_pair | last_pair
  pairs <- unique(coordinates[idxs, c("x", "y")])
  i <- coordinates[!idxs, c("x", "y")]
  # find 'internal intersections'
  i_dup <- duplicated(i)
  i_u <- unique(i)
  # this stage can be made a bit faster with data.table:
  # https://stackoverflow.com/questions/23971160
  # i_in_pairs = interaction(i_u) %in% interaction(pairs) # very slow!
  tmp <- data.table::rbindlist(list(i_u, pairs))
  len_ <- nrow(i_u)
  tmp[, idx := any(.I <= len_) & .N > 1L, by=names(tmp)]
  i_in_pairs <- tail(tmp$idx, nrow(pairs))
  p_sf <- rbind(i_u[i_in_pairs, ], i[i_dup, ])
  p <- sfheaders::sf_point(unique(p_sf))
  sf::st_crs(p) <- sf::st_crs(rnet)
  p
})
#> Error in parse_rprof(prof_output, expr_source): No parsing data available. Maybe your function was too fast?


rnet_breakup_vertices2 <- function(rnet, breakup_internal_vertex_matches = TRUE) {
  rnet_nodes <- sf::st_geometry(line2points(rnet))
  rnet_internal_vertexes <- sf::st_geometry(line2vertices(rnet))

  # For the first part of the procedure I don't need duplicated nodes or
  # duplicated vertexes so I can extract their unique values
  unique_rnet_nodes <- do.call("c", unique(rnet_nodes))
  unique_rnet_internal_vertexes <- do.call("c", unique(rnet_internal_vertexes))

  # Intersection between nodes and internal vertexes
  # The following code is the same as
  # intersection_point <- sf::st_intersection(unique_rnet_nodes, unique_rnet_internal_vertexes)
  # but faster since we are dealing only with points

  rbind_nodes_internal_vertexes <- rbind(unique_rnet_nodes, unique_rnet_internal_vertexes)
  index_intersection_points <- duplicated(rbind_nodes_internal_vertexes)

  if (any(index_intersection_points)) {
    intersection_points <- sf::st_as_sf(
      data.frame(rbind_nodes_internal_vertexes[index_intersection_points, , drop = FALSE]),
      coords = c("x_coords", "y_coords"),
      crs = sf::st_crs(rnet)
    )

    message("Splitting rnet object at the shared boundary points.")
    rnet_breakup_collection <- lwgeom::st_split(rnet, intersection_points$geometry)
    rnet_clean <- sf::st_collection_extract(rnet_breakup_collection, "LINESTRING")
  } else {
    rnet_clean <- rnet
  }

  # Split again at the duplicated internal vertexes
  rnet_internal_vertexes_duplicated <- rnet_internal_vertexes[duplicated(rnet_internal_vertexes)]

  if (length(rnet_internal_vertexes_duplicated) > 0 & breakup_internal_vertex_matches) {
    message("Splitting rnet object at the shared internal points.")
    rnet_breakup_collection <- lwgeom::st_split(rnet_clean, rnet_internal_vertexes_duplicated)
    rnet_clean <- sf::st_collection_extract(rnet_breakup_collection, "LINESTRING")
  }

  rnet_clean
}

plot(st_geometry(rnet_roundabout), lwd = 2, col = rainbow(nrow(rnet_roundabout)))

</details>

bench::mark(f1 = {rnet_roundabout_clean <- rnet_breakup_vertices(rnet_roundabout)})
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared boundary points.
#> # A tibble: 1 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 f1           8.65ms   8.83ms      110.     216KB     4.25


boundary_points <- st_geometry(line2points(rnet_roundabout))
points_cols <- rep(rainbow(nrow(rnet_roundabout)), each = 2)
plot(boundary_points, pch = 16, add = TRUE, col = points_cols)

library(stplanr)
library(sf)
rnet = stplanr::osm_net_example

# define:
# rnet_boundary_rnet_breakup
# `%dtIN%`
# rnet_boundary_rnet_breakup2


boundary_points_by_id_r <- function (rnet) {
  ids <- lapply (rnet$geometry, function (i) rownames (as.matrix (i)))
  ends <- vapply (ids, function (i) i [c (1, length (i))],
                  character (2), USE.NAMES = FALSE)
  ends <- unique (as.vector (ends))
  
  mids <- lapply (ids, function (i) i [-c (1, length (i))])
  mids <- unname (do.call (c, mids))
  mid_is_end <- mids [mids %in% ends]
  mids <- table (mids)
  mids_dup <- names (mids) [which (mids > 1)]
  
  ids <- unique (c (mid_is_end, mids_dup))
  
  xy <- do.call (rbind, lapply (rnet$geometry, function (i) as.matrix (i)))
  xy <- xy [match (ids, rownames (xy)), ]
  p <- sfheaders::sf_point (xy)
  sf::st_crs(p) <- sf::st_crs(rnet)
  return (p)
}

library(Rcpp)
cppFunction('
Rcpp::CharacterVector getids (Rcpp::List g) {
    const int n = g.size ();

    std::unordered_set <std::string> endset, midset, middupset;
    Rcpp::List mids (n);
    for (Rcpp::NumericMatrix i: g) {
        Rcpp::List dimnames = i.attr ("dimnames");
        Rcpp::CharacterVector rownames = dimnames (0);

        endset.emplace (rownames (0));
        endset.emplace (rownames (rownames.size () - 1));

        rownames.erase (0);
        rownames.erase (rownames.size () - 1);
        for (auto j: rownames)
        {
            std::string js = Rcpp::as <std::string> (j);
            if (midset.find (js) != midset.end ())
                middupset.emplace (j);
            midset.emplace (js);
        }
    }

    std::unordered_set <std::string> mid_is_end;
    for (auto i: midset) {
        if (endset.find (i) != endset.end ())
            mid_is_end.emplace (i);
    }

    std::unordered_set <std::string> res;
    for (auto i: mid_is_end)
        res.emplace (i);
    for (auto i: middupset)
        res.emplace (i);

    return Rcpp::wrap (res);
    }
')

boundary_points_by_id_cpp <- function (rnet) {
  ids <- getids (rnet$geometry)
  xy <- do.call (rbind, lapply (rnet$geometry, function (i) as.matrix (i)))
  xy <- xy [match (ids, rownames (xy)), ]
  p <- sfheaders::sf_point (xy)
  sf::st_crs(p) <- sf::st_crs(rnet)
  return (p)
}

pref <- rnet_boundary_rnet_breakup2 (rnet)
pr <- boundary_points_by_id_r (rnet)
pc <- boundary_points_by_id_cpp (rnet)
nrow (pref); nrow (pr); nrow (pc)
#> [1] 55
#> [1] 55
#> [1] 55
#> [1] 55
#> [1] 55
#> [1] 55

bench::mark(check = FALSE,
            l2p = {line2points(rnet)},
            rbp = {rnet_boundary_points(rnet)},
            rdv = {rnet_duplicated_vertices(rnet)},
            rnb = {rnet_boundary_rnet_breakup(rnet)},
            rnb2 = {rnet_boundary_rnet_breakup2(rnet)},
            rnbv = {rnet_breakup_vertices(rnet)},
            rrid = {boundary_points_by_id_r(rnet)},
            rcid = {boundary_points_by_id_cpp(rnet)},
            time_unit = "ms")
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared internal points.
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared internal points.
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared internal points.
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared internal points.
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared internal points.
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared internal points.
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared internal points.
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared internal points.
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared internal points.
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared internal points.
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared internal points.
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared internal points.
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared internal points.
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared internal points.
#> # A tibble: 8 x 6
#>   expression    min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr>  <dbl>  <dbl>     <dbl> <bch:byt>    <dbl>
#> 1 l2p         5.53   6.33      159.    96.33KB     5.77
#> 2 rbp         0.423  0.463    2098.    85.59KB     6.26
#> 3 rdv         3.25   3.51      282.   346.93KB     2.04
#> 4 rnb        10.1   10.9        90.7  276.41KB     6.64
#> 5 rnb2        2.04   2.23      440.   187.54KB     6.34
#> 6 rnbv       38.4   40.0        25.1    1.29MB     4.56
#> 7 rrid        1.70   1.89      524.    88.64KB     6.29
#> 8 rcid        0.740  0.840    1169.    82.91KB     4.11

Created on 2020-12-02 by the reprex package (v0.3.0)

@Robinlovelace
Copy link
Member

Main point I wanted to make is that when a new boundary_vertices function like this is incorporated, I'd suggest it would be important to examine whether the underlying data retain OSM IDs as rownames of the sf geometries, and to use those if so via a function like boundary_points_by_id, and only otherwise switch to using coordinates. (And even then, the coordinates version could easily be C++-ed for extra gains - let me know, and I'll be happy to PR some code.)

I'm up for that but suggest waiting until #451 by @agila5 is merged first. No harm in having parallel PRs but with line2vertices() already duplicated, the possibility of renaming these functions line_bound*() and a desire to tidy-up stplanr worth waiting until that is in. Will provide a stronger basis for a concise, reproducible benchmark also.

@Robinlovelace
Copy link
Member

But overall very keen for more super fast CPP code in there using auto and everything in this package!

@Robinlovelace
Copy link
Member

Regarding your point about OSM ids and boundary points @mpadge, interesting one. Looking into it with this:

mapview::mapview(rnet) +
  mapview::mapview(pc)

image

@Robinlovelace
Copy link
Member

I can see a point there but no OSM id:

image

@Robinlovelace
Copy link
Member

The closes OSM element I can find is this: https://www.openstreetmap.org/node/7592587253

@agila5
Copy link
Collaborator Author

agila5 commented Dec 2, 2020

Hi everyone and thank you very much for your inputs. I plan to check everything as soon as possible (Saturday probably). Just one note: the previous benchmarks should be adjusted since you are comparing different ideas (i.e. extract the boundary points, extract the "internal" points, breaking up a line string).

@luukvdmeer
Copy link

luukvdmeer commented Dec 3, 2020

If I may also contribute here ;)

rnet_boundary_rnet_breakup2 could also be written without the data.table dependency (which I would prefer for sfnetworks).

library(sf)
#> Linking to GEOS 3.7.1, GDAL 2.2.2, PROJ 4.9.2
library(sfheaders)

p1 = st_point(c(0, 1))
p2 = st_point(c(1, 1))
p3 = st_point(c(3, 1))
p4 = st_point(c(4, 1))
p5 = st_point(c(3, 2))
p6 = st_point(c(3, 0))
p7 = st_point(c(4, 3))
p8 = st_point(c(4, 2))
p9 = st_point(c(4, 0))

l1 = st_sfc(st_linestring(c(p1, p2, p3, p4)), crs = 4326)
l2 = st_sfc(st_linestring(c(p5, p3, p6)), crs = 4326)
l3 = st_sfc(st_linestring(c(p7, p8, p4, p9)), crs = 4326)

points = st_sfc(st_multipoint(c(p1, p2, p3, p4, p5, p6, p7, p8, p9)))
lines = c(l1, l2, l3)
lines = st_as_sf(lines)
par(mar = c(1, 1, 1, 1))
plot(lines, col = rainbow(nrow(lines)))
plot(points, pch = 20, add = TRUE)

get_split_points = function(x, unique = FALSE) {
  # Get linestring vertices.
  df = sfc_to_df(st_geometry(x))
  coords = df[names(df) %in% c("x", "y", "z", "m")]
  # Find which of the vertices are boundaries.
  is_startpoint = !duplicated(df[["sfg_id"]])
  is_endpoint = !duplicated(df[["sfg_id"]], fromLast = TRUE)
  is_boundary = is_startpoint | is_endpoint
  # Find which of the vertices occur more than one.
  is_duplicate_desc = duplicated(coords)
  is_duplicate_asc = duplicated(coords, fromLast = TRUE)
  is_multiple = is_duplicate_desc | is_duplicate_asc
  # Split points are those vertices that:
  # --> 1) Have duplicates, and
  # --> 2) Are not boundaries.
  split_idxs = is_multiple & !is_boundary
  # Subset coordinates.
  split_coords = coords[split_idxs, ]
  if (unique) split_coords = unique(split_coords)
  # Rebuild sf structure.
  split_sf = sf_point(split_coords)
  st_crs(split_sf) = st_crs(x)
  split_sf
}

split_points_1 = get_split_points(lines, unique = TRUE)
split_points_1
#> Simple feature collection with 2 features and 0 fields
#> geometry type:  POINT
#> dimension:      XY
#> bbox:           xmin: 3 ymin: 1 xmax: 4 ymax: 1
#> CRS:            EPSG:4326
#>      geometry
#> 1 POINT (3 1)
#> 2 POINT (4 1)
plot(lines, col = rainbow(nrow(lines)))
plot(points, pch = 20, add = TRUE)
plot(split_points_1, pch = 8, col = "red", add = TRUE)

`%dtIN%` <- function(y, x) {
  tmp = data.table::rbindlist(list(x,y))
  len_ = nrow(x)
  tmp[, idx := any(.I <= len_) & .N > 1L, by=names(tmp)]
  tail(tmp$idx, nrow(y))
}

rnet_boundary_rnet_breakup2 <- function(rnet) {
  coordinates <- sfheaders::sf_to_df(rnet)
  first_pair <- !duplicated(coordinates[["sfg_id"]])
  last_pair <- !duplicated(coordinates[["sfg_id"]], fromLast = TRUE)
  idxs <- first_pair | last_pair
  pairs <- unique(coordinates[idxs, c("x", "y")])
  i <- coordinates[!idxs, c("x", "y")]
  # find 'internal intersections'
  i_dup <- duplicated(i)
  i_u <- unique(i)
  # this stage can be made a bit faster with data.table:
  # https://stackoverflow.com/questions/23971160
  # i_in_pairs = interaction(i_u) %in% interaction(pairs) # very slow!
  if(requireNamespace("data.table", quietly = TRUE))
    i_in_pairs <- i_u %dtIN% pairs
  p_sf <- rbind(i_u[i_in_pairs, ], i[i_dup, ])
  p <- sfheaders::sf_point(unique(p_sf))
  sf::st_crs(p) <- sf::st_crs(rnet)
  p
}

split_points_2 = rnet_boundary_rnet_breakup2(lines)
split_points_2
#> Simple feature collection with 2 features and 0 fields
#> geometry type:  POINT
#> dimension:      XY
#> bbox:           xmin: 3 ymin: 1 xmax: 4 ymax: 1
#> CRS:            EPSG:4326
#>      geometry
#> 1 POINT (4 1)
#> 2 POINT (3 1)

plot(lines, col = rainbow(nrow(lines)))
plot(points, pch = 20, add = TRUE)
plot(split_points_2, pch = 8, col = "red", add = TRUE)

bench::mark(
  check = FALSE,
  rnb2 = {rnet_boundary_rnet_breakup2(lines)},
  gsp = {get_split_points(lines)},
  time_unit = "ms"
)
#> # A tibble: 2 x 6
#>   expression   min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <dbl>  <dbl>     <dbl> <bch:byt>    <dbl>
#> 1 rnb2       1.13   1.20       774.  100.78KB     24.5
#> 2 gsp        0.270  0.282     3326.    4.98KB     28.0

Created on 2020-12-03 by the reprex package (v0.3.0)

@Robinlovelace
Copy link
Member

bench::mark(
  check = FALSE,
  rnb2 = {rnet_boundary_rnet_breakup2(lines)},
  gsp = {get_split_points(lines)},
  time_unit = "ms"
)
#> # A tibble: 2 x 6
#>   expression   min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <dbl>  <dbl>     <dbl> <bch:byt>    <dbl>
#> 1 rnb2       1.08   1.12       836.  100.78KB     24.5
#> 2 gsp        0.275  0.286     3327.    4.98KB     27.9

You made it 4 times faster again @luukvdmeer? 🤯 amazing if so, not double checked but very exciting if so, many thanks for sharing!

@mpadge
Copy link
Member

mpadge commented Dec 3, 2020

One more thought: All of the code using vertex coordinates relies on applying equality operations to floating point numbers (mostly via unique and duplicated calls), which is of course a strict no-no in any language. The good news is that you could simply go ahead and implement as is, then I could convert to the C++ version and extend to use of coordinates rather than (OSM) vertex IDs by adding a single line to construct vertex IDs by rounding floating point coordinates to specified precision, then submitting to same routine regardless. The routine would then - appropriately and necessarily - have an additional tolerance parameter to define numeric equality.

@luukvdmeer
Copy link

Here an extension of my previous post: a function to subdivide lines by shared internal vertices. Hence, an alternative to stplanr::rnet_breakup_vertices. Loosely based on the my_st_split from @agila5 . It is already much faster than stplanr::rnet_breakup_vertices, especially if you choose to drop the linestring attributes after splitting, but probably further improvement is very well possible. More checks needed also, if you like the idea (ping @agila5)

Of course @mpadge makes a very important point about the use of duplicated on floating point numbers. Interested to see the C++ implementation. Another way to work around this would be by using st_equals on a geometry column instead of duplicated on two coordinate columns, see the second implementation of the subdivide_lines function. This does slow the function down, but for a reason. But I assume it can be made much faster.

library(sf)
#> Linking to GEOS 3.7.1, GDAL 2.2.2, PROJ 4.9.2
library(sfheaders)

p1 = st_point(c(0, 1))
p2 = st_point(c(1, 1))
p3 = st_point(c(3, 1))
p4 = st_point(c(4, 1))
p5 = st_point(c(3, 2))
p6 = st_point(c(3, 0))
p7 = st_point(c(4, 3))
p8 = st_point(c(4, 2))
p9 = st_point(c(4, 0))

l1 = st_sfc(st_linestring(c(p1, p2, p3, p4)), crs = 4326)
l2 = st_sfc(st_linestring(c(p5, p3, p6)), crs = 4326)
l3 = st_sfc(st_linestring(c(p7, p8, p4, p9)), crs = 4326)

points = st_sfc(st_multipoint(c(p1, p2, p3, p4, p5, p6, p7, p8, p9)))
lines = c(l1, l2, l3)
lines = st_as_sf(lines)
lines$foo = letters[1:nrow(lines)]
lines
#> Simple feature collection with 3 features and 1 field
#> geometry type:  LINESTRING
#> dimension:      XY
#> bbox:           xmin: 0 ymin: 0 xmax: 4 ymax: 3
#> CRS:            EPSG:4326
#>                                x foo
#> 1 LINESTRING (0 1, 1 1, 3 1, ...   a
#> 2     LINESTRING (3 2, 3 1, 3 0)   b
#> 3 LINESTRING (4 3, 4 2, 4 1, ...   c

par(mar = c(1, 1, 1, 1))
plot(st_geometry(lines), col = rainbow(nrow(lines)))
plot(points, pch = 20, add = TRUE)

subdivision_1 = stplanr::rnet_breakup_vertices(lines)
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared internal points.
subdivision_1
#> Simple feature collection with 6 features and 1 field
#> geometry type:  LINESTRING
#> dimension:      XY
#> bbox:           xmin: 0 ymin: 0 xmax: 4 ymax: 3
#> CRS:            EPSG:4326
#>     foo                          x
#> 1     a LINESTRING (0 1, 1 1, 3 1)
#> 1.1   a      LINESTRING (3 1, 4 1)
#> 2     b      LINESTRING (3 2, 3 1)
#> 2.1   b      LINESTRING (3 1, 3 0)
#> 3     c LINESTRING (4 3, 4 2, 4 1)
#> 3.1   c      LINESTRING (4 1, 4 0)

plot(st_geometry(subdivision_1), col = rainbow(nrow(subdivision_1)))
plot(points, pch = 20, add = TRUE)

subdivide_lines_v1 = function(x, drop_attrs = FALSE) {
  # Get linestring vertices.
  df = sfc_to_df(st_geometry(x))
  coords = df[names(df) %in% c("x", "y", "z", "m")]
  # Find which of the vertices are boundaries.
  is_startpoint = !duplicated(df[["sfg_id"]])
  is_endpoint = !duplicated(df[["sfg_id"]], fromLast = TRUE)
  is_boundary = is_startpoint | is_endpoint
  # Find which of the vertices occur more than one.
  is_duplicate_desc = duplicated(coords)
  is_duplicate_asc = duplicated(coords, fromLast = TRUE)
  is_multiple = is_duplicate_desc | is_duplicate_asc
  # Split points are those vertices that:
  # --> 1) Have duplicates, and
  # --> 2) Are not boundaries.
  is_split = is_multiple & !is_boundary
  # Duplicate coordinates of split points.
  reps = rep(1L, nrow(coords)) # By default store each point once.
  reps[is_split] = 2L # Store split points twice.
  new_coords = data.frame(lapply(coords, function(i) rep(i, reps)))
  # Update linestring indices.
  # --> First duplicate original indices at each split point.
  # --> Then increment those accordingly at each split point.
  dup_idxs = rep(df$linestring_id, reps)
  incs = integer(nrow(new_coords)) # By default don't increment.
  incs[which(is_split) + 1:sum(is_split)] = 1L # Increment by 1 after each split.
  new_idxs = dup_idxs + cumsum(incs)
  new_coords$linestring_id = new_idxs
  # Rebuild lines.
  new_lines = sfc_linestring(new_coords, linestring_id = "linestring_id")
  st_crs(new_lines) = st_crs(x)
  # Restore original line attributes.
  # Duplicate attributes within splitted lines.
  if (drop_attrs) {
    x_new = st_as_sf(new_lines)
  } else {
    orig_idxs = dup_idxs[!duplicated(new_idxs)]
    x_new = x[orig_idxs, ]
    st_geometry(x_new) = new_lines
  }
  x_new
}

subdivision_2 = subdivide_lines_v1(lines)
subdivision_2
#> Simple feature collection with 6 features and 1 field
#> geometry type:  LINESTRING
#> dimension:      XY
#> bbox:           xmin: 0 ymin: 0 xmax: 4 ymax: 3
#> CRS:            EPSG:4326
#>                              x foo
#> 1   LINESTRING (0 1, 1 1, 3 1)   a
#> 1.1      LINESTRING (3 1, 4 1)   a
#> 2        LINESTRING (3 2, 3 1)   b
#> 2.1      LINESTRING (3 1, 3 0)   b
#> 3   LINESTRING (4 3, 4 2, 4 1)   c
#> 3.1      LINESTRING (4 1, 4 0)   c

plot(st_geometry(subdivision_2), col = rainbow(nrow(subdivision_2)))
plot(points, pch = 20, add = TRUE)

subdivision_3 = subdivide_lines_v1(lines, drop_attrs = TRUE)
subdivision_3
#> Simple feature collection with 6 features and 0 fields
#> geometry type:  LINESTRING
#> dimension:      XY
#> bbox:           xmin: 0 ymin: 0 xmax: 4 ymax: 3
#> CRS:            EPSG:4326
#>                            x
#> 1 LINESTRING (0 1, 1 1, 3 1)
#> 2      LINESTRING (3 1, 4 1)
#> 3      LINESTRING (3 2, 3 1)
#> 4      LINESTRING (3 1, 3 0)
#> 5 LINESTRING (4 3, 4 2, 4 1)
#> 6      LINESTRING (4 1, 4 0)

plot(st_geometry(subdivision_3), col = rainbow(nrow(subdivision_3)))
plot(points, pch = 20, add = TRUE)

subdivide_lines_v2 = function(x, drop_attrs = FALSE) {
  # Get linestring vertices.
  df = sfc_to_df(st_geometry(x))
  coords = df[names(df) %in% c("x", "y", "z", "m")]
  # Find which of the vertices are boundaries.
  is_startpoint = !duplicated(df[["sfg_id"]])
  is_endpoint = !duplicated(df[["sfg_id"]], fromLast = TRUE)
  is_boundary = is_startpoint | is_endpoint
  # Find which of the vertices occur more than one.
  # is_duplicate_desc = duplicated(coords)
  # is_duplicate_asc = duplicated(coords, fromLast = TRUE)
  # is_multiple = is_duplicate_desc | is_duplicate_asc
  is_multiple = lengths(st_equals(sf_cast(x, "POINT"))) > 1
  # Split points are those vertices that:
  # --> 1) Have duplicates, and
  # --> 2) Are not boundaries.
  is_split = is_multiple & !is_boundary
  # Duplicate coordinates of split points.
  reps = rep(1L, nrow(coords)) # By default store each point once.
  reps[is_split] = 2L # Store split points twice.
  new_coords = data.frame(lapply(coords, function(i) rep(i, reps)))
  # Update linestring indices.
  # --> First duplicate original indices at each split point.
  # --> Then increment those accordingly at each split point.
  dup_idxs = rep(df$linestring_id, reps)
  incs = integer(nrow(new_coords)) # By default don't increment.
  incs[which(is_split) + 1:sum(is_split)] = 1L # Increment by 1 after each split.
  new_idxs = dup_idxs + cumsum(incs)
  new_coords$linestring_id = new_idxs
  # Rebuild lines.
  new_lines = sfc_linestring(new_coords, linestring_id = "linestring_id")
  st_crs(new_lines) = st_crs(x)
  # Restore original line attributes.
  # Duplicate attributes within splitted lines.
  if (drop_attrs) {
    x_new = st_as_sf(new_lines)
  } else {
    orig_idxs = dup_idxs[!duplicated(new_idxs)]
    x_new = x[orig_idxs, ]
    st_geometry(x_new) = new_lines
  }
  x_new
}

subdivision_4 = subdivide_lines_v2(lines)
subdivision_4
#> Simple feature collection with 6 features and 1 field
#> geometry type:  LINESTRING
#> dimension:      XY
#> bbox:           xmin: 0 ymin: 0 xmax: 4 ymax: 3
#> CRS:            EPSG:4326
#>                              x foo
#> 1   LINESTRING (0 1, 1 1, 3 1)   a
#> 1.1      LINESTRING (3 1, 4 1)   a
#> 2        LINESTRING (3 2, 3 1)   b
#> 2.1      LINESTRING (3 1, 3 0)   b
#> 3   LINESTRING (4 3, 4 2, 4 1)   c
#> 3.1      LINESTRING (4 1, 4 0)   c

plot(st_geometry(subdivision_4), col = rainbow(nrow(subdivision_4)))
plot(points, pch = 20, add = TRUE)

bench::mark(
  check = FALSE,
  rbv = {stplanr::rnet_breakup_vertices(lines)},
  sl1 = {subdivide_lines_v1(lines)},
  sl2 = {subdivide_lines_v2(lines)},
  sl1d = {subdivide_lines_v1(lines, drop_attrs = TRUE)},
  sl2d = {subdivide_lines_v2(lines, drop_attrs = TRUE)},
  time_unit = "ms"
)
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared internal points.
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared internal points.
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared internal points.
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared internal points.
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared internal points.
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared internal points.
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared internal points.
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared internal points.
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared internal points.
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared internal points.
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared internal points.
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared internal points.
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared internal points.
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared internal points.
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared internal points.
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared internal points.
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared internal points.
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared internal points.
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared internal points.
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared internal points.
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared internal points.
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared internal points.
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared internal points.
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared internal points.
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared internal points.
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared internal points.
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared internal points.
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared internal points.
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared internal points.
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared internal points.
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared internal points.
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared internal points.
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared internal points.
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared internal points.
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared internal points.
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared internal points.
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared internal points.
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared internal points.
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared internal points.
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared internal points.
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared internal points.
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared internal points.
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared internal points.
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared internal points.
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared internal points.
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared internal points.
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared internal points.
#> # A tibble: 5 x 6
#>   expression   min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <dbl>  <dbl>     <dbl> <bch:byt>    <dbl>
#> 1 rbv        8.70   9.60       102.   96.88KB     18.4
#> 2 sl1        1.01   1.05       909.   10.61KB     17.1
#> 3 sl2        2.24   2.38       400.   71.24KB     15.3
#> 4 sl1d       0.552  0.575     1629.    5.25KB     17.0
#> 5 sl2d       1.77   1.97       498.   65.88KB     15.2

Created on 2020-12-03 by the reprex package (v0.3.0)

@Robinlovelace
Copy link
Member

Great additional points @luukvdmeer. Just one follow-up in terms of benchmarking, I think @agila5's benchmark on a large real dataset should be used to ensure that it scales and ensure it works for edge cases that can be found on a citywide network. Would be interested to see benchmarks comparing with this starter for 10 from @agila5 in #451:

remotes::install_github("ropensci/stplanr", "6ae27d0b81370fe9b5838fe879cc983db929d55c")

# get data
iow <- osmextract::oe_get("Isle of Wight", quiet = TRUE)
nrow(iow)
#> [1] 45422

# current approach
system.time(({
  iow_breakup <- stplanr::rnet_breakup_vertices(iow)
}))
#>    user  system elapsed 
#>    7.11    0.66    8.73
#> [1] 80957

@luukvdmeer
Copy link

Of course, totally agree!

@luukvdmeer
Copy link

Output is the same and execution times similar. I think also the concept of the implementation is very similar.

Just wanted to share this version without data.table dependency, which I would like to avoid if possible in sfnetworks. Feel free to take ideas from it!

# remotes::install_github("ropensci/stplanr", "6ae27d0b81370fe9b5838fe879cc983db929d55c")

library(stplanr)
library(sf)
#> Linking to GEOS 3.7.1, GDAL 2.2.2, PROJ 4.9.2
library(sfheaders)

subdivide_lines_v1 = function(x, drop_attrs = FALSE) {
  # Get linestring vertices.
  df = sfc_to_df(st_geometry(x))
  coords = df[names(df) %in% c("x", "y", "z", "m")]
  # Find which of the vertices are boundaries.
  is_startpoint = !duplicated(df[["sfg_id"]])
  is_endpoint = !duplicated(df[["sfg_id"]], fromLast = TRUE)
  is_boundary = is_startpoint | is_endpoint
  # Find which of the vertices occur more than one.
  is_duplicate_desc = duplicated(coords)
  is_duplicate_asc = duplicated(coords, fromLast = TRUE)
  is_multiple = is_duplicate_desc | is_duplicate_asc
  # Split points are those vertices that:
  # --> 1) Have duplicates, and
  # --> 2) Are not boundaries.
  is_split = is_multiple & !is_boundary
  # Duplicate coordinates of split points.
  reps = rep(1L, nrow(coords)) # By default store each point once.
  reps[is_split] = 2L # Store split points twice.
  new_coords = data.frame(lapply(coords, function(i) rep(i, reps)))
  # Update linestring indices.
  # --> First duplicate original indices at each split point.
  # --> Then increment those accordingly at each split point.
  dup_idxs = rep(df$linestring_id, reps)
  incs = integer(nrow(new_coords)) # By default don't increment.
  incs[which(is_split) + 1:sum(is_split)] = 1L # Increment by 1 after each split.
  new_idxs = dup_idxs + cumsum(incs)
  new_coords$linestring_id = new_idxs
  # Rebuild lines.
  new_lines = sfc_linestring(new_coords, linestring_id = "linestring_id")
  st_crs(new_lines) = st_crs(x)
  # Restore original line attributes.
  # Duplicate attributes within splitted lines.
  if (drop_attrs) {
    x_new = st_as_sf(new_lines)
  } else {
    orig_idxs = dup_idxs[!duplicated(new_idxs)]
    x_new = x[orig_idxs, ]
    st_geometry(x_new) = new_lines
  }
  x_new
}

iow = osmextract::oe_get("Isle of Wight", quiet = TRUE)
#> Warning: st_crs<- : replacing crs does not reproject data; use st_transform for
#> that
nrow(iow)
#> [1] 45421
system.time({iow_breakup_1 = rnet_breakup_vertices(iow)})
#> Splitting the input object at the internal points that are also boundaries.
#> Splitting rnet object at the duplicated internal points.
#>    user  system elapsed 
#>   3.481   0.052   2.705
system.time({iow_breakup_2 = subdivide_lines_v1(iow)})
#>    user  system elapsed 
#>   2.066   0.000   2.066

iow_breakup_1
#> Simple feature collection with 80955 features and 9 fields
#> geometry type:  LINESTRING
#> dimension:      XY
#> bbox:           xmin: -5.401978 ymin: 43.35489 xmax: -0.175775 ymax: 50.89601
#> CRS:            4326
#> First 10 features:
#>    osm_id          name     highway waterway aerialway barrier man_made z_order
#> 1     413      Lane End residential     <NA>      <NA>    <NA>     <NA>       3
#> 2     413      Lane End residential     <NA>      <NA>    <NA>     <NA>       3
#> 3     415 Foreland Road   secondary     <NA>      <NA>    <NA>     <NA>       6
#> 4     415 Foreland Road   secondary     <NA>      <NA>    <NA>     <NA>       6
#> 5     415 Foreland Road   secondary     <NA>      <NA>    <NA>     <NA>       6
#> 6     415 Foreland Road   secondary     <NA>      <NA>    <NA>     <NA>       6
#> 7     415 Foreland Road   secondary     <NA>      <NA>    <NA>     <NA>       6
#> 8     415 Foreland Road   secondary     <NA>      <NA>    <NA>     <NA>       6
#> 9     415 Foreland Road   secondary     <NA>      <NA>    <NA>     <NA>       6
#> 10    415 Foreland Road   secondary     <NA>      <NA>    <NA>     <NA>       6
#>                                                                                               other_tags
#> 1                                    "lit"=>"yes","lanes"=>"2","surface"=>"asphalt","maxspeed"=>"30 mph"
#> 2                                    "lit"=>"yes","lanes"=>"2","surface"=>"asphalt","maxspeed"=>"30 mph"
#> 3  "lit"=>"yes","ref"=>"B3395","lanes"=>"2","surface"=>"asphalt","maxspeed"=>"30 mph","sidewalk"=>"both"
#> 4  "lit"=>"yes","ref"=>"B3395","lanes"=>"2","surface"=>"asphalt","maxspeed"=>"30 mph","sidewalk"=>"both"
#> 5  "lit"=>"yes","ref"=>"B3395","lanes"=>"2","surface"=>"asphalt","maxspeed"=>"30 mph","sidewalk"=>"both"
#> 6  "lit"=>"yes","ref"=>"B3395","lanes"=>"2","surface"=>"asphalt","maxspeed"=>"30 mph","sidewalk"=>"both"
#> 7  "lit"=>"yes","ref"=>"B3395","lanes"=>"2","surface"=>"asphalt","maxspeed"=>"30 mph","sidewalk"=>"both"
#> 8  "lit"=>"yes","ref"=>"B3395","lanes"=>"2","surface"=>"asphalt","maxspeed"=>"30 mph","sidewalk"=>"both"
#> 9  "lit"=>"yes","ref"=>"B3395","lanes"=>"2","surface"=>"asphalt","maxspeed"=>"30 mph","sidewalk"=>"both"
#> 10 "lit"=>"yes","ref"=>"B3395","lanes"=>"2","surface"=>"asphalt","maxspeed"=>"30 mph","sidewalk"=>"both"
#>                          geometry
#> 1  LINESTRING (-1.083348 50.68...
#> 2  LINESTRING (-1.083194 50.68...
#> 3  LINESTRING (-1.083348 50.68...
#> 4  LINESTRING (-1.084021 50.68...
#> 5  LINESTRING (-1.084318 50.68...
#> 6  LINESTRING (-1.084632 50.68...
#> 7  LINESTRING (-1.084889 50.68...
#> 8  LINESTRING (-1.085483 50.68...
#> 9  LINESTRING (-1.086317 50.68...
#> 10 LINESTRING (-1.086771 50.68...
iow_breakup_2
#> Simple feature collection with 80955 features and 9 fields
#> geometry type:  LINESTRING
#> dimension:      XY
#> bbox:           xmin: -5.401978 ymin: 43.35489 xmax: -0.175775 ymax: 50.89601
#> CRS:            4326
#> First 10 features:
#>     osm_id          name     highway waterway aerialway barrier man_made
#> 1      413      Lane End residential     <NA>      <NA>    <NA>     <NA>
#> 1.1    413      Lane End residential     <NA>      <NA>    <NA>     <NA>
#> 2      415 Foreland Road   secondary     <NA>      <NA>    <NA>     <NA>
#> 2.1    415 Foreland Road   secondary     <NA>      <NA>    <NA>     <NA>
#> 2.2    415 Foreland Road   secondary     <NA>      <NA>    <NA>     <NA>
#> 2.3    415 Foreland Road   secondary     <NA>      <NA>    <NA>     <NA>
#> 2.4    415 Foreland Road   secondary     <NA>      <NA>    <NA>     <NA>
#> 2.5    415 Foreland Road   secondary     <NA>      <NA>    <NA>     <NA>
#> 2.6    415 Foreland Road   secondary     <NA>      <NA>    <NA>     <NA>
#> 2.7    415 Foreland Road   secondary     <NA>      <NA>    <NA>     <NA>
#>     z_order
#> 1         3
#> 1.1       3
#> 2         6
#> 2.1       6
#> 2.2       6
#> 2.3       6
#> 2.4       6
#> 2.5       6
#> 2.6       6
#> 2.7       6
#>                                                                                                other_tags
#> 1                                     "lit"=>"yes","lanes"=>"2","surface"=>"asphalt","maxspeed"=>"30 mph"
#> 1.1                                   "lit"=>"yes","lanes"=>"2","surface"=>"asphalt","maxspeed"=>"30 mph"
#> 2   "lit"=>"yes","ref"=>"B3395","lanes"=>"2","surface"=>"asphalt","maxspeed"=>"30 mph","sidewalk"=>"both"
#> 2.1 "lit"=>"yes","ref"=>"B3395","lanes"=>"2","surface"=>"asphalt","maxspeed"=>"30 mph","sidewalk"=>"both"
#> 2.2 "lit"=>"yes","ref"=>"B3395","lanes"=>"2","surface"=>"asphalt","maxspeed"=>"30 mph","sidewalk"=>"both"
#> 2.3 "lit"=>"yes","ref"=>"B3395","lanes"=>"2","surface"=>"asphalt","maxspeed"=>"30 mph","sidewalk"=>"both"
#> 2.4 "lit"=>"yes","ref"=>"B3395","lanes"=>"2","surface"=>"asphalt","maxspeed"=>"30 mph","sidewalk"=>"both"
#> 2.5 "lit"=>"yes","ref"=>"B3395","lanes"=>"2","surface"=>"asphalt","maxspeed"=>"30 mph","sidewalk"=>"both"
#> 2.6 "lit"=>"yes","ref"=>"B3395","lanes"=>"2","surface"=>"asphalt","maxspeed"=>"30 mph","sidewalk"=>"both"
#> 2.7 "lit"=>"yes","ref"=>"B3395","lanes"=>"2","surface"=>"asphalt","maxspeed"=>"30 mph","sidewalk"=>"both"
#>                           geometry
#> 1   LINESTRING (-1.083348 50.68...
#> 1.1 LINESTRING (-1.083194 50.68...
#> 2   LINESTRING (-1.083348 50.68...
#> 2.1 LINESTRING (-1.084021 50.68...
#> 2.2 LINESTRING (-1.084318 50.68...
#> 2.3 LINESTRING (-1.084632 50.68...
#> 2.4 LINESTRING (-1.084889 50.68...
#> 2.5 LINESTRING (-1.085483 50.68...
#> 2.6 LINESTRING (-1.086317 50.68...
#> 2.7 LINESTRING (-1.086771 50.68...

Created on 2020-12-03 by the reprex package (v0.3.0)

@Robinlovelace
Copy link
Member

Very interesting @luukvdmeer, seems they are pretty close. Both more than fast enough to close this issue. Reduced data.table dependency is definite advantage of your approach 👍

From what I can tell your implementation not only has fewer dependencies but is 30% faster.

2.705 / 2.066
#> [1] 1.309293

Created on 2020-12-03 by the reprex package (v0.3.0)

@agila5
Copy link
Collaborator Author

agila5 commented Dec 3, 2020

Hi everyone! I finally read all the messages and I think that the conclusion is that we should prefer @luukvdmeer's approach. IMO the code is quite clear (even if I think there should be more comments in the final version of the function). Thanks for helping us here.

I tested both implementations using a variety of road networks (namely Isle of Wight with approx 45k edges, West Yorkshire with approximately 180k edges and Greater London with approximately 500k edges) and, in all cases, subdivide_lines_v1 outperformed my implementation with identical results (which makes my both happy and extremely frustrated since I spent way too many weekends on this project 😆). Anyway, if you agree and you give permission, I think we should create a PR both here and in sfnetworks repo that adds this new approach.

@luukvdmeer I think that you could use the same function to create a new morpher for luukvdmeer/sfnetworks#73 and refactor to_spatial_dense_graph() (see here). Happy to work on a PR after you add the new function.

Main point I wanted to make is that when a new boundary_vertices function like this is incorporated, I'd suggest it would be important to examine whether the underlying data retain OSM IDs as rownames of the sf geometries, and to use those if so via a function like boundary_points_by_id, and only otherwise switch to using coordinates. (And even then, the coordinates version could easily be C++-ed for extra gains - let me know, and I'll be happy to PR some code.)

Hi @mpadge! I agree with the suggestion. IMO that's not a priority (see also the next comment) but I still think that it's a good idea and I would also like to read (and maybe understand) the C++ implementation.

One more thought: All of the code using vertex coordinates relies on applying equality operations to floating point numbers (mostly via unique and duplicated calls), which is of course a strict no-no in any language. The good news is that you could simply go ahead and implement as is, then I could convert to the C++ version and extend to use of coordinates rather than (OSM) vertex IDs by adding a single line to construct vertex IDs by rounding floating point coordinates to specified precision, then submitting to same routine regardless. The routine would then - appropriately and necessarily - have an additional tolerance parameter to define numeric equality.

Thank you very much for pointing it out! Unfortunately, my experience with these problems is limited to FAQ 7.31. Looking at the code behind subdivide_lines_v1 I think that the main weak point is here: is_duplicate_desc = duplicated(coords), since it compares the coordinates of the points, right? So the C++ construct function would replace that line of code, right? I think the other use of duplicated, i.e. is_startpoint = !duplicated(df[["sfg_id"]]), is not a problem since the IDs are just integers from 1 to n. Looking forward to the implementation! I would prefer using @mpadge proposal instead of sf::st_equals just to learn a bit more about the C++ implementation.

btw: I wouldn't bother rewriting the whole function in C++ since it runs in less than 60 seconds considering a network of 500,000 edges.

I think that's all. Feel free to add any question.

Robinlovelace pushed a commit that referenced this issue Dec 8, 2020
* First ideas for #416

* fix problem in the interaction between route() and data.table created after setting  .datatable.aware = TRUE

* Second round

* Improve rnet_breakup

* fix problem with CRS in rnet_breakup
@agila5
Copy link
Collaborator Author

agila5 commented Dec 8, 2020

Thank you very much to all the people that joined the discussion. Looking forward to the next steps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants