-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor rnet_breakup_vertices #451
Conversation
… after setting .datatable.aware = TRUE
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good first pass, look forward to seeing benchmarks, I've suggested a few changes. Thanks @agila5 !
R/rnet-clean.R
Outdated
rnet_clean | ||
# step 3 - Find duplicated internal points | ||
internal_points <- data.table::data.table( | ||
sfheaders::sf_to_df(line2vertices(rnet)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think line2vertices
is needed here - it's really slow. Replace with:
sfheaders::sf_to_df(rnet)
I suggest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that's true since line2vertices()
is used to extract the "internal" points, i.e. those points that are part of the linestring(s) but do not lie in the boundaries. It's like the complementary of rnet_boundary_points()
. I could rewrite also line2vertices using the sfheaders approach but, IMO, at the moment it's not worth it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aha yes, I see you refactor that function:
https://github.com/ropensci/stplanr/blob/master/R/od-funs.R#L456-L490
However... @agila5 you can get the internal and boundary points with a single operation as shown here, which I think is faster, so suggest calling sfheaders::sf_to_df()
/ sf::st_coordinates()
(directly or indirectly) only once:
stplanr/data-raw/ad-hoc-tests/test-rnet-breakup-timing1.R
Lines 33 to 38 in 8f055db
coordinates <- sfheaders::sf_to_df(rnet) | |
first_pair <- !duplicated(coordinates[["sfg_id"]]) | |
last_pair <- !duplicated(coordinates[["sfg_id"]], fromLast = TRUE) | |
idxs <- first_pair | last_pair | |
pairs <- unique(coordinates[idxs, c("x", "y")]) | |
i <- coordinates[!idxs, c("x", "y")] |
The i
result here is the same as line2vertices()
I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However... @agila5 you can get the internal and boundary points with a single operation as shown here, which I think is faster, so suggest calling sfheaders::sf_to_df()/ sf::st_coordinates() (directly or indirectly) only once
Yes, you are right, I will change that part!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be fixed
Some simple benchmarks. Current approach: # install stplanr
remotes::install_github("ropensci/stplanr")
# get data
iow <- osmextract::oe_get("Isle of Wight", quiet = TRUE)
nrow(iow)
#> [1] 45422
# current approach
system.time(({
iow_breakup <- stplanr::rnet_breakup_vertices(iow)
}))
#> Linking to GEOS 3.8.0, GDAL 3.0.4, PROJ 6.3.1
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared internal points.
#> user system elapsed
#> 210.87 4.30 247.69
nrow(iow_breakup)
#> [1] 80944 Created on 2020-12-02 by the reprex package (v0.3.0) New approach: # install stplanr
remotes::install_github("ropensci/stplanr", "6ae27d0b81370fe9b5838fe879cc983db929d55c")
# get data
iow <- osmextract::oe_get("Isle of Wight", quiet = TRUE)
nrow(iow)
#> [1] 45422
# current approach
system.time(({
iow_breakup <- stplanr::rnet_breakup_vertices(iow)
}))
#> Splitting the input object at the internal points that are also boundaries.
#> Splitting rnet object at the duplicated internal points.
#> user system elapsed
#> 7.11 0.66 8.73
nrow(iow_breakup)
#> [1] 80957 Created on 2020-12-02 by the reprex package (v0.3.0) The outputs are not exactly identical for the reasons documented also in #416 (comment) (i.e. linestrings that intersect themselves aren't split with the current approach, so it's not surprising that the new approach returns more lines). I graphically compared the examples reported in the help page, and they look identical. TODOs
I plan to finish the PR on Saturday. |
247.69/8.73
#> [1] 28.37228 Created on 2020-12-03 by the reprex package (v0.3.0) You just made my pkg 30 times faster 🤯 🎉 🎉 🎉 |
Great work @agila5. All your suggestions look good, look forward to Sunday 👍 Only comment for now:
How about |
Unfortunately (well, not really), I think this now outdated. |
You still got it 30 times faster, which may have inspired the other approach : ) |
Looking good @agila5, ready to test/merge I guess 🎉 |
IMO, yes it's ready to be merged. Clearly feel free to test it locally if you want. I think the only failing test is unrelated to the PR. |
Hi @agila5 I've tested it with the following reproducible code: # Aim: test rnet_breakup_vertices
remotes::install_github("ropensci/stplanr")
library(stplanr)
rnet = osm_net_example
system.time(({
rnet_processed1 <- stplanr::rnet_breakup_vertices(rnet)
}))
nrow(rnet_processed1)
rnet_processed1$n = seq(nrow(rnet_processed1))
mapview::mapview(rnet_processed1["n"])
remotes::install_github("agila5/stplanr", "refactor_rnet_breakup")
library(stplanr)
rnet = osm_net_example
system.time(({
rnet_processed1 <- stplanr::rnet_breakup_vertices(rnet)
}))
nrow(rnet_processed1)
rnet_processed1$n = seq(nrow(rnet_processed1))
mapview::mapview(rnet_processed1["n"]) I think there is a small problem. |
Result of first test with old function: remotes::install_github("ropensci/stplanr")
#> Skipping install of 'stplanr' from a github remote, the SHA1 (784f64dd) has not changed since last install.
#> Use `force = TRUE` to force installation
library(stplanr)
rnet = osm_net_example
system.time(({
rnet_processed1 <- stplanr::rnet_breakup_vertices(rnet)
}))
#> Linking to GEOS 3.8.0, GDAL 3.0.4, PROJ 7.0.0
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared internal points.
#> user system elapsed
#> 0.189 0.004 0.196
nrow(rnet_processed1)
#> [1] 129
rnet_processed1$n = seq(nrow(rnet_processed1))
mapview::mapview(rnet_processed1["n"]) Created on 2020-12-08 by the reprex package (v0.3.0) |
Result with new function: remotes::install_github("agila5/stplanr", "refactor_rnet_breakup")
#> Skipping install of 'stplanr' from a github remote, the SHA1 (463ff935) has not changed since last install.
#> Use `force = TRUE` to force installation
library(stplanr)
rnet = osm_net_example
system.time(({
rnet_processed1 <- stplanr::rnet_breakup_vertices(rnet)
}))
#> user system elapsed
#> 0.199 0.012 0.212
nrow(rnet_processed1)
#> [1] 129
rnet_processed1$n = seq(nrow(rnet_processed1))
mapview::mapview(rnet_processed1["n"]) Created on 2020-12-08 by the reprex package (v0.3.0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Issue: it seems the CRS hasn't been copied.
Fixed! It was just a typo at the end. # updates
remotes::install_github("ropensci/stplanr", "286a667f5e04ad102b070cd0d91ae836bf8e4905")
#> Using github PAT from envvar GITHUB_PAT
#> Skipping install of 'stplanr' from a github remote, the SHA1 (286a667f) has not changed since last install.
#> Use `force = TRUE` to force installation
# packages
library(stplanr)
# data
rnet = osm_net_example
system.time(({
rnet_processed1 <- stplanr::rnet_breakup_vertices(rnet)
}))
#> user system elapsed
#> 0.20 0.05 0.33
rnet_processed1$n = seq(nrow(rnet_processed1))
mapview::mapview(rnet_processed1["n"]) Created on 2020-12-08 by the reprex package (v0.3.0) |
Final test:
|
I'll bury this here just for a comparative reference, starting with modified C++ code now that I see how you're approaching the re-assembly to linestring objects: #include <Rcpp.h>
// [[Rcpp::export]]
Rcpp::DataFrame getids (Rcpp::List g) {
const int n = g.size ();
int npoints = 0;
std::unordered_set <std::string> endset, midset, middupset;
Rcpp::List mids (n);
for (Rcpp::NumericMatrix i: g) {
Rcpp::List dimnames = i.attr ("dimnames");
Rcpp::CharacterVector rownames = dimnames (0);
npoints += rownames.size ();
endset.emplace (rownames (0));
endset.emplace (rownames (rownames.size () - 1));
rownames.erase (0);
rownames.erase (rownames.size () - 1);
for (auto j: rownames)
{
std::string js = Rcpp::as <std::string> (j);
if (midset.find (js) != midset.end ())
middupset.emplace (j);
midset.emplace (js);
}
}
std::unordered_set <std::string> mid_is_end;
for (auto i: midset) {
if (endset.find (i) != endset.end ())
mid_is_end.emplace (i);
}
// work out how many points in total:
int index = 0;
for (Rcpp::NumericMatrix gi: g) {
Rcpp::List dimnames = gi.attr ("dimnames");
Rcpp::CharacterVector rownames = dimnames (0);
for (int j = 0; j < rownames.size (); j++)
{
std::string js = Rcpp::as <std::string> (rownames (j));
index++;
if (mid_is_end.find (js) != mid_is_end.end ())
index++;
}
}
std::vector <int> geom_num_vec (index);
std::vector <double> x (index), y (index);
size_t geom_num = 1;
index = 0; // index into output vectors
for (Rcpp::NumericMatrix gi: g) {
Rcpp::List dimnames = gi.attr ("dimnames");
Rcpp::CharacterVector rownames = dimnames (0);
for (int j = 0; j < gi.nrow (); j++)
{
std::string js = Rcpp::as <std::string> (rownames (j));
geom_num_vec [index] = geom_num;
x [index] = gi (j, 0);
y [index] = gi (j, 1);
index++;
if (mid_is_end.find (js) != mid_is_end.end ())
{
geom_num_vec [index] = geom_num++;
x [index] = gi (j, 0);
y [index] = gi (j, 1);
index++;
}
}
geom_num++;
}
Rcpp::DataFrame res =
Rcpp::DataFrame::create (
Rcpp::Named ("x") = x,
Rcpp::Named ("y") = y,
Rcpp::Named ("geom") = geom_num_vec,
Rcpp::_["stringsAsFactors"] = false);
return res;
} Saving that file as "breakup.cpp" then gives the following: library (stplanr)
library(sf)
#> Linking to GEOS 3.8.0, GDAL 3.0.4, PROJ 6.3.2
library(Rcpp)
sourceCpp ("breakup.cpp")
breakup_cpp <- function (rnet) {
dat <- getids (rnet$geometry)
new_linestring <- sfheaders::sfc_linestring(
dat,
linestring_id = "geom"
)
new_linestring_sfc <- sf::st_sfc(
new_linestring,
crs = sf::st_crs(rnet),
precision = sf::st_precision(rnet)
)
return (new_linestring_sfc)
}
fname <- "ms-streetnet.Rds"
if (!file.exists (fname)) {
library (dodgr)
net <- dodgr_streetnet ("Münster Germany")
saveRDS (net, file = fname)
} else {
rnet <- readRDS (fname)
}
bench::mark(check = FALSE,
stplr = {rnet_breakup_vertices (rnet)},
rcid = {breakup_cpp(rnet)},
time_unit = "ms")
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 x 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <dbl> <dbl> <dbl> <bch:byt> <dbl>
#> 1 stplr 2847. 2847. 0.351 NA 5.97
#> 2 rcid 1214. 1214. 0.824 NA 2.47
nrow (rnet_breakup_vertices (rnet)); length (breakup_cpp (rnet))
#> [1] 84438
#> [1] 92022 Created on 2020-12-08 by the reprex package (v0.3.0) The final numbers are the crucial point: Using OSM IDs to ascertain whether vertices are identical yields more geoms than using coordinates, because the latter invariably rounds some values to be numerically identical even though the vertices are actually different. Those rounding error conflate non-identical vertices and ultimately give you fewer unique geometries than what there rightfully are. Happy to PR that code if you like |
Many thanks for this faster and seemingly more accurate implementation.
That would be very much appreciated. Would be good to identify exactly what is different, including identifying specific linestrings that are split in the C++ implementation but not in the R one. |
Yeah, that's obviously important. I don't have time to look further at the mo, but feel free to explore a bit. How about I open a PR and we can start examining there? Or would a new issue be better for future reference? |
Both would be awesome: the issue being that the current implementation misses breaks and the PR fixing it I guess. Very grateful for your input on this! |
Hi @mpadge and, again, thank you very much. I plan to compare the two implementations in a few days (friday or weekend, probably). |
Cool, thanks Andrea - i'll wait to hear back from you first then, and we can proceed after that. |
Hi! This is just the first draft and I will add more docs/examples/tests tomorrow (or in the free time ASAP). ATM I have just one question. I had to move data.table and sfheaders from Suggests to Imports, any objection?
Also please check the second commit, related to
route()
.