Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor rnet_breakup_vertices #451

Merged
merged 6 commits into from
Dec 8, 2020

Conversation

agila5
Copy link
Collaborator

@agila5 agila5 commented Nov 29, 2020

Hi! This is just the first draft and I will add more docs/examples/tests tomorrow (or in the free time ASAP). ATM I have just one question. I had to move data.table and sfheaders from Suggests to Imports, any objection?

Also please check the second commit, related to route().

Copy link
Member

@Robinlovelace Robinlovelace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good first pass, look forward to seeing benchmarks, I've suggested a few changes. Thanks @agila5 !

R/rnet-clean.R Outdated
rnet_clean
# step 3 - Find duplicated internal points
internal_points <- data.table::data.table(
sfheaders::sf_to_df(line2vertices(rnet))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think line2vertices is needed here - it's really slow. Replace with:

     sfheaders::sf_to_df(rnet)

I suggest.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that's true since line2vertices() is used to extract the "internal" points, i.e. those points that are part of the linestring(s) but do not lie in the boundaries. It's like the complementary of rnet_boundary_points(). I could rewrite also line2vertices using the sfheaders approach but, IMO, at the moment it's not worth it.

Copy link
Member

@Robinlovelace Robinlovelace Nov 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aha yes, I see you refactor that function:

https://github.com/ropensci/stplanr/blob/master/R/od-funs.R#L456-L490

However... @agila5 you can get the internal and boundary points with a single operation as shown here, which I think is faster, so suggest calling sfheaders::sf_to_df()/ sf::st_coordinates() (directly or indirectly) only once:

coordinates <- sfheaders::sf_to_df(rnet)
first_pair <- !duplicated(coordinates[["sfg_id"]])
last_pair <- !duplicated(coordinates[["sfg_id"]], fromLast = TRUE)
idxs <- first_pair | last_pair
pairs <- unique(coordinates[idxs, c("x", "y")])
i <- coordinates[!idxs, c("x", "y")]

The i result here is the same as line2vertices() I think.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

However... @agila5 you can get the internal and boundary points with a single operation as shown here, which I think is faster, so suggest calling sfheaders::sf_to_df()/ sf::st_coordinates() (directly or indirectly) only once

Yes, you are right, I will change that part!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be fixed

R/rnet-clean.R Outdated Show resolved Hide resolved
R/rnet-clean.R Outdated Show resolved Hide resolved
@agila5
Copy link
Collaborator Author

agila5 commented Dec 2, 2020

Some simple benchmarks. Current approach:

# install stplanr
remotes::install_github("ropensci/stplanr")

# get data
iow <- osmextract::oe_get("Isle of Wight", quiet = TRUE)
nrow(iow)
#> [1] 45422

# current approach
system.time(({
  iow_breakup <- stplanr::rnet_breakup_vertices(iow)
}))
#> Linking to GEOS 3.8.0, GDAL 3.0.4, PROJ 6.3.1
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared internal points.
#>    user  system elapsed 
#>  210.87    4.30  247.69
nrow(iow_breakup)
#> [1] 80944

Created on 2020-12-02 by the reprex package (v0.3.0)

New approach:

# install stplanr
remotes::install_github("ropensci/stplanr", "6ae27d0b81370fe9b5838fe879cc983db929d55c")

# get data
iow <- osmextract::oe_get("Isle of Wight", quiet = TRUE)
nrow(iow)
#> [1] 45422

# current approach
system.time(({
  iow_breakup <- stplanr::rnet_breakup_vertices(iow)
}))
#> Splitting the input object at the internal points that are also boundaries.
#> Splitting rnet object at the duplicated internal points.
#>    user  system elapsed 
#>    7.11    0.66    8.73
nrow(iow_breakup)
#> [1] 80957

Created on 2020-12-02 by the reprex package (v0.3.0)

The outputs are not exactly identical for the reasons documented also in #416 (comment) (i.e. linestrings that intersect themselves aren't split with the current approach, so it's not surprising that the new approach returns more lines). I graphically compared the examples reported in the help page, and they look identical.

TODOs

  • Explore Z/M dimensions and document changes
  • Test with sfc objects
  • Add some tests at the beginning of my_st_split (since it doesn't make any sense to force a data.table input) and write docs for my_st_split (and maybe a new name 😅 )

I plan to finish the PR on Saturday.

@Robinlovelace
Copy link
Member

247.69/8.73
#> [1] 28.37228

Created on 2020-12-03 by the reprex package (v0.3.0)

You just made my pkg 30 times faster 🤯 🎉 🎉 🎉

@Robinlovelace
Copy link
Member

Great work @agila5. All your suggestions look good, look forward to Sunday 👍 Only comment for now:

Add some tests at the beginning of my_st_split (since it doesn't make any sense to force a data.table input) and write docs for my_st_split (and maybe a new name sweat_smile )

How about fst_split()?

@agila5
Copy link
Collaborator Author

agila5 commented Dec 3, 2020

Unfortunately (well, not really), I think this now outdated.

@Robinlovelace
Copy link
Member

You still got it 30 times faster, which may have inspired the other approach : )

@agila5
Copy link
Collaborator Author

agila5 commented Dec 8, 2020

Fix #416. Next step: wait for @mpadge's comments and Rcpp approach (no time pressure!) and add some tests.

@agila5 agila5 marked this pull request as ready for review December 8, 2020 10:39
@Robinlovelace
Copy link
Member

Looking good @agila5, ready to test/merge I guess 🎉

@agila5
Copy link
Collaborator Author

agila5 commented Dec 8, 2020

Looking good @agila5, ready to test/merge I guess 🎉

IMO, yes it's ready to be merged. Clearly feel free to test it locally if you want. I think the only failing test is unrelated to the PR.

@Robinlovelace
Copy link
Member

Hi @agila5 I've tested it with the following reproducible code:

# Aim: test rnet_breakup_vertices

remotes::install_github("ropensci/stplanr")
library(stplanr)

rnet = osm_net_example
system.time(({
  rnet_processed1 <- stplanr::rnet_breakup_vertices(rnet)
}))

nrow(rnet_processed1)
rnet_processed1$n = seq(nrow(rnet_processed1))
mapview::mapview(rnet_processed1["n"])

remotes::install_github("agila5/stplanr", "refactor_rnet_breakup")
library(stplanr)

rnet = osm_net_example
system.time(({
  rnet_processed1 <- stplanr::rnet_breakup_vertices(rnet)
}))

nrow(rnet_processed1)
rnet_processed1$n = seq(nrow(rnet_processed1))
mapview::mapview(rnet_processed1["n"])

I think there is a small problem.

@Robinlovelace
Copy link
Member

Result of first test with old function:

remotes::install_github("ropensci/stplanr")
#> Skipping install of 'stplanr' from a github remote, the SHA1 (784f64dd) has not changed since last install.
#>   Use `force = TRUE` to force installation
library(stplanr)

rnet = osm_net_example
system.time(({
  rnet_processed1 <- stplanr::rnet_breakup_vertices(rnet)
}))
#> Linking to GEOS 3.8.0, GDAL 3.0.4, PROJ 7.0.0
#> Splitting rnet object at the shared boundary points.
#> Splitting rnet object at the shared internal points.
#>    user  system elapsed 
#>   0.189   0.004   0.196

nrow(rnet_processed1)
#> [1] 129
rnet_processed1$n = seq(nrow(rnet_processed1))
mapview::mapview(rnet_processed1["n"])

Created on 2020-12-08 by the reprex package (v0.3.0)

@Robinlovelace
Copy link
Member

Result with new function:

remotes::install_github("agila5/stplanr", "refactor_rnet_breakup")
#> Skipping install of 'stplanr' from a github remote, the SHA1 (463ff935) has not changed since last install.
#>   Use `force = TRUE` to force installation
library(stplanr)

rnet = osm_net_example
system.time(({
rnet_processed1 <- stplanr::rnet_breakup_vertices(rnet)
}))
#>    user  system elapsed 
#>   0.199   0.012   0.212

nrow(rnet_processed1)
#> [1] 129
rnet_processed1$n = seq(nrow(rnet_processed1))
mapview::mapview(rnet_processed1["n"])

Created on 2020-12-08 by the reprex package (v0.3.0)

Copy link
Member

@Robinlovelace Robinlovelace left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue: it seems the CRS hasn't been copied.

R/rnet-clean.R Outdated Show resolved Hide resolved
R/rnet-clean.R Outdated Show resolved Hide resolved
@agila5
Copy link
Collaborator Author

agila5 commented Dec 8, 2020

Fixed! It was just a typo at the end.

# updates
remotes::install_github("ropensci/stplanr", "286a667f5e04ad102b070cd0d91ae836bf8e4905")
#> Using github PAT from envvar GITHUB_PAT
#> Skipping install of 'stplanr' from a github remote, the SHA1 (286a667f) has not changed since last install.
#>   Use `force = TRUE` to force installation

# packages
library(stplanr)

# data
rnet = osm_net_example
system.time(({
  rnet_processed1 <- stplanr::rnet_breakup_vertices(rnet)
}))
#>    user  system elapsed 
#>    0.20    0.05    0.33

rnet_processed1$n = seq(nrow(rnet_processed1))
mapview::mapview(rnet_processed1["n"])

Created on 2020-12-08 by the reprex package (v0.3.0)

@Robinlovelace
Copy link
Member

Final test:

> nrow(rnet)
[1] 45464
> system.time(({
+   rnet_processed1 <- stplanr::rnet_breakup_vertices(rnet)
+ }))
   user  system elapsed 
  1.641   0.008   1.481 
> nrow(rnet_processed1)
[1] 81040

@Robinlovelace Robinlovelace merged commit f543fa4 into ropensci:master Dec 8, 2020
@mpadge
Copy link
Member

mpadge commented Dec 8, 2020

I'll bury this here just for a comparative reference, starting with modified C++ code now that I see how you're approaching the re-assembly to linestring objects:

#include <Rcpp.h>

// [[Rcpp::export]]
Rcpp::DataFrame getids (Rcpp::List g) {
    const int n = g.size ();

    int npoints = 0;

    std::unordered_set <std::string> endset, midset, middupset;
    Rcpp::List mids (n);
    for (Rcpp::NumericMatrix i: g) {
        Rcpp::List dimnames = i.attr ("dimnames");
        Rcpp::CharacterVector rownames = dimnames (0);

        npoints += rownames.size ();

        endset.emplace (rownames (0));
        endset.emplace (rownames (rownames.size () - 1));

        rownames.erase (0);
        rownames.erase (rownames.size () - 1);
        for (auto j: rownames)
        {
            std::string js = Rcpp::as <std::string> (j);
            if (midset.find (js) != midset.end ())
                middupset.emplace (j);
            midset.emplace (js);
        }
    }

    std::unordered_set <std::string> mid_is_end;
    for (auto i: midset) {
        if (endset.find (i) != endset.end ())
            mid_is_end.emplace (i);
    }

    // work out how many points in total:
    int index = 0;
    for (Rcpp::NumericMatrix gi: g) {
        Rcpp::List dimnames = gi.attr ("dimnames");
        Rcpp::CharacterVector rownames = dimnames (0);
        for (int j = 0; j < rownames.size (); j++)
        {
            std::string js = Rcpp::as <std::string> (rownames (j));
            index++;
            if (mid_is_end.find (js) != mid_is_end.end ())
                index++;
        }
    }

    std::vector <int> geom_num_vec (index);
    std::vector <double> x (index), y (index);

    size_t geom_num = 1;
    index = 0; // index into output vectors
    for (Rcpp::NumericMatrix gi: g) {
        Rcpp::List dimnames = gi.attr ("dimnames");
        Rcpp::CharacterVector rownames = dimnames (0);
        for (int j = 0; j < gi.nrow (); j++)
        {
            std::string js = Rcpp::as <std::string> (rownames (j));
            geom_num_vec [index] = geom_num;
            x [index] = gi (j, 0);
            y [index] = gi (j, 1);
            index++;
            if (mid_is_end.find (js) != mid_is_end.end ())
            {
                geom_num_vec [index] = geom_num++;
                x [index] = gi (j, 0);
                y [index] = gi (j, 1);
                index++;
            }
        }
        geom_num++;
    }

    Rcpp::DataFrame res =
        Rcpp::DataFrame::create (
                Rcpp::Named ("x") = x,
                Rcpp::Named ("y") = y,
                Rcpp::Named ("geom") = geom_num_vec,
                Rcpp::_["stringsAsFactors"] = false);

    return res;
    }

Saving that file as "breakup.cpp" then gives the following:

library (stplanr)
library(sf)
#> Linking to GEOS 3.8.0, GDAL 3.0.4, PROJ 6.3.2
library(Rcpp)
sourceCpp ("breakup.cpp")

breakup_cpp <- function (rnet) {
    dat <- getids (rnet$geometry)

    new_linestring <- sfheaders::sfc_linestring(
        dat,
        linestring_id = "geom"
    )
    new_linestring_sfc <- sf::st_sfc(
        new_linestring,
        crs = sf::st_crs(rnet),
        precision = sf::st_precision(rnet)
    )
    return (new_linestring_sfc)
}

fname <- "ms-streetnet.Rds"
if (!file.exists (fname)) {
    library (dodgr)
    net <- dodgr_streetnet ("Münster Germany")
    saveRDS (net, file = fname)
} else {
    rnet <- readRDS (fname)
}

bench::mark(check = FALSE,
            stplr = {rnet_breakup_vertices (rnet)},
            rcid = {breakup_cpp(rnet)},
            time_unit = "ms")
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 x 6
#>   expression   min median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <dbl>  <dbl>     <dbl> <bch:byt>    <dbl>
#> 1 stplr      2847.  2847.     0.351        NA     5.97
#> 2 rcid       1214.  1214.     0.824        NA     2.47

nrow (rnet_breakup_vertices (rnet)); length (breakup_cpp (rnet))
#> [1] 84438
#> [1] 92022

Created on 2020-12-08 by the reprex package (v0.3.0)

The final numbers are the crucial point: Using OSM IDs to ascertain whether vertices are identical yields more geoms than using coordinates, because the latter invariably rounds some values to be numerically identical even though the vertices are actually different. Those rounding error conflate non-identical vertices and ultimately give you fewer unique geometries than what there rightfully are. Happy to PR that code if you like

@Robinlovelace
Copy link
Member

Many thanks for this faster and seemingly more accurate implementation.

Happy to PR that code if you like

That would be very much appreciated. Would be good to identify exactly what is different, including identifying specific linestrings that are split in the C++ implementation but not in the R one.

@mpadge
Copy link
Member

mpadge commented Dec 8, 2020

identifying specific linestrings that are split in the C++ implementation but not in the R one.

Yeah, that's obviously important. I don't have time to look further at the mo, but feel free to explore a bit. How about I open a PR and we can start examining there? Or would a new issue be better for future reference?

@Robinlovelace
Copy link
Member

Yeah, that's obviously important. I don't have time to look further at the mo, but feel free to explore a bit. How about I open a PR and we can start examining there? Or would a new issue be better for future reference?

Both would be awesome: the issue being that the current implementation misses breaks and the PR fixing it I guess. Very grateful for your input on this!

@agila5
Copy link
Collaborator Author

agila5 commented Dec 8, 2020

Hi @mpadge and, again, thank you very much. I plan to compare the two implementations in a few days (friday or weekend, probably).

@mpadge
Copy link
Member

mpadge commented Dec 8, 2020

Cool, thanks Andrea - i'll wait to hear back from you first then, and we can proceed after that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants