Skip to content

Commit

Permalink
Use geojson's derived deserialization for simpler code and guaranteed…
Browse files Browse the repository at this point in the history
… line-strings, thanks to michaelkirk

Unfortunately this increases the parsing time from 11s to 26s
  • Loading branch information
dabreegster committed Apr 1, 2023
1 parent 512a923 commit 59a7007
Show file tree
Hide file tree
Showing 3 changed files with 56 additions and 70 deletions.
33 changes: 14 additions & 19 deletions rust/cli/src/main.rs
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ use geo::GeodesicLength;
use geojson::{Feature, FeatureCollection, GeoJson};
use rayon::prelude::*;

use overline::{feature_to_line_string, overline, Output};
use overline::{overline, Input, Output};

fn main() -> Result<()> {
let args: Vec<String> = std::env::args().collect();
Expand All @@ -15,12 +15,9 @@ fn main() -> Result<()> {

println!("Reading and deserializing {}", args[1]);
let mut now = Instant::now();
let geojson: GeoJson = std::fs::read_to_string(&args[1])?.parse()?;
let input: Vec<Feature> = if let GeoJson::FeatureCollection(collection) = geojson {
collection.features
} else {
bail!("Input isn't a FeatureCollection");
};
let input: Vec<Input> = geojson::de::deserialize_feature_collection_str_to_vec(
&std::fs::read_to_string(&args[1])?,
)?;
println!("... took {:?}", now.elapsed());

println!("Running overline on {} line-strings", input.len());
Expand Down Expand Up @@ -50,19 +47,17 @@ fn main() -> Result<()> {
],
)?;
} else if args[1] == "tests/atip_input.geojson" {
fn foot(f: &Feature) -> f64 {
f.property("foot").unwrap().as_f64().unwrap()
fn foot(f: &Input) -> f64 {
f.properties.get("foot").unwrap().as_f64().unwrap()
}

println!("Input:");
for (idx, line) in input.iter().enumerate() {
if let Some(geom) = feature_to_line_string(line) {
println!(
"- {idx} has foot={}, length={}",
foot(line),
geom.geodesic_length()
);
}
println!(
"- {idx} has foot={}, length={}",
foot(line),
line.geometry.geodesic_length()
);
}
println!("Output:");
for line in &output {
Expand All @@ -88,7 +83,7 @@ fn main() -> Result<()> {
}

fn aggregate_and_write(
input: Vec<Feature>,
input: Vec<Input>,
output: Vec<Output>,
properties: Vec<(String, Aggregation)>,
) -> Result<()> {

This comment has been minimized.

Copy link
@michaelkirk

michaelkirk Apr 2, 2023

FYI there is serialize support for custom structs too if you wanted to try to apply similar changes to this code.

Expand Down Expand Up @@ -124,7 +119,7 @@ enum Aggregation {
}

fn aggregate_properties(
input: &Vec<Feature>,
input: &Vec<Input>,
grouped_indices: &Vec<Output>,
properties: Vec<(String, Aggregation)>,
) -> Vec<Feature> {
Expand All @@ -149,7 +144,7 @@ fn aggregate_properties(
let mut values = grouped
.indices
.iter()
.flat_map(|i| input[*i].property(&key));
.flat_map(|i| input[*i].properties.get(key));
match aggregation {
Aggregation::KeepAny => {
if let Some(value) = values.next() {
Expand Down
77 changes: 37 additions & 40 deletions rust/overline/src/lib.rs
Original file line number Diff line number Diff line change
@@ -1,13 +1,20 @@
use std::collections::HashMap;

use geojson::Feature;
use ordered_float::NotNan;
use rayon::prelude::*;
use serde::{Deserialize, Serialize};

// TODO Never aggregate across OSM ID threshold. Plumb through an optional property to restrict
// aggregation.

#[derive(Deserialize)]
pub struct Input {
#[serde(deserialize_with = "geojson::de::deserialize_geometry")]
pub geometry: geo::LineString<f64>,
#[serde(flatten)]
pub properties: geojson::JsonObject,
}

#[derive(Debug, PartialEq, Serialize, Deserialize)]
pub struct Output {
#[serde(
Expand All @@ -22,14 +29,12 @@ pub struct Output {
/// Ignores anything aside from LineStrings. Returns LineStrings chopped up to remove overlap, with
/// exactly one property -- `indices`, an array of numbers indexing the input that share that
/// geometry.
pub fn overline(input: &Vec<Feature>) -> Vec<Output> {
pub fn overline(input: &Vec<Input>) -> Vec<Output> {
// For every individual (directed) line segment, record the index of inputs matching there
let mut line_segments: HashMap<(HashedPoint, HashedPoint), Vec<usize>> = HashMap::new();
for (idx, input) in input.iter().enumerate() {
if let Some(geom) = feature_to_line_string(input) {
for line in geom.lines().map(HashedPoint::new_line) {
line_segments.entry(line).or_insert_with(Vec::new).push(idx);
}
for line in input.geometry.lines().map(HashedPoint::new_line) {
line_segments.entry(line).or_insert_with(Vec::new).push(idx);
}
}
// TODO We also need to split some line segments if they're not matching up at existing points.
Expand All @@ -46,37 +51,35 @@ pub fn overline(input: &Vec<Feature>) -> Vec<Output> {
let mut indices_so_far = Vec::new();
let mut keep_this_output = false;

if let Some(geom) = feature_to_line_string(input_feature) {
for line in geom.lines() {
// The segment is guaranteed to exist
let indices = &line_segments[&HashedPoint::new_line(line)];

if &indices_so_far == indices {
assert_eq!(*pts_so_far.last().unwrap(), line.start);
pts_so_far.push(line.end);
continue;
} else if !pts_so_far.is_empty() {
// The overlap ends here
let add = Output {
geometry: std::mem::take(&mut pts_so_far).into(),
indices: std::mem::take(&mut indices_so_far),
};
if keep_this_output {
intermediate_output.push(add);
}
// Reset below
}
for line in input_feature.geometry.lines() {
// The segment is guaranteed to exist
let indices = &line_segments[&HashedPoint::new_line(line)];

assert!(pts_so_far.is_empty());
pts_so_far.push(line.start);
if &indices_so_far == indices {
assert_eq!(*pts_so_far.last().unwrap(), line.start);
pts_so_far.push(line.end);
indices_so_far = indices.clone();
// Say we're processing input 2, and we have a segment with indices [2, 5]. We want to
// add it to output. But later we'll work on input 5 and see the same segment with
// indices [2, 5]. We don't want to add it again, so we'll skip it using the logic
// below, since we process input in order.
keep_this_output = indices_so_far.iter().all(|i| *i >= idx);
continue;
} else if !pts_so_far.is_empty() {
// The overlap ends here
let add = Output {
geometry: std::mem::take(&mut pts_so_far).into(),
indices: std::mem::take(&mut indices_so_far),
};
if keep_this_output {
intermediate_output.push(add);
}
// Reset below
}

assert!(pts_so_far.is_empty());
pts_so_far.push(line.start);
pts_so_far.push(line.end);
indices_so_far = indices.clone();
// Say we're processing input 2, and we have a segment with indices [2, 5]. We want
// to add it to output. But later we'll work on input 5 and see the same segment
// with indices [2, 5]. We don't want to add it again, so we'll skip it using the
// logic below, since we process input in order.
keep_this_output = indices_so_far.iter().all(|i| *i >= idx);
}
// This input ended; add to output if needed
if !pts_so_far.is_empty() && keep_this_output {
Expand All @@ -91,12 +94,6 @@ pub fn overline(input: &Vec<Feature>) -> Vec<Output> {
.collect()
}

pub fn feature_to_line_string(f: &Feature) -> Option<geo::LineString<f64>> {
f.geometry
.as_ref()
.and_then(|x| TryInto::<geo::LineString<f64>>::try_into(x).ok())
}

// Assume there are no precision issues with the input. If we wanted to quantize, we'd do it here.
#[derive(PartialEq, Eq, Hash, Debug)]
struct HashedPoint {
Expand Down
16 changes: 5 additions & 11 deletions rust/tests/src/lib.rs
Original file line number Diff line number Diff line change
@@ -1,20 +1,14 @@
#[cfg(test)]
mod tests {
use geojson::GeoJson;
use overline::{overline, Output};
use overline::{overline, Input, Output};

include!(concat!(env!("OUT_DIR"), "/tests.rs"));

fn test(input_path: &str, output_path: &str) {
let geojson: GeoJson = std::fs::read_to_string(input_path)
.unwrap()
.parse()
.unwrap();
let input = if let GeoJson::FeatureCollection(collection) = geojson {
collection.features
} else {
panic!("Input isn't a FeatureCollection");
};
let input: Vec<Input> = geojson::de::deserialize_feature_collection_str_to_vec(
&std::fs::read_to_string(input_path).unwrap(),
)
.unwrap();
let actual_output = overline(&input);
let expected_output: Vec<Output> = geojson::de::deserialize_feature_collection_str_to_vec(
&std::fs::read_to_string(output_path).unwrap(),
Expand Down

6 comments on commit 59a7007

@Robinlovelace
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, 👍 to simpler code that's easier to maintain and contribute to. Is this related to the fact that input could contain Linestring, MultiLinestring, or other geometry types? Have hit issues around different geom types as inputs in the R/GEOS implemenation.

@Robinlovelace
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And many thanks @michaelkirk for contributing to active travel stuff from afar!

@michaelkirk
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately this increases the parsing time from 11s to 26s

@dabreegster - I'm attempting some perf work over in georust/geojson#222

Do you have a benchmark or an example command invocation I could run to see this difference? All the geojsons committed in the repository seem pretty small.

@dabreegster
Copy link
Contributor Author

@dabreegster dabreegster commented on 59a7007 Apr 6, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@michaelkirk, https://github.com/actenglabs/overline/releases/download/v0/cycle_routes_london.zip
cargo run --release cycle_routes_london.geojson should do the trick. Reading + deserializing takes 16s on https://github.com/actenglabs/overline/tree/rust for me. On https://github.com/actenglabs/overline/tree/rust_variation, up to 34s

Thank you for making base dependencies even more awesome!

@michaelkirk
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: Improvements are somewhere between modest and non-existent using georust/geojson#222

@michaelkirk
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm making a little progress over at (draft) PR georust/geojson#223

With branch rust:

Reading and deserializing ../cycle_routes_london.geojson
... took 7.46851125s
Running overline on 125317 line-strings
... took 3.667612875s
Writing with indices to output.geojson
... took 333.463458ms

With branch rust_variation, which introduces a nicer API and the option of streaming to minimize memory usage.

Reading and deserializing ../cycle_routes_london.geojson
... took 18.82973175s
Running overline on 125317 line-strings
... took 3.42870325s

With georust/geojson#223, which avoids (heap) allocating a Vec per coordinate:

Reading and deserializing ../cycle_routes_london.geojson
... took 6.830074834s
Running overline on 125317 line-strings
... took 3.173955875s

The branch is in a pretty rough and incomplete state, but it seems doable.

Please sign in to comment.