Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

extract and check-refs use too much RAM with numerically high node IDs #234

Open
RyanDeRose-TomTom opened this issue Nov 8, 2021 · 1 comment

Comments

@RyanDeRose-TomTom
Copy link

What version of osmium-tool are you using?

osmium version 1.13.2 (v1.13.2-4-gf0657f8)
libosmium version 2.17.1
Supported PBF compression types: none zlib lz4

What operating system version are you using?

Ubuntu 18.04.6 LTS

Tell us something about your system

8 CPU cores
32 GB RAM

What did you do exactly?

I have a custom-made pbf (contains nodes and ways, but no relations) filling the area of Luxembourg, and I attempted to extract a region corresponding to one of the two level 8 mercator tiles that contain Lux:
5.625,49.83798245308484,7.03125,50.73645513701065

With my file (15 MB), this fills up my system's RAM (32 GB) on the first pass and is killed. I tried check-refs to validate my file, and the same thing happened.

Eventually I tried renumbering my file, and afterwards, extract works very quickly, correctly, and with very little RAM. I am using large node IDs (which I can't really avoid for reasons) up to ~2e16 integers, so I set out to recreate the issue on an official extract:

wget https://download.geofabrik.de/europe/luxembourg-latest.osm.pbf
osmium renumber luxembourg-latest.osm.pbf -o renum-1e16.osm.pbf -s 10000000000000000
osmium renumber luxembourg-latest.osm.pbf -o renum-2e16.osm.pbf -s 20000000000000000
etc, with the following extract:
osmium extract -b 5.625,49.83798245308484,7.03125,50.73645513701065 renum-3e16.osm.pbf -o renum-extract.osm.pbf -O -v

These are all still reasonable numbers, since they are far below the upper limit of 2^63-1 ~ 9.22e18
I found that 1e16, 2e16, and 3e16 respectively used a max of 9, 18, and 28 GB of RAM (a linear increase in node ID number despite constant node count), with anything larger being killed.

I can work around this with renumbering, but this prevents work from being done in parallel on different machines which don't have access to the same renumbering index.

@joto
Copy link
Member

joto commented Nov 8, 2021

This is a known limitation of the current implementation. This isn't a problem when used with OSM data, because it doesn't have those large IDs, so it is unlikely to get fixed. If this is needed for your non-OSM use case and you are willing to put some money into it, I do contract development. Please contact me directly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants