Instance Slices (+29%) #591
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Push instances into a regular ol' Vec, and keep track of the inserted range in TypeInfo. This means that all instances of a specific class are in a contiguous slice, and are also in the correct iteration order expected by the decode_prop_chunk loops. No more
instances_by_ref.get_mut(referent).unwrap(). A HashMap tracks which i32 ref id corresponds to which instance using indices.There's probably some more optimizations hiding in
DeserializerState::finish.The algorithm is currently using Vec::swap_remove to remove the elements from the instances list.(swap_remove removed in 695e88e) Theinstances_to_constructmachinery can probably be combined with the instances_by_ref and PRNT chunk decode in some clever way to do even less work.I wrote an unsafe version that avoids swap_remove here: 3df205b but there doesn't seem to be a significant impact.I wrote a safe version using a similar technique. Theoretically it should be faster than the swap_remove method, but I haven't put it to the test.Performance
I observe a 29% improvement in throughput in the Miner's Haven deserialize benchmark.
Lore
The instances could be owned by type_info directly, I was able to run type_ids in parallel with rayon (branch) using this method. I got a 1% speedup!!! Wow!!! Probably because it spent the entire time waiting for 36000 Parts to decode while every other class only has <2000 instances. The InstanceKey technique would still work, but removing each instance in finish() would be triple indirection (key, type_id, index) instead of double indirection (key, index).
I was using the indexmap crate for the Vec + HashMap abomination several times, but it's better to implement separately because of the separable mutable + immutable aliasing over the Vec + HashMap parts.
This has three previous incarnations: