Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jobs/Burst based substepping #26

Open
wants to merge 59 commits into
base: master
Choose a base branch
from
Open

Jobs/Burst based substepping #26

wants to merge 59 commits into from

Conversation

gotmachine
Copy link
Contributor

@gotmachine gotmachine commented Sep 2, 2021

Opening this to be able to comment on the changes. Don't merge yet.

gotmachine and others added 21 commits August 19, 2021 18:39
Still to do : adaptative substepping interval depending on the workload
Functional calculation of body and vessel positions using Unity Jobs and Burst.
Dev, not actualy being applied yet: runs alongside original sim / compute thread.
Structural differences, so created a namespace and introduced classes/structs informed by the threaded originals.
Demo functionality: includes full validation of position data for each step for all bodies, and final step for all vessels, compared to stock methods and logs misses.  This validation is slow.  So is the example stock orbit calculator for all orbits & steps, to represent single-threaded performance.
Known issues: when vessel changes SOI, position computation will be incorrect (because the Orbit in the new SOI does not accurately depict positions in the previous SOI at high timewarp).  Errors only for the single FixedUpdate frame of the calculation.
New paradigm: Does not precompute the future, instead computes everything from previous FixedUpdate to current FixedUpdate.
Placement of job completion is example.  More to do, but I need to get this committed before I break things irrecoverably!  Working correct positions of bodies & vessels (included landed) is a good point for that.
Does not yet implement the thermal sim components (SimStep's logic) or get tied into the final SimVessel / SimBody / SimStar.
GatherFrames is silly slow and will be replaced with a Jobs implementation.
Don't unroll large reference structures. (Avoid the memory allocation to do so, try a pattern where you carry references to the template.)
Use NativeSlice
Import data from job
Don't need 3 separate loops through bodies and vessels
VesselAccumulator is still a little slow
…and FluxAnalysis

Don't return to main thread to reconstruct the arrays, do it in a job and pass the handle between.
Instead of passing a List of frames to FluxAnalysis, pass the component arrays and get back occlusion and irradiance arrays.
Remove unused Load/Save
Remove singleton pattern in SubstepSim, allow for multiple places to own their own copy and simulate independently as needed.
... instead of NativeArray.  More flexible, otherwise the same thing.
Running on a single substep frame and spawning many jobs when there are many substeps was unnecessarily slow and not much less complicated
Remove RelPosition and irradiance references now in SubstepVessel
Don't worry about identifying vessel with largest time since eval.  Enqueue ALL vessels and refill when exhausted.
Move VesselData / Simulation valid checks into separate method, checked every time the queues are refilled
Unify loaded and unloaded processing calls into single loop through a Queue
Also partially refactor flux job ordering, to highlight more parallelism in the 4 flux/irradiance computations: direct solar irradiance, direct body irradiance(body core thermal flux), albedo irradiance, and re-emitted solar irradiance
@gotmachine gotmachine linked an issue Sep 2, 2021 that may be closed by this pull request
Copy link
Contributor Author

@gotmachine gotmachine left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few notes on readability / maintainability :

  • Please retain the full comments about where the fluxes formula comes from
  • Please make the variable names more explicit.
  • Please don't use var when there is no mention of the type on the same line. This is complicated enough to read/understand/maintain.

My overall feeling is that this will be a nightmare to maintain. Unless there is a good reason not to, I would really prefer the evaluation stuff to happen in a monolithic job. If that isn't possible, you really need to document what everything is doing, is called by what, in what order, etc.

Do we really need so many sub-jobs for every tiny little function ?
I fail to see how all those parallel foreachs will yield any performance benefit. We already have more that enough parallelism potential to fill even a 64 thread processor due to the substep count * vessel count premise.

src/Kerbalism/SteppedSim/FluxAnalysisFactory.cs Outdated Show resolved Hide resolved
src/Kerbalism/SteppedSim/Jobs/FluxAnalysisJobs.cs Outdated Show resolved Hide resolved
occludedLocal |= math.lengthsq(bv) <= radiusSq;
else
occludedLocal |= math.lengthsq(math.cross(ab, av)) <= radiusSq * abLen2;
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That occlusion check seems quite wasteful to me, in comparison to the one we currently have :

/// <summary> return true if the ray 'dir' starting at 'start' and of length 'dist' doesn't hit 'body'</summary>
/// <param name="start">ray origin</param>
/// <param name="dir">ray direction</param>
/// <param name="dist">ray length</param>
/// <param name="body">obstacle to check against</param>
public static bool RayAvoidBody(Vector3d start, Vector3d dir, double dist, CelestialBody body)
{
// ray from origin to body center
Vector3d diff = body.position - start;
// projection of origin->body center ray over the raytracing direction
double k = Vector3d.Dot(diff, dir);
// the ray doesn't hit body if its minimal analytical distance along the ray is less than its radius
// simplified from 'start + dir * k - body.position'
return k < 0.0 || k > dist || (dir * k - diff).magnitude > body.Radius;
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that your occlusion check is incorrect, or presupposes a lot of things about what it is evaluating.
Your implementation requires a ray of unit length (direction) already established.
The ray must not be within the sphere when it starts. (fix with k < -body.Radius instead of 0?)

The other code acts on an arbitrary line segment, point and distance from point.

I think these are probably similar methods, or can become the same after the assumptions in your method are backed out. I'm more than happy if someone finds a more optimal general solution for (distance from point to line segment).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, dir must be normalized. It works perfectly well if the start point is inside the sphere radius. This code (and the various optimized variations in Sim) have worked perfectly fine for years.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eh?

Body -- Vessel-Facing-> ----- Body+Radius -> Distant Target

dir is towards the right.
diff = Body.position - start is towards the left.
k = math.dot (diff, dir) is negative.
The vessel is inside the occluder. This will answer that it is not.

Copy link
Contributor Author

@gotmachine gotmachine Sep 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're assuming this should return "occluded" if the point is inside the occluder, which isn't the case. That's the point of that method, it works reliably (for our purpose) at any altitude, including negative ones. I suggest you test it.

@@ -490,6 +392,49 @@ void FixedUpdate()
fuWatch.Stop();
}

public void ManageWorkQueues(Queue<Vessel> mandatory, Queue<Vessel> discretionary)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Granted, the previous implementation was a bit messy, and I intended to clean/simplify it at some point (mainly so we don't have those duplicated sections for loaded/unloaded vessels).

But I'm not sure I like this Queue thing. First, not sure why this is queue and not a simple list, and not sure why there is two of them. Also, don't use references to Vessel, use VesselData directely : you are doing tons of useless TryGetVesselData(). Also, why are you iterating separately over VesselsLoaded and VesselsUnloaded ?

IMO, the correct implementation for this should be :

(DB.UpdateVesselDataDictionary();)

  • foreach VesselData in DB.VesselDatas
    • remove dead vessels
  • foreach Vessel in FlightGlobals.Vessels
    • Create VesselData for new vessels
    • Check Simulated state
    • If vessel is loaded, add it the "to process this FU" list
    • Else, check the last simulation UT, keep the oldest VD
  • Add the oldest VD to the "to process this FU" list
  • ClearExpiredFrames()
  • foreach VesselData in "to process this FU"
    • update the VD

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's not simple about a Queue? Because you have a queue: an collection of things to process that you have no need to revisit or index into arbitrarily, and for which adding and removing are extremely fast.

The point of doing this was somewhat that the oldest vessel is irrelevant, and if you ever want to do more than 1, what is the next-oldest vessel? It doesn't matter. It's not one you've processed since you filled the queue. The background queue doesn't need to refresh every FixedUpdate, you don't need to find the oldest vessel. The oldest vessel is always in the queue, because the only way out of the queue is to get processed. In fact, everything in the queue is basically of similar-enough age, and if the global list of vessels itself doesn't constantly re-order, the natural ordering of creating the queue will also be oldest-first after the first complete run... so you get "oldest vessel" free anyway. In fact, you get a sorted queue for free after the first cycle.

I think we both accuse eachother a lot of premature optimization. Dictionary lookup is O(1). Extra TryGetVesselData is pretty insignificant. I do keep looking where I can iterate over the VesselDatas instead of Vessels because that's usually what I really want, anyway. Here was a case where I thought not to. Because:

Access to FlightGlobals' short list of loaded vessels are the only ones considered this cycle. I don't even need to inspect all of the unloaded ones. Why bother to iterate over a large collection when I can iterate over a tiny one and do fast Dictionary lookups to double-check validity? And I'm actually doing relatively few of them. Unloaded vessels only need to be reviewed every complete block -- they can be checked when the queue refills.

When you complete a full pass on all vessels, you get a reasonably-sized chunk of expired frames to return.
When you complete a single vessel, you get a small number of frames to return. (Exactly 1 FU's generation, I believe.) And it is the same cost to find them all each time.

Why 2 queues, beyond that they can have different update rates? Because now, with a separate queue of background vessels to process as you get to them, minimum 1/cycle, you could dynamically decide whether to pull from that queue or not. Consider: start a stopwatch, process a vessel, check elapsed time, if within budget, grab the next vessel and process. Which vessel? The next one in the queue.

I think the correct implementation is, well, what I wrote.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uh, I didn't understand that the "discretionary" queue was long lived.
Well, we need to iterate on all vessels in any case (ie, ValidateVesselData() must be called every fixedupdate).
Said otherwise, it is mandatory that the DB.VesselData and FlightGlobals.Vessels are synchronized before e do anything. So we have to iterate on both collections every fixedUpdate in any case. So there is little point in caching anything.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand.
ValidateVesselData creates a VesselData object in the DB if one doesn't exist.

I can understand catching any loaded vessels that didn't get a DB entry here, if we don't trust the other places in the code to detect new loaded vessels and set them up. I didn't include a fallback, but easy enough to (since I already check that a valid VesselData exists)

For background vessels? What exactly are we hunting? If it went (NumVessels) FixedUpdates cycles before being detected, does that matter? What's going to process it if it doesn't have a VesselData? There are things that intentionally won't, since there is: if (VesselData.VesselNeedVesselData(v.protoVessel)) in there.

If it is necessary, perhaps move it up out of the substep sim and the sim's decision-making on what to definitely process and what to maybe process this cycle?

Copy link
Contributor Author

@gotmachine gotmachine Sep 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are tons of entry points for getting unloaded vessels VesselData (CommNet, GameEvents...)
Anyway, the main issue is that long lived queue. There are tons of ways the discretionary queue will become invalid : vessel changing loaded state, vessel changing simulated state, destroyed vessels... It's an unnecessary point of failure, and I don't see any performance gain.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've moved ValidateVesselData to a point where it will always run.

I have also swapped the queues to more directly reflect loaded and unloaded, instead of mandatory/discretionary, as part of 47533e8. That allows flexibility in how we choose the next vessel to process while we wait for jobs to finish. As an idea, there is a configurable time budget to process unloaded vessels from before switching to loaded. The purpose is to provide useful main-thread processing while we wait for potentially-expensive jobs to complete. (40 vessels 100k timewarp typical 3ms and without access to the main thread, so "expensive" is relative.)

src/Kerbalism/SteppedSim/FluxAnalysisFactory.cs Outdated Show resolved Hide resolved
hemisphericReradiatedIrradianceFactor = bodyHemisphericReradiatedIrradianceFactor,
irradiance = albedoIrradiance,
}.Schedule(fullSize, 256, JobHandle.CombineDependencies(bodyHemisphericReradiatedIrradianceFactorJob, vesselBodyOcclusionJob));

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question : is there a benefit in dividing every evaluation in so many tiny sub-jobs ?
Intuitively, it seems to me that this lead to tons of redundant allocations and context switching and putting a lot of pressure on the job scheduler.
It also make the code very messy and spaghettified overall...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In some places, yes. Here, I'm pretty confident. But you need to see what each job is doing.
In this example, for the albedo computation, yes, it makes sense to use the total body flux at the body and pre-calculate the flux/luminosity of the body based on that. The job itself is simple:

var body = bodies[index];
factor[index] = Unity.Burst.CompilerServices.Hint.Unlikely(occluded[index]) ?
    0 : sunFluxAtBody[index] * body.radius * body.radius * 0.5 * body.albedo;

This calculation is going to be re-used for every vessel, for every substep, to determine the albedo flux on that vessel during that substep. So it actually was a lot more intuitive, to me, to convert this weird thing into a flux originating at the body. And then re-use as much as possible the flux-to-irradiance computation. That's why BodyDirectIrradiances can now be its own job covering 3 things, and AlbedoIrradianceAtVesselJob isn't in there only because of the sun-body-vessel position-dependent calc.

How many allocations and context switches and jobs do you believe were scheduled here?
(In most tests, it's ending up < 4/frame, most of which are 2 fixedUpdate cycles)

Following the code structure I can see is an issue. It's one I've been struggling with generally. Part of it is creating the jobs takes up a lot of screen space, so it's harder to see flow at a glance. Part of it is the variable allocations at the jobs that use them -- this isn't really necessary, though it can be a micro-optimization (and would indeed be a premature one here!) if you launch jobs while you allocate and schedule for others. It's that way now more because organizationally it was easier... tho still ugly. Part of it is naming is hard, and as I reorganized my own work, some of the naming may not have kept up well. (BodyDirectIrradiances ... are not the direct irradiance anymore, oops)

I agree that somewhere in the file, there needs to be a summary of what the code flow is. If each of those job creations were turned into a 1-line method call, I think it would be a lot more clear. (Instead, they're a multi-line method, basically...)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took another (... and then another) shot at restructuring the code flow and organization.

{
var star = bodies[firstBodyIndex + starIndexes[i]];
var d2 = math.distancesq(star.position, body.position);
tempFlux += star.solarLuminosity / (4 * math.PI_DBL * d2);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure you can use CelestialBody.solarLuminosity. I don't remember the details, but we had to implement the Sim.stars list for a good reason.
Also, isn't d2 = math.distancesq() wrong ? The formula is irradianceAtAU / (4 * PI * distance²)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

math.distancesq is the square of the distance...
That isn't CelestialBody.solarLuminosity. It is:

SubStepSim.GenerateBodyVesselOrbitData L287

double defaultLuminosity = PhysicsGlobals.SolarLuminosity > 0 ? PhysicsGlobals.SolarLuminosity : PhysicsGlobals.SolarLuminosityAtHome * 4 * math.PI_DBL * Sim.AU * Sim.AU;

and L315:

solarLuminosity = body.isStar ? defaultLuminosity : 0,	// FIXME

Missing the Kopernicus support, which is why //FIXME

else if (math.dot(bv, ab) > 0)
occluded |= math.lengthsq(bv) <= radiusSq;
else
occluded |= math.lengthsq(math.cross(ab, av)) <= radiusSq * abLen2;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, seems very wasteful compared to the analytical sphere-ray intersection check I'm used to

emissiveIrradiance = bodyEmissiveIrradianceSum,
coreIrradiance = bodyCoreIrradianceSum,
}.Schedule(numSteps * numVessels, numVessels, JobHandle.CombineDependencies(sumTemp, sum3));

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not exactly sure where this is going, but you have no way to predict which steps will be effectively used. Therefore you can't sum their results.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the steps will be used. The job isn't computing future steps that may or may not exist or be used. It is computing up to the current Planetarium.UT.

Also, this isn't the sum across all of the timesteps, though this might be a good place to take a large step across.
This is the sum of all of the individual body contributions to the vessel. And what instigated the conversation in the Discord, about if/why you might need it still per-body at the next level of processing.

Treat everything as substepped for VesselData.EnvironmentUpdate.  Even if there is just a single substep.
WIP: Re-examine any changes to EnvironmentUpdate.  Probably need to reconstitute a per-body (not per-star) list of irradiances for future thermal sim. This should be done at the Jobs level and code here may still sum it across all substeps. The current rollup may be too broad, if specific directionality becomes important to later calcs.
WIP: Removed a lot of the fields recalculated from irradiance temporarily.
WIP/TODO: Since we're back in managed / main thread land, consider jobifying all this summation.  Need a good, fast way to deliver all the frames back into a job.  Could be a significant architectural idea/work.
Remove reliance on SimStep and SimVessel in VesselData
Parameterize Sim for Evaluate/EnvironmentUpdate.  (Idea: externals can make a VesselData and run a sim with different characteristics/subsets for bodies/vessels than the "gameworld" sim)
On the way to closing out the previous threaded sim.
Rename variables, add comments, reorganize some structure.
TODO: Restructure a lot of it again for properly handling multiple stars / albedo sources
Make FluxAnalysisFactory non-static
Track index of stars
DRVeyl and others added 30 commits September 11, 2021 14:27
Don't wait for the jobs to complete right after launching.  Return from SubStepSim.OnFixedUpdate to let Kerbalism FixedUpdate progress and then collect results when required.
Allow unloaded vessels to process with data up to the previous FixedUpdate.  Loaded vessels require current data, so block if necessary.
Define a budget for background vessel processing before switching to loaded vessels.  WIP: 2ms
GetNextVesselToEval controls choice of next vessel or when to stop processing
Swapped work queue definitions back to loaded and unloaded.
Custom struct to reference indices
Undo some of the unrolling for reference data
Provide consistent FrameStats struct
Slightly micro-optimize RotationsComputeJob
Vessel index didn't play a role in the source albedo data
Disabled the desynchronized update cycles in VesselData, this was causing issues with the new jobs based substepping.
Note : Planner is likely borked, I didn't reimplement its environment evaluation (since it will be refactored anyway)
Demonstrate constructing a job and running it from the main thread.  Set up a job for the large summation across timesteps and bodies.  Not ideal, since we have to index and serialize each irradiance structure for each timestep, but demonstrative.  Potential for future improvements.
Switch FrameManager from Dictionary to LinkedList to improve performance (iteratating many sequential dictionary key lookups was noticably inefficient).
FIXMEs: Unbreak the variable-timestep awareness when summing, and the tracking of sunlightFactor and maybe visibility.  Move them into the supporting job calculation or replace.
Substeps can have different intervals when accumulating across more than 1 FixedUpdate cycle, typically when warp rate changes between evaluations.
Determine weighting factors and apply during the accumulator job.
Compute cumulative visibility across multiple substeps.
Track visibility as a float in the irradiance structure, so that it can be used to represent more than a single instant in time
Fix indexing error into relevance array which is not same dimension as bodies array
…tamps

Removed accumulated list of timestamps.  Instead, the sim notifies each VesselData of how much new data is available each cycle
VesselData iterates through FrameManager's data, skipping past evaluated frames and applying until available frames are exhausted or a stop time is reached.
TODO: Make stop time parameter work
Optimization: Iterate on aggregate frames generated 1/FixedUpdate instead of all detail frames generated 1/max interval.
Move from VesselDataJobs to FluxAnalysisJobs
Allow parameterized number of vessels in the frame
Take advantage of VesselBodyIrradiance struct using a float for visibility combined with time-weighted averaging jobs to compute single data points for an entire FixedUpdate that themselves can be combined in a time-weighted fashion across multiple cycles.  Allows timewarp-independent overall summary of detailed timesteps.
Don't short-circuit the loop on occluders in the Job when occlusion is detected.  This is counter-intuitive, but empirically faster with Burst, perhaps because of removing the dependency on a result discovered at loop's end.
Avoid branching in occlusion test.
Since the actual distance doesn't matter, only if it is less than a target, we can skip the test for which end of the line segment is closest, and simply compute all 3 possibilities.  If any of them are sufficient, the full result is.  The branching is more expensive than the extra math operations.
Empirically on a 4C/4T 3.2GHz system with 20 vessels, 17 bodies, 67 block computations per FixedUpdate (100k warp, 30-sec intervals), this change reduced typical occlusion compute time from 1.7ms to to 1.2ms.
Fix NRE around aggregator when no work was needed
Enable validator with a field instead of commenting in/out
Math better next time; the cross product test was not valid if the occluder was beyond the segment start or end.
Adjusted albedo calculations with latest developements - subject to change again.
Lot of complexity wrapped into Orbit.GetRelativePositionAtUT, made difficult to inspect where to optimize.
Break out code into sequential calculations:
compute ObT, solveEccentricAnomalies, getTrueAnomaly.  Include some temp data compute jobs.
Refactor relevance into subarrays of relevant body indices.  Store length of each subarray per source vessel.
Rather than build and then sort separately, avoid the extra (and larger) allocations and combine the steps.
Fix allocation leak with new indexer
Swap solver to a do-while
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Jobs/Burst based substepping
2 participants