add model parallel for inference #55

cathalobrien · 2024-11-26T14:15:07Z

Lets you run inference over multiple GPUs.

All credit goes to @mishooax . This is his implementation, I just added it to anemoi inference

I compared output for n320 1024c running on 1 GPU vs 4 GPUs and it seems to match.

With this I was able to run 9km inference over 4 nodes with 4 40GB a100s per node.

I would like feedback about how the input tensor is read and how the output tensor is written. Currently all ranks read the input and only rank 0 writes output. Also, at the moment when you run there is lots of duplicated logging

Unfortunately you have to use slurm to launch an inference job on multiple GPUs (as opposed to anemoi training which supports launching interactive jobs with multiple gpus like anemoi-training train hardware.num_gpus_per_node=<num_gpus>). I tried launching with 'torchrun' but it didnt work. Happy to look into this more though

If you are running over multiple nodes you need to add these lines to your slurm batch script

MASTER_ADDR=$(scontrol show hostname $SLURM_NODELIST | head -n 1)
export MASTER_ADDR=$(nslookup $MASTER_ADDR | grep -oP '(?<=Address: ).*')

If you're running over a single node, localhost is used as the address.

mishooax · 2024-12-05T12:00:56Z

thanks @cathalobrien - this looks good, just a minor comment from my side.
sadly i don't have the time to test this - if someone else wants to, please go ahead.

mishooax · 2024-12-05T12:03:12Z

src/anemoi/inference/runner.py

@@ -443,55 +474,57 @@ def get_most_recent_datetime(input_fields):

            # Predict next state of atmosphere
            with torch.autocast(device_type=device, dtype=autocast):
-                y_pred = model.predict_step(input_tensor_torch)
+                # y_pred = model.predict_step(input_tensor_torch, model_comm_group)
+                y_pred = model.forward(input_tensor_torch.unsqueeze(2), model_comm_group)


why is an unsqueeze op needed here?

It expects an ensemble dimension I believe?

ah, yes, somehow i missed that you are now calling forward instead of predict_step.
is predict_step now obsolete? (if so, should it be removed?)

mishooax · 2024-12-05T12:03:19Z

src/anemoi/inference/runner.py

+        dist.init_process_group(
+            backend="nccl",
+            init_method=f"tcp://{addr}:{port}",
+            timeout=datetime.timedelta(minutes=1),


we should probably set a longer timeout, O(5-10 mins) ?

yeah seems sensible

japols · 2024-12-05T13:41:16Z

src/anemoi/inference/runner.py

+        if global_rank == 0:
+            LOGGER.info("World size: %d", world_size)
+        addr = os.getenv("MASTER_ADDR", "localhost")  # localhost should be sufficient to run on a single node
+        port = os.getenv("MASTER_PORT", 10000 + random.randint(0, 10000))  # random port between 10,000 and 20,000


Does the fallback to 10000 + random.randint(0, 10000) work without consistent seeding across ranks? Maybe we should use SLURM_JOBID instead?

cathalobrien added 4 commits November 25, 2024 10:10

model parallel wip

ce74e6e

logging only on rank 0

936c60a

fallback if env vars arent set and some work only done by rank 0

d870289

changelog

b39b796

cathalobrien requested review from b8raoult and mishooax November 26, 2024 14:20

pre-commit checks and no model comm group for single gpu case

b95e167

mishooax reviewed Dec 5, 2024

View reviewed changes

japols reviewed Dec 5, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add model parallel for inference #55

add model parallel for inference #55

cathalobrien commented Nov 26, 2024 •

edited

Loading

mishooax commented Dec 5, 2024

mishooax Dec 5, 2024

cathalobrien Dec 5, 2024

mishooax Dec 5, 2024

mishooax Dec 5, 2024

cathalobrien Dec 5, 2024

japols Dec 5, 2024

add model parallel for inference #55

Are you sure you want to change the base?

add model parallel for inference #55

Conversation

cathalobrien commented Nov 26, 2024 • edited Loading

mishooax commented Dec 5, 2024

mishooax Dec 5, 2024

Choose a reason for hiding this comment

cathalobrien Dec 5, 2024

Choose a reason for hiding this comment

mishooax Dec 5, 2024

Choose a reason for hiding this comment

mishooax Dec 5, 2024

Choose a reason for hiding this comment

cathalobrien Dec 5, 2024

Choose a reason for hiding this comment

japols Dec 5, 2024

Choose a reason for hiding this comment

cathalobrien commented Nov 26, 2024 •

edited

Loading