Understanding queues and nodes #22
-
As discussed in #18, I would like to use https://github.com/shikokuchuo/mirai#example-1-connecting-to-remote-servers--remote-server-queues to distribute tasks across jobs on a cluster (and eventually, across AWS Batch jobs on the same local network). Each cluster or AWS Batch job will run one R process that calls I have learned a lot from the README, the documentation of What I think I understand so farAside from autoscaling and crash detection, here is how I see my use case playing out. The following example has 2 workers and only a few tasks, but I envision scaling up to hundreds of each. First, I open a TCP socket on the client: # from xx.xx.xx.175
daemons("tcp://xx.xx.xx.175:5555") and then on a different computer on the network I run a server process # from xx.xx.xx.138
server("tcp://xx.xx.xx.175:5555") and on a third computer I run a third server process # from xx.xx.xx.176
server("tcp://xx.xx.xx.175:5555") Then the client can send jobs for the two servers to run. Below, I experiment with a task load of 6 jobs for 2 servers. # from xx.xx.xx.175
tasks <- list(
mirai({Sys.sleep(5); paste0("task1_server", gsub("^.*\\.", "", getip::getip()))}),
mirai({Sys.sleep(3); paste0("task2_server", gsub("^.*\\.", "", getip::getip()))}),
mirai({Sys.sleep(4); paste0("task3_server", gsub("^.*\\.", "", getip::getip()))}),
mirai({Sys.sleep(2); paste0("task4_server", gsub("^.*\\.", "", getip::getip()))}),
mirai({Sys.sleep(8); paste0("task5_server", gsub("^.*\\.", "", getip::getip()))}),
mirai({Sys.sleep(7); paste0("task6_server", gsub("^.*\\.", "", getip::getip()))})
)
for (i in seq_len(60)) {
Sys.sleep(1)
seconds <- paste0(i, "s")
print(paste(c(seconds, purrr::map_chr(tasks, ~.x$data)), collapse = " "))
} The output I see from the loop on the client is: [1] "1s NA NA NA NA NA NA"
[1] "2s NA NA NA NA NA NA"
[1] "3s NA task2_server138 NA NA NA NA"
[1] "4s NA task2_server138 NA NA NA NA"
[1] "5s task1_server176 task2_server138 NA task4_server138 NA NA"
[1] "6s task1_server176 task2_server138 NA task4_server138 NA NA"
[1] "7s task1_server176 task2_server138 NA task4_server138 NA NA"
[1] "8s task1_server176 task2_server138 NA task4_server138 NA NA"
[1] "9s task1_server176 task2_server138 task3_server176 task4_server138 NA NA"
[1] "10s task1_server176 task2_server138 task3_server176 task4_server138 NA NA"
[1] "11s task1_server176 task2_server138 task3_server176 task4_server138 NA NA"
[1] "12s task1_server176 task2_server138 task3_server176 task4_server138 NA task6_server138"
[1] "13s task1_server176 task2_server138 task3_server176 task4_server138 NA task6_server138"
[1] "14s task1_server176 task2_server138 task3_server176 task4_server138 NA task6_server138"
[1] "15s task1_server176 task2_server138 task3_server176 task4_server138 NA task6_server138"
[1] "16s task1_server176 task2_server138 task3_server176 task4_server138 NA task6_server138"
[1] "17s task1_server176 task2_server138 task3_server176 task4_server138 task5_server176 task6_server138" From this output, we can draw a picture of which tasks are running at which times. In the plot below, each row is a task. Time advances from left to right, and each task is shaded according to the server running it at a given time point. I really like the way Nodes and queues?In https://github.com/shikokuchuo/mirai#example-1-connecting-to-remote-servers--remote-server-queues, you explain that
I am sorry if I am missing something obvious from the code usage examples, documentation, or #18, but would you be willing to explain it to me at a more basic level what these features are and the specific scenarios that motivate them? |
Beta Was this translation helpful? Give feedback.
Replies: 3 comments 25 replies
-
Let me answer (3) first – if you don’t need an active queue, don’t use it. The underlying NNG logic and implementation is very robust and I have recommended throughout the documentation that if this is suitable then it should be used. The problem we have is that when we send the tasks we have no way of knowing the task length a priori. In your example, the tasks are still roughly the same length so the solution is more or less acceptable. Let me give you an extreme counter-example: odd number tasks length 1, even number tasks length 10. As NNG round-robins*, the odd number tasks are all sent to server 1, and the even ones to server 2. Server 1 will be idle after 3 seconds, However the total time taken will be 30 seconds. This is the reason for an active queue. *NNG is not so dumb it also responds to back-pressure from the socket, but only in the case where messages are sent faster than socket buffers can be cleared. Buffers are tuneable at the NNG level but there are also system level TCP socket buffers - in short this is not something we can control reliably. |
Beta Was this translation helpful? Give feedback.
-
(1) Specifying |
Beta Was this translation helpful? Give feedback.
-
(2) What the active queue does is act as a relay or switch - it sits in the middle and forwards tasks between the client and an end server. The logic it has is to only forward tasks to servers that are free (idle). This is the key difference! As we cannot know a priori the task length, we should not allocate them all at once - they need to be queued. Now we just need to poll for servers becoming free and send tasks to the server if there are remaining tasks in the queue. The polling is why we run this in a background process. So if you now call:
(I find it easier to specify all interfaces on the client, also partial matching works so 'n' is as good as 'nodes') and
[]* Note the increment of the port number You will see now that waiting tasks will be allocated to a server as soon as it becomes available. |
Beta Was this translation helpful? Give feedback.
Let me answer (3) first – if you don’t need an active queue, don’t use it. The underlying NNG logic and implementation is very robust and I have recommended throughout the documentation that if this is suitable then it should be used.
The problem we have is that when we send the tasks we have no way of knowing the task length a priori. In your example, the tasks are still roughly the same length so the solution is more or less acceptable.
Let me give you an extreme counter-example: odd number tasks length 1, even number tasks length 10. As NNG round-robins*, the odd number tasks are all sent to server 1, and the even ones to server 2. Server 1 will be idle after 3 seconds, However the tot…