Python ships with a multiprocessing
module that allows your code to run functions in parallel by offloading calls to available processors.
In this guide, we will explore the concept of Pools and what a Pool
in multiprocessing
is.
A Python snippet to play with
Let’s take the following code.
import random, time
def calculate_something(i):
time.sleep(5)
print(random.randint(10, 100)*i)
for i in range(5):
calculate_something(i)
This function will take about 5*5seconds to complete (25seconds?)
We loop through 5 times and call a function that calculates something for us. We use time.sleep
to pretend like the function is doing more work than it is. This gives us a good reason to look into doing things in parallel.
Introducing Multiprocessing
Multiprocessing is pretty simple. Do all the above, but instead of doing all the operations on a single process, rather hand off each one to somewhere that can do it simultaneously.
import random, time, multiprocessing
def calculate_something(i):
time.sleep(5)
print(random.randint(10, 100)*i)
processes = []
for i in range(5):
p = multiprocessing.Process(target=calculate_something, args=(i,))
processes.append(p)
p.start()
for j in range(len(processes)):
processes[j].join()
Now they will all run in parallel, the whole thing will complete in around 5seconds.
But what if you had 1000 items in your loop? ..and only 4 processors on your machine?
This is where Pools shine.
Introducing Pools
Multiprocessing was easy, but Pools is even easier!
Let’s convert the above code to use pools:
import random, time, multiprocessing
def calculate_something():
time.sleep(5)
print(random.randint(10, 100)*i)
pool = multiprocessing.Pool(multiprocessing.cpu_count()-1)
for i in range(1000):
pool.apply_async(calculate_something, args=(i))
pool.close()
pool.join()
So what’s actually happening here?
We create a pool
from multiprocessing.Pool()
and tell it to use 1 less CPU than we have. The reason for this is to not lock up the machine for other tasks.
So let’s say we have 8 CPUs in total, this means the pool will allocate 7 to be used and it will run the tasks with a max of 7 at a time. The first CPU to complete will take the next task from the queue, and so it will continue until all 1000 tasks have been completed.
Note that: if you only have 2 processors, then you might want to remove the -1
from the multiprocessing.cpu_count()-1
. Otherwise, it will only do things on a single CPU!