Multiprocessing Pools in Python


Python ships with a multiprocessing module that allows your code to run functions in parallel by offloading calls to available processors.

In this guide, we will explore the concept of Pools and what a Pool in multiprocessing is.

A Python snippet to play with

Let’s take the following code.

import random, time

def calculate_something(i):
    time.sleep(5)
    print(random.randint(10, 100)*i)

for i in range(5):
   calculate_something(i)

This function will take about 5*5seconds to complete (25seconds?)

We loop through 5 times and call a function that calculates something for us. We use time.sleep to pretend like the function is doing more work than it is. This gives us a good reason to look into doing things in parallel.

Introducing Multiprocessing

Multiprocessing is pretty simple. Do all the above, but instead of doing all the operations on a single process, rather hand off each one to somewhere that can do it simultaneously.

import random, time, multiprocessing

def calculate_something(i):
    time.sleep(5)
    print(random.randint(10, 100)*i)

processes = []

for i in range(5):
    p = multiprocessing.Process(target=calculate_something, args=(i,))
    processes.append(p)
    p.start()

for j in range(len(processes)):
    processes[j].join()

Now they will all run in parallel, the whole thing will complete in around 5seconds.

But what if you had 1000 items in your loop? ..and only 4 processors on your machine?

This is where Pools shine.

Introducing Pools

Multiprocessing was easy, but Pools is even easier!

Let’s convert the above code to use pools:

import random, time, multiprocessing

def calculate_something():
    time.sleep(5)
    print(random.randint(10, 100)*i)

pool = multiprocessing.Pool(multiprocessing.cpu_count()-1)

for i in range(1000):
    pool.apply_async(calculate_something, args=(i))

pool.close()
pool.join()

So what’s actually happening here?

We create a pool from multiprocessing.Pool() and tell it to use 1 less CPU than we have. The reason for this is to not lock up the machine for other tasks.

So let’s say we have 8 CPUs in total, this means the pool will allocate 7 to be used and it will run the tasks with a max of 7 at a time. The first CPU to complete will take the next task from the queue, and so it will continue until all 1000 tasks have been completed.

Note that: if you only have 2 processors, then you might want to remove the -1 from the multiprocessing.cpu_count()-1. Otherwise, it will only do things on a single CPU!