Set up ad-hoc Dask cluster¶

This guide documents how to set up an ad-hoc cluster of machines that can run Dask processes.

Requirements¶

For this guide, we will assume you have access to the following:

At least two machines connected to a network that allows them to talk to each other over ssh
Ability to run on each machine gds_env containers (for this example, we will use gds_dev but the same should work with gds_py and gds)

Intuition¶

Dask has several ways of setting up a cluster, including over ssh or with orchestration tools such as Kubernetes. For a more thorough coverage, please visit the official docs. In this guide, we will connect a few machines that can talk to each other over ssh. This is handy, for example, if you have a few computers running on the same local network.

The structure a Dask cluster relies on a single “entry point”, the “scheduler”, that recruits resources from a series of additional machines, the “workers” (including itself potentially).

The sequence of actions will be as follows:

Start the scheduler
Start workers on different machines, attaching them to the scheduler
Use the cluster on a Python session with a Dask Client

Scheduler¶

We will run the scheduler inside a Jupyter Lab session that will then use the cluster, although this need not be the case. For illustration, we will refer to the IP on which this machine can be reached at with <scheduler-ip>.

Fire up a terminal and type:

docker run --rm -ti -p 8888:8888 -p 8787:8787 -p 8786:8786 darribas/gds_dev:latest start.sh

This will spin up a gds_dev container with a JupyterLab instance, opening ports 8888 (JupyterLab), 8787 (Dask dashboard) and 8786 (scheduler).

Inside JupyterLab, open a terminal window and launch the scheduler with:

dask-scheduler

Workers¶

Now go to each of the machines in the network you want to recruit as workers. Open a terminal and run:

docker run --network="host"  --rm -ti -p 8786:8786 darribas/gds_dev:latest start.sh dask-worker <scheduler-ip>:8786

This will launch a container that starts a worker process and attaches to the scheduler (through the <scheduler-ip> link).

It’s important to include the --network="host" parameter so that the worker container can be reached from outside the machine through ssh.

If you want to start a worker on the same machine as a scheduler, use dask-worker <scheduler-ip>:8786 (not localhost).

Use the cluster¶

Once the cluster is available, we can use it on a Python session.

Open a notebook or Python console in the same JupyterLab instance as the scheduler
Import Dask’s Client

from dask.distributed import Client

Set up a client for the session

client = Client("tcp://172.17.0.2:8786")

Replace tcp://172.17.0.2:8786 by the URL where the scheduler is running at (you can see this on the scheduler log feed).

You can check the status of the cluster at <scheduler-ip>:8787

Now the cluster is set up and linked to the session, when you run a Dask operation, its computation will be distributed across the cluster.

Urban Grammar AI | WP1 - Data Processing