Set up ad-hoc Dask cluster¶
This guide documents how to set up an ad-hoc cluster of machines that can run Dask processes.
Requirements¶
For this guide, we will assume you have access to the following:
At least two machines connected to a network that allows them to talk to each other over ssh
Ability to run on each machine
gds_env
containers (for this example, we will usegds_dev
but the same should work withgds_py
andgds
)
Intuition¶
Dask has several ways of setting up a cluster, including over ssh or with orchestration tools such as Kubernetes. For a more thorough coverage, please visit the official docs. In this guide, we will connect a few machines that can talk to each other over ssh. This is handy, for example, if you have a few computers running on the same local network.
The structure a Dask cluster relies on a single “entry point”, the “scheduler”, that recruits resources from a series of additional machines, the “workers” (including itself potentially).
The sequence of actions will be as follows:
Start the scheduler
Start workers on different machines, attaching them to the scheduler
Use the cluster on a Python session with a Dask
Client
Scheduler¶
We will run the scheduler inside a Jupyter Lab session that will then use the cluster, although this need not be the case. For illustration, we will refer to the IP on which this machine can be reached at with <scheduler-ip>
.
Fire up a terminal and type:
docker run --rm -ti -p 8888:8888 -p 8787:8787 -p 8786:8786 darribas/gds_dev:latest start.sh
This will spin up a gds_dev
container with a JupyterLab instance, opening ports 8888 (JupyterLab), 8787 (Dask dashboard) and 8786 (scheduler).
Inside JupyterLab, open a terminal window and launch the scheduler with:
dask-scheduler
Workers¶
Now go to each of the machines in the network you want to recruit as workers. Open a terminal and run:
docker run --network="host" --rm -ti -p 8786:8786 darribas/gds_dev:latest start.sh dask-worker <scheduler-ip>:8786
This will launch a container that starts a worker process and attaches to the scheduler (through the <scheduler-ip>
link).
It’s important to include the --network="host"
parameter so that the worker container can be reached from outside the machine through ssh.
If you want to start a worker on the same machine as a scheduler, use dask-worker <scheduler-ip>:8786
(not localhost).
Use the cluster¶
Once the cluster is available, we can use it on a Python session.
Open a notebook or Python console in the same JupyterLab instance as the scheduler
Import Dask’s
Client
from dask.distributed import Client
Set up a client for the session
client = Client("tcp://172.17.0.2:8786")
Replace tcp://172.17.0.2:8786
by the URL where the scheduler is running at (you can see this on the scheduler log feed).
You can check the status of the cluster at
<scheduler-ip>:8787
Now the cluster is set up and linked to the session, when you run a Dask operation, its computation will be distributed across the cluster.