Dask

Dask is a Python library for parallel and distributed computing. Dask is …

  • Easy to use and set up (it’s just a Python library)

  • Powerful at providing scale, and unlocking complex algorithms

  • and Fun 🎉

How to Use Dask

Dask provides several APIs. Choose one that works best for you:

Dask Futures parallelize arbitrary for-loop style Python code, providing:

  • Flexible tooling allowing you to construct custom pipelines and workflows

  • Powerful scaling techniques, processing several thousand tasks per second

  • Responsive feedback allowing for intuitive execution, and helpful dashboards

Dask futures form the foundation for other Dask work

Learn more at Futures Documentation or see an example at Futures Example

from dask.distributed import LocalCluster
client = LocalCluster().get_client()

# Submit work to happen in parallel
results = []
for filename in filenames:
    data = client.submit(load, filename)
    result = client.submit(process, data)
    results.append(result)

# Gather results back to local computer
results = client.gather(results)
_images/futures-graph.png

Dask Dataframes parallelize the popular pandas library, providing:

  • Larger-than-memory execution for single machines, allowing you to process data that is larger than your available RAM

  • Parallel execution for faster processing

  • Distributed computation for terabyte-sized datasets

Dask Dataframes are similar in this regard to Apache Spark, but use the familiar pandas API and memory model. One Dask dataframe is simply a collection of pandas dataframes on different computers.

Learn more at DataFrame Documentation or see an example at DataFrame Example

import dask.dataframe as dd

# Read large datasets in parallel
df = dd.read_parquet("s3://mybucket/data.*.parquet")
df = df[df.value < 0]
result = df.groupby(df.name).amount.mean()

result = result.compute()  # Compute to get pandas result
result.plot()
_images/dask-dataframe.svg

Dask Arrays parallelize the popular NumPy library, providing:

  • Larger-than-memory execution for single machines, allowing you to process data that is larger than your available RAM

  • Parallel execution for faster processing

  • Distributed computation for terabyte-sized datasets

Dask Arrays allow scientists and researchers to perform intuitive and sophisticated operations on large datasets but use the familiar NumPy API and memory model. One Dask array is simply a collection of NumPy arrays on different computers.

Learn more at Array Documentation or see an example at Array Example

import dask.array as da

x = da.random.random((10000, 10000))
y = (x + x.T) - x.mean(axis=1)

z = y.var(axis=0).compute()
_images/dask-array.svg

Xarray wraps Dask array and is a popular downstream project, providing labeled axes and simultaneously tracking many Dask arrays together, resulting in more intuitive analyses. Xarray is popular and accounts for the majority of Dask array use today especially within geospatial and imaging communities.

Learn more at Xarray Documentation or see an example at Xarray Example

import xarray as xr

ds = xr.open_mfdataset("data/*.nc")
da.groupby('time.month').mean('time').compute()
https://docs.xarray.dev/en/stable/_static/dataset-diagram-logo.png

Dask Bags are simple parallel Python lists, commonly used to process text or raw Python objects. They are …

  • Simple offering easy map and reduce functionality

  • Low-memory processing data in a streaming way that minimizes memory use

  • Good for preprocessing especially for text or JSON data prior ingestion into dataframes

Dask bags are similar in this regard to Spark RDDs or vanilla Python data structures and iterators. One Dask bag is simply a collection of Python iterators processing in parallel on different computers.

Learn more at Bag Documentation or see an example at Bag Example

import dask.bag as db

# Read large datasets in parallel
lines = db.read_text("s3://mybucket/data.*.json")
records = (lines
    .map(json.loads)
    .filter(lambda d: d["value"] > 0)
)
df = records.to_dask_dataframe()

How to Install Dask

Installing Dask is easy with pip or conda

Learn more at Install Documentation

pip install "dask[complete]"
conda install dask

How to Deploy Dask

You can then use Dask on a single machine, or deploy it on distributed hardware

Learn more at Deploy Documentation

Dask can set itself up easily in your Python session if you create a LocalCluster object, which sets everything up for you.

from dask.distributed import LocalCluster
cluster = LocalCluster()
client = cluster.get_client()

# Normal Dask work ...

Alternatively, you can skip this part, and Dask will operate within a thread pool contained entirely with your local process.

The dask-kubernetes project provides a Dask Kubernetes Operator.

from dask_kubernetes.operator import KubeCluster
cluster = KubeCluster(
   name="my-dask-cluster",
   image='ghcr.io/dask/dask:latest'
)
cluster.scale(10)

Learn more at Dask Kubernetes Documentation

The dask-jobqueue project interfaces with popular job submission projects, like SLURM, PBS, SGE, LSF, Torque, Condor, and others.

from dask_jobqueue import SLURMCluster

cluster = SLURMCluster()
cluster.scale(jobs=10)

You can also deploy Dask with MPI

# myscript.py
from dask_mpi import initialize
initialize()

from dask.distributed import Client
client = Client()  # Connect this local process to remote workers
$ mpirun -np 4 python myscript.py

Learn more at Dask Jobqueue Documentation and the Dask MPI Documentation.

The dask-cloudprovider project interfaces with popular cloud platforms like AWS, GCP, Azure, and Digital Ocean.

from dask_cloudprovider.aws import FargateCluster
cluster = FargateCluster(
    # Cluster manager specific config kwargs
)

Learn more at Dask CloudProvider Documentation

Several companies offer commercial Dask products. These are not open source, but tend to be easier, safer, cheaper, more fully featured, etc.. All options here include solid free offerings for individuals.

  • Coiled provides a standalone Dask deployment product that works in AWS and GCP.

    Coiled notably employs many of the active Dask maintainers today.

    Learn more at Coiled

  • Saturn Cloud provides Dask as part of their hosted platform including Jupyter and other products.

    Learn more at Saturn Cloud

  • Nebari from Quansight provides Dask as part of a Kubernetes-based git-ops manged platform along with Jupyter and other products suitable for on-prem deployments.

    Learn more at Nebari

Learn with Examples

Dask use is widespread, across all industries and scales. Dask is used anywhere Python is used and people experience pain due to large scale data, or intense computing.

You can learn more about Dask applications at the following sources:

Additionally, we encourage you to look through the reference documentation on this website related to the API that most closely matches your application.

Dask was designed to be easy to use and powerful. We hope that it’s able to help you have fun with your work.