Spatial Join

We run this on the latest official RAPIDS container (on a NVIDIA GPU accelerated machine), which we can launch with:

docker run --gpus all --rm -it \
    -p 8889:8888 -p 8788:8787 -p 8786:8786 \
    -v /media/dani/DataStore/data/:/rapids/notebooks/data \
    -v ${PWD}:/rapids/notebooks/work \
    rapidsai/rapidsai:cuda11.4-runtime-ubuntu20.04-py3.9

With this setup, we can access the same work and data folders as in the previous notebook.

import geopandas
import cuspatial
import pandas
from tools import sjoin_gpu
from tqdm import tqdm
from math import ceil

uprn_p = '/rapids/notebooks/data/tmp/epc_uprn.pq'
ss_p = '/rapids/notebooks/data/tmp/sss.pq'
pc_p = '/rapids/notebooks/data/tmp/postcode_pts.pq'

Check GPU-based spatial join validity

Before we run the spatial join on the whole dataset, and since cuspatial is a relatively new library compared to geopandas, we perform a check on a small sample to confirm the results from the spatial join are the same.

We will read into RAM the first 1,600 EPC properties (uprn) and joined them to the spatial signature polygons (ss):

%%time
uprn = geopandas.read_parquet(uprn_p).head(1600)
ss = geopandas.read_parquet(ss_p)

CPU times: user 26.2 s, sys: 7.71 s, total: 34 s
Wall time: 29.5 s

Then we move them to the GPU:

%%time
uprn_gpu = cuspatial.from_geopandas(uprn)
ss_gpu = cuspatial.from_geopandas(ss)

CPU times: user 7.78 s, sys: 675 ms, total: 8.45 s
Wall time: 8.39 s

And perform the GPU-backed spatial join:

%time tst_gpu = sjoin_gpu(uprn_gpu, ss_gpu)

/opt/conda/envs/rapids/lib/python3.9/site-packages/cuspatial/core/spatial/indexing.py:193: UserWarning: scale 5 is less than required minimum scale 9345.561538461538. Clamping to minimum scale
  warnings.warn(
/opt/conda/envs/rapids/lib/python3.9/site-packages/cuspatial/core/spatial/join.py:171: UserWarning: scale 5 is less than required minimum scale 9345.561538461538. Clamping to minimum scale
  warnings.warn(

CPU times: user 649 ms, sys: 40.1 ms, total: 689 ms
Wall time: 686 ms

And the same with geopandas:

%%time
tst = geopandas.sjoin(uprn, ss, how='left')

CPU times: user 1.78 s, sys: 757 µs, total: 1.78 s
Wall time: 1.78 s

We can see computation time is much shorter on the GPU (this gap actually grows notably when the number of points grows, to obtain at least a 20x performance boost). To compare the two results, we join them into a single table:

check = tst.join(
    tst_gpu.to_pandas().set_index('LMK_KEY'), 
    on='LMK_KEY', 
    rsuffix='_gpu'
)

And check that the unique identifier of each EPC property (id and id_gpu) are the same:

(check['id'] != check['id_gpu']).sum()

The only instance in this sample that differs actually doesn’t differ but it is a point that is not joined to any polygon and hence has NaN values:

check[check.eval('id != id_gpu')]

	LMK_KEY	CONSTRUCTION_AGE_BAND	UPRN	geometry	index_right	id	code	type	point_index	UPRN_gpu	CONSTRUCTION_AGE_BAND_gpu	id_gpu	type_gpu
559	887304392732013022216585817278109	England and Wales: 2007 onwards	10090070569	POINT (452546.000 533673.000)	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

With this, we confirm we can use the GPU-backed spatial join, and proceed to deployment to the entire dataset.

Join UPRNs to Spatial Signatures on a GPU

We read in RAM the two tables without subsetting this time:

%%time
uprn = geopandas.read_parquet(uprn_p)
ss = geopandas.read_parquet(ss_p)

CPU times: user 24.7 s, sys: 7.48 s, total: 32.1 s
Wall time: 28 s

Then we move them to the GPU:

%%time
uprn_gpu = cuspatial.from_geopandas(uprn)
ss_gpu = cuspatial.from_geopandas(ss)

CPU times: user 3min 46s, sys: 6.65 s, total: 3min 52s
Wall time: 3min 49s

And we are ready to perform the GPU-backed spatial join. Because the GPU on which this is being run only has 8GB or memory, we need to chunk the computation. We will do this by joining chunk_size points at a time and storing the results back on RAM. Once finished, we save the resulting table to disk.

We can set this up with a simple for loop:

%%time
out = []
chunk_size = 500000
for i in tqdm(range(ceil(len(uprn_gpu) / chunk_size))):
    chunk = uprn_gpu.iloc[i*(chunk_size-1): i*(chunk_size-1)+chunk_size, :]
    sjoined = sjoin_gpu(chunk, ss_gpu, scale=10000)
    out.append(sjoined.to_pandas())
out = pandas.concat(out)
out.to_parquet('/rapids/notebooks/data/tmp/epc_uprn_ss.pq')

 37%|███▋      | 17/46 [06:55<13:08, 27.19s/it]

! du -h /rapids/notebooks/data/tmp/epc_uprn*

Join Postcode centroids to Spatial Signatures to Land Registry

We replicate the approach above to join the centroid of each postcode to the spatial signature where they are located. For this, we first read into RAM both tables, postcode centroids and signature polygons:

%%time
pc = geopandas.read_parquet(pc_p)
ss = geopandas.read_parquet(ss_p)

CPU times: user 630 ms, sys: 211 ms, total: 841 ms
Wall time: 742 ms

Then we move them to the GPU:

%%time
pc_gpu = cuspatial.from_geopandas(pc)
ss_gpu = cuspatial.from_geopandas(ss)

CPU times: user 11.1 s, sys: 516 ms, total: 11.6 s
Wall time: 11.6 s

And we are ready to perform the GPU-backed spatial join. In this case, the dataset fits into the GPU all at once, so the code is greatly simplified:

%%time
pc_ss = sjoin_gpu(
    pc_gpu, ss_gpu, pts_cols=['PCD', 'lr_upc'], poly_cols=['id', 'type']
)

/opt/conda/envs/rapids/lib/python3.9/site-packages/cuspatial/core/spatial/indexing.py:193: UserWarning: scale 5 is less than required minimum scale 9345.561538461538. Clamping to minimum scale
  warnings.warn(
/opt/conda/envs/rapids/lib/python3.9/site-packages/cuspatial/core/spatial/join.py:171: UserWarning: scale 5 is less than required minimum scale 9345.561538461538. Clamping to minimum scale
  warnings.warn(

CPU times: user 1min 6s, sys: 104 ms, total: 1min 6s
Wall time: 1min 6s

Now we can bring back the table we prepared earlier, attach signature types to each Land Registry sale, and write back to disk:

(
    pandas.read_parquet(
        '/rapids/notebooks/data/tmp/sales_by_month_pc.pq'
    )
    .join(
        pc_ss.to_pandas()[['lr_upc', 'id', 'type']].set_index('lr_upc'),
        on='postcode'
    )
    .dropna()
).to_parquet('/rapids/notebooks/data/tmp/sales_by_month_pc_ss.pq')

Method documentation

Since the method used to perform the spatial join (sjoin_gpu) was written for this project, it might be helpful to print here its documentation:

You can download the file with the function here.

sjoin_gpu?

Signature:
sjoin_gpu(
    pts_gdf,
    poly_gdf,
    scale=5,
    max_depth=7,
    max_size=125,
    pts_cols=['LMK_KEY', 'UPRN', 'CONSTRUCTION_AGE_BAND'],
    poly_cols=['id', 'type'],
)
Docstring:
Spatial Join on a GPU
...
Adapted from:
> https://docs.rapids.ai/api/cuspatial/stable/user_guide/users.html#cuspatial.quadtree_point_in_polygon
Arguments
---------
pts_gdf : geopandas.GeoDataFrame/cuspatial.GeoDataFrame
          Table with points
poly_gdf : geopandas.GeoDataFrame/cuspatial.GeoDataFrame
           Table with polygons
scale : int
        [From `cuspatial` docs. Default=5] A scaling function that increases the size of the point 
        space from an origin defined by `{x_min, y_min}`. This can increase the likelihood of 
        generating well-separated quads.
        
max_depth : int
            [From `cuspatial` docs. Default=7] In order for a quadtree to index points effectively, 
            it must have a depth that is log-scaled with the size of the number of points. Each level 
            of the quad tree contains 4 quads. The number of available quads $q$
            for indexing is then equal to $q = 4^d$ where $d$ is the max_depth parameter. With an input 
            size of 10m points and `max_depth` = 7, points will be most efficiently packed into the leaves
            of the quad tree.
max_size : int
           [From `cuspatial` docs. Default=125] Maximum number of points allowed in a node before it's 
           split into 4 leaf nodes. 
pts_cols : list
           [Optional. Default=['UPRN', 'CONSTRUCTION_AGE_BAND']] Column names in `pts_gdf` to be 
           joined in the output
poly_cols : list
            [Optional. Default=['id', 'type']] Column names in `poly_gdf` to be joined in the output 
Returns
-------
sjoined : cudf.DataFrame
          Table with `pts_cols` and `poly_cols` spatially joined
File:      /rapids/notebooks/work/tools.py
Type:      function