import geopandas
import cuspatial
import pandas
from tools import sjoin_gpu
from tqdm import tqdm
from math import ceil
= '/rapids/notebooks/data/tmp/epc_uprn.pq'
uprn_p = '/rapids/notebooks/data/tmp/sss.pq'
ss_p = '/rapids/notebooks/data/tmp/postcode_pts.pq' pc_p
Spatial Join
We run this on the latest official RAPIDS container (on a NVIDIA GPU accelerated machine), which we can launch with:
docker run --gpus all --rm -it \
-p 8889:8888 -p 8788:8787 -p 8786:8786 \
-v /media/dani/DataStore/data/:/rapids/notebooks/data \
-v ${PWD}:/rapids/notebooks/work \
rapidsai/rapidsai:cuda11.4-runtime-ubuntu20.04-py3.9
With this setup, we can access the same work
and data
folders as in the previous notebook.
Check GPU-based spatial join validity
Before we run the spatial join on the whole dataset, and since cuspatial
is a relatively new library compared to geopandas
, we perform a check on a small sample to confirm the results from the spatial join are the same.
We will read into RAM the first 1,600 EPC properties (uprn
) and joined them to the spatial signature polygons (ss
):
%%time
= geopandas.read_parquet(uprn_p).head(1600)
uprn = geopandas.read_parquet(ss_p) ss
CPU times: user 26.2 s, sys: 7.71 s, total: 34 s
Wall time: 29.5 s
Then we move them to the GPU:
%%time
= cuspatial.from_geopandas(uprn)
uprn_gpu = cuspatial.from_geopandas(ss) ss_gpu
CPU times: user 7.78 s, sys: 675 ms, total: 8.45 s
Wall time: 8.39 s
And perform the GPU-backed spatial join:
%time tst_gpu = sjoin_gpu(uprn_gpu, ss_gpu)
/opt/conda/envs/rapids/lib/python3.9/site-packages/cuspatial/core/spatial/indexing.py:193: UserWarning: scale 5 is less than required minimum scale 9345.561538461538. Clamping to minimum scale
warnings.warn(
/opt/conda/envs/rapids/lib/python3.9/site-packages/cuspatial/core/spatial/join.py:171: UserWarning: scale 5 is less than required minimum scale 9345.561538461538. Clamping to minimum scale
warnings.warn(
CPU times: user 649 ms, sys: 40.1 ms, total: 689 ms
Wall time: 686 ms
And the same with geopandas
:
%%time
= geopandas.sjoin(uprn, ss, how='left') tst
CPU times: user 1.78 s, sys: 757 µs, total: 1.78 s
Wall time: 1.78 s
We can see computation time is much shorter on the GPU (this gap actually grows notably when the number of points grows, to obtain at least a 20x performance boost). To compare the two results, we join them into a single table:
= tst.join(
check 'LMK_KEY'),
tst_gpu.to_pandas().set_index(='LMK_KEY',
on='_gpu'
rsuffix )
And check that the unique identifier of each EPC property (id
and id_gpu
) are the same:
'id'] != check['id_gpu']).sum() (check[
1
The only instance in this sample that differs actually doesn’t differ but it is a point that is not joined to any polygon and hence has NaN
values:
eval('id != id_gpu')] check[check.
LMK_KEY | CONSTRUCTION_AGE_BAND | UPRN | geometry | index_right | id | code | type | point_index | UPRN_gpu | CONSTRUCTION_AGE_BAND_gpu | id_gpu | type_gpu | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
559 | 887304392732013022216585817278109 | England and Wales: 2007 onwards | 10090070569 | POINT (452546.000 533673.000) | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
With this, we confirm we can use the GPU-backed spatial join, and proceed to deployment to the entire dataset.
Join UPRNs to Spatial Signatures on a GPU
We read in RAM the two tables without subsetting this time:
%%time
= geopandas.read_parquet(uprn_p)
uprn = geopandas.read_parquet(ss_p) ss
CPU times: user 24.7 s, sys: 7.48 s, total: 32.1 s
Wall time: 28 s
Then we move them to the GPU:
%%time
= cuspatial.from_geopandas(uprn)
uprn_gpu = cuspatial.from_geopandas(ss) ss_gpu
CPU times: user 3min 46s, sys: 6.65 s, total: 3min 52s
Wall time: 3min 49s
And we are ready to perform the GPU-backed spatial join. Because the GPU on which this is being run only has 8GB or memory, we need to chunk the computation. We will do this by joining chunk_size
points at a time and storing the results back on RAM. Once finished, we save the resulting table to disk.
We can set this up with a simple for
loop:
%%time
= []
out = 500000
chunk_size for i in tqdm(range(ceil(len(uprn_gpu) / chunk_size))):
= uprn_gpu.iloc[i*(chunk_size-1): i*(chunk_size-1)+chunk_size, :]
chunk = sjoin_gpu(chunk, ss_gpu, scale=10000)
sjoined
out.append(sjoined.to_pandas())= pandas.concat(out)
out '/rapids/notebooks/data/tmp/epc_uprn_ss.pq') out.to_parquet(
37%|███▋ | 17/46 [06:55<13:08, 27.19s/it]
! du -h /rapids/notebooks/data/tmp/epc_uprn*
Join Postcode centroids to Spatial Signatures to Land Registry
We replicate the approach above to join the centroid of each postcode to the spatial signature where they are located. For this, we first read into RAM both tables, postcode centroids and signature polygons:
%%time
= geopandas.read_parquet(pc_p)
pc = geopandas.read_parquet(ss_p) ss
CPU times: user 630 ms, sys: 211 ms, total: 841 ms
Wall time: 742 ms
Then we move them to the GPU:
%%time
= cuspatial.from_geopandas(pc)
pc_gpu = cuspatial.from_geopandas(ss) ss_gpu
CPU times: user 11.1 s, sys: 516 ms, total: 11.6 s
Wall time: 11.6 s
And we are ready to perform the GPU-backed spatial join. In this case, the dataset fits into the GPU all at once, so the code is greatly simplified:
%%time
= sjoin_gpu(
pc_ss =['PCD', 'lr_upc'], poly_cols=['id', 'type']
pc_gpu, ss_gpu, pts_cols )
/opt/conda/envs/rapids/lib/python3.9/site-packages/cuspatial/core/spatial/indexing.py:193: UserWarning: scale 5 is less than required minimum scale 9345.561538461538. Clamping to minimum scale
warnings.warn(
/opt/conda/envs/rapids/lib/python3.9/site-packages/cuspatial/core/spatial/join.py:171: UserWarning: scale 5 is less than required minimum scale 9345.561538461538. Clamping to minimum scale
warnings.warn(
CPU times: user 1min 6s, sys: 104 ms, total: 1min 6s
Wall time: 1min 6s
Now we can bring back the table we prepared earlier, attach signature types to each Land Registry sale, and write back to disk:
(
pandas.read_parquet('/rapids/notebooks/data/tmp/sales_by_month_pc.pq'
)
.join('lr_upc', 'id', 'type']].set_index('lr_upc'),
pc_ss.to_pandas()[[='postcode'
on
)
.dropna()'/rapids/notebooks/data/tmp/sales_by_month_pc_ss.pq') ).to_parquet(
Method documentation
Since the method used to perform the spatial join (sjoin_gpu
) was written for this project, it might be helpful to print here its documentation:
You can download the file with the function here.
sjoin_gpu?
Signature: sjoin_gpu( pts_gdf, poly_gdf, scale=5, max_depth=7, max_size=125, pts_cols=['LMK_KEY', 'UPRN', 'CONSTRUCTION_AGE_BAND'], poly_cols=['id', 'type'], ) Docstring: Spatial Join on a GPU ... Adapted from: > https://docs.rapids.ai/api/cuspatial/stable/user_guide/users.html#cuspatial.quadtree_point_in_polygon Arguments --------- pts_gdf : geopandas.GeoDataFrame/cuspatial.GeoDataFrame Table with points poly_gdf : geopandas.GeoDataFrame/cuspatial.GeoDataFrame Table with polygons scale : int [From `cuspatial` docs. Default=5] A scaling function that increases the size of the point space from an origin defined by `{x_min, y_min}`. This can increase the likelihood of generating well-separated quads. max_depth : int [From `cuspatial` docs. Default=7] In order for a quadtree to index points effectively, it must have a depth that is log-scaled with the size of the number of points. Each level of the quad tree contains 4 quads. The number of available quads $q$ for indexing is then equal to $q = 4^d$ where $d$ is the max_depth parameter. With an input size of 10m points and `max_depth` = 7, points will be most efficiently packed into the leaves of the quad tree. max_size : int [From `cuspatial` docs. Default=125] Maximum number of points allowed in a node before it's split into 4 leaf nodes. pts_cols : list [Optional. Default=['UPRN', 'CONSTRUCTION_AGE_BAND']] Column names in `pts_gdf` to be joined in the output poly_cols : list [Optional. Default=['id', 'type']] Column names in `poly_gdf` to be joined in the output Returns ------- sjoined : cudf.DataFrame Table with `pts_cols` and `poly_cols` spatially joined File: /rapids/notebooks/work/tools.py Type: function