Data acquisition

This document collates the three main datasets used int his capsule: the Energy Performance Certificates (EPC), the UPRN locations, and the Spatial Signature polygons. We first link (through a table join) building age, through EPC, with UPRN locations, and then we bring the Spatial Signatures. The two are subsequently joined on the GPU in a separate notebook. Each section details the origin of the data.

import pandas
import geopandas
import dask_geopandas
from pyogrio import read_dataframe
import warnings # To turn disable some known ones below

uprn_p = '/home/jovyan/data/uk_os_openuprn/osopenuprn_202210.gpkg'
epc_p = '/home/jovyan/data/uk_epc_certificates/'
ss_p = '/home/jovyan/data/tmp/spatial_signatures_GB.gpkg'
pp_p = '/home/jovyan/data/tmp/pp-complete.csv'
pc_p = '/home/jovyan/data/tmp/postcodes.csv'

ERROR 1: PROJ: proj_create_from_database: Open of /opt/conda/share/proj failed

Some of the computations will be run in parallel through Dask, so we set up a client for a local cluster with 16 workers (as many as threads in the machine where this is run):

import dask.dataframe as ddf
from dask.distributed import LocalCluster, Client

with warnings.catch_warnings():
    warnings.filterwarnings("ignore")
    client = Client(LocalCluster(n_workers=16))

EPC certificates

These need to be downloaded manually from the official website (https://epc.opendatacommunities.org/). Once unzipped, it is a collection of .csv files that can be processed efficiently with Dask. Here we specify the computation lazily:

dtypes = {
   'CONSTRUCTION_AGE_BAND': 'str',
   'UPRN': 'str',
   'LMK_KEY': 'str'
}
certs_all = ddf.read_csv(
    f'{epc_p}*/certificates.csv', 
    dtype=dtypes,
    usecols=dtypes
)

And execute it on the Dask cluster, local in this case, to load them in RAM (NOTE: this will take a significant amount of RAM on your machine). Note that we drop rows with N/A values in either of the three columns as we need observations with the three valid.

%%time
with warnings.catch_warnings():
    warnings.filterwarnings("ignore")
    certs = certs_all.dropna().compute()

CPU times: user 12.4 s, sys: 3.1 s, total: 15.5 s
Wall time: 38.6 s

UPRN coords

UPRN coordinates are unique identifiers for property in Britain. We source them from the Ordnance Survey’s Open UPRN product (https://www.ordnancesurvey.co.uk/business-government/products/open-uprn), which also needs to be downloaded manually. We access the GPKG format which contains the geometries created for each point already.

To consume them, we load them up in RAM (NOTE - this will take a significant amount of memory on your machine):

The approach using pyogrio seems to beat a multi-core implementation with dask-geopandas, possibly because the latter relies on geopandas.read_file, even though it spreads the computation it across cores. In case of interest, here’s the code:

uprn = dask_geopandas.read_file(
    uprn_p, npartitions=16
).compute()

%%time
uprn = read_dataframe(uprn_p, columns=['UPRN', 'geometry'])
uprn['UPRN'] = uprn['UPRN'].astype(str)

CPU times: user 56.1 s, sys: 8.79 s, total: 1min 4s
Wall time: 1min 10s

Merge UPRN-EPC

With both tables ready in memory, we merge them so that we attach point geometries to all the EPC certificate points through their UPRNs.

%%time
db = geopandas.GeoDataFrame(
    certs.merge(
        uprn, left_on='UPRN', right_on='UPRN', how='left'
    ), crs=uprn.crs
)

CPU times: user 40.9 s, sys: 3.58 s, total: 44.4 s
Wall time: 43.4 s

After the merge, we write the table to disk so it can be loaded later on for the spatial join:

db.to_parquet('/home/jovyan/data/tmp/epc_uprn.pq')

/tmp/ipykernel_3312797/3783868997.py:1: UserWarning: this is an initial implementation of Parquet/Feather file support and associated metadata.  This is tracking version 0.1.0 of the metadata specification at https://github.com/geopandas/geo-arrow-spec

This metadata specification does not yet make stability promises.  We do not yet recommend using this in a production setting unless you are able to rewrite your Parquet/Feather files.

To further ignore this warning, you can do: 
import warnings; warnings.filterwarnings('ignore', message='.*initial implementation of Parquet.*')
  db.to_parquet('/home/jovyan/data/tmp/epc_uprn.pq')

Spatial Signatures

For the Spatial Signature boundaries, we rely on the official open data product. This can be downloaded programmatically from its Figshare location. You can download it directly with:

! rm -f $ss_p # Remove if exsisting
! wget -O $ss_p https://figshare.com/ndownloader/files/30904861

--2022-12-21 17:30:16--  https://figshare.com/ndownloader/files/30904861
Resolving figshare.com (figshare.com)... 54.194.88.49, 52.17.229.77, 2a05:d018:1f4:d003:376b:de5c:3a42:a610, ...
Connecting to figshare.com (figshare.com)|54.194.88.49|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://s3-eu-west-1.amazonaws.com/pfigshare-u-files/30904861/spatial_signatures_GB.gpkg?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIYCQYOYV5JSSROOA/20221221/eu-west-1/s3/aws4_request&X-Amz-Date=20221221T173017Z&X-Amz-Expires=10&X-Amz-SignedHeaders=host&X-Amz-Signature=6c7b771aaa9d3262e8c5d21388e81b74dd21b6d622d36a17bac818dc7fe6a71e [following]
--2022-12-21 17:30:17--  https://s3-eu-west-1.amazonaws.com/pfigshare-u-files/30904861/spatial_signatures_GB.gpkg?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIYCQYOYV5JSSROOA/20221221/eu-west-1/s3/aws4_request&X-Amz-Date=20221221T173017Z&X-Amz-Expires=10&X-Amz-SignedHeaders=host&X-Amz-Signature=6c7b771aaa9d3262e8c5d21388e81b74dd21b6d622d36a17bac818dc7fe6a71e
Resolving s3-eu-west-1.amazonaws.com (s3-eu-west-1.amazonaws.com)... 52.218.85.75, 52.218.100.203, 52.92.1.232, ...
Connecting to s3-eu-west-1.amazonaws.com (s3-eu-west-1.amazonaws.com)|52.218.85.75|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 909824000 (868M) [application/octet-stream]
Saving to: ‘/home/jovyan/data/tmp/spatial_signatures_GB.gpkg’

/home/jovyan/data/t 100%[===================>] 867.68M  70.4MB/s    in 12s     

2022-12-21 17:30:29 (72.6 MB/s) - ‘/home/jovyan/data/tmp/spatial_signatures_GB.gpkg’ saved [909824000/909824000]

%%time
ss = read_dataframe(ss_p)

CPU times: user 1.46 s, sys: 794 ms, total: 2.26 s
Wall time: 2.24 s

This is very detailed, which makes things much slower to run, so we simplify first:

%%time
sss = ss.simplify(10)

CPU times: user 1min 17s, sys: 1.04 s, total: 1min 19s
Wall time: 1min 10s

Now we can write to disk a Parquet table with the simplified geometries for consumption later in the GPU:

ss.assign(geometry=sss).to_parquet('/home/jovyan/data/tmp/sss.pq')

/tmp/ipykernel_3312797/4276883947.py:1: UserWarning: this is an initial implementation of Parquet/Feather file support and associated metadata.  This is tracking version 0.1.0 of the metadata specification at https://github.com/geopandas/geo-arrow-spec

This metadata specification does not yet make stability promises.  We do not yet recommend using this in a production setting unless you are able to rewrite your Parquet/Feather files.

To further ignore this warning, you can do: 
import warnings; warnings.filterwarnings('ignore', message='.*initial implementation of Parquet.*')
  ss.assign(geometry=sss).to_parquet('/home/jovyan/data/tmp/sss.pq')

Land Registry Price Paid

! wget \
    http://prod.publicdata.landregistry.gov.uk.s3-website-eu-west-1.amazonaws.com/pp-complete.csv \
    -O $pp_p

--2022-12-23 11:48:14--  http://prod.publicdata.landregistry.gov.uk.s3-website-eu-west-1.amazonaws.com/pp-complete.csv
Resolving prod.publicdata.landregistry.gov.uk.s3-website-eu-west-1.amazonaws.com (prod.publicdata.landregistry.gov.uk.s3-website-eu-west-1.amazonaws.com)... 52.218.118.28, 52.218.120.108, 52.218.120.212, ...
Connecting to prod.publicdata.landregistry.gov.uk.s3-website-eu-west-1.amazonaws.com (prod.publicdata.landregistry.gov.uk.s3-website-eu-west-1.amazonaws.com)|52.218.118.28|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: http://prod1.publicdata.landregistry.gov.uk.s3-website-eu-west-1.amazonaws.com/pp-complete.csv [following]
--2022-12-23 11:48:14--  http://prod1.publicdata.landregistry.gov.uk.s3-website-eu-west-1.amazonaws.com/pp-complete.csv
Resolving prod1.publicdata.landregistry.gov.uk.s3-website-eu-west-1.amazonaws.com (prod1.publicdata.landregistry.gov.uk.s3-website-eu-west-1.amazonaws.com)... 52.218.120.108, 52.218.120.212, 52.92.19.172, ...
Reusing existing connection to prod.publicdata.landregistry.gov.uk.s3-website-eu-west-1.amazonaws.com:80.
HTTP request sent, awaiting response... 200 OK
Length: 4849144007 (4.5G) [text/csv]
Saving to: ‘/home/jovyan/data/tmp/pp-complete.csv’

/home/jovyan/data/t 100%[===================>]   4.52G  30.3MB/s    in 1m 47s  

2022-12-23 11:50:02 (43.1 MB/s) - ‘/home/jovyan/data/tmp/pp-complete.csv’ saved [4849144007/4849144007]

Following the official documentation, the column names are:

col_names = [
    'tid',
    'price',
    'date_of_transfer',
    'postcode',
    'property_type',
    'new_build',
    'duration',
    'PAON',
    'SAON',
    'street',
    'locality',
    'town_city',
    'district',
    'county',
    'ppd_cat_type',
    'rec_status'
]

We only read a subset of those:

%%time
pp = ddf.read_csv(
    pp_p, 
    names=col_names,
    usecols=['tid', 'date_of_transfer', 'postcode', 'new_build'],
    parse_dates=['date_of_transfer']
).compute()

CPU times: user 10.8 s, sys: 3.32 s, total: 14.1 s
Wall time: 21.9 s

For the analysis, we will need counts by month by postcode. We can calculate these already and save space:

%%time
sales = (
    pp
    .assign(moy=pp['date_of_transfer'].dt.to_period('M'))
    .query('new_build == "Y"')
    .groupby(['moy', 'postcode'])
    .size()
    .reset_index()
    .rename(columns={0: 'new_sales'})
)

CPU times: user 6.42 s, sys: 851 ms, total: 7.27 s
Wall time: 7 s

Or, with polars (note most of the time is in building the polars.DataFrame object, rather than the computation):

%%time
import polars as pl

sales_pl = (
    pl.DataFrame(pp)
    .lazy()
    .with_column(
        pl.col('date_of_transfer').dt.strftime("%Y-%m").alias('moy')
    )
    .filter(pl.col('new_build') == 'Y')
    .groupby(['moy', 'postcode'])
    .agg(pl.count())
    .rename({'count': 'new_sales'})
    .collect()
    .to_pandas()
)

CPU times: user 5.61 s, sys: 711 ms, total: 6.32 s
Wall time: 3.21 s

To confirm they’re the same:

(
    sales_pl
    .set_index(['moy', 'postcode'])
    .join(sales.set_index(['moy', 'postcode']), lsuffix='_pl')
    .eval('new_sales_pl - new_sales')
    .sum()
)

0.0

We write the table as we will need it later on in the analysis, once it’s joined to the spatial signatures:

sales.to_parquet('/home/jovyan/data/tmp/sales_by_month_pc.pq')

Postcode centroids

Postcode locations (centroids) come from the ONSPD database:

! wget \
    https://geoportal.statistics.gov.uk/datasets/2e65b9933cd9483b8724760f27968a48_0.csv \
    -O $pc_p

--2022-12-23 14:41:43--  https://geoportal.statistics.gov.uk/datasets/2e65b9933cd9483b8724760f27968a48_0.csv
Resolving geoportal.statistics.gov.uk (geoportal.statistics.gov.uk)... 44.207.123.71, 3.219.120.199, 34.193.115.202
Connecting to geoportal.statistics.gov.uk (geoportal.statistics.gov.uk)|44.207.123.71|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/csv]
Saving to: ‘/home/jovyan/data/tmp/postcodes.csv’

/home/jovyan/data/t     [   <=>              ]   1.10G  15.1MB/s    in 50s     

2022-12-23 14:42:33 (22.6 MB/s) - ‘/home/jovyan/data/tmp/postcodes.csv’ saved [1186245382]

We read in parallel only the columns we need and drop rows with any missing value as we need postcodes for which we have the three features (i.e., IDs and location coordinates):

%%time
pcs = ddf.read_csv(
    pc_p,
    usecols=['PCD', 'OSEAST1M', 'OSNRTH1M'],
    assume_missing=True
).dropna().compute()

CPU times: user 535 ms, sys: 130 ms, total: 665 ms
Wall time: 1.83 s

We generate the point geometries:

xys = geopandas.points_from_xy(
    pcs['OSEAST1M'], pcs['OSNRTH1M']
)

We can now build the geo-table with the point geometries of all available postcodes:

pc_pts = (
    geopandas.GeoDataFrame(
        pcs[['PCD']], geometry=xys
    ).set_crs(epsg=27700)
)

The ONSPD appears to contain postcodes expressed with a space in between and without:

pc_pts[pc_pts['PCD'].str.contains(' ')].head(1)

	PCD	geometry
0	AB1 0AA	POINT (385386.000 801193.000)

pc_pts[~pc_pts['PCD'].str.contains(' ')].head(1)

	PCD	geometry
2655	AB101AA	POINT (394251.000 806376.000)

While the postcodes in the Land Registry all are expressed with a space (with the exception of an UNKNOWN instance):

lr_upcs = pandas.Series(sales['postcode'].unique())
lr_upcs[~lr_upcs.str.contains(' ')]

201210    UNKNOWN
dtype: object

To connect the two tables, we join them only after removing spaces in both sets of postcodes (which finds a geometry for the vast majority of postcodes):

j = geopandas.GeoDataFrame(
    pandas.DataFrame(
        {'lr_upc': lr_upcs, 'jlr_upc': lr_upcs.str.replace(' ', '')}
    )
    .join(
        pc_pts.assign(jPCD=pc_pts['PCD'].str.replace(' ', '')).set_index('jPCD'), 
        on='jlr_upc',
        how='left'
    )
    .drop(columns=['jlr_upc'])
).set_crs(pc_pts.crs)

j.info()

<class 'geopandas.geodataframe.GeoDataFrame'>
RangeIndex: 284126 entries, 0 to 284125
Data columns (total 3 columns):
 #   Column    Non-Null Count   Dtype   
---  ------    --------------   -----   
 0   lr_upc    284126 non-null  object  
 1   PCD       283932 non-null  object  
 2   geometry  283932 non-null  geometry
dtypes: geometry(1), object(2)
memory usage: 6.5+ MB

We write this to disk to be able to join it to spatial signature types on a GPU:

j.to_parquet('/home/jovyan/data/tmp/postcode_pts.pq')

/tmp/ipykernel_3543616/936528254.py:1: UserWarning: this is an initial implementation of Parquet/Feather file support and associated metadata.  This is tracking version 0.1.0 of the metadata specification at https://github.com/geopandas/geo-arrow-spec

This metadata specification does not yet make stability promises.  We do not yet recommend using this in a production setting unless you are able to rewrite your Parquet/Feather files.

To further ignore this warning, you can do: 
import warnings; warnings.filterwarnings('ignore', message='.*initial implementation of Parquet.*')
  j.to_parquet('/home/jovyan/data/tmp/postcode_pts.pq')