This document collates the three main datasets used int his capsule: the Energy Performance Certificates (EPC), the UPRN locations, and the Spatial Signature polygons. We first link (through a table join) building age, through EPC, with UPRN locations, and then we bring the Spatial Signatures. The two are subsequently joined on the GPU in a separate notebook. Each section details the origin of the data.
import pandasimport geopandasimport dask_geopandasfrom pyogrio import read_dataframeimport warnings # To turn disable some known ones belowuprn_p ='/home/jovyan/data/uk_os_openuprn/osopenuprn_202210.gpkg'epc_p ='/home/jovyan/data/uk_epc_certificates/'ss_p ='/home/jovyan/data/tmp/spatial_signatures_GB.gpkg'pp_p ='/home/jovyan/data/tmp/pp-complete.csv'pc_p ='/home/jovyan/data/tmp/postcodes.csv'
ERROR 1: PROJ: proj_create_from_database: Open of /opt/conda/share/proj failed
Some of the computations will be run in parallel through Dask, so we set up a client for a local cluster with 16 workers (as many as threads in the machine where this is run):
These need to be downloaded manually from the official website (https://epc.opendatacommunities.org/). Once unzipped, it is a collection of .csv files that can be processed efficiently with Dask. Here we specify the computation lazily:
And execute it on the Dask cluster, local in this case, to load them in RAM (NOTE: this will take a significant amount of RAM on your machine). Note that we drop rows with N/A values in either of the three columns as we need observations with the three valid.
CPU times: user 12.4 s, sys: 3.1 s, total: 15.5 s
Wall time: 38.6 s
UPRN coords
UPRN coordinates are unique identifiers for property in Britain. We source them from the Ordnance Survey’s Open UPRN product (https://www.ordnancesurvey.co.uk/business-government/products/open-uprn), which also needs to be downloaded manually. We access the GPKG format which contains the geometries created for each point already.
To consume them, we load them up in RAM (NOTE - this will take a significant amount of memory on your machine):
The approach using pyogrio seems to beat a multi-core implementation with dask-geopandas, possibly because the latter relies on geopandas.read_file, even though it spreads the computation it across cores. In case of interest, here’s the code:
/tmp/ipykernel_3312797/3783868997.py:1: UserWarning: this is an initial implementation of Parquet/Feather file support and associated metadata. This is tracking version 0.1.0 of the metadata specification at https://github.com/geopandas/geo-arrow-spec
This metadata specification does not yet make stability promises. We do not yet recommend using this in a production setting unless you are able to rewrite your Parquet/Feather files.
To further ignore this warning, you can do:
import warnings; warnings.filterwarnings('ignore', message='.*initial implementation of Parquet.*')
db.to_parquet('/home/jovyan/data/tmp/epc_uprn.pq')
Spatial Signatures
For the Spatial Signature boundaries, we rely on the official open data product. This can be downloaded programmatically from its Figshare location. You can download it directly with:
/tmp/ipykernel_3312797/4276883947.py:1: UserWarning: this is an initial implementation of Parquet/Feather file support and associated metadata. This is tracking version 0.1.0 of the metadata specification at https://github.com/geopandas/geo-arrow-spec
This metadata specification does not yet make stability promises. We do not yet recommend using this in a production setting unless you are able to rewrite your Parquet/Feather files.
To further ignore this warning, you can do:
import warnings; warnings.filterwarnings('ignore', message='.*initial implementation of Parquet.*')
ss.assign(geometry=sss).to_parquet('/home/jovyan/data/tmp/sss.pq')
We read in parallel only the columns we need and drop rows with any missing value as we need postcodes for which we have the three features (i.e., IDs and location coordinates):
To connect the two tables, we join them only after removing spaces in both sets of postcodes (which finds a geometry for the vast majority of postcodes):
/tmp/ipykernel_3543616/936528254.py:1: UserWarning: this is an initial implementation of Parquet/Feather file support and associated metadata. This is tracking version 0.1.0 of the metadata specification at https://github.com/geopandas/geo-arrow-spec
This metadata specification does not yet make stability promises. We do not yet recommend using this in a production setting unless you are able to rewrite your Parquet/Feather files.
To further ignore this warning, you can do:
import warnings; warnings.filterwarnings('ignore', message='.*initial implementation of Parquet.*')
j.to_parquet('/home/jovyan/data/tmp/postcode_pts.pq')