Interpolation of signatures to OA and LSOA¶
Signatures are composed of bespoke geometries but can be transferred to other. In this notebook, we are interpolating signatures to Output Areas and Lower Super Output Area.
The notebook requires tobler > 0.8.2
.
import geopandas
import tobler
signatures = geopandas.read_file("spatial_signatures_GB.gpkg")
/opt/conda/lib/python3.8/site-packages/geopandas/geodataframe.py:577: RuntimeWarning: Sequential read of iterator was interrupted. Resetting iterator. This can negatively impact the performance.
for feature in features_lst:
Output Areas¶
We fetch OA for England and Wales, and those for Scotlaned and combine them to a single GeoDataFrame.
from download import download
import pandas
import dask_geopandas
download("https://opendata.arcgis.com/api/v3/datasets/09b58d063d4e421a9cad16ba5419a6bd_0/downloads/data?format=geojson&spatialRefId=4326", "oa_ew.geojson")
Downloading data from https://opendata.arcgis.com/api/v3/datasets/09b58d063d4e421a9cad16ba5419a6bd_0/downloads/data?format=geojson&spatialRefId=4326 (1 byte)
file_sizes: 2.62GB [01:15, 34.7MB/s]
Successfully downloaded file to oa_ew.geojson
'oa_ew.geojson'
download("https://www.nrscotland.gov.uk/files/geography/output-area-2011-mhw.zip", "oa_scot", kind="zip")
Creating data folder...
Downloading data from https://www.nrscotland.gov.uk/files/geography/output-area-2011-mhw.zip (31.2 MB)
file_sizes: 100%|██████████████████████████| 32.7M/32.7M [00:00<00:00, 99.8MB/s]
Extracting zip file...
Successfully downloaded / unzipped to oa_scot
'oa_scot'
oa_ew = geopandas.read_file("oa_ew.geojson")
oa_scot = geopandas.read_file('oa_scot')
oa_ew = oa_ew[["OA11CD", "geometry"]].to_crs(27700)
oa_scot = oa_scot[["code", "geometry"]].rename(columns={"code": "OA11CD"})
oa = pandas.concat([oa_ew, oa_scot]).reset_index(drop=True)
OA geometries are very dense. For our purpose, we simplify them a bit to speedup the interpolation.
%%time
oa_ddf = dask_geopandas.from_geopandas(oa, npartitions=16)
oa_ddf["geometry"] = oa_ddf.simplify(1).buffer(0)
oa = oa_ddf.compute()
CPU times: user 2min 18s, sys: 493 ms, total: 2min 18s
Wall time: 12.1 s
Using tobler, we interpolate the data and then clean the DataFrame.
%%time
estimates = tobler.area_weighted.area_interpolate(signatures, oa, spatial_index="target", categorical_variables=["type"], n_jobs=-1)
CPU times: user 1h 11min 21s, sys: 13.3 s, total: 1h 11min 35s
Wall time: 2h 6min 1s
estimates["OA11CD"] = oa.OA11CD.values
We also want to get the primary type covering the largest portion of each OA.
primary = estimates.drop(columns=["geometry", "OA11CD"]).idxmax(axis=1)
estimates["primary_type"] = primary.str.replace("type_", "")
estimates.columns = [col.replace("type_", "") for col in estimates.columns]
order = ['OA11CD', 'primary_type', 'Countryside agriculture', 'Accessible suburbia', 'Open sprawl',
'Wild countryside', 'Warehouse land', 'Gridded residential quarters',
'Urban buffer', 'Disconnected suburbia',
'Dense residential neighbourhoods',
'Connected residential neighbourhoods', 'Dense urban neighbourhoods',
'Local urbanity', 'Distilled urbanity', 'Regional urbanity', 'outlier',
'Metropolitan urbanity', 'Hyper distilled urbanity',
]
oa_estimated = estimates[order]
oa_estimated
OA11CD | primary_type | Countryside agriculture | Accessible suburbia | Open sprawl | Wild countryside | Warehouse land | Gridded residential quarters | Urban buffer | Disconnected suburbia | Dense residential neighbourhoods | Connected residential neighbourhoods | Dense urban neighbourhoods | Local urbanity | Distilled urbanity | Regional urbanity | outlier | Metropolitan urbanity | Hyper distilled urbanity | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | E00000001 | Distilled urbanity | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
1 | E00000003 | Distilled urbanity | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
2 | E00000005 | Distilled urbanity | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
3 | E00000007 | Distilled urbanity | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
4 | E00000010 | Distilled urbanity | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
227754 | S00094726 | Wild countryside | 0.000000 | 0.0 | 0.000000 | 0.965738 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
227755 | S00102583 | Urban buffer | 0.000000 | 0.0 | 0.076019 | 0.000000 | 0.0 | 0.0 | 0.923981 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
227756 | S00119179 | Wild countryside | 0.000000 | 0.0 | 0.000000 | 0.995102 | 0.0 | 0.0 | 0.000594 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
227757 | S00119262 | Wild countryside | 0.000000 | 0.0 | 0.000000 | 0.982881 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
227758 | S00126169 | Countryside agriculture | 0.999624 | 0.0 | 0.000000 | 0.000377 | 0.0 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
227759 rows × 19 columns
oa_estimated.to_csv("output_area_estimates.csv")
LSOA¶
We re-use OA geometries and combine them to LSOA using the lookup table.
download("https://www.arcgis.com/sharing/rest/content/items/39b8fd75e06f4588b9242bcad1f1a898/data", "lookup", kind="zip")
Creating data folder...
Downloading data from https://www.arcgis.com/sharing/rest/content/items/39b8fd75e06f4588b9242bcad1f1a898/data (22.4 MB)
file_sizes: 100%|██████████████████████████| 23.5M/23.5M [00:01<00:00, 18.8MB/s]
Extracting zip file...
Successfully downloaded / unzipped to lookup
'lookup'
lookup = pandas.read_csv("lookup/Postcode_to_Output_Area_to_Lower_Layer_Super_Output_Area_to_Middle_Layer_Super_Output_Area_to_Local_Authority_District_November_2018_Lookup_in_the_UK.csv")
/opt/conda/lib/python3.8/site-packages/IPython/core/interactiveshell.py:3165: DtypeWarning: Columns (13) have mixed types.Specify dtype option on import or set low_memory=False.
has_raised = await self.run_ast_nodes(code_ast.body, cell_name,
lookup = lookup[["oa11cd", "lsoa11cd"]]
oa_lsoa = oa.merge(lookup[~lookup.oa11cd.duplicated()], left_on="OA11CD", right_on="oa11cd", how="left")
oa_lsoa
OA11CD | geometry | oa11cd | lsoa11cd | |
---|---|---|---|---|
0 | E00000001 | POLYGON ((532305.295 181814.350, 532192.035 18... | E00000001 | E01000001 |
1 | E00000003 | POLYGON ((532215.180 181846.434, 532215.756 18... | E00000003 | E01000001 |
2 | E00000005 | POLYGON ((532181.931 181763.261, 532176.493 18... | E00000005 | E01000001 |
3 | E00000007 | POLYGON ((532203.092 181668.419, 532229.800 18... | E00000007 | E01000001 |
4 | E00000010 | POLYGON ((532129.762 182133.441, 532099.885 18... | E00000010 | E01000003 |
... | ... | ... | ... | ... |
227754 | S00094726 | MULTIPOLYGON (((140992.500 757140.000, 140998.... | S00094726 | S01007285 |
227755 | S00102583 | MULTIPOLYGON (((249205.272 658381.836, 249199.... | S00102583 | S01008322 |
227756 | S00119179 | MULTIPOLYGON (((180736.000 822983.000, 181184.... | S00119179 | S01010669 |
227757 | S00119262 | POLYGON ((331504.549 973428.233, 332026.118 97... | S00119262 | S01010790 |
227758 | S00126169 | POLYGON ((314833.000 734046.000, 314950.484 73... | S00126169 | S01011959 |
227759 rows × 4 columns
Creation of LSOA geometries is done using dask_geopandas dissolve method, that was not released at the time.
# https://github.com/geopandas/dask-geopandas/pull/69
def dissolve(ddf, by=None, aggfunc="first", split_out=1, **kwargs):
"""Dissolve geometries within `groupby` into a single geometry.
Parameters
----------
by : string, default None
Column whose values define groups to be dissolved. If None,
whole GeoDataFrame is considered a single group.
aggfunc : function, string or dict, default "first"
Aggregation function for manipulation of data associated
with each group. Passed to dask `groupby.agg` method.
Note that ``aggfunc`` needs to be applicable to all column (i.e. ``"mean:``
cannot be used with string dtype). Select only required columns before
``dissolve`` or pass a dictionary mapping ``aggfunc`` to each column
separately.
split_out : int, default 1
Number of partitions of the output
**kwargs
keyword arguments passed to `groupby`
Examples
--------
>>> ddf.dissolve("foo", split_out=12)
>>> ddf[["foo", "bar", "geometry"]].dissolve("foo", aggfunc="mean")
>>> ddf.dissolve("foo", aggfunc={"bar": "mean", "baz": "first"})
"""
import dask.dataframe as dd
if by is None:
by = lambda x: 0
drop = [ddf.geometry.name]
else:
drop = [by, ddf.geometry.name]
def union(block):
merged_geom = block.unary_union
return merged_geom
merge_geometries = dd.Aggregation(
"merge_geometries", lambda s: s.agg(union), lambda s0: s0.agg(union)
)
if isinstance(aggfunc, dict):
data_agg = aggfunc
else:
data_agg = {col: aggfunc for col in ddf.columns.drop(drop)}
data_agg[ddf.geometry.name] = merge_geometries
aggregated = ddf.groupby(by=by, **kwargs).agg(
data_agg,
split_out=split_out,
)
return aggregated.set_crs(ddf.crs)
oa_lsoa_ddf = dask_geopandas.from_geopandas(oa_lsoa, npartitions=16)
lsoa_ddf = dissolve(oa_lsoa_ddf, "lsoa11cd", split_out=16)
lsoa = lsoa_ddf.compute()
Now we can interpolate the signature classification to LSOA and clean the DataFrame as above.
%%time
estimates = tobler.area_weighted.area_interpolate(signatures, lsoa, spatial_index="target", categorical_variables=["type"], n_jobs=-1)
CPU times: user 11min 11s, sys: 9.26 s, total: 11min 20s
Wall time: 50min 52s
estimates["LSOA11CD"] = lsoa.index.values
primary = estimates.drop(columns=["geometry", "LSOA11CD"]).idxmax(axis=1)
estimates["primary_type"] = primary.str.replace("type_", "")
estimates.columns = [col.replace("type_", "") for col in estimates.columns]
order = ['LSOA11CD', 'primary_type', 'Countryside agriculture', 'Accessible suburbia', 'Open sprawl',
'Wild countryside', 'Warehouse land', 'Gridded residential quarters',
'Urban buffer', 'Disconnected suburbia',
'Dense residential neighbourhoods',
'Connected residential neighbourhoods', 'Dense urban neighbourhoods',
'Local urbanity', 'Distilled urbanity', 'Regional urbanity', 'outlier',
'Metropolitan urbanity', 'Hyper distilled urbanity',
]
lsoa_estimated = estimates[order]
lsoa_estimated
LSOA11CD | primary_type | Countryside agriculture | Accessible suburbia | Open sprawl | Wild countryside | Warehouse land | Gridded residential quarters | Urban buffer | Disconnected suburbia | Dense residential neighbourhoods | Connected residential neighbourhoods | Dense urban neighbourhoods | Local urbanity | Distilled urbanity | Regional urbanity | outlier | Metropolitan urbanity | Hyper distilled urbanity | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | E01000007 | Dense urban neighbourhoods | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.822703 | 0.177297 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
1 | E01000015 | Dense residential neighbourhoods | 0.0 | 0.000000 | 0.001117 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.022815 | 0.707635 | 0.136794 | 0.130251 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
2 | E01000030 | Dense residential neighbourhoods | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.427742 | 0.000000 | 0.000000 | 0.000000 | 0.572258 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
3 | E01000085 | Dense urban neighbourhoods | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.171808 | 0.126567 | 0.701626 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
4 | E01000118 | Connected residential neighbourhoods | 0.0 | 0.339714 | 0.036230 | 0.0 | 0.000000 | 0.132012 | 0.000000 | 0.000000 | 0.004280 | 0.487764 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
41724 | W01001910 | Gridded residential quarters | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.981498 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.018502 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
41725 | W01001914 | Urban buffer | 0.0 | 0.000000 | 0.000020 | 0.0 | 0.000000 | 0.000000 | 0.999980 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
41726 | W01001923 | Urban buffer | 0.0 | 0.000000 | 0.383934 | 0.0 | 0.000000 | 0.005691 | 0.533735 | 0.000000 | 0.003031 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
41727 | W01001949 | Accessible suburbia | 0.0 | 0.922427 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.077573 | 0.000000 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
41728 | W01001952 | Dense urban neighbourhoods | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.885102 | 0.114898 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
41729 rows × 19 columns
lsoa_estimated.to_csv("lsoa_estimates.csv")
estimates.to_file("lsoa.gpkg", driver="GPKG")