Interpolation of signatures to OA and LSOA

Signatures are composed of bespoke geometries but can be transferred to other. In this notebook, we are interpolating signatures to Output Areas and Lower Super Output Area.

The notebook requires tobler > 0.8.2.

import geopandas
import tobler
signatures = geopandas.read_file("spatial_signatures_GB.gpkg")
/opt/conda/lib/python3.8/site-packages/geopandas/geodataframe.py:577: RuntimeWarning: Sequential read of iterator was interrupted. Resetting iterator. This can negatively impact the performance.
  for feature in features_lst:

Output Areas

We fetch OA for England and Wales, and those for Scotlaned and combine them to a single GeoDataFrame.

from download import download
import pandas
import dask_geopandas
download("https://opendata.arcgis.com/api/v3/datasets/09b58d063d4e421a9cad16ba5419a6bd_0/downloads/data?format=geojson&spatialRefId=4326", "oa_ew.geojson")
Downloading data from https://opendata.arcgis.com/api/v3/datasets/09b58d063d4e421a9cad16ba5419a6bd_0/downloads/data?format=geojson&spatialRefId=4326 (1 byte)

file_sizes: 2.62GB [01:15, 34.7MB/s]                                            
Successfully downloaded file to oa_ew.geojson
'oa_ew.geojson'
download("https://www.nrscotland.gov.uk/files/geography/output-area-2011-mhw.zip", "oa_scot", kind="zip")
Creating data folder...
Downloading data from https://www.nrscotland.gov.uk/files/geography/output-area-2011-mhw.zip (31.2 MB)

file_sizes: 100%|██████████████████████████| 32.7M/32.7M [00:00<00:00, 99.8MB/s]
Extracting zip file...
Successfully downloaded / unzipped to oa_scot
'oa_scot'
oa_ew = geopandas.read_file("oa_ew.geojson")
oa_scot = geopandas.read_file('oa_scot')
oa_ew = oa_ew[["OA11CD", "geometry"]].to_crs(27700)
oa_scot = oa_scot[["code", "geometry"]].rename(columns={"code": "OA11CD"})
oa = pandas.concat([oa_ew, oa_scot]).reset_index(drop=True)

OA geometries are very dense. For our purpose, we simplify them a bit to speedup the interpolation.

%%time
oa_ddf = dask_geopandas.from_geopandas(oa, npartitions=16)
oa_ddf["geometry"] = oa_ddf.simplify(1).buffer(0)
oa = oa_ddf.compute()
CPU times: user 2min 18s, sys: 493 ms, total: 2min 18s
Wall time: 12.1 s

Using tobler, we interpolate the data and then clean the DataFrame.

%%time
estimates = tobler.area_weighted.area_interpolate(signatures, oa, spatial_index="target", categorical_variables=["type"], n_jobs=-1)
CPU times: user 1h 11min 21s, sys: 13.3 s, total: 1h 11min 35s
Wall time: 2h 6min 1s
estimates["OA11CD"] = oa.OA11CD.values

We also want to get the primary type covering the largest portion of each OA.

primary = estimates.drop(columns=["geometry", "OA11CD"]).idxmax(axis=1)
estimates["primary_type"] = primary.str.replace("type_", "")
estimates.columns = [col.replace("type_", "") for col in estimates.columns]
order = ['OA11CD', 'primary_type', 'Countryside agriculture', 'Accessible suburbia', 'Open sprawl',
       'Wild countryside', 'Warehouse land', 'Gridded residential quarters',
       'Urban buffer', 'Disconnected suburbia',
       'Dense residential neighbourhoods',
       'Connected residential neighbourhoods', 'Dense urban neighbourhoods',
       'Local urbanity', 'Distilled urbanity', 'Regional urbanity', 'outlier',
       'Metropolitan urbanity', 'Hyper distilled urbanity', 
       ]
oa_estimated = estimates[order]
oa_estimated
OA11CD primary_type Countryside agriculture Accessible suburbia Open sprawl Wild countryside Warehouse land Gridded residential quarters Urban buffer Disconnected suburbia Dense residential neighbourhoods Connected residential neighbourhoods Dense urban neighbourhoods Local urbanity Distilled urbanity Regional urbanity outlier Metropolitan urbanity Hyper distilled urbanity
0 E00000001 Distilled urbanity 0.000000 0.0 0.000000 0.000000 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0
1 E00000003 Distilled urbanity 0.000000 0.0 0.000000 0.000000 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0
2 E00000005 Distilled urbanity 0.000000 0.0 0.000000 0.000000 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0
3 E00000007 Distilled urbanity 0.000000 0.0 0.000000 0.000000 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0
4 E00000010 Distilled urbanity 0.000000 0.0 0.000000 0.000000 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
227754 S00094726 Wild countryside 0.000000 0.0 0.000000 0.965738 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
227755 S00102583 Urban buffer 0.000000 0.0 0.076019 0.000000 0.0 0.0 0.923981 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
227756 S00119179 Wild countryside 0.000000 0.0 0.000000 0.995102 0.0 0.0 0.000594 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
227757 S00119262 Wild countryside 0.000000 0.0 0.000000 0.982881 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
227758 S00126169 Countryside agriculture 0.999624 0.0 0.000000 0.000377 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

227759 rows × 19 columns

oa_estimated.to_csv("output_area_estimates.csv")

LSOA

We re-use OA geometries and combine them to LSOA using the lookup table.

download("https://www.arcgis.com/sharing/rest/content/items/39b8fd75e06f4588b9242bcad1f1a898/data", "lookup", kind="zip")
Creating data folder...
Downloading data from https://www.arcgis.com/sharing/rest/content/items/39b8fd75e06f4588b9242bcad1f1a898/data (22.4 MB)

file_sizes: 100%|██████████████████████████| 23.5M/23.5M [00:01<00:00, 18.8MB/s]
Extracting zip file...
Successfully downloaded / unzipped to lookup
'lookup'
lookup = pandas.read_csv("lookup/Postcode_to_Output_Area_to_Lower_Layer_Super_Output_Area_to_Middle_Layer_Super_Output_Area_to_Local_Authority_District_November_2018_Lookup_in_the_UK.csv")
/opt/conda/lib/python3.8/site-packages/IPython/core/interactiveshell.py:3165: DtypeWarning: Columns (13) have mixed types.Specify dtype option on import or set low_memory=False.
  has_raised = await self.run_ast_nodes(code_ast.body, cell_name,
lookup = lookup[["oa11cd", "lsoa11cd"]]
oa_lsoa = oa.merge(lookup[~lookup.oa11cd.duplicated()], left_on="OA11CD", right_on="oa11cd", how="left")
oa_lsoa
OA11CD geometry oa11cd lsoa11cd
0 E00000001 POLYGON ((532305.295 181814.350, 532192.035 18... E00000001 E01000001
1 E00000003 POLYGON ((532215.180 181846.434, 532215.756 18... E00000003 E01000001
2 E00000005 POLYGON ((532181.931 181763.261, 532176.493 18... E00000005 E01000001
3 E00000007 POLYGON ((532203.092 181668.419, 532229.800 18... E00000007 E01000001
4 E00000010 POLYGON ((532129.762 182133.441, 532099.885 18... E00000010 E01000003
... ... ... ... ...
227754 S00094726 MULTIPOLYGON (((140992.500 757140.000, 140998.... S00094726 S01007285
227755 S00102583 MULTIPOLYGON (((249205.272 658381.836, 249199.... S00102583 S01008322
227756 S00119179 MULTIPOLYGON (((180736.000 822983.000, 181184.... S00119179 S01010669
227757 S00119262 POLYGON ((331504.549 973428.233, 332026.118 97... S00119262 S01010790
227758 S00126169 POLYGON ((314833.000 734046.000, 314950.484 73... S00126169 S01011959

227759 rows × 4 columns

Creation of LSOA geometries is done using dask_geopandas dissolve method, that was not released at the time.

# https://github.com/geopandas/dask-geopandas/pull/69
def dissolve(ddf, by=None, aggfunc="first", split_out=1, **kwargs):
    """Dissolve geometries within `groupby` into a single geometry.
    Parameters
    ----------
    by : string, default None
        Column whose values define groups to be dissolved. If None,
        whole GeoDataFrame is considered a single group.
    aggfunc : function,  string or dict, default "first"
        Aggregation function for manipulation of data associated
        with each group. Passed to dask `groupby.agg` method.
        Note that ``aggfunc`` needs to be applicable to all column (i.e. ``"mean:``
        cannot be used with string dtype). Select only required columns before
        ``dissolve`` or pass a dictionary mapping ``aggfunc`` to each column
        separately.
    split_out : int, default 1
        Number of partitions of the output
    **kwargs
        keyword arguments passed to `groupby`
    Examples
    --------
    >>> ddf.dissolve("foo", split_out=12)
    >>> ddf[["foo", "bar", "geometry"]].dissolve("foo", aggfunc="mean")
    >>> ddf.dissolve("foo", aggfunc={"bar": "mean", "baz": "first"})
    """
    import dask.dataframe as dd
    
    if by is None:
        by = lambda x: 0
        drop = [ddf.geometry.name]
    else:
        drop = [by, ddf.geometry.name]

    def union(block):
        merged_geom = block.unary_union
        return merged_geom

    merge_geometries = dd.Aggregation(
        "merge_geometries", lambda s: s.agg(union), lambda s0: s0.agg(union)
    )
    if isinstance(aggfunc, dict):
        data_agg = aggfunc
    else:
        data_agg = {col: aggfunc for col in ddf.columns.drop(drop)}
    data_agg[ddf.geometry.name] = merge_geometries
    aggregated = ddf.groupby(by=by, **kwargs).agg(
        data_agg,
        split_out=split_out,
    )
    return aggregated.set_crs(ddf.crs)
oa_lsoa_ddf = dask_geopandas.from_geopandas(oa_lsoa, npartitions=16)
lsoa_ddf = dissolve(oa_lsoa_ddf, "lsoa11cd", split_out=16)
lsoa = lsoa_ddf.compute()

Now we can interpolate the signature classification to LSOA and clean the DataFrame as above.

%%time
estimates = tobler.area_weighted.area_interpolate(signatures, lsoa, spatial_index="target", categorical_variables=["type"], n_jobs=-1)
CPU times: user 11min 11s, sys: 9.26 s, total: 11min 20s
Wall time: 50min 52s
estimates["LSOA11CD"] = lsoa.index.values
primary = estimates.drop(columns=["geometry", "LSOA11CD"]).idxmax(axis=1)
estimates["primary_type"] = primary.str.replace("type_", "")
estimates.columns = [col.replace("type_", "") for col in estimates.columns]
order = ['LSOA11CD', 'primary_type', 'Countryside agriculture', 'Accessible suburbia', 'Open sprawl',
       'Wild countryside', 'Warehouse land', 'Gridded residential quarters',
       'Urban buffer', 'Disconnected suburbia',
       'Dense residential neighbourhoods',
       'Connected residential neighbourhoods', 'Dense urban neighbourhoods',
       'Local urbanity', 'Distilled urbanity', 'Regional urbanity', 'outlier',
       'Metropolitan urbanity', 'Hyper distilled urbanity', 
       ]
lsoa_estimated = estimates[order]
lsoa_estimated
LSOA11CD primary_type Countryside agriculture Accessible suburbia Open sprawl Wild countryside Warehouse land Gridded residential quarters Urban buffer Disconnected suburbia Dense residential neighbourhoods Connected residential neighbourhoods Dense urban neighbourhoods Local urbanity Distilled urbanity Regional urbanity outlier Metropolitan urbanity Hyper distilled urbanity
0 E01000007 Dense urban neighbourhoods 0.0 0.000000 0.000000 0.0 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.822703 0.177297 0.0 0.0 0.0 0.0 0.0
1 E01000015 Dense residential neighbourhoods 0.0 0.000000 0.001117 0.0 0.000000 0.000000 0.000000 0.022815 0.707635 0.136794 0.130251 0.000000 0.0 0.0 0.0 0.0 0.0
2 E01000030 Dense residential neighbourhoods 0.0 0.000000 0.000000 0.0 0.427742 0.000000 0.000000 0.000000 0.572258 0.000000 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0
3 E01000085 Dense urban neighbourhoods 0.0 0.000000 0.000000 0.0 0.000000 0.000000 0.000000 0.000000 0.171808 0.126567 0.701626 0.000000 0.0 0.0 0.0 0.0 0.0
4 E01000118 Connected residential neighbourhoods 0.0 0.339714 0.036230 0.0 0.000000 0.132012 0.000000 0.000000 0.004280 0.487764 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
41724 W01001910 Gridded residential quarters 0.0 0.000000 0.000000 0.0 0.000000 0.981498 0.000000 0.000000 0.000000 0.000000 0.018502 0.000000 0.0 0.0 0.0 0.0 0.0
41725 W01001914 Urban buffer 0.0 0.000000 0.000020 0.0 0.000000 0.000000 0.999980 0.000000 0.000000 0.000000 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0
41726 W01001923 Urban buffer 0.0 0.000000 0.383934 0.0 0.000000 0.005691 0.533735 0.000000 0.003031 0.000000 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0
41727 W01001949 Accessible suburbia 0.0 0.922427 0.000000 0.0 0.000000 0.000000 0.000000 0.000000 0.077573 0.000000 0.000000 0.000000 0.0 0.0 0.0 0.0 0.0
41728 W01001952 Dense urban neighbourhoods 0.0 0.000000 0.000000 0.0 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.885102 0.114898 0.0 0.0 0.0 0.0 0.0

41729 rows × 19 columns

lsoa_estimated.to_csv("lsoa_estimates.csv")
estimates.to_file("lsoa.gpkg", driver="GPKG")