Generate Spatial Signatures across GB

This notebook generates spatial signatures as a clustering of form and function characters.

Number of clusters

Clustergram which will be used to determine number of clusters will likely be complex. It is better to use new interactive version (clustergram=>0.5.0 required).

# pip install git+https://github.com/martinfleis/clustergram.git
import dask.dataframe
import numpy as np

from clustergram import Clustergram

We first load all standardized data and create a single pandas DataFrame.

standardized_form = dask.dataframe.read_parquet("../../urbangrammar_samba/spatial_signatures/clustering_data/form/standardized/").set_index('hindex')
stand_fn = dask.dataframe.read_parquet("../../urbangrammar_samba/spatial_signatures/clustering_data/function/standardized/")
data = dask.dataframe.multi.concat([standardized_form, stand_fn], axis=1).replace([np.inf, -np.inf], np.nan).fillna(0)
%time data = data.compute()
CPU times: user 2min 36s, sys: 1min 24s, total: 4min
Wall time: 2min 43s

Due to a small mistake, parquet files contain 3 columns which were used temporarily during the computation. We remove them now.

data = data.drop(columns=["keep_q1", "keep_q2", "keep_q3"])

We run clustergram for all the options between 1 and 24 clusters using Mini-Batch K-Means algorithm.

cgram = Clustergram(range(1, 25), method='minibatchkmeans', batch_size=1_000_000, n_init=100, random_state=42)
cgram.fit(data)
K=1 fitted in 780.0785899162292 seconds.
K=2 fitted in 864.1424376964569 seconds.
K=3 fitted in 954.2592947483063 seconds.
K=4 fitted in 1258.9098596572876 seconds.
K=5 fitted in 1360.8928196430206 seconds.
K=6 fitted in 1446.0337007045746 seconds.
K=7 fitted in 1550.0224254131317 seconds.
K=8 fitted in 1662.5290818214417 seconds.
K=9 fitted in 1759.5144119262695 seconds.
K=10 fitted in 1860.4208154678345 seconds.
K=11 fitted in 1957.2675037384033 seconds.
K=12 fitted in 2036.3741669654846 seconds.
K=13 fitted in 2098.1098449230194 seconds.
K=14 fitted in 2188.7303895950317 seconds.
K=15 fitted in 2251.541695833206 seconds.
K=16 fitted in 2390.264476776123 seconds.
K=17 fitted in 2506.9812376499176 seconds.
K=18 fitted in 2602.7613401412964 seconds.
K=19 fitted in 2642.102708339691 seconds.
K=20 fitted in 2746.4516792297363 seconds.
K=21 fitted in 2901.3386924266815 seconds.
K=22 fitted in 3010.796851873398 seconds.
K=23 fitted in 3055.145115852356 seconds.
K=24 fitted in 3155.7938318252563 seconds.

We can save resulting labels to parquet.

labels = cgram.labels.copy()
labels.columns = labels.columns.astype("str")  # parquet require str column names
labels.to_parquet("../../urbangrammar_samba/spatial_signatures/clustering_data/clustergram_labels.pq")

Now we can plot the clustergram.

import urbangrammar_graphics as ugg
import seaborn as sns

sns.set(style='whitegrid')
%%time

ax = cgram.plot(
    figsize=(20, 20),
    line_style=dict(color=ugg.COLORS[1]),
    cluster_style={"color": ugg.COLORS[2]},
)
ax.yaxis.grid(False)
sns.despine(offset=10)
ax.set_ylim(-30, 50)
CPU times: user 11min 28s, sys: 4min 51s, total: 16min 20s
Wall time: 3min 30s
(-30.0, 50.0)
../_images/spatial_signatures_gb_14_2.png

Better option is an interactive clustergram, showing the same data in a more friendly manner. We first initialise bokeh.

from bokeh.io import output_notebook
from bokeh.plotting import show

output_notebook()
Loading BokehJS ...

Now we can plot clustergram using bokeh. First the same as above, using PCA weighting.

fig = cgram.bokeh(
    figsize=(800, 600),
    line_style=dict(color=ugg.HEX[1]),
    cluster_style={"color": ugg.HEX[2]},
)
show(fig)

Second, we can plot clustergram using mean values. Both perspectives combined give us a better picture of clustering behaviour.

fig2 = cgram.bokeh(
    figsize=(800, 600),
    line_style=dict(color=ugg.HEX[1]),
    cluster_style={"color": ugg.HEX[2]},
    pca_weighted=False
)
show(fig2)