Chip making¶
This document shows how we flexibly generate a grid of pixel patches (or chips) and select those that fully fall within a given signature class. The result needs to fit in memory.
import os, time
import pandas
import geopandas
import tools
import xarray, rioxarray
# from geopandas_view import view
from shapely.geometry import box
import dask.dataframe as ddf
from joblib import Parallel, delayed
from dask.distributed import LocalCluster, Client
import dask
import dask.distributed
import dask_geopandas
print(dask_geopandas.__version__)
tmp_dir = '/home/jovyan'
out_f_xys = f'{tmp_dir}/chip_xys_liv'
grid_dir = f'{tmp_dir}/grid'
joined_dir = f'{tmp_dir}/joined'
/opt/conda/lib/python3.9/site-packages/distributed/node.py:160: UserWarning: Port 8787 is already in use. Perhaps you already have a cluster running? Hosting the HTTP server on port 38187 instead warnings.warn(
Client
Client-138f2b0b-7d00-11ec-8fcb-5d66f227cdbe
Connection method: Cluster object | Cluster type: distributed.LocalCluster |
Dashboard: http://127.0.0.1:38187/status |
Cluster Info
LocalCluster
59e9e1e6
Dashboard: http://127.0.0.1:38187/status | Workers: 4 |
Total threads: 16 | Total memory: 125.54 GiB |
Status: running | Using processes: True |
Scheduler Info
Scheduler
Scheduler-72bc00f8-eff9-4030-a7e1-a03702e9dc4d
Comm: tcp://127.0.0.1:33555 | Workers: 4 |
Dashboard: http://127.0.0.1:38187/status | Total threads: 16 |
Started: Just now | Total memory: 125.54 GiB |
Workers
Worker: 0
Comm: tcp://172.17.0.3:35083 | Total threads: 4 |
Dashboard: http://172.17.0.3:36105/status | Memory: 31.39 GiB |
Nanny: tcp://127.0.0.1:41601 | |
Local directory: /home/jovyan/work/signature_ai/dask-worker-space/worker-zs0ca7op |
Worker: 1
Comm: tcp://172.17.0.3:36631 | Total threads: 4 |
Dashboard: http://172.17.0.3:44333/status | Memory: 31.39 GiB |
Nanny: tcp://127.0.0.1:43075 | |
Local directory: /home/jovyan/work/signature_ai/dask-worker-space/worker-18ihur_v |
Worker: 2
Comm: tcp://172.17.0.3:35595 | Total threads: 4 |
Dashboard: http://172.17.0.3:46467/status | Memory: 31.39 GiB |
Nanny: tcp://127.0.0.1:41903 | |
Local directory: /home/jovyan/work/signature_ai/dask-worker-space/worker-eiphm8y3 |
Worker: 3
Comm: tcp://172.17.0.3:43237 | Total threads: 4 |
Dashboard: http://172.17.0.3:38017/status | Memory: 31.39 GiB |
Nanny: tcp://127.0.0.1:43515 | |
Local directory: /home/jovyan/work/signature_ai/dask-worker-space/worker-fzciyz9g |
Read data in¶
Signatures (simplified)
/opt/conda/lib/python3.8/site-packages/geopandas/geodataframe.py:577: RuntimeWarning: Sequential read of iterator was interrupted. Resetting iterator. This can negatively impact the performance. for feature in features_lst:
Mosaic
Liverpool for prototyping:
def minmax(
a,
bottom=0,
top=255,
min_cut=0,
max_cut=99
):
from numpy import percentile
vals = a.to_series().values
min_bin = percentile(vals, min_cut)
max_bin = percentile(vals, max_cut)
a = xarray.where(a > max_bin, max_bin, a)
a = xarray.where(a < min_bin, min_bin, a)
a_std = (a - a.min()) / (a.max() - a.min())
a_scaled = a_std * (top - bottom) + bottom
return a_scaled.astype(int)
liv.groupby(
"band"
).map(
minmax
).plot.imshow(figsize=(16, 16));

Clipping input data to the valid range for imshow with RGB data ([0..1] for floats or [0..255] for integers).

Build the grid¶
The goal here is to have a flexible and performant method to generate a uniform grid of polygons (GeoDataFrame
) that matches the layout of pixels in the mosaic.
Small datasets¶
For small datasets, we can all pack it into a single shot:
Larger-than-RAM datasets¶
Datasets that do not fit in memory can be run through dask-geopandas
, but it is worth splitting the job across different steps.
[X] Turn coords into grid XYs as a
dask.DataFrame
w/Point
objects –>chip_pts
[X] Turn
chip_pts
into pixel squares –>grid
[ ] Join
sigs
togrid
to transfer label and discard mixed-signature chips –>sig_chips
To select the number of chunks to write (npartitions
), we can do a back-of-the-envelope calculation:
The mosaic is 121,865 by 182,437 pixels (22,232,685,005) with four values per pixel
1500 partitions allocates +59m. ints per chunk (200Mb approx in memory)
Deprecated¶
First we get the coordinates for the centroid of each chip written to disk:
These are now in disk, so we can work with them within dask/-geopandas
.
Next step involves turning point coordinates into points. For now, we just express the computation:
%%time
def process_chunk(xy_pXgrid_p):
t0 = time.time()
xy_p, grid_p = xy_pXgrid_p
xys = pandas.read_parquet(xy_p)
chip_len = abs(
(xys.head() - xys.head().shift())['Y'].iloc[1]
)
xy_pts = geopandas.points_from_xy(
xys['X'], xys['Y']
)
grid = xy_pts.buffer(chip_len/2, cap_style=3)
geopandas.GeoDataFrame(
{'geometry': grid}, crs='EPSG:27700'
).to_parquet(grid_p)
t = time.time() - t0
msg = f'Execution of {grid_p.split("/")[-1]} completed in {t} seconds'
return msg
! rm -rf $grid_dir
! mkdir $grid_dir
items = [
(
f'{out_f_xys}/chunk_{i}.pq', f'{grid_dir}/chunk_{i}.pq'
) for i in range(len(os.listdir(out_f_xys)))
]
'''
_ = tools.dask_map_seq(process_chunk, items[:16], client)
out = Parallel(n_jobs=16)(
delayed(process_chunk)(i) for i in items
)
'''