Urban Grammar AI
research project
tobler
Last year, we needed areal interpolation to transfer data from various administrative and statistical spatial units like Output areas to enclosed tessellation cells. To do that, we used the Tobler Python package from PySAL family. And since we needed to scale it to a national level, we spent some time refactoring the tools to make it much more performant. All that is discussed in this blog post.
After a year, we needed areal interpolation again. This time, we took a slightly different approach. We did not want to partition the data a priori. We also knew that the first part of the function would work well as we had already reimplemented it before. However, we hit another bottleneck. This time in the second part of the code, we did not touch before. What has happened?
It is a bit technical. We need to store a relation between source and target geometries. Say that we have 100 000 source geometries and 1 000 000 target ones. We essentially want an array of 100 000 rows and 1 000 000. That does not fit in memory. But since a lot of the cells of such an array would say 0, we can omit them and use a sparse array (or sparse matrix) format, which behaves like a normal dense array but is way more memory efficient. However, there are multiple ways of storing the data in a sparse array. Tobler was using DOK - Dictionary of Keys. That is really fast if you need to access individual records quickly. But that is not the application we have in tobler. We need quick row indexing (apart from a few other things). Fortunately, the fix was simple. We replaced DOK with CSR - Compressed Sparse Row matrix, and the results we miraculous. While the small test benchmark was about 45x faster, our actual interpolation was faster by several orders of magnitude. The original code did not finish after hours. The new once was done in under a second.
Small things, like storage formats, matter. If you want to see the effect of different sparse array formats, see this notebook. And the change in tobler is here and will be part of the next release.