Chapter 2: Visualizing and modelling spatial data¶

Tomas Beuzen, 2021

Chapter Outline¶

Chapter Learning Objectives
Imports
1. Spatial visualization
2. Spatial modelling

Chapter Learning Objectives¶

Make informed choices about how to plot your spatial data, e.g., scattered, polygons, 3D, etc..
Plot spatial data using libraries such as geopandas, plotly, and keplergl.
Interpolate unobserved spatial data using deterministic methods such as nearest-neighbour interpolation.
Interpolate data from one set of polygons to a different set of polygons using areal interpolation.

Imports¶

import warnings
import keplergl
import numpy as np
import osmnx as ox
import pandas as pd
import geopandas as gpd
import plotly.express as px
from skgstat import Variogram
import matplotlib.pyplot as plt
from shapely.geometry import Point
from pykrige.ok import OrdinaryKriging
from scipy.interpolate import NearestNDInterpolator
from tobler.area_weighted import area_interpolate
# Custom functions
from scripts.utils import pixel2poly
# Plotting defaults
plt.style.use('ggplot')
px.defaults.height = 400; px.defaults.width = 620
plt.rcParams.update({'font.size': 16, 'axes.labelweight': 'bold', 'figure.figsize': (6, 6), 'axes.grid': False})

1. Spatial visualization¶

1.1. Geopandas¶

We saw last chapter how to easily plot geospatial data using the geopandas method .plot(). This workflow is useful for making quick plots, exploring your data, and easily layering geometries. Let’s import some data of UBC buildings using osmnx (our Python API for accessing OpenStreetMap data) and make a quick plot:

ubc = (ox.geometries_from_place("University of British Columbia, Canada", tags={'building':True})
         .loc[:, ["geometry"]]                 # just keep the geometry column for now
         .query("geometry.type == 'Polygon'")  # only what polygons (buidling footprints)
         .assign(Label="Building Footprints")  # assign a label for later use
         .reset_index(drop=True)               # reset to 0 integer indexing
      )
ubc.head()

	geometry	Label
0	POLYGON ((-123.25526 49.26695, -123.25506 49.2...	Building Footprints
1	POLYGON ((-123.25328 49.26803, -123.25335 49.2...	Building Footprints
2	POLYGON ((-123.25531 49.26859, -123.25493 49.2...	Building Footprints
3	POLYGON ((-123.25403 49.26846, -123.25408 49.2...	Building Footprints
4	POLYGON ((-123.25455 49.26906, -123.25398 49.2...	Building Footprints

Recall that we can make a plot using the .plot() method on a GeoDataFrame:

ax = ubc.plot(figsize=(8, 8), column="Label", legend=True,
              edgecolor="0.2", markersize=200, cmap="rainbow")
plt.title("UBC");

../_images/chapter2_spatial-viz-and-modelling_14_0.png

Say I know the “point” location of my office but I want to locate the building footprint (a “polygon”). That’s easily done with geopandas!

First, I’ll use shapely (the Python geometry library geopandas is built on) to make my office point, but you could also use the geopandas function gpd.points_from_xy() like we did last chapter:

point_office = Point(-123.2522145, 49.2629555)
point_office

../_images/chapter2_spatial-viz-and-modelling_16_0.svg

Now, I can use the .contains() method to find out which building footprint my office resides in:

ubc[ubc.contains(point_office)]

	geometry	Label
48	POLYGON ((-123.25217 49.26345, -123.25196 49.2...	Building Footprints

Looks like it’s index 48! I’m going to change the label of that one to “Tom’s Office”:

ubc.loc[48, "Label"] = "Tom's Office"

Now let’s make a plot!

ax = ubc.plot(figsize=(8, 8), column="Label", legend=True,
              edgecolor="0.2", markersize=200, cmap="rainbow")
plt.title("UBC");

../_images/chapter2_spatial-viz-and-modelling_22_0.png

We can add more detail to this map by including a background map. For this, we need to install the contextily package. Note that most web providers use the Web Mercator projection, “EPSG:3857” (interesting article on that here) so I’ll convert to that before plotting:

import contextily as ctx

ax = (ubc.to_crs("EPSG:3857")
         .plot(figsize=(10, 8), column="Label", legend=True,
               edgecolor="0.2", markersize=200, cmap="rainbow")
     )
ctx.add_basemap(ax, source=ctx.providers.OpenStreetMap.Mapnik)  # I'm using OSM as the source. See all provides with ctx.providers
plt.axis("off")
plt.title("UBC");

../_images/chapter2_spatial-viz-and-modelling_24_0.png

1.2. Plotly¶

The above map is nice, but it sure would be helpful in some cases to have interactive functionality for our map don’t you think? Like the ability to zoom and pan? Well there are several packages out there that can help us with that, but plotly is one of the best for this kind of mapping. plotly supports plotting of maps backed by MapBox (a mapping and location data cloud platform).

Here’s an example using the plotly.express function px.choropleth_mapbox():

fig = px.choropleth_mapbox(ubc, geojson=ubc.geometry, locations=ubc.index, color="Label",
                           center={"lat": 49.261, "lon": -123.246}, zoom=12.5,
                           mapbox_style="open-street-map")
fig.update_layout(margin=dict(l=0, r=0, t=30, b=10))

You can pan and zoom above as you desire! How about we add a terrain map instead:

fig = px.choropleth_mapbox(ubc, geojson=ubc.geometry, locations=ubc.index, color="Label",
                           center={"lat": 49.261, "lon": -123.246}, zoom=12.5,
                           mapbox_style="stamen-terrain")
fig.update_layout(margin=dict(l=0, r=0, t=30, b=10))

We are just colouring our geometries based on a label at the moment, but we could of course do some cooler things. Let’s colour by building area:

# Calculate area
ubc["Area"] = ubc.to_crs(epsg=3347).area  # note I'm projecting to EPSG:3347 (the projected system in meters that Statistics Canada useshttps://epsg.io/3347)

# Make plot
fig = px.choropleth_mapbox(ubc, geojson=ubc.geometry, locations=ubc.index, color="Area",
                           center={"lat": 49.261, "lon": -123.246}, zoom=12.5,
                           mapbox_style="carto-positron")
fig.update_layout(margin=dict(l=0, r=0, t=30, b=10))

Check out the plotly documentation for more - there are many plotting options and examples to learn from! Other popular map plotting options include altair (doesn’t support interactivity yet), folium, and bokeh.

1.3. Kepler.gl ¶

The above mapping was pretty cool, but are you ready for more power?

Time to introduce kepler.gl! keplergl is a web-based tool for visualing spatial data. Luckily, it has a nice Python API and Jupyter extension for us to use (see the install instructions). The basic way keplergl works is:

We create an instance of a map with keplergl.KeplerGl()
We add as much data to the map as we like with the .add_data() method
We customize and configure the map in any way we like using the GUI (graphical user interface)

ubc_map = keplergl.KeplerGl(height=500)
ubc_map.add_data(data=ubc.copy(), name="Building heights")
ubc_map

User Guide: https://docs.kepler.gl/docs/keplergl-jupyter

%%html
<iframe src="../_images/ubc-2d.html" width="80%" height="500"></iframe>

I’ll do you one better than that! Let’s add a 3D element to our plot! I’m going to load in some data of UBC building heights I downloaded from the City of Vancouver Open Data Portal:

ubc_bldg_heights = gpd.read_file("data-spatial/ubc-building-footprints-2009")
ubc_bldg_heights.head()

	id	orient8	bldgid	topelev_m	med_slope	baseelev_m	hgt_agl	rooftype	area_m2	avght_m	minht_m	maxht_m	base_m	len	wid	geometry
0	158502.0	26.7542	112433.0	97.55	4.0	80.01	17.54	Flat	2085.07	13.70	0.00	21.76	81.40	110.61	41.11	POLYGON ((-123.25550 49.26127, -123.25544 49.2...
1	158503.0	26.7542	112433.0	100.58	13.0	81.50	19.08	Flat	2085.07	13.70	0.00	21.76	81.40	110.61	41.11	POLYGON ((-123.25561 49.26111, -123.25578 49.2...
2	158506.0	27.5790	112852.0	122.32	5.0	93.04	29.28	Flat	4116.37	18.02	0.03	30.55	93.73	67.22	66.37	POLYGON ((-123.24859 49.26136, -123.24850 49.2...
3	158508.0	28.2727	120767.0	109.85	1.0	90.51	19.33	Flat	5667.17	17.95	0.17	31.16	89.19	155.53	59.50	POLYGON ((-123.25340 49.26482, -123.25351 49.2...
4	158509.0	28.5496	122507.0	112.69	27.0	86.63	26.07	Pitched	1349.35	18.26	0.00	27.34	87.79	64.92	26.47	POLYGON ((-123.25520 49.26655, -123.25524 49.2...

I’m going to combine this with our ubc data and do a bit of clean-up and wrangling. You’ll do this workflow in your lab too so I won’t spend too much time here:

ubc_bldg_heights = (gpd.sjoin(ubc, ubc_bldg_heights[["hgt_agl", "geometry"]], how="inner")
                       .drop(columns="index_right")
                       .rename(columns={"hgt_agl": "Height"})
                       .reset_index()
                       .dissolve(by="index", aggfunc="mean")  # dissolve is like "groupby" in pandas. We use it because it retains geometry information
                   )

Now I’ll make a new map and configure it to be in 3D using the GUI!

ubc_height_map = keplergl.KeplerGl(height=500)
ubc_height_map.add_data(data=ubc_bldg_heights.copy(), name="Building heights")
ubc_height_map

User Guide: https://docs.kepler.gl/docs/keplergl-jupyter

%%html
<iframe src="../_images/ubc-3d.html" width="80%" height="500"></iframe>

You can save your configuration and customization for re-use later (see the docs here)

2. Spatial modelling¶

There’s typically two main ways we might want to “model” spatial data:

Spatial interpolation: use a set of observations in space to estimate the value of a spatial field
Areal interpolation: project data from one set of polygons to another set of polygons

Both are based on the fundamental premise: “everything is related to everything else, but near things are more related than distant things.” (Tobler’s first law of geography). To demonstrate some of these methods, we’ll look at the annual average air pollution (PM 2.5) recorded at stations across BC during 2020 (which I downloaded from DataBC):

pm25 = pd.read_csv("data/bc-pm25.csv")
pm25.head()

	Station Name	Lat	Lon	EMS ID	PM_25
0	Abbotsford A Columbia Street	49.021500	-122.326600	E289309	4.7
1	Abbotsford Central	49.042800	-122.309700	E238212	4.6
2	Agassiz Municipal Hall	49.238032	-121.762334	E293810	6.7
3	Burnaby South	49.215300	-122.985600	E207418	3.8
4	Burns Lake Fire Centre	54.230700	-125.764300	E225267	5.4

fig = px.scatter_mapbox(pm25, lat="Lat", lon="Lon", color="PM_25", size="PM_25",
                        color_continuous_scale="RdYlGn_r",
                        center={"lat": 52.261, "lon": -123.246}, zoom=3.5,
                        mapbox_style="carto-positron", hover_name="Station Name")
fig.update_layout(margin=dict(l=0, r=0, t=30, b=10))
fig.show()

The goal is to interpolate these point measurements to estimate the air pollution for all of BC.

2.1. Deterministic spatial interpolation¶

Create surfaces directly from measured points using a mathematical function. Common techniques (all available in the scipy module interpolate) are:

Inverse distance weighted interpolation
Nearest neighbour interpolation
Polynomial interpolation
Radial basis function (RBF) interpolation

Let’s try nearest neighbour interpolation now using scipy.interpolate.NearestNDInterpolator. As angular coordinates (lat/lon) are not good for measuring distances, I’m going to first convert my data to the linear, meter-based Lambert projection recommend by Statistics Canada and extract the x and y locations as columns in my GeoDataFrame (“Easting” and “Northing” respectively):

gpm25 = (gpd.GeoDataFrame(pm25, crs="EPSG:4326", geometry=gpd.points_from_xy(pm25["Lon"], pm25["Lat"]))
            .to_crs("EPSG:3347")
        )
gpm25["Easting"], gpm25["Northing"] = gpm25.geometry.x, gpm25.geometry.y
gpm25.head()

	Station Name	Lat	Lon	EMS ID	PM_25	geometry	Easting	Northing
0	Abbotsford A Columbia Street	49.021500	-122.326600	E289309	4.7	POINT (4056511.036 1954658.126)	4.056511e+06	1.954658e+06
1	Abbotsford Central	49.042800	-122.309700	E238212	4.6	POINT (4058698.814 1956191.109)	4.058699e+06	1.956191e+06
2	Agassiz Municipal Hall	49.238032	-121.762334	E293810	6.7	POINT (4104120.445 1957264.772)	4.104120e+06	1.957265e+06
3	Burnaby South	49.215300	-122.985600	E207418	3.8	POINT (4023977.036 1996103.398)	4.023977e+06	1.996103e+06
4	Burns Lake Fire Centre	54.230700	-125.764300	E225267	5.4	POINT (4128403.079 2571148.393)	4.128403e+06	2.571148e+06

Now let’s create a grid of values (a raster) to interpolate over. I’m just going to make a square grid of fixed resolution that spans the bounds of my observed data points (we’ll plot this shortly so you can see what it looks like):

resolution = 25_000  # cell size in meters
gridx = np.arange(gpm25.bounds.minx.min(), gpm25.bounds.maxx.max(), resolution)
gridy = np.arange(gpm25.bounds.miny.min(), gpm25.bounds.maxy.max(), resolution)

So now let’s interpolate using a nearest neighbour method (the code below is straight from the scipy docs):

model = NearestNDInterpolator(x = list(zip(gpm25["Easting"], gpm25["Northing"])),
                              y = gpm25["PM_25"])
z = model(*np.meshgrid(gridx, gridy))
plt.imshow(z);

../_images/chapter2_spatial-viz-and-modelling_57_0.png

Okay so it looks like we successfully interpolated our made-up grid, but let’s re-project it back to our original map. To do this I need to convert each cell in my raster to a small polygon using a simple function I wrote called pixel2poly() which we imported at the beginning of the notebook:

polygons, values = pixel2poly(gridx, gridy, z, resolution)

Now we can convert that to a GeoDataFrame and plot using plotly:

pm25_model = (gpd.GeoDataFrame({"PM_25_modelled": values}, geometry=polygons, crs="EPSG:3347")
                 .to_crs("EPSG:4326")
             )

fig = px.choropleth_mapbox(pm25_model, geojson=pm25_model.geometry, locations=pm25_model.index,
                           color="PM_25_modelled", color_continuous_scale="RdYlGn_r", opacity=0.5,
                           center={"lat": 52.261, "lon": -123.246}, zoom=3.5,
                           mapbox_style="carto-positron")
fig.update_layout(margin=dict(l=0, r=0, t=30, b=10))
fig.update_traces(marker_line_width=0)

Very neat! Also, notice how the grid distorts a bit once we project it back onto an angular (degrees-based) projection from our linear (meter-based) projection.

2.2. Probabilistic (geostatistical) spatial interpolation¶

Geostatistical interpolation (called “Kriging”) differs to deterministic interpolation in that we interpolate using statistical models that include estimates of spatial autocorrelation. There’s a lot to read about kriging, I just want you to be aware of the concept in case you need to do something like this in the future. Usually I’d do this kind of interpolation in GIS software like ArcGIS or QGIS, but it’s possible to do in Python too.

The basic idea is that if we have a set of observations $Z(s)$ at locations $s$, we estimate the value of an unobserved location ($s_0$) as a weighted sum: $$\hat{Z}(s_0)=\sum_{i=0}^{N}\lambda_iZ(s_i)$$

Where $N$ is the size of $s$ (number of observed samples) and $\lambda$ is an array of weights. The key is deciding which weights $\lambda$ to use. Kriging uses the spatial autocorrelation in the data to determine the weights. Spatial autocorrelation is calculated by looking at the squared difference (the variance) between points at similar distances apart, let’s see what that means:

warnings.filterwarnings("ignore")  # Silence some warnings
vario = Variogram(coordinates=gpm25[["Easting", "Northing"]],
                  values=gpm25["PM_25"],
                  n_lags=20)
vario.distance_difference_plot();

../_images/chapter2_spatial-viz-and-modelling_65_0.png

The idea is to then fit a model to this data that describes how variance (spatial autocorrelation) changes with distance (“lag”) between locations. We look at the average variance in bins of the above distances/pairs and fit a line through them. This model is called a “variogram” and it’s analogous to the autocorrelation function for time series. It defines the variance (autocorrelation structure) as a function of distance:

vario.plot(hist=False);

../_images/chapter2_spatial-viz-and-modelling_67_0.png

The above plot basically shows that:

at small distances (points are close together), variance is reduced because points are correlated
but at a distance around 400,000 m the variance flattens out indicating points are two far away to have any impactful spatial autocorrelation. This location is called the “range” while the variance at the “range” is called the “sill” (like a ceiling). We can extract the exact range:

vario.describe()["effective_range"]

363018.90908542706

By the way, we call it the semi-variance because there is a factor of 0.5 in the equation to account for the fact that variance is calculated twice for each pair of points (read more here).

Remember, our variogram defines the spatial autocorrelation of the data (i.e., how the locations in our region affect one another). Once we have a variogram model, we can use it to estimate the weights in our kriging model. I won’t go into detail on how this is done, but there is a neat walkthrough in the scikit-gstat docs here.

Anyway, I’ll briefly use the pykrige library to do some kriging so you can get an idea of what it looks like:

krig = OrdinaryKriging(x=gpm25["Easting"], y=gpm25["Northing"], z=gpm25["PM_25"], variogram_model="spherical")
z, ss = krig.execute("grid", gridx, gridy)
plt.imshow(z);

../_images/chapter2_spatial-viz-and-modelling_71_0.png

Now let’s convert our raster back to polygons so we can map it. I’m also going to load in a polygon of BC using osmnx to clip my data so it fits nicely on my map this time:

polygons, values = pixel2poly(gridx, gridy, z, resolution)
pm25_model = (gpd.GeoDataFrame({"PM_25_modelled": values}, geometry=polygons, crs="EPSG:3347")
                 .to_crs("EPSG:4326")
                 )
bc = ox.geocode_to_gdf("British Columbia, Canada")
pm25_model = gpd.clip(pm25_model, bc)

fig = px.choropleth_mapbox(pm25_model, geojson=pm25_model.geometry, locations=pm25_model.index,
                           color="PM_25_modelled", color_continuous_scale="RdYlGn_r",
                           center={"lat": 52.261, "lon": -123.246}, zoom=3.5,
                           mapbox_style="carto-positron")
fig.update_layout(margin=dict(l=0, r=0, t=30, b=10))
fig.update_traces(marker_line_width=0)

I used an “ordinary kriging” interpolation above which is the simplest implementation of kriging. The are many other forms of kriging too that can account for underlying trends in the data (“universal kriging”), or even use a regression or classification model to make use of additional explanatory variables. pykrige supports most variations. In particular for the latter, pykrige can accept sklearn models which is useful!

2.3. Areal interpolation¶

Areal interpolation is concerned with mapping data from one polygonal representation to another. Imagine I want to map the air pollution polygons I just made to FSA polygons (recall FSA is “forward sortation area”, which are groups of postcodes). The most intuitive way to do this is to distribute values based on area proportions, hence “areal interpolation”.

I’ll use the tobler library for this. First, load in the FSA polygons:

van_fsa = gpd.read_file("data-spatial/van-fsa")
ax = van_fsa.plot(edgecolor="0.2")
plt.title("Vancouver FSA");

../_images/chapter2_spatial-viz-and-modelling_78_0.png

Now I’m just going to made a higher resolution interpolation using kriging so we can see some of the details on an FSA scale:

resolution = 10_000  # cell size in meters
gridx = np.arange(gpm25.bounds.minx.min(), gpm25.bounds.maxx.max(), resolution)
gridy = np.arange(gpm25.bounds.miny.min(), gpm25.bounds.maxy.max(), resolution)
krig = OrdinaryKriging(x=gpm25["Easting"], y=gpm25["Northing"], z=gpm25["PM_25"], variogram_model="spherical")
z, ss = krig.execute("grid", gridx, gridy)
polygons, values = pixel2poly(gridx, gridy, z, resolution)
pm25_model = (gpd.GeoDataFrame({"PM_25_modelled": values}, geometry=polygons, crs="EPSG:3347")
                 .to_crs("EPSG:4326")
                 )

Now we can easily do the areal interpolation using the function area_interpolate():

areal_interp = area_interpolate(pm25_model.to_crs("EPSG:3347"),
                                van_fsa.to_crs("EPSG:3347"),
                                intensive_variables=["PM_25_modelled"]).to_crs("EPSG:4326")
areal_interp.plot(column="PM_25_modelled", figsize=(8, 8),
                  edgecolor="0.2", cmap="RdBu", legend=True)
plt.title("FSA Air Pollution");

../_images/chapter2_spatial-viz-and-modelling_82_0.png

There are other methods you can use for areal interpolation too, that include additional variables or use more advanced interpolation algorithms. The tobbler documentation describes some of these.

Python for Geospatial Analysis