In-class Exercise 4B: Calibrating Spatial Interaction Models with R

Overview

This in-class exercise is a continuation of Hands-on Exercise 3, In-class Exercise 3 and In-class Exercise 4: Preparing Spatial Interaction Modelling Variables. We will continue our journey of calibrating Spatial Interaction Models by using propulsiveness and attractiveness variables prepared in earlier in-class exercise.

Getting Started

For the purpose of this exercise, five R packages will be used. They are:

  • sf for importing, integrating, processing and transforming geospatial data.

  • tidyverse for importing, integrating, wrangling and visualising data.

  • tmap for plotting cartographic quality thematic maps.

  • performance for computing model comparison matrices such as rmse.

  • ggpubr for creating publication quality statistical graphics.

pacman::p_load(tmap, sf, performance, knitr,
               ggpubr, tidyverse)

The Data

This exercise is a continuation of Hands-on Exercise 3 and In-class Exercise 4: Preparing Spatial Interaction Modelling Variables. The following data will be used:

  • flow_data_tidy.rds, weekday morning peak passenger flows at planning subzone level.

  • mpsz.rds, URA Master Plan 2019 Planning Subzone boundary in simple feature tibble data frame format.

flow_data <- read_rds("data/rds/flow_data_tidy.rds")
glimpse(flow_data)
Rows: 14,734
Columns: 13
$ ORIGIN_SZ       <chr> "AMSZ01", "AMSZ01", "AMSZ01", "AMSZ01", "AMSZ01", "AMS…
$ DESTIN_SZ       <chr> "AMSZ01", "AMSZ02", "AMSZ03", "AMSZ04", "AMSZ05", "AMS…
$ MORNING_PEAK    <dbl> 1998, 8289, 8971, 2252, 6136, 2148, 1620, 1925, 1773, …
$ dist            <dbl> 50.0000, 810.4491, 1360.9294, 840.4432, 1076.7916, 805…
$ ORIGIN_AGE7_12  <dbl> 310, 310, 310, 310, 310, 310, 310, 310, 310, 310, 310,…
$ ORIGIN_AGE13_24 <dbl> 710, 710, 710, 710, 710, 710, 710, 710, 710, 710, 710,…
$ ORIGIN_AGE25_64 <dbl> 2780, 2780, 2780, 2780, 2780, 2780, 2780, 2780, 2780, …
$ DESTIN_AGE7_12  <dbl> 310.00, 1140.00, 1010.00, 980.00, 810.00, 1050.00, 420…
$ DESTIN_AGE13_24 <dbl> 710.00, 2770.00, 2650.00, 2000.00, 1920.00, 2390.00, 1…
$ DESTIN_AGE25_64 <dbl> 2780.00, 15700.00, 14240.00, 11320.00, 9650.00, 12460.…
$ SCHOOL_COUNT    <dbl> 0.99, 2.00, 2.00, 1.00, 3.00, 2.00, 0.99, 0.99, 3.00, …
$ RETAIL_COUNT    <dbl> 1.00, 0.99, 6.00, 0.99, 0.99, 0.99, 1.00, 117.00, 0.99…
$ geometry        <LINESTRING [m]> LINESTRING (29501.77 39419...., LINESTRING …

Notice that this sf tibble data.frame includes two additional fields namely: SCHOOL_COUNT and BUSINESS_COUNT (but above output shows RETAIL_COUNT?). Both of them will be used as attractiveness variables when calibrating origin constrained SIM.

The code chunk below is used to display the first five columns and rows of flow_data.

kable(head(flow_data[, 1:5], n = 5))
ORIGIN_SZ DESTIN_SZ MORNING_PEAK dist ORIGIN_AGE7_12 geometry
AMSZ01 AMSZ01 1998 50.0000 310 LINESTRING (29501.77 39419….
AMSZ01 AMSZ02 8289 810.4491 310 LINESTRING (29501.77 39419….
AMSZ01 AMSZ03 8971 1360.9294 310 LINESTRING (29501.77 39419….
AMSZ01 AMSZ04 2252 840.4432 310 LINESTRING (29501.77 39419….
AMSZ01 AMSZ05 6136 1076.7916 310 LINESTRING (29501.77 39419….

Notice that this data frame include intra-zonal flow.

Preparing inter-zonal flow data

In general, we will calibrate separate Spatial Interaction Models for inter- and intra-zonal flows. In this hands-on exercise, we will focus our attention on inter-zonal flow. Hence, we need to exclude the intra-zonal flow from flow_data.

First, two new columns called FlowNoIntra and offset will be created by using the code chunk below.

flow_data$FlowNoIntra <- ifelse(
  flow_data$ORIGIN_SZ == flow_data$DESTIN_SZ, 
  0, flow_data$TRIPS)
flow_data$offset <- ifelse(
  flow_data$ORIGIN_SZ == flow_data$DESTIN_SZ, 
  0.000001, 1)

According to the syntax used to derive values in FlowNoIntra field, all intra-zonal flow will be given a value of 0 or else the original flow values will be inserted.

Next, inter-zonal flow will be selected from flow_data and save into a new output data.frame called inter_zonal_flow by using the code chunk below.

inter_zonal_flow <- flow_data %>%
  filter(FlowNoIntra > 0)

We are ready to calibrate the Spatial Interaction Models now.

Calibrating Spatial Interaction Models

In this section, we will focus on calibrating an origin constrained SIM and a doubly constrained by using flow_data prepared. This complements what we have learned in Hands-on Exercise 3.

Origin- (Production-) constrained Model

Code chunk below shows the calibration of the model by using glm() of R and flow_data.

What to learn from the code chunk below:

  • For origin-constrained model, only explanatory variables representing the attractiveness at the destinations will be used.

  • All the explanatory variables including distance will be log transformed.

  • ORIGIN_SZ is used to model 𝜇𝑖 . It must be in categorical data type.

  • It is important to note that -1 is added in the equation after the distance variable. The -1 serves the purpose of removing the intercept that by default, glm will insert into the model.

orcSIM_Poisson <- glm(formula = TRIPS ~ 
                ORIGIN_SZ +
                log(SCHOOL_COUNT) +
                log(BUSINESS_COUNT) +
                log(DIST) - 1,
              family = poisson(link = "log"),
              data = inter_zonal_flow,
              na.action = na.exclude)
summary(orcSIM_Poisson)

From the report above:

  • the ⍺1 and ⍺2 of SCHOOL_COUNT and BUSINESS_COUNT are 0.4755516 and 0.1796905 respectively.

  • 𝛽, the distance decay parameter is -1.6929522

  • there are a series of parameters which are the vector of 𝜇𝑖 values associated with the origin constraints.

Goodness of fit

In statistical modelling, the next question we would like to answer is how well the proportion of variance in the dependent variable (i.e. TRIPS) that can be explained by the explanatory variables.

In order to provide answer to this question, R-squared statistics will be used. However, R-squared is not an output of glm(). Hence we will write a function called CalcRSquared by using the code chunk below.

CalcRSquared <- function(observed, estimated){
  r <- cor(observed, estimated)
  R2 <- r^2
  R2
}

Now, we can examine how the constraints hold for destinations this time.

CalcRSquared(orcSIM_Poisson$data$TRIPS, orcSIM_Poisson$fitted.values)

With reference to the R-Squared above, we can conclude that the model accounts for about 44% of the variation of flows in the systems. 

Doubly constrained model

In this section, we will fit a doubly constrained SIM using the code chunk below:

dbcSIM_Poisson <- glm(formula = TRIPS ~ 
                ORIGIN_SZ + 
                DESTIN_SZ +
                log(DIST),
              family = poisson(link = "log"),
              data = inter_zonal_flow,
              na.action = na.exclude)
summary(dbcSIM_Poisson)

Note about the above code chunk: It is important to note that there is a slight change of the code chunk. I have removed the -1 which means that an intercept will appear in the model again. This is not because I want an intercept as it makes the origin and destination coefficients harder to interpret, rather the -1 cheat for removing the intercept only works with one factor level but in double-constrained model we have two factor levels, namely: origins and destinations.

Now, let us examine how well the proportion of variance in the dependent variable (i.e. TRIPS) that can be explained by the explanatory variables. (using R-Squared value)

CalcRSquared(dbcSIM_Poisson$data$TRIPS,
             dbcSIM_Poisson$fitted.values)

Notice that there is a relatively greater improvement in the R-Squared value.

Model comparison

Statistical measures

Another useful model performance measure for continuous dependent variable is Root Mean Squared Error. In this sub-section, we will learn how to use compare_performance() of performance package.

First of all, let us create a list called model_list by using the code chunk below.

model_list <- list(
  Origin_Constrained = orcSIM_Poisson,
  Doubly_Constrained = dbcSIM_Poisson)

Next, we will compute the RMSE of all the models in model_list file by using the code chunk below.

compare_performance(model_list,
                    metrics = "RMSE")

The print above reveals that doubly constrained SIM is the best model among the two SIMs because it has the smallest RMSE value of 1906.694.

Visualising fitted values

In this section, you will learn how to visualise the observed values and the fitted values.

Firstly we will extract the fitted values from Origin-constrained Model by using the code chunk below.

df <- as.data.frame(orcSIM_Poisson$fitted.values) %>%
  round(digits = 0)

Next, we will append the fitted values into inter_zonal_flow data frame by using the code chunk below.

inter_zonal_flow <- inter_zonal_flow %>%
  cbind(df) %>%
  rename(orcTRIPS = "orcSIM_Poisson.fitted.values")

Notice that rename() is used to rename the field name and the $ in the original field name has been replaced with an .. This is because R replaced $ with . during the cbind().

Next, we also extract the fitted values from the Doubly Constrained Model:

df <- as.data.frame(dbcSIM_Poisson$fitted.values) %>%
  round(digits = 0)
inter_zonal_flow <- inter_zonal_flow %>%
  cbind(df) %>%
  rename(dbcTRIPS = "dbcSIM_Poisson.fitted.values")

Next, two scatterplots will be created by using geom_point() and other appropriate functions of ggplot2 package.

orc_p <- ggplot(data = inter_zonal_flow,
                aes(x = orcTRIPS,
                    y = TRIPS)) +
  geom_point() +
  geom_smooth(method = lm) +
  coord_cartesian(xlim=c(0,150000),
                  ylim=c(0,150000))

dbc_p <- ggplot(data = inter_zonal_flow,
                aes(x = dbcTRIPS,
                    y = TRIPS)) +
  geom_point() +
  geom_smooth(method = lm) +
  coord_cartesian(xlim=c(0,150000),
                  ylim=c(0,150000))

Now, we will put all the graphs into a single visual for better comparison by using the code chunk below.

ggarrange(orc_p, dbc_p,
          ncol = 2,
          nrow = 1)