Analyzing Epidemiological Time Series With The R Package {distantia}

Summary

In this post I show how the R package distantia can help compare and analyze epidemiological time series. The the example data comprises weekly Covid-19 prevalence curves from 36 California counties between 2020-03-16 and 2023-12-18.

The post is structured as follows:

  • Data Preparation: Transform the raw data into a time series list, the format required by distantia.

  • Exploration: Overview of data features through visualization and descriptive statistics.

  • Computing Dissimilarity Scores: Calculate time series dissimilarity using lock-step and dynamic time warping.

  • Analyzing Dissimilarity Scores: Explore methods like hierarchical clustering, mapping, and dissimilarity modeling to analyze dissimilarity scores.

DISCLAIMER: I am not an epidemiologist, so I will be focusing on the technical aspects of the applications of {distantia} to this kind of data, without any attempt to draw conclusions from the results.

R packages

This tutorial requires the following packages:

  • distantia: time series dissimilarity analysis.
  • zoo: management of regular and irregular time series.
  • dplyr: data frame processing.
  • tidyr: to pivot a few wide data frames.
  • ggplot2: to make a few charts.
  • mapview: interactive spatial visualization.
  • reactable: interactive tables
  • factoextra: plotting results of hierarchical clustering.
  • visreg: response curves of statistical models.
library(distantia)
library(zoo)
library(dplyr)
library(ggplot2)
library(mapview)
library(reactable)
library(factoextra)
library(visreg)

Example Data

We will be working with two example data frames shipped with distantia: the simple features data frame covid_counties, and the data frame covid_prevalence.

covid_counties

This simple features data frame comprises county polygons and several socioeconomic variables. It is connected to covid_prevalence by the column name.

The socioeconomic variables available in this dataset are:

  • area_hectares: county surface.
  • population: county population.
  • poverty_percentage: population percentage below the poverty line.
  • median_income: median county income in dollars.
  • domestic_product: yearly domestic product in billions (not a typo) of dollars.
  • daily_miles_traveled: daily miles traveled by the average inhabitant.
  • employed_percentage: percentage of the county population under employment.
name
area_hectares
population
poverty_percentage
median_income
domestic_product
daily_miles_traveled
employed_percentage
Alameda
193407
1636194
9.4
108971
79737
21.5
48.5
Butte
434090
205592
16.6
62982
39373
21.3
41.9
Contra_Costa
192999
1147653
8.8
110595
64073
19.1
46.1
El_Dorado
462063
189006
8.8
87689
39035
21.8
47.7
Fresno
1557468
1011499
19.4
63140
41345
21.7
42.1
Humboldt
927041
134047
19.4
53924
38570
26.3
41.6
Imperial
1168323
179476
17.3
50970
46871
34.6
32.1
Kern
2111888
907476
18.5
57926
52374
27.8
40.3
Kings
359176
151018
17.7
60810
40222
29.2
35.2
Los_Angeles
1058550
9761210
14.1
77356
72928
19.5
48.3
1–10 of 36 rows
Show

Note: These variables were included in the dataset because they were easy to capture from on-line sources, not because of their importance for epidemiological analyses..

covid_prevalence

Long format data frame with weekly Covid-19 prevalence in 36 California counties between 2020-03-16 and 2023-12-18. It has the columns name, time, and prevalence, with the county name, the date of the first day of the week each data point represents, and the maximum Covid-19 prevalence observed during the given week, expressed as proportion of positive tests.

name
time
prevalence
Alameda
2020-03-16
0.01
Alameda
2020-03-23
0.01
Alameda
2020-03-30
0.01
Alameda
2020-04-06
0.01
Alameda
2020-04-13
0.01
Alameda
2020-04-20
0.02
Alameda
2020-04-27
0.01
Alameda
2020-05-04
0.02
Alameda
2020-05-11
0.02
Alameda
2020-05-18
0.01
1–10 of 7092 rows
Show
...

Data Preparation

Before starting with the analysis, we need to transform the data frame covid_prevalence into the format required by distantia.

A time series list, or tsl for short, is a list of zoo time series representing unique realizations of the same phenomena observed in different sites, times, or individuals. All data manipulation and analysis functions in distantia are designed to work with all time series stored in a tsl at once.

The code below applies tsl_initialize() to covid_prevalence and returns the object tsl.

tsl <- distantia::tsl_initialize(
  x = covid_prevalence,
  name_column = "name",
  time_column = "time"
)

Each element in tsl is named after the county the data belongs to.

names(tsl)
##  [1] "Alameda"         "Butte"           "Contra_Costa"    "El_Dorado"      
##  [5] "Fresno"          "Humboldt"        "Imperial"        "Kern"           
##  [9] "Kings"           "Los_Angeles"     "Madera"          "Marin"          
## [13] "Merced"          "Monterey"        "Napa"            "Orange"         
## [17] "Placer"          "Riverside"       "Sacramento"      "San_Bernardino" 
## [21] "San_Diego"       "San_Francisco"   "San_Joaquin"     "San_Luis_Obispo"
## [25] "San_Mateo"       "Santa_Barbara"   "Santa_Clara"     "Santa_Cruz"     
## [29] "Shasta"          "Solano"          "Sonoma"          "Stanislaus"     
## [33] "Sutter"          "Tulare"          "Ventura"         "Yolo"

The zoo objects within tsl also have the attribute name to help track the data in case it is extracted from the time series list.

attributes(tsl[[1]])$name
## [1] "Alameda"
attributes(tsl[[2]])$name
## [1] "Butte"

Each individual zoo object comprises a time index, extracted with zoo::index(), and a data matrix accessed via zoo::coredata().

zoo::index(tsl[["Alameda"]]) |> 
  head()
## [1] "2020-03-16" "2020-03-23" "2020-03-30" "2020-04-06" "2020-04-13"
## [6] "2020-04-20"
zoo::coredata(tsl[["Alameda"]]) |> 
  head()
##   prevalence
## 1       0.01
## 2       0.01
## 3       0.01
## 4       0.01
## 5       0.01
## 6       0.02

Exploration

In the package distantia there are several tools to help you visualize your time series and explore their basic properties.

Visualization

The function tsl_plot() provides a static multipanel visualization of a large number of time series at once.

distantia::tsl_plot(
  tsl = tsl,
  columns = 3,
  guide = FALSE,
  text_cex = 1.2
)

If we combine tsl_subset() with tsl_plot(), we can focus on any particular sub-group of counties and zoom-in over specific time periods.

distantia::tsl_plot(
  tsl = tsl_subset(
    tsl = tsl,
    names = c("Los_Angeles", "Kings"),
    time = c("2021-09-01", "2022-01-31")
  ),
  guide = FALSE
)

Descriptive Stats

Having a good understanding of the time in a time series is important, because details such as length, resolution, time units, or regularity usually define the design of an analysis. Here, the function tsl_time() will give us all relevant details about the time in a time series list.

df_time <- distantia::tsl_time(
  tsl = tsl
)
name
class
units
length
resolution
begin
end
Alameda
Date
days
1372
7
2020-03-16
2023-12-18
Butte
Date
days
1372
7
2020-03-16
2023-12-18
Contra_Costa
Date
days
1372
7
2020-03-16
2023-12-18
El_Dorado
Date
days
1372
7
2020-03-16
2023-12-18
Fresno
Date
days
1372
7
2020-03-16
2023-12-18
Humboldt
Date
days
1372
7
2020-03-16
2023-12-18
Imperial
Date
days
1372
7
2020-03-16
2023-12-18
Kern
Date
days
1372
7
2020-03-16
2023-12-18
Kings
Date
days
1372
7
2020-03-16
2023-12-18
Los_Angeles
Date
days
1372
7
2020-03-16
2023-12-18
1–10 of 36 rows
Show

In our case all time series are regular and have the same time resolution and range, which makes the table above rightfully boring.

Once acquainted with time, having a summary of the values in each time series also helps make sense of the data, and that’s where tsl_stats() comes into play.

df_stats <- distantia::tsl_stats(
  tsl = tsl,
  lags = 1:6 #weeks
)

The first part of the output corresponds with common statistical descriptors, such as centrality and dispersion metrics, and higher moments.

name
min
q1
median
q3
max
sd
range
iq_range
skewness
kurtosis
Alameda
0.01
0.01
0.01
0.03
0.36
0.05
0.35
0.02
4.89
27.38
Butte
0.01
0.01
0.02
0.03
0.18
0.03
0.17
0.02
2.87
10.03
Contra_Costa
0.01
0.01
0.01
0.03
0.31
0.04
0.3
0.02
4.54
23.8
El_Dorado
0.01
0.01
0.01
0.03
0.28
0.03
0.27
0.02
3.83
18.87
Fresno
0.01
0.01
0.02
0.05
0.37
0.05
0.36
0.04
3.46
15.29
Humboldt
0.01
0.01
0.02
0.03
0.19
0.03
0.18
0.02
3.43
15.08
Imperial
0.01
0.01
0.03
0.06
0.44
0.07
0.43
0.05
3.41
13.76
Kern
0.01
0.01
0.02
0.04
0.28
0.05
0.27
0.03
2.77
9.35
Kings
0.01
0.01
0.04
0.08
0.73
0.08
0.72
0.07
3.88
22.27
Los_Angeles
0.01
0.01
0.02
0.04
0.59
0.08
0.58
0.03
4.3
21.5
1–10 of 36 rows
Show

The second part, produced by the lags argument, shows the temporal autocorrelation of the different time series. Given that the resolution of all time series is 7 days per data-point, each time lag represents a week. A value of 0.9 for the lag 1 indicates the Pearson correlation between prevalence values in consecutive weeks.

name
lag1
lag2
lag3
lag4
lag5
lag6
Alameda
0.91
0.72
0.5
0.28
0.12
0.02
Butte
0.9
0.74
0.53
0.32
0.15
0.01
Contra_Costa
0.91
0.72
0.51
0.28
0.11
0.01
El_Dorado
0.88
0.7
0.48
0.27
0.13
0.02
Fresno
0.92
0.75
0.54
0.33
0.16
0.03
Humboldt
0.9
0.73
0.55
0.36
0.22
0.11
Imperial
0.9
0.74
0.57
0.38
0.21
0.09
Kern
0.92
0.74
0.53
0.32
0.14
0.01
Kings
0.8
0.65
0.43
0.28
0.17
0.1
Los_Angeles
0.9
0.71
0.47
0.26
0.1
0
1–10 of 36 rows
Show

The autocorrelation values for each lag across all counties are summarized in the following boxplot.

According to the boxplot, our prevalence time series show a very strong short-term (1-2 weeks) memory, and a rapid diminishment of long-term memory (5-6 weeks), suggesting that the influence of past prevalence on future values dissipates quickly.


The stats produced by tsl_stats() can be joined to covid_counties using their common column name.

sf_stats <- dplyr::inner_join(
  x = covid_counties[, "name"],
  y = df_stats,
  by = "name"
)

After the join, we can map any of these stats with mapview(). For example, the map below shows the maximum prevalence per county across the full period.

mapview::mapview(
  sf_stats, 
  zcol = "q3",
  layer.name = "Max prevalence",
  label = "name",
  col.regions = distantia::color_continuous()
  )

If you wish to compute stats for a given period or group of counties, then you can combine tsl_subset() and tsl_stats(), as we did before.

df_stats_2021 <- distantia::tsl_stats(
  tsl = tsl_subset(
    tsl = tsl,
    names = c("Napa", "San_Francisco"),
    time = c("2021-01-01", "2021-12-31")
  ),
  lags = 1:6
)

Computing Dissimilarity Scores

The package distantia was designed to streamline the comparison of time series using two methods: Lock-Step (LS) and and Dynamic time warping (DTW). In this section I explain both methods in brief, demonstrate how they work, and show how to compute them to compare all our prevalence time series.

Dynamic Time Warping and Lock-Step Comparison

LS directly compares values at the same time points in equal-length time series. It is efficient and interpretable but cannot adjust for differences in outbreak timing, potentially missing relevant time series similarities.

DTW aligns time series by warping the time axis to maximize shape similarity. It accounts for differences in outbreak timing, making it useful for uneven-length or misaligned time series. However, it is computationally expensive, harder to interpret, and sensitive to scaling.

Before moving forward, let me show you a small example focused on San Francisco, Napa, and Solano to illustrate how LS and DTW work.

At first glance, Napa and Solano seem quite well synchronized, while San Francisco shows a timing difference of several months. The table below shows the LS comparison between these time series. The psi column indicates the dissimilarity between pairs of time series. The smaller the value, the more similar the time series are.

tsl_smol |> 
  distantia::distantia_ls() |> 
  dplyr::mutate(
    psi = round(psi, 3)
  ) |> 
  reactable::reactable(
    resizable = TRUE,
    striped = TRUE,
    compact = TRUE,
    wrap = FALSE,
    fullWidth = FALSE
  )
x
y
psi
Napa
Solano
0.832
Napa
San_Francisco
3.868
Solano
San_Francisco
4.586

As we could expect, Napa and Solano show the lower psi score, indicating that these are the most similar pair in this batch. LS seems to be capturing their synchronicity quite well! On the other hand, San Francisco shows a high dissimilarity with the other two time series, as expected due to its asynchronous pattern. So far so good!

Now, let’s take a look at the DTW result:

tsl_smol |> 
  distantia::distantia_dtw() |> 
  dplyr::mutate(
    psi = round(psi, 3)
  ) |> 
  reactable::reactable(
    resizable = TRUE,
    striped = TRUE,
    compact = TRUE,
    wrap = FALSE,
    fullWidth = FALSE
  )
x
y
psi
Solano
San_Francisco
0.354
Napa
San_Francisco
0.43
Napa
Solano
0.555

Surprisingly, when adjusting for time shifts, San Francisco appears more similar to Solano than Napa!

This happens because DTW ignores absolute dates, and allows any sample in a time series to map to multiple samples from the other. That’s where the time warping happens! The left side of the plot below shows San Francisco’s (dashed red line) first case aligning with a group of Napa samples with prevalence 0, effectively canceling their timing differences.

## Loading required package: proxy

## 
## Attaching package: 'proxy'

## The following objects are masked from 'package:stats':
## 
##     as.dist, dist

## The following object is masked from 'package:base':
## 
##     as.matrix

## Loaded dtw v1.23-1. See ?dtw for help, citation("dtw") for use in publication.

These results suggest that, in spite of their timing differences, San Francisco, Napa, and Solano have a similar dynamics. This is a conclusion LS alone would miss!

It is for this reason that a DTW analysis is preferable when outbreak timing varies significantly and shape similarity is more important than exact time alignment. On the other hand, LS is best applied when the question at hand is focused on time-series synchronization rather than shape similarity.

Computing DTW and LS with distantia

The most straightforward way of computing DTW and LS in distantia involves the functions distantia_ls() and distantia_dtw(). Both are simplified versions of the function distantia(), which has a whole set of bells and whistles that are out of the scope of this post.

The function distantia_ls() computes lock-step dissimilarity using euclidean distances by default. It returns a data frame with the columns x and y with the time series names, and the psi score.

df_psi_ls <- distantia::distantia_ls(
  tsl = tsl
  )

str(df_psi_ls)
## 'data.frame':	630 obs. of  3 variables:
##  $ x  : chr  "Napa" "Alameda" "Alameda" "Sacramento" ...
##  $ y  : chr  "Solano" "San_Mateo" "Contra_Costa" "Sonoma" ...
##  $ psi: num  0.873 1.066 1.162 1.258 1.292 ...
##  - attr(*, "type")= chr "distantia_df"

The function distantia_dtw() returns an output with the same structure. You will notice that it runs a bit slower than distantia_ls().

df_psi_dtw <- distantia::distantia_dtw(
  tsl = tsl
  )

str(df_psi_dtw)
## 'data.frame':	630 obs. of  3 variables:
##  $ x  : chr  "Sacramento" "Contra_Costa" "Contra_Costa" "Alameda" ...
##  $ y  : chr  "Sonoma" "Sacramento" "Sonoma" "Sonoma" ...
##  $ psi: num  0.352 0.364 0.371 0.375 0.375 ...
##  - attr(*, "type")= chr "distantia_df"

The table below combines both results to facilitate a direct comparison between both dissimilarity scores.

x
y
psi_dtw
psi_ls
Sacramento
Sonoma
0.352
1.258
Contra_Costa
Sacramento
0.364
1.868
Contra_Costa
Sonoma
0.371
1.395
Alameda
Sonoma
0.375
1.843
San_Francisco
San_Mateo
0.375
4.133
Fresno
Tulare
0.397
2.994
San_Bernardino
Ventura
0.405
2.229
Santa_Clara
Yolo
0.411
3.288
Alameda
Sacramento
0.414
2.475
Riverside
San_Bernardino
0.415
2.73
1–10 of 630 rows
Show
...

Here goes a scatterplot between the columns psi_dtw and psi_ls, in case you are wondering how these two metrics are related in our experiment.

The chart shows that there is good correlation between the psi scores for DTW and LS, so in principle, using one or the other might not be that important, as they may lead to similar conclusions. However, notice all these cases with a warmer color in the upper-left corner. These represent pairs of time series with a large dissimilarity mismatch between DTW and LS. Remember San Francisco versus Napa? If these cases are important to your research question, then DTW is the method of choice. Otherwise, LS is a reasonable alternative.

In the end, the research question must guide the methodological choice, but it is up to us to understand the consequences of such choice.

Analyzing Dissimilarity Scores

In this section we will have a look at the different kinds of analyses implemented in distantia that may help better understand how time series are related. For the sake of simplicity, I will focus on the DTW results alone. If you wish to use the LS results instead, just replace df_psi_dtw with df_psi_ls in the code below.

df_psi <- df_psi_dtw

Summary Stats

The functions distantia_stats() and distantia_boxplot() provide alternative summarized representations of dissimilarity by aggregating psi scores across all time series.

df_psi_stats <- distantia::distantia_stats(
  df = df_psi
)
name
mean
min
q1
median
q3
max
sd
range
Alameda
1.075
0.375
0.713
1.199
1.381
2.034
0.419
1.66
Butte
1.237
0.495
1.012
1.194
1.426
2.1
0.324
1.605
Contra_Costa
1.069
0.364
0.754
1.159
1.346
2.079
0.42
1.715
El_Dorado
0.999
0.495
0.833
0.931
1.157
1.793
0.271
1.298
Fresno
1.002
0.397
0.832
0.991
1.228
1.583
0.314
1.186
Humboldt
1.378
0.826
1.15
1.319
1.551
2.384
0.35
1.558
Imperial
1.275
0.805
1.048
1.278
1.482
1.864
0.276
1.058
Kern
1.095
0.545
0.855
1.061
1.284
1.69
0.314
1.145
Kings
1.322
0.898
1.128
1.327
1.474
1.864
0.24
0.966
Los_Angeles
1.366
0.662
1.076
1.364
1.554
2.173
0.374
1.511
Madera
1.081
0.516
0.965
1.105
1.227
1.509
0.268
0.994
Marin
1.236
0.762
0.918
1.119
1.458
2.337
0.376
1.576
Merced
1.147
0.471
1.018
1.218
1.384
1.597
0.307
1.126
Monterey
1.297
0.632
1.046
1.301
1.588
1.949
0.357
1.317
Napa
0.985
0.538
0.769
0.945
1.21
1.49
0.259
0.952
Orange
1.038
0.446
0.777
1.011
1.232
1.659
0.286
1.213
Placer
1.253
0.559
0.989
1.167
1.511
2.39
0.388
1.831
Riverside
1.164
0.415
0.844
1.141
1.413
1.935
0.405
1.52
Sacramento
1.001
0.352
0.715
1.022
1.26
1.888
0.372
1.536
San_Bernardino
1.496
0.405
1.094
1.508
1.866
2.39
0.533
1.985
San_Diego
1.059
0.568
0.866
1.022
1.24
1.762
0.295
1.194
San_Francisco
1.155
0.375
0.733
1.137
1.53
2.066
0.439
1.691
San_Joaquin
1.08
0.471
0.923
1.095
1.258
1.524
0.279
1.053
San_Luis_Obispo
1.019
0.571
0.76
1.037
1.23
1.507
0.27
0.936
San_Mateo
1.108
0.375
0.744
1.209
1.419
2.095
0.419
1.72
Santa_Barbara
0.866
0.57
0.709
0.784
0.992
1.386
0.21
0.817
Santa_Clara
1.002
0.411
0.677
1.088
1.285
1.724
0.351
1.313
Santa_Cruz
0.913
0.535
0.724
0.883
1.054
1.491
0.238
0.956
Shasta
1.371
0.716
1.234
1.389
1.495
2.173
0.3
1.457
Solano
1.086
0.573
0.861
1.086
1.257
1.766
0.271
1.192
Sonoma
0.986
0.352
0.665
1.036
1.185
1.843
0.37
1.492
Stanislaus
1.124
0.478
1.015
1.126
1.299
1.733
0.301
1.255
Sutter
1.11
0.671
1.016
1.108
1.216
1.78
0.222
1.109
Tulare
0.981
0.397
0.753
0.955
1.171
1.58
0.28
1.183
Ventura
0.997
0.405
0.736
1.05
1.194
1.632
0.329
1.227
Yolo
0.932
0.411
0.709
0.891
1.161
1.605
0.291
1.195

The boxplot generated by distantia_boxplot() serves as a quick visual summary of the table above.

distantia::distantia_boxplot(
  df = df_psi,
  text_cex = 1.4
)

According to the boxplot, San Bernardino is the county with a higher overall dissimilarity with all others. On the other hand, at the bottom of the boxplot, Santa Barbara seems to be the most similar to all others.

Hierarchical Clustering

The functions distantia_cluster_hclust() and distantia_cluster_kmeans() are designed to apply hierarchical and K-means clustering to the data frames produced by distantia_dtw() and distantia_ls(). In this section we will be focusing on hierarchical clustering.

By default, the function distantia_cluster_hclust() selects the clustering method (see argument method in help(hclust)) and the number of groups in the clustering solution via silhouette-width maximization (see distantia::utils_cluster_silhouette() and distantia::utils_cluster_hclust_optimizer() for further details).

psi_cluster <- distantia::distantia_cluster_hclust(
  df = df_psi
)

names(psi_cluster)
## [1] "cluster_object"   "clusters"         "silhouette_width" "df"              
## [5] "d"                "optimization"

The output is a list with several components, but these are the key ones:

  • df: data frame with time series names, cluster assignation, and silhouette width.
  • cluster_object: the output of stats::hclust().
  • clusters: the number of distinct clusters selected by the algorithm.

The object df has the column silhouette_width, in the range [-1, 1], which informs about how well each case fits within its assigned cluster:

  • <0: misclassified.
  • ≈0: near decision boundary.
  • ≈1: well-clustered.
name
cluster
silhouette_width
Sacramento
1
0.338
Contra_Costa
1
0.384
Alameda
1
0.373
San_Francisco
1
0.429
Fresno
2
0.333
San_Bernardino
2
0.44
Santa_Clara
1
0.312
Riverside
2
0.425
Orange
2
0.146
Merced
2
0.268
Butte
1
0.191
San_Mateo
1
0.371
Santa_Cruz
2
-0.049
Napa
1
0.259
Kern
2
0.305
Madera
2
0.23
Placer
1
0.349
San_Diego
2
0.264
Santa_Barbara
2
0.16
Sonoma
1
0.305
San_Luis_Obispo
2
0.2
Monterey
2
0.161
Stanislaus
2
0.143
Los_Angeles
2
0.269
San_Joaquin
2
0.199
Solano
1
0.249
El_Dorado
1
0.181
Tulare
2
0.299
Marin
1
0.382
Imperial
2
0.271
Humboldt
1
0.299
Ventura
2
0.31
Sutter
1
0.153
Kings
2
0.238
Shasta
1
0.143
Yolo
1
0.292

If you arrange the table above by silhouette_width, you will notice that “Santa Cruz”, has a silhouette score of -0.049, which highlights it as the county with the most dubious cluster membership.

The cluster object can be plotted as a dendrogram with factoextra::fviz_dend(), using the number of clusters in the object psi_cluster$clusters as guide to colorize the different branches.

#number of clusters
k <- psi_cluster$clusters

factoextra::fviz_dend(
  x = psi_cluster$cluster_object, 
  k = k, 
  k_colors = distantia::color_discrete(n = k),
  cex = 1,
  label_cols = "gray20",
  rect = TRUE,  
  horiz = TRUE,
  main = "",
  ylab = "Dissimilarity"
  )

Mapping these groups may be helpful for the ones like me not fully acquainted with the distribution of counties in California. To get there we first need to join psi_cluster$df to the spatial data frame covid_counties.

sf_cluster <- dplyr::inner_join(
  x = covid_counties[, "name"],
  y = psi_cluster$df,
  by = "name"
)

Now we can use mapview::mapview() again to plot cluster membership, but with a twist: we are going to code the silhouette score of each case as polygon transparency, with higher transparency indicating a lower silhouette width. This will help identify these cases that might not fully belong to their assigned groups.

Notice that to do so, the code below applies distantia::f_rescale_local() to rescale silhouette width between 0.1 and 1 to adjust it to what the argument alpha.regions expects.

mapview::mapview(
  sf_cluster, 
  zcol = "cluster",
  layer.name = "Cluster",
  label = "name",
  col.regions = distantia::color_discrete(n = k),
  alpha.regions = distantia::f_rescale_local(
    x = sf_cluster$silhouette_width, 
    new_min = 0.1
    )
  )

The map shows something the dendrogram could not: the two groups of counties are geographically coherent! Additionally, the map shows that Santa Cruz, the county with a more dubious cluster membership, is surrounded by counties from the other cluster.

Dissimilarity Modelling

Dissimilarity scores reveal the degree of interrelation between time series but do not explain the reasons behind their dissimilarity.

To fill this gap, in this section I show how dissimilarity scores and geographic and socioeconomic variables can be combined to fit models that may help unveil relevant drivers of dissimilarity between epidemiological time series.

Building a Model Frame

The function distantia_model_frame() builds data frames for model fitting. In the code below we use it to combine the dissimilarity scores stored in df_psi with socioeconomic count variables stored in covid_counties.

df_model <- distantia::distantia_model_frame(
  response_df = df_psi,
  predictors_df = covid_counties,
  scale = FALSE
)

Let’s take a look at the output.

x
y
psi
area_hectares
population
poverty_percentage
median_income
domestic_product
daily_miles_traveled
employed_percentage
geographic_distance
Sacramento
Sonoma
0.35
155749.08
1094279
3.9
13388
4286.51
3.61
5.75
44700.85
Contra_Costa
Sacramento
0.36
63954.38
424800
4.2
30984
9094.15
0.31
1.53
0
Contra_Costa
Sonoma
0.37
219703.46
669479
0.3
17596
4807.65
3.93
4.23
13285.05
Alameda
Sonoma
0.37
219295.43
1158020
0.3
15972
20471.4
1.48
1.76
26691.93
San_Francisco
San_Mateo
0.38
104199.56
94059
4.6
11375
48221.99
13.27
7.16
0
Fresno
Tulare
0.4
302255.18
536435
0.7
5742
4801.01
1
2.38
0
San_Bernardino
Ventura
0.4
4730055.32
1356403
4.3
21601
18790.39
7.23
4.14
81949.76
Santa_Clara
Yolo
0.41
70795.8
1665199
7.9
64914
138677.81
7.45
5.94
92406.27
Alameda
Sacramento
0.41
63546.34
63741
3.6
29360
24757.91
2.13
4
27453.67
Riverside
San_Bernardino
0.42
3316013.77
257178
1.6
4472
5747.5
4.46
0.87
0
1–10 of 630 rows
Show
...

The model frame df_model has the following components:

  • Time series identifiers: in this case, county names in the columns x and y. These columns are not used during model fitting.
  • Response variable: column psi from df_psi, representing time series dissimilarity.
  • Predictors: the columns with socioeconomic data in covid_counties are transformed into absolute differences between each pair of counties. For example, the value of the column area_hectares for the counties Sacramento and Sonoma is 1.5574908^{5} hectares. This is the absolute difference between the area of Sacramento (2.5695387^{5} hectares) and the area of Sonoma (4.1270295^{5} hectares).
  • Geographic distance: the column geographic_distance, not in covid_counties, represents the distance between county limits. It is computed when the input of the argument predictors_df is spatial data frame.

The function distantia_model_frame() has anotehr neat feature: it can combine different predictors into a new one!

In the example below, the predictors poverty_percentage, median_income, employed_percentage, daily_miles_traveled and domestic_product are scaled, and their multivariate distances between pairs of counties are computed and stored in the column economy. This new variable represents overall differences in economy between counties.

Notice that the code below applies log transformations to the county areas and populations, and square root transformation to the geographic distances between counties. Finally, it also scales the predictors, because in distantia version <= 2.0.2 the argument scale of distantia_model_frame() does not work as it should (this issue is fixed in version 2.0.3).

df_model <- distantia::distantia_model_frame(
  response_df = df_psi,
  predictors_df = covid_counties |> 
    dplyr::mutate(
      area = log(area_hectares),
      population = log(population)
    ),
  composite_predictors = list(
    economy = c(
      "poverty_percentage",
      "median_income",
      "domestic_product",
      "employed_percentage",
      "daily_miles_traveled"
    )
  ),
  #scale = TRUE is broken in version <= 2.0.2
  scale = FALSE 
  ) |> 
  #selecting and renaming columns
  dplyr::transmute(
    x,
    y,
    psi, 
    economy, 
    population,
    area,
    distance = sqrt(geographic_distance)
  ) |> 
  #scale
  dplyr::mutate(
    dplyr::across(
      dplyr::all_of(c("economy", "distance", "population", "area")), 
      ~ as.numeric(scale(.x))
    )
  )
x
y
psi
economy
population
area
distance
Sacramento
Sonoma
0.35
-0.7
-0.07
-0.73
-0.89
Contra_Costa
Sacramento
0.36
-0.62
-1.03
-0.92
-1.76
Contra_Costa
Sonoma
0.37
-0.88
-0.41
-0.43
-1.28
Alameda
Sonoma
0.37
-1.11
-0.02
-0.44
-1.09
San_Francisco
San_Mateo
0.38
0.38
-1.24
1.08
-1.76
Fresno
Tulare
0.4
-1.35
-0.54
-1
-1.76
San_Bernardino
Ventura
0.4
-0.34
-0.31
1.24
-0.58
Santa_Clara
Yolo
0.41
1.3
0.98
-0.97
-0.51
Alameda
Sacramento
0.41
-0.57
-1.33
-0.92
-1.08
Riverside
San_Bernardino
0.42
-1.04
-1.25
-0.17
-1.76
1–10 of 630 rows
Show
...

Now, with the model frame at hand, we are ready to better understand how the dissimilarity between prevalence time series changes as a function of differences between counties in distance, area, economy, and population.

Fitting a Linear Model

Since our model frame shows no multicollinearity between predictors (you can check it with cor(df_model[, -1])), it can be used right away to fit a multiple regression model.

m <- stats::lm(
  formula = psi ~ distance + area + economy + population,
  data = df_model
)

The model has an R-squared of 0.29, which is not great, not terrible.

term
estimate
std.error
p.value
distance
0.1579
0.013
0
economy
0.0521
0.0136
0.0001
area
0.0434
0.0139
0.002
population
0.0256
0.0126
0.0425

The coefficients table and the effects plot support the idea that all predictors have a positive relationship with dissimilarity between prevalence time series. The predictor distance has the strongest effect, which is almost trhee times the effect economy and area, and more than six times the effect of population.

The scatterplot below confronts the observations versus the model predictions. This kind of plot is known as calibration plot, and helps assess prediction bias in a model.

df_model$psi_predicted <- stats::predict(object = m)

ggplot(df_model) + 
  aes(
    x = psi,
    y = psi_predicted
  ) + 
  geom_point(alpha = 0.5, col = "gray50") + 
  coord_fixed(
    xlim = range(c(df_model$psi, df_model$psi_predicted)),
    ylim = range(c(df_model$psi, df_model$psi_predicted))
  ) + 
  geom_smooth(
    method = "lm", 
    col = "red4",
    formula = y ~ x
    ) + 
  geom_abline(
    intercept = 0, 
    slope = 1,
    col = "black",
    lty = 2
  ) +
  labs(
    x = "Observed psi",
    y = "Predicted psi",
    title = "Calibration plot - Linear Model"
  ) +
  theme_bw()

The calibration plot shows that our model underpredicts high and overpredicts low dissimilarity values. This kind of bias indicates that there are relevant predictors, interactions, or transformations missing from the model. At this point we could look for more meaningful variables, we could disaggregate the variable economy, or maybe decide that this linear model does not fit our purpose and try with a different modelling method.

In any case, we are going to stop the modelling exercise here, because it could go on for a while, and this post is freaking long already. Whatever comes next is up to you now!

Closing Thoughts

In this rather long post I’ve gone through the main functionalities in distantia that are helpful to better understand how univariate epidemiological time series are related and to explore drivers of dissimilarity. However, there are other utilities in distantia that may be useful in this field, such as network plots, which are described in the post Mapping Time Series Dissimilarity, and time-delay analysis, as implemented in the function distantia_time_delay(). I hope I’ll have time to combine these ideas in a new post soon.

And that’s all folks, I hope you got something to take home from this tutorial!

BMB

Blas M. Benito
Blas M. Benito
Data Scientist and Team Lead

Related