Archetypal analysis of COVID-19 in Montana, USA, March 13, 2020 to April 26, 2022

Given the potential consequences of infectious diseases, it is important to understand how broad scale incidence variability influences the probability of localized outbreaks. Often, these infectious disease data can involve complex spatial patterns intermixed with temporal trends. Archetypal Analysis is a method to mine complex spatiotemporal epidemiological data, and can be used to discover the dynamics of spatial patterns. The application of Archetypal Analysis to epistemological data is relatively new, and here we present one of the first applications using COVID-19 data from March 13, 2020 to April 26, 2022, in the counties of Montana, USA. We present three views of the data set with Archetypal Analysis. First, we evaluate the entire 56 county data set. Second, we compute mutual information of the 56 counties’ time series to remove counties whose dynamics are mainly independent from most of the other counties. We choose the top 17 counties ranked in terms of increasing total mutual information. Finally, to compare how population size might influence results, we conducted an analysis with 10 of the largest counties. Using the Archetypal Analysis results, we analyze the disease outbreaks across Montana, comparing and contrasting the three different cases and showing how certain counties can be found in distinct sets of archetypes. Using the reconstruction time series, we show how each outbreak had a unique trajectory across the state in terms of the archetypes.

The COVID-19 pandemic launched an intensive effort to understand the processes and AA typically finds the centroids of clusters, and represents data as convex combinations 26 of the clusters, so it can indicate if a point lies between several clusters. Therefore, AA 27 combines the strengths of other commonly used techniques for data decomposition; 28 providing the interpretability that PCA lacks and more flexibility than many clustering 29 algorithms (i.e. it is a soft clustering algorithm). 30 AA applications have spanned many fields, and include the analysis of weather and 31 climate patterns [6,11,22,24], machine learning [16], market analysis [13], and 32 biomedical and industrial engineering [9,26]. Mokhtari et al. [4] use AA in one of the 33 first applications to epidemiological data, reconstructing the spatio-temporal patterns in 34 an influenza time series, and showing how prominent outbreaks developed across space 35 and for each influenza season (2010-2019). Here, we follow their approach and apply AA 36 to COVID-19 county-level data in Montana. Our goal is to further evaluate the use of 37 AA and attempt to reconstruct the spatial patterns for each COVID-19 outbreak from 38 March 2020 to April 2022. The full 56 dimension data set is decomposed by AA, along 39 with 2 reduced dimension sets, one reduced using a mutual information metric 40 (17-dimensional), and another considering only counties with large population centers 41 (10-dimensional). We apply AA to decompose them into a limited number of spatial 42 patterns of disease counts in each county for the specific outbreaks. The patterns found 43 in each outbreak can be compared, and a reconstruction coefficient time series allows 44 examination of the spread of COVID-19 from one pattern to another. 45 We would like to note here a common misconception in the later papers applying 46 Archetypal Analysis. The Archetype algorithm finds points that are convex 47 combinations of the data points that minimize the error in a reconstruction of the data 48 in terms of convex combinations of the archetypes. Recent publications use the term 49 "extremes" [6] to mean archetypes, which is indeed the case if there are significant 50 outliers in the data set. Archetypal decomposition is sensitive to outliers, so "extremes" 51 may be found first. However, this depends strongly on how the data are distributed, 52 especially in high dimensional spaces. The error is the residual sum of squares, so if a 53 point occurs often in a data set, the error in reproducing it will be multiplied by the 54 number of occurrences of that point, and the algorithm will use it as an archetype, 55 necessarily. It may or may not be an "extreme" in the data set. Calling Archetypes 56 "extremes" is a muddy issue, and we discuss this in detail for our data set in what follows. 57

2/18
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted March 9, 2023 Human Services from March 13, 2020-April 26, 2022. This covers 110 full weeks or 62 n = 776 daily time observations. The first case was observed on March 13, 2020, 63 however, we removed the first 100 days because Montana was following 64 non-pharmaceutical interventions during this time and did not experience the 65 pandemic's initial wave observed in the rest of the country. Therefore, we analyzed the 66 period from June 20, 2020 -April 26, 2022, resulting in n = 676 observations. During 67 this time period, Montana experienced three peaks in cases followed by declines (i.e., 68 outbreaks or waves). A running weekly average of COVID-19 cases per day were 69 smoothed using the "smoothdata" function within MATLAB (default setting of mean = 70 3) to reduce noise in the time series. Fig. 1 shows the data reported for each county. To 71 account for differences in population size, we weighted the COVID-19 cases for each 72 county by the population size of the county in 2020 (American Community survey data 73 from the Census Bureau) and reported cases per 1000 people in that county. Consider an m × n matrix X, where n is the number of observations of COVID-19 cases 76 across m Montana counties. AA decomposes the spatio-temporal variability of X in a 77 similar way to PCA but with the following underlying constraints. Given a specified 78 value for k, AA identifies m-dimensional vectors z 1 , · · · , z k that best describe k 79 characteristic patterns, or archetypes, in the original data set, such that data can be 80 represented as convex combinations (i.e., linear combinations with non-negative 81 coefficients that sums to unity) of these archetypal patterns: The n-dimensional vector β j contains the convex weights for the jth archetype 83 across all observations. The n × k matrix of all such weights is given by B = β 1 , · · · , β k . 84 Each archetype is either a convex combinations of the original observations or an actual 85 observation [7], so they are more readily interpreted compared to PCA eigenvectors. All 86 observations can then be approximated by a convex combination of the archetypes: Here, the convex weights, sometimes referred to as mixture coefficients, α ji with 88 j = 1, · · · , k range from 0 to 1, are used to reconstruct the ith observation across the k 89 archetypes. The k × n matrix of all such weights is given by A = {α 1 , · · · , α n }. The α j 90 are like the (nonlinear) projection of the original data X onto the jth archetype z j , 91 similar to PC scores in PCA. Thus the α j s are time series that determine how much of 92 each archetype is used in reconstructing each data point.

93
The m × k matrix Z of k archetypes is defined by the matrix factorization problem:

3/18
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted March 9, 2023. ; https://doi.org/10.1101/2023.03.06.23286886 doi: medRxiv preprint where Z = XB. RSS = ∥X − XBA∥ is the residual sum of square errors, where ∥.∥ is 94 the spectral norm. AA seeks to find k m-dimensional archetypes such that the RSS is 95 minimized. This approach is described in detail in [7], but can be summarized as 96 follows: AA uses a convex least-squares method (CLSM) to estimate the coefficient α ji , 97 subject to the constraints for given some initial values of β ij . It then finds the best β ij 98 using CLSM, using the new α ji . This process repeats until the RSS fails to improve, or 99 potentially until the maximum number of iterations is reached. AA will find local 100 minimums, not necessarily the global minimum of RSS, hence using several starting β ij 101 values to insure a global solution is recommended. Furthermore, there is no universal 102 method for determining the optimal value of k. One commonly used approach is the 103 "elbow" criteria, where a good value of k is selected by when the RSS fails to improve, 104 which can be determined by finding an elbow in the relationship between RSS and k in 105 a scree plot. Since its introduction, other algorithms have been developed to find an 106 archetypal decomposition of data. To compute the archetypes we used Matlab packages 107 by Morten Mørup and Lars Kai Hansen [16] for computing the Principal Convex 108 Hull [2]. Archetypes themselves are presented as color mapped counties within a state 109 map, and are created with GeoPandas (GeoPandas.org).

110
It is noted in [7] that the archetypal points z, viewed as vectors, are not orthogonal 111 and have no natural nesting structure, i.e., as more archetypes are found, the archetypes 112 in the smaller set can change. This is in contrast to PCA, where the set of the leading 113 N principal components are a subset of the set of the leading M principal components 114 for M > N . In PCA all the eigenvectors are found in a single decomposition, and  with very small population counties), we could chose to remove them from the data set, 123 thereby reducing the dimension. The exact meaning of "more or less independently" is 124 explored below.
and the joint entropy of two random variables X and Y quantifies the uncertainty of 130 their joint distribution.

4/18
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted March 9, 2023. ; https://doi.org/10.1101/2023.03.06.23286886 doi: medRxiv preprint The mutual information is symmetric in the variables X and Y , I(X; Y ) = I(Y ; X), 133 and is zero if the random variables are independent or if the relation between them is 134 deterministic (nothing to be learned in either case). Note also that if X is statistically 135 correlated to Y , H(X|Y ) will be less than H(X), and I will be greater than 0. If X is 136 independent of Y , H(X|Y ) = H(X) and I = 0. If X is uniquely determined by Y , 137 H(X|Y ) = 0 and I(X; Y ) = H(X).

138
In general, association measures like correlation coefficient or mutual information are 139 used to estimate the relationships between two random variables. Correlation coefficient 140 measures such as Pearson or Spearman entail the assumption of linear dependence.

141
Therefore, if two random variables are associated by a nonlinear relationship these 142 methods fail to detect this link, or the strength will be wrongly estimated. Mutual Montana (14 of 56 counties have population less than 5000, and 4 have populations less 154 than 1000) it can lead to issues with our decomposition, as we shall demonstrate now. 155 We first apply archetypes to the entire truncated in time data set, which is 676 (m) 156 observations in 56 (n) dimensions. To determine how many archetypes to compute we 157 consider the scree plot, or RSS versus number of archetypes, plotted in 2. To choose the 158 set of the archetype set, the "elbow criterion" would indicate truncating to one 159 archetype, but this is the "no-disease" archetype that appears in all the decompositions, 160 and serves to turn the counts off and on. Instead, we choose k = 10 archetypes. We are 161 left with a 56 by 10 (n x k) set of β values, and a 10 by 676 (k x m) set of α values.

162
These 10 archetypes are shown in 3 as color-mapped counties in Montana. The color is 163 determined by the β value for that county, for that archetype. Later we refer to high 164 count/large outbreaks as those above 1.5, mid-size outbreaks as between 0.75-1.5, and 165 low level outbreaks as β values below 0.75. Each archetypal map can be thought of as a 166 representation of the spatial features of the outbreak at a given time, or period of time. 167 The information in each archetype are summarized in Table 3.

5/18
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted March 9, 2023. ;

10
Widespread large outbreaks throughout state, both in large and small population counties.
The archetype algorithm is strongly affected by outliers, as they will contribute 182 largely to the RSS if not included included in the set. To remove these in a systematic 183 way, we use techniques from information theory in the following section.

184
Reducing Spatial Dimension and Removing Outliers with To determine which counties should be included in the archetypal analysis, we 192 computed the mutual information between all counties, and ranked counties according 193 to their total mutual information (see also [4]).

194
Following the formulas in section 2, we first created histograms of the time series 195 data for single counties, and joint histograms for each county with all the others. We 196 note that the choice of bin size in these histograms will change the value of the entropy, 197 but by choosing a fixed bin size for all the histograms, it is possible to compare the 198 6/18 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted March 9, 2023.  Silverbow (34915) and Yellowstone (161300). We note that the counties can be grouped 211 into rough geographic subregions, where they are connected by state and/or Interstate 212 highways, each with one major city or town. These are noted in Table 2. is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

223
Using an elbow criterion on the scree plot to determine a cut-off, we choose 9 224 archetypes for the decomposition. We also examined an 8 archetype decomposition, and 225 found that these 8 are included in the 9 archetype set. The additional archetype is 226 archetype 2; a general widespread outbreak archetype, which should be included.

227
Accordingly, we are presenting this 9 archetype set in our analysis here. See figure 8 for 228 color maps of the archetypes. Each county is colored according to the β value it has for 229 that archetype. Note that high β values are considered those greater than 1.5, mid-size 230 are between 0.75 and 1.5, low are below 0.75. These values multiply the alpha values in 231 the time series to give the data reconstruction. The 9 archetype set gives a good 232 separation of the different outbreaks into distinct groups of archetypes, described below. 233 As the sum of the alpha time series is equal to 1 for each data point, it is fair to 234 ignore one archetype; its alpha time series can be calculated from the remaining. We 235 choose to ignore the zero archetype when analyzing in detail the anatomy and sequence 236 of an outbreak, as it serves to turn the epidemic "off". We further classify archetypes 237 1-9 in Table 3. Note that high β values are considered those greater than 1.5, mid-size 238 are between 0.75 and 1.5, low are below 0.75. These values multiply the alpha values in 239 the time series to give the data reconstruction. . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted March 9, 2023. ; Mid-size outbreaks in the contiguous region made up of Flathead, Lake, Missoula, and Lewis and Clark counties, with a large outbreak in adjacent Cascade county, and a mid-size outbreak in Hill county. The remaining counties have low mid-size outbreaks.
The archetypes separate out clusters of counties which have simultaneous outbreaks 241 (spatial), and the alpha's give the sequence in which the outbreaks occur (temporal).

242
The alpha time series indicate when each archetype is active in the time series. In Fig 9 243 we show how the 9 archetypes are present almost uniquely during the different 244 outbreaks. The first outbreak is captured by archetype 7, 5 and 2, in that order. The

245
Spring/Summer 2020 low outbreak is captured by archetypes 6 and 3, the Delta 246 outbreak by 4 and 2, and finally the Omicron outbreak by 6, 8, 9 and 3. There are 247 small contributions from other archetypes in each, e.g. archetype 2 is a low widespread 248 outbreak, and is present at the end of the first outbreak and the Delta outbreak. Gallatin, which are linked by I15 from Deer Lodge to Butte to Bozeman.

260
After that, in Spring/Summer archetype 6 shows declines in counts in all counties 261 9/18 . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted March 9, 2023. ; except Missoula and a large outbreak in Gallatin, then archetype 3 represents a further 262 decline in counts in all counties except Cascade and Flathead, where a small increase 263 occurs. The summer outbreak is mainly a sharp increase in cases in Gallatin county, 264 represented again by archetype 6.

265
The Delta outbreak begins with largest counts in Lincoln (far northwest corner of 266 the state) and Custer (far east side of the state), and mid-size elsewhere (archetype 4). 267 Larger counts are also found in Hill, Yellowstone, Cascade, Lake and Beaverhead. The 268 outbreak appears to spread in from the east and northwest. Then the transition to    Fig 12 shows the residual sum of squares (RSS) for each truncation from 1 archetype 308 to 20. For this analysis we choose a truncation to 6 archetypes, which captures roughly 309 98% of the variance. Thus, we have reduced the dimension of the data set from 10 to 6, 310 or really 5, because the alpha time series must sum to one at each time point. 311

10/18
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted March 9, 2023. ; https://doi.org/10.1101/2023.03.06.23286886 doi: medRxiv preprint In Figure 13 we show color maps of each archetype. Again, the archetypes are 312 ordered according to the overall size of their contribution to the reconstruction. The 313 alpha time series (Figure 14) show when each archetype is active in the time series. We 314 classify archetypes 1-6 in Table 4. The zero or "no disease" archetype.

315
2 Mid-size overall, with larger counts in Lincoln.

3
Low level outbreak overall, with mid-size outbreaks in Cascade and Flathead. . CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted March 9, 2023. We have seen how archetypal analysis can be used to good effect in studying a 346 spatio-temporal data set of COVID-19 counts in Montana. Decomposing the entire data 347 set was problematic, however, because of the large size of the normalized counts in small 348 population counties. These small population counties (down to less than 500) have very 349 stochastic signals, and normalized they dominate in the creation of archetypes, which is 350 sensitive to outliers [7]. Hence, we sought ways to remove the stochastic signals of the 351 small population counties. A straight-forward truncation to large population counties 352 with significant city centers formed one reduced set, but we also wanted a way to 353 include the small counties that had significant interaction with the others. To do this, 354 we used a Mutual Information (MI) measure between time series of different counties. A 355 high (relative) MI indicates that a county's time series can be better predicted by 356 considering the other, and vice versa. We computed the MI between each county and all 357 the others, adding all together for each county to create a total MI, which can be 358 visualized as a spectrum. From this we chose those with the highest total MI, and 359 included all the large population counties, for analysis.

360
For each presentation of the data, the first archetype is necessarily the zero 361 archetype, which is used to turn off the outbreak in the decomposition of the data. In 362 the 17 county high total MI set we found that certain archetypes were tied uniquely to a 363 given outbreak, while one archetype was used to represent low-level counts in between 364 outbreaks (archetype 2). The initial outbreak was captured by archetypes 5 and 7, are represented in archetype 8. After that counts decline in many of these counties, 380 lingering more in some counties than others, with a notable late outbreak in Cascade 381 county, hence the later appearance of archetype 9. Finally, it ends with further decline 382 in all counties, with Cascade and Flathead behind the rest, seen in archetype 3.

383
The 17 county data set shows the influence of small population counties on the 384 eastern and northern border of the state on the initiation and spread of the virus.

385
However, because of of the stochastic signal within small counties and their contribution 386 to outliers in the AA results, we compare this to the Archetypal decomposition of the 10 387 large population counties data set. It shows the same progression of the epidemic as in 388 the 17 county set, restricted to the high population counties. Archetype 4 for the 10 389 county data set is similar to archetype 5 in the high total MI set without the low 390 population counties, and both are the main component of the first outbreak. In the 391 Delta outbreak Archetype 2 in the 10 county set is similar to archetype 4. In the 392 Omicron outbreak, Archetype 5 is similar to 8 in the high total MI set, archetype 6 is 393 similar to 9, and archetype 3 to 3. This nesting structure is to be expected, and 394 confirms the validity of the results.

395
In the 17 county set we see that Archetypal Analysis finds archetypes of counties 396

13/18
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity.

(which was not certified by peer review)
The copyright holder for this preprint this version posted March 9, 2023. involved in large outbreaks first, as not including them contributes a sizable amount to 397 the RSS. Therefore, an outbreak may follow a pattern of a low-level outbreak archetype 398 before any hot spots occur, to an archetype with large counts in some selection of 399 counties, perhaps followed by another archetype with another configuration, ending 400 with another low level archetype. The clusters of large outbreak counties is revealed 401 automatically in the analysis, for instance in the initial outbreak a hot-spot in Hill 402 county and larger counts in Powell county spread to all eastern counties and the western 403 side of the state.

404
How an Archetypal decomposition can be used predicting the spatial spread of 405 disease over time could be explored further. For instance, the information from the 406 Archetypal decomposition could be compared with future outbreaks. Do the outbreaks 407 follow similar spatial patterns? If an outbreak begins in a county that features largely in 408 one archetype, would this imply spread to other counties that have high counts in that 409 archetype? Flathead, Lewis and Clark and Cascade counties have similar outbreak levels 410 in several of the archetypes, which is not surprising, as they are linked by major state 411 roads and have larger population cities, but the analysis confirms that these connections 412 are important. In the east, further analysis could be done focusing on all of the counties 413 in this region to determine how COVID-19 spreads west if initiated in the east. The 414 alpha time series would be used in this analysis, as it represents the transitions from one 415 spatial pattern to another, and holds all time dependent information of the data. 416 We close by commenting on the care that must be used in the creation and analysis 417 of the archetypes. They have the advantage of automatically showing the counties that 418 experience simultaneous outbreaks, and the state of the other counties during these 419 time periods. Understanding the archetypes is intuitive, unlike graphical representations 420 of PCA vectors. However, if outliers are present in the data (like the inflated counts 421 from small population counties), they can bias the selection of the archetypes. This can 422 be mitigated by filtering out these data points, as we did using an information measure 423 between time series. In doing so we retained the small population counties that were

15/18
. CC-BY 4.0 International license It is made available under a is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. (which was not certified by peer review) The copyright holder for this preprint this version posted March 9, 2023.