Harnessing interpretable and unsupervised machine learning to address big data from modern X-ray diffraction

Significance Recent instrumental advances now enable large volumes of X-ray diffraction to be collected with high efficiency at synchrotron sources. This article shows that machine learning can produce an unbiased and comprehensive analysis of such data that uniquely combines both long-range and short-range structural correlations as a function of temperature. In Cd2Re2O7, machine learning characterizes both the critical behavior of the primary order parameter and the Goldstone mode fluctuations that drive symmetry breaking at a lower temperature. The approach results from a synergy between computer scientists and physicists, producing a machine learning strategy that is interpretable within the established framework of physics and adaptable to other “big data” problems in materials science and engineering.

In order to separate these high intensity features from rest of the data, we take advantage of their sparsity relative to the 71 background. Specifically, we minimize the Kullback-Leibler divergence DKL, where for probability distributions p(x), q(x): [1] 73 between the distribution of log I q i (T ) with a high intensity cutoff and a gaussian. Information theoretically, the Kullback-

74
Leibler divergence quantifies the information loss associated with approximating the distribution p(x) by q(x). In this context, 75 the minimizing DKL optimally chooses a high-intensity cutoff so that the distribution of the remaining log I q i (T ) looks closest 76 to a normal distribution. This is illustrated by applying our procedure to the Sr3Rh4Sn13 data in Fig. S2. Optimization is 77 performed via gradient descent. Note that optimizing with this sliding cutoff is necessary and a Gaussian cannot be directly 78 fitted because the distribution log I q i (T ) is heavy tailed. Directly fitting with a Gaussian yields a higher cutoff susceptible to 79 missing important low-intensity features. 80 The thresholding procedure is the only step in the X-TEC pipeline where we analyse the temperature averaged intensity.

81
The rest of the analysis is solely focused on the temperature dependence of these thresholded intensities. Gaussian fit with a sliding high-intensity cutoff. The algorithm selects the non-Gaussian high intensity features above the cutoff, and these are the intensities that contain the physically relevant signals.
C. Re-scale data. A re-scaling step is necessary to bolster the ability to cluster distinct functional forms of the intensity-83 temperature trajectories rather than the magnitude of intensities. While there are different ways to re-scale the intensities, we 84 narrow down the choice to two schemes. The user decides the optimal re-scaling procedure depending on the nature of the data 85 and the physics of interest. To focus on trajectories that show high variance in temperature compared to the background (such 86 as the CDW order parameters in Sr3Rh4Sn13), one re-scales the intensities {I q i (Tj); j = 1, . . . , d T } with their mean at each qi 87 given by, where µ q i = d −1 T j I q i (Tj) is the mean value of the trajectory at qI . On the other hand, if the user decides to focus on the 90 low variance trajectories (such as the Goldstone mode fluctuations in Cd2Re2O7), a z-scoring is the more efficient choice for 91 rescaling, given by,

94
D. Select number of clusters K. Following the preprocessing, the user starts with a guess for the number of clusters K. This 95 guess may be physically motivated, such as a prior knowledge on the number of order parameters in the system. The optimal 96 number of clusters for the data is later deduced in step G.

97
E. X-TEC-detailed (X-TEC-d) and X-TEC-smoothed (X-TEC-s). As discussed in the main text, the user decides between the 98 two modalities of X-TEC depending on whether to focus on order parameter like features from the peak intensities or their 99 fluctuations in the surrounding diffuse scattering.  Our label smoothing requires us to construct a weighted graph connecting similar momenta in order for diffusion to occur.

115
This may be done by computing the following kernel: where the Qi are the reciprocal basis vectors and is the relevant length scale for the local correlations. The structure of this 118 kernel is shown in Fig. S3 where K(k, 0) is plotted as an intensity for a 2D slice.

119
This kernel is really just a weighted adjacency matrix. By incorporating a cutoff in the weights, we may exploit the sparsity 120 of our system for fast matrix-vector multiplication. When handling large datasets, this cutoff is essential since the full kernel is 121 too large to be stored in any reasonable amount of RAM. Define A to be the matrix associated with this kernel after having 122 normalized the rows i.e. it is row stochastic so that j Aij = 1. Now define P to be the matrix consisting of cluster probabilities 123 calculated by the E-step. Specifically, let the first index correspond to the different momenta and the second to the cluster 124 probabilities so that P is also row stochastic. Then the product AP is also row stochastic since jk AijP jk = When the size of the data is large, one can resort to a cheaper version of label smoothing which is peak averaging. With 130 peak averaging, the intensity of connected pixels that pass the thresholding are replaced by their pixel averaged intensity.

131
This is effective in revealing the order parameters when the intensity of the peaks are much larger than the intensity of the   The corresponding cluster assignments in the (h, k, 0) plane, with pixels color coded according to their cluster assignments.
In Fig S4, we show the X-TEC -d results of the Sr3Rh4Sn13 data using K = 3 clusters. The cluster assignments are 139 labeled through different colors. The characteristic temperature trajectory of each cluster is given by the cluster mean [lines in show one standard deviation]. Clusters whose means are well separated from others beyond their standard deviation are robust 142 trajectories, as is the order parameter like (blue) cluster in Fig. S4.

143
The cluster assignments in qi space are shown in Fig. S4(b) where the pixels are assigned the colors according to their 144 cluster assignments. We find that without label smoothing, Fig S4( where qi(zi) is some distribution over a random variable zi (in our case this will be the cluster assignment) s.t. bound implies that improving˜ (q, θ) necessarily improves (θ) but since theta is unknown, we will have to make a guess, θt, 162 and improve it iteratively. This iterative prescription is known as expectation maximization (EM). It consists of an E-step, 163 where q t i ← p(zi|xi; θt) and an M-step θ t+1 ← argmax θ˜ (q t , θ). 164 We now derive the EM algorithm for the GMM. The E-step follows directly from the model likelihood and Bayes' theorem: For the M-step, we must find {π, µ, Σ} that optimizes our lower log-likelihood bound: where λ is a Lagrange multiplier constraining the mixing weights to sum to unity.

167
Solving for the mixing weights: Solving for the mean: Solving for the covariance is a little trickier. First note the following matrix identities for symmetric invertible matrix A: Now, when solving for the covariance we promote the covariance cluster index to an upper index so that the lower indices 171 refer to the matrix elements: Note that all quantities derived about have the same form as one would expect from standard regression but with each data 173 point xi having a cluster weight w k i .

A. Specific Heat Measurements.
In the main text, the heat capacity (Cp) of Cd2Re2O7 was displayed in Fig. 3(b). The data 195 shown in that figure was processed by a standard method in relaxation calorimetry ("pseudostatic method") in which the 196 heat capacity is assumed to be constant throughout the heating and cooling segments of an applied heat pulse during which

197
∆T ≪T. However, in the presence of a 1st order transition, the shape and magnitude of a peak in Cp at the phase transition 198 temperature can be modified, while the hysteresis can be lost, when using the pseudostatic method. For this reason, we have 199 also used the "scanning method" for which Cp is numerically determined at every point in the warming and cooling segments, 200 which yields a more accurate peakshape and hysteresis for a 1st order transition at the cost of noise and absolute accuracy.

201
A more detailed description of pseudostatic and scanning analysis can be found in Ref. 7. Fig. S7 shows the temperature 202 dependence of Cp in the vicinity of the ∼ 113 K phase transition when analyzed using the scanning method. A small but 203 resolvable thermal hysteresis was observed between the peaks in Cp from the heating and cooling curves, which is suggestive of 204 a latent heat and hence a first-order character. We do note, however, that the peak height and width of the peak in Cp did not 205 differ substantially between these two methods, as would also be anticipated for a first-order transition, and for this reason the 206 analysis of Cp alone is not definitive in identifying the order of the transition.

207
Cd2Re2O7. We first performed scans using an x-ray energy of 87 keV, which contained scattering spanning nearly 15,000 209 Brillouin zones, A first pass of X-TEC -s * for two clusters (K = 2) readily finds a cluster whose intensity rises sharply below 210 Ts1 = 200 K [the purple cluster in Fig. S8(a)]. The crisp clustering results with tight variance around the means reflect 211 the amplification of the meaningful trend upon using data from a large number of BZ's. By examining the X-TEC cluster 212 assignments, we find the purple cluster to exclusively consist of peaks with Q = (H, K, L), with all indices even, exactly one of 213 which is not divisible by four, using the cubic indices of Phase I [see Fig. S8(b)]. Peaks that are equivalent in the cubic phase 214 have different temperature dependence in Phase II, implying that the sample is untwinned, something that is confirmed by 215 our high-resolution data. This means that the presence of (00L) peaks with L = 4n + 2 below Ts1 in phase II unambiguously 216 rules out all the tetragonal space groups compatible with the pyrochlore structure, apart from I4m2 and I4. According to an 217 earlier group theoretical analysis(8), of these two, only the former is compatible with a single second-order phase transition, 218 so our data is strong confirmation of previous conclusions that, at Ts1, I4m2 phase is selected out of two-dimensional Eu 219 representation(9, 10).     5. We cluster the peak-averaged trajectories using K = 2 clusters. We found two clusters to be the minimum number 231 necessary to separate all distinct behaviors and that there was no advantage to using more than two. 232 6. The dashed lines in Fig. 3(c) and symbols in Fig 4(a) show the cluster averaged intensity trajectory of the two 233 clusters. The cluster averaged trajectory is shown for the full temperature range: [30 K, 300 K], although the 234 clustering assignments were obtained from trajectories ≤ 150 K.

235
• X-TEC -d (peaks opened) on cubic forbidden peaks of high resolution data : Fig 3(c-d), Fig: 4(c) 236 1. We select a 50 × 50 × 50 window around each known peak center and threshold as described in SM 1-B. 3. We rescale the data by z-scoring it.
4. We cluster the data using K = 3 clusters.

241
5. The resulting cluster averaged intensity trajectory for the full temperature range: [30 K, 300 K] is shown as solid 242 lines in Fig. 3(c). 243 6. Cluster averaging the absolute intensity trajectories washes out the characteristic temperature dependence of the 244 diffuse halos. This can be remedied by cluster averaging over z-scored intensities. This is reported in Fig. 4(c).

245
• X-TEC -s (peak averaged) on low resolution data: Fig S8   246 1. In order to reduce noise, we first construct an average BZ mask by thresholding every BZ as described in SM 1-B 247 and then averaging the thresholded BZs together.  3. We multiply each BZ by the average BZ mask to remove noise and emphasize the peaks.  7. Finally we cluster using K = 2 clusters. We subtract the minimum value of the cluster means when plotting to 257 emphasize the order-parameter like behavior of the purple cluster in Fig. S8(a). Here X-TEC analyses the data for 258 the full temperature range [30 K, 300 K].

259
• Processing times for X-TEC The X-TEC analysis of the higher resolution XRD data for Cd2Re2O7 is the most time 260 consuming of all the cases studied in this paper, and takes ∼ 10 min in total to run.  Fig. S9. Four-cluster X-TEC-s (peak averaged) results on the high resolution measurements of Cd2Re2O7 retaining all Bragg peaks. Two of these sub-cluster trajectories (yellow and green symbols) identify with the cubic forbidden trajectories shown in Fig. 3(c) and Fig. 4(a) of main text. The other two sub-cluster trajectories (magenta and brown lines) arise from peaks that are not forbidden in the high-temperature cubic phase. The temperatures of the two structural phase transitions are shown as dotted lines. allowed peaks. By including all Bragg peaks, four clusters are sufficient for X-TEC -s to identify all the distinct trajectories.
268 Fig. S9 shows the cluster means (z-scored intensity trajectories) for all four clusters identified by X-TEC -s (peak averaged) 269 analysis. Two of these sub-clusters (yellow and green symbols) can be identified with the behavior of the cubic excluded peaks.

270
It should be noted that these clusters represent the average temperature dependence of all the peaks assigned to their respective

276
We do not currently have an explanation for this remarkable behavior.

277
The structural phase transition at Ts1 is from the cubic pyrochlore structure, with space group F d3m, to a distorted that have been assigned to the yellow and green clusters of X-TEC -s shown in Fig. 4(a). Equations 10 and 11 show that the 299 006 (yellow) peak is only sensitive to δz Cd and δz Re , whereas the 060 (green) peak is only sensitive to in-plane distortions.

300
The fit to the 006 peak yields relative z-axis distortions that are equal and opposite, i.e, δz Re = −δz Cd , illustrated in Fig.   301 4(b). The out-of-phase distortions are the reason for the flattening of the peak intensity of the 006 peak between 180 K and 302 120 K, confirming the conclusions based on the fits to the cluster means in Fig. 4(a). On the other hand, the 060 peak follows 303 the scaling law from 200 K to 120 K, showing either that δx Re has the same sign as δx Cd or that one of the distortions is 304 much larger than the other. This is an example where the temperature dependence of the peak intensities below a structural 305 phase transition yields information on the relative internal distortions, which have proved to be too subtle for conventional 306 crystallographic refinement until now. Cd, Nb and O masses), so all mode energies will be multiplied by the same constant in order to agree with Raman data (12) 326 for the Higgs energy at T =0 (85 cm −1 ). 327 We now have all we need to calculate the order parameter, the phase boundary, and mode energies. What about the 328 mode intensities? The basic idea can be seen from the work of Fleury (13) and Shapiro (14). The energy integrated intensity 329 (appropriate for the diffuse scattering collected from high energy x-rays) is given by (14) 330

E. Mode energies and intensities from
where n(ω) is the Bose factor, ωq is the mode energy for a given q, and Γq is the lifetime broadening. To evaluate, we choose parameters as in Fig. 3b of (8), with b1=0.3. We then do the following normalizations. a1 is some 348 constant times T − Ts1. This constant is adjusted so that Ts1 is 200 K and Ts2 is 113 K. Then the mode energies are normalized 349 as stated above (so that the Higgs mode energy is equal to 85 cm −1 at T =0 as observed by Raman (12)). Finally, the intensities 350 are normalized by Ts1. In Fig. S11, the resulting mode energies and intensities are shown. Note the small jump in the Higgs 351 energy and the dip in the Goldstone energy at Ts2. Also that the Goldstone intensity completely dominates outside the critical 352 region associated with Ts1. As an aside, the Raman data cut-off at about 6 cm −1 as noted above. The prediction is that 353 the Goldstone mode energy should rise above this value at low T . We suggest then that the Raman mode seen at 30 cm −1 354 below Ts2 could be the Goldstone mode. This in turn implies that the central peak in the intensity from Raman has more 355 contributions to it than the Goldstone one, and this would presumably be due to elastic scattering from impurities and static 356 short-range structural disorder. Landau mode intensities as a function of T . Outside of the critical region near Ts1 (200 K), the intensity is dominated by the Goldstone intensity. Note the resemblance of the calculated intensity to the XRD diffuse scattering intensity presented in this paper (Fig. 4(c)).
Finally, some caveats. First, the behavior well below Ts2 cannot be taken too seriously since Landau theory is not valid 358 at low T where Q(T ) flattens as a function of T (as observed for the Higgs mode by Raman). Nor for the intensities where 359 the T /ω approximation for n(ω) is not valid. Second, the theory is for a pure Eu model. In reality, the secondary mode A2u 360 (corresponding to distortions along the < 111 > trigonal axis orthogonal to Eu distortions) will play some role, and its coupling 361 to Eu is also an anisotropy term in the Landau energy (it does not exist for I4122) (8). Finally, the critical exponent near Ts1 362 is the mean field one. In reality, experiment finds β=1/4, not 1/2. Despite these caveats, Fig. S11 is remarkably similar to 363 the Raman data, and the XRD data reported in this paper. This brings into question the interpretation of the pump-probe 364 measurements in Ref. (15) which claims that a structural soft mode does not exist for Cd2Re2O7.