Regression with race-modifiers: towards equity and interpretability

The pervasive effects of structural racism and racial discrimination are well-established and offer strong evidence that the effects of many important variables on health and life outcomes vary by race. Alarmingly, standard practices for statistical regression analysis introduce racial biases into the estimation and presentation of these race-modified effects. We advocate abundance-based constraints (ABCs) to eliminate these racial biases. ABCs offer a remarkable invariance property: estimates and inference for main effects are nearly unchanged by the inclusion of race-modifiers. Thus, quantitative researchers can estimate race-specific effects “for free”—without sacrificing parameter interpretability, equitability, or statistical efficiency. The benefits extend to prominent statistical learning techniques, especially regularization and selection. We leverage these tools to estimate the joint effects of environmental, social, and other factors on 4th end-of-grade readings scores for students in North Carolina (n = 27,638) and identify race-modified effects for racial (residential) isolation, PM2.5 exposure, and mother’s age at birth.


Introduction
Health and life outcomes are inextricably linked to race (1,2).Racial disparities exist in birth outcomes, mortality, disease onset and progression, socioeconomic status, and police-involved deaths, along with many other health and life outcomes (2)(3)(4).These disparities persist even after adjusting for socioeconomic status and occur through multiple pathways (1).Structural racism contributes to significant differences in the quality of education, housing, employment opportunities, accumulation

Significance Statement
The far-reaching impacts of race suggest that the effects of many key variables on health and life outcomes are modified by race.Alarmingly, we show that standard practices for statistical regression analysis introduce racial biases into the estimation and presentation of these race-specific effects.We introduce an alternative approach that eliminates these racial biases, improves interpretability, and provides unique and appealing statistical properties for estimation and selection of race-specific effects.Using these tools, we identify and quantify race-modifiers for environmental, social, and other key factors on child educational outcomes in North Carolina.Our statistical approach offers high-impact potential to quantify, understand, and ultimately reduce the numerous and significant racial disparities in health and life outcomes.

D R A F T
of wealth, access to medical care, and treatment in the criminal justice system (1,2,(5)(6)(7).Perceived racial discrimination impacts both mental and physical health through heightened stress responses, health behaviors, and traumatic experiences (8,9).Thus, rigorous studies of health and life outcomes must carefully consider race as a primary factor.
That race permeates so many aspects of an individual's life course is a strong indicator that the effects of important factors (X) on health and life outcomes (Y ) may be race-specific (10).
Regression analysis-the primary statistical tool to quantify how these covariates X determine, predict, or associate with an outcome Y -must therefore consider race-modifiers for X.Indeed, there is abundant and growing evidence for race-specific modifiers, including the effects of red-lining, PM 2.5 exposure, and cigarette use on mortality risks (11)(12)(13); maternal age, poverty, education, and hypertension on infant birthweight, infant mortality, and maternal stroke risk (14)(15)(16); education level on multiple health outcomes (17); mood/anxiety disorder on chronic physical health conditions (18); perceived racism on mental health (9); age on allostatic load scores, known as "racial weathering" (19); and the timing of hypertension, insulin resistance, or diabetes onset (20,21), among many others (22)(23)(24)(25)(26).The identification and quantification of race-modified effects are essential to understand and eliminate harmful race disparities in health and life outcomes (27).
However, there are significant racial biases that occur in commonplace estimation, inference, and presentation of results for regression analysis with race as a covariate.We showcase these effects in the following example.
Motivating example.Suppose Y is a continuous outcome variable, X is a continuous covariate of interest, and R is a categorical variable (race).The most widely-used model to estimate both the main and race-specific effects of X on Y is the linear regression model for the expectation µ(x, r) := E(Y | X = x, R = r), which we refer to as the race-modified model: µ(x, r) = α 0 + xα 1 + β r + xβ * r [1] or equivalently, y ∼ 1 + x + race + x:race in Wilkinson notation.By including X and race (the main effects) and their interaction (the race-modifier), Eq. ( 1) produces race-specific intercepts and slopes: µ(0, r) = α 0 + β r [2] µ ′ x (r) := µ(x + 1, r) − µ(x, r) = α 1 + β * r [3] for each race r.By fitting Eq. (1) to observed data {x i , r i , y i } n i=1 , the analyst can quantify the effect of X on Y and determine whether this effect varies by race-which is essential information for constructing equitable policies and practices.
However, the race-modified model-which uses "one-hot" or "dummy variable" coding of the categorical variable R-is overparametrized: neither {α 0 , β r } r nor {α 1 , β * r } r are identifiable without D R A F T further constraints.Any constant could be added to α 0 and subtracted from each {β r }, and similarly for α 1 and {β * r }, which alters each parameter but leaves Eq. ( 1) unchanged.Thus, additional constraints are needed for estimation and inference with the race-modified model and its generalizations.
Undoubtedly, the most common identification strategy is reference group encoding (RGE): a reference group is selected, typically non-Hispanic (NH) White, and removed (β NHW = β * NHW = 0, r = NHW).This is the default strategy for all major statistical software implementations of (generalized) linear regression, including R, SAS, Python, MATLAB, and Stata.However, we argue that RGE output is racially biased (28), difficult to interpret, and obscures important main and race-specific effects.We categorize these significant limitations into presentation bias and statistical bias.
Presentation bias.Table 1 (left) displays standard output for a regression of 4th end-of-grade reading scores on racial (residential) isolation (RI), race, and their interaction.RI measures the geographic separation of NH Black individuals and communities from other race groups, and thus is an important measure of structural racism (5,6,(29)(30)(31).The dataset, detailed subsequently, includes n = 27, 638 students in North Carolina (58% NH White, 36% NH Black, 6% Hispanic).Using the race-modified model, the goal is to estimate the effect of RI on reading scores and determine if, and how, the RI effect varies by race.
Critically, under RGE, the RI effect (red) actually refers to the RI effect only for NH White individuals, . Similarly, the "Intercept" refers to the NH White intercept, α 0 = α 0 + β NHW = µ(0, NHW).First, this output is inequitable: it elevates a single race group (NH White) above others.Further, all other race-specific effects are presented relative to the NH White group.For instance, RI:NH Black refers to the difference between the RI effects for NH Black students and NH White students: . Implicitly, and problematically, this framing presents NH White effects as "normal" and other race groups as "deviations from normal", which is known to bias interpretations of results (32).Second, this output is unclear: it is nowhere indicated that the intercept and RI effects are specific to NH White students.
We emphasize that the presentation format in Table 1 (left) is predominant in scientific journals.
Among recent publications in social science journals, it was found that 92% of such tables used NH White as the reference group, while less than half explicitly stated the reference group (28).
A cursory inspection of this output might result in a mistaken interpretation of the RI effect as a global effect, rather than a NH White-specific effect.Finally, and related, this output is misleading: the RI effect is reported to be small and insignificant.This occurs because the RI effect for NH White individuals is estimated to be small and insignificant.However, this output-and RGE more broadly-does not provide a global, race-invariant x-effect (i.e., the coefficient on x).For comparison, consider the main-only model that omits the race-modifier, where the superscript M emphasizes that these parameters are distinct from those in the racemodified model.Here, the x-effect is global: does not depend on the race group r.Fitting the main-only model, we identify a highly significant and detrimental global RI effect (Estimate = −0.042,SE = 0.007, p < 0.001; SI Table A1)-which is not identified in the default RGE output (Table 1, left).Of course, the main-only model does not estimate race-specific effects and thus is not a satisfactory solution.Statistical bias.The racial inequity in RGE is not limited to the presentation of results (Table 1), but also permeates statistical estimation and inference.Modern statistical learning commonly features penalized regression, variable selection, and Bayesian inference (33).Broadly, these regularization strategies seek to stabilize (i.e., reduce the variance of) estimators, typically by "shrinking" coefficients toward zero.This approach is particularly useful in the presence of a moderate to large number of covariates that may be correlated.However, under RGE, shrinking or setting coefficients to zero introduces racial bias to the estimation.Critically, shrinkage or selection of the race-specific terms, β * r → 0, does not innocuously shrink toward a global slope; rather, it implies that the coefficient on x for race r is pulled toward that of the NH White group, . First, such an estimator is racially biased.In particular, this bias attenuates the estimated differences between the x-effects for each race and NH White individuals.Identification and quantification of such race-specific effects are precisely the goal of race-modified models and are essential for equitable policy-making.Furthermore, RGE cannot distinguish between shrinkage toward a global, race-invariant x-effect and shrinkage toward the NH White x-effect: both require β * r → 0 for all r > 1. Naturally, a fundamental goal of penalized estimation and selection in this context is to D R A F T remove unnecessary race-specific parameters.However, with RGE, the cost is racial bias in the shrinkage and selection.Thus, default RGE cannot fully and equitably leverage the state-of-the-art in statistical learning.
Overview.The primary goal of this paper is to introduce, study, and validate alternative statistical methods that eliminate these racial biases and deliver more interpretable results.By carefully reframing the regression model, we uniquely attain the best of both worlds from the race-modified model Eq. ( 1) and the main-only model Eq. ( 4): estimation and inference for race-specific x-effects as well as global x-effects (Table 1, right).This "global" effect interpretation-unavailable to RGE-is the key to removing the racial (presentation and statistical) biases, and is validated theoretically and empirically.We apply these tools to identify and quantify the race-specific effects of multiple environmental, social, and other key factors on 4th end-of-grade readings scores for students in North Carolina.

Alternative approaches.
Although RGE is used in the overwhelming majority of regression analyses, there are several alternative identification strategies for Eq. ( 1) and its generalizations.The most common alternative is data disaggregation, which subsets the data by race groups and fits separate regression models (12,25,31,34).This approach produces race-specific intercepts and slopes as in Eq. ( 2) and Eq. ( 3), and thus implicitly acknowledges the importance of including race-modifiers.
However, these separate models do not produce global (race-invariant) x-effect estimates or inference, cannot incorporate information-sharing or regularization across race groups (often leading to variance inflation and reduced power), and require separate model diagnostics.Alternatively, sum-to-zero (STZ) constraints require that r β r = 0 and r β * r = 0.While STZ constraints address the inequities in RGE, the resulting model parameters are difficult to interpret, and the estimators do not yield the useful properties offered by the proposed approach (SI Figure A9).Finally, overparametrized estimation considers Eq. ( 1) and its generalizations without any identifying constraints and relies on penalized (lasso, ridge, etc.) regression or Bayesian inference with informative priors to produce unique estimators.However, the model parameters remain nonidentified, which leaves the global x-effect α 1 and the race-specific deviations β * r (as well as α 0 and β r ) without a clear interpretation.These alternative strategies are subsequently evaluated using simulated data (Figure 3; SI Figures A1-A4).
We emphasize that the presentation bias and statistical bias of RGE are not resolved merely by changing the reference group (e.g., to NH Black or Hispanic).Furthermore, these challenges apply for any categorical covariates, including gender identity, national origin, religion, and other protected groups.Thus, there is urgent demand for more equitable and interpretable regression analysis with categorical covariates.Here, we focus on race, but the proposed methods apply more broadly.

Methods
We introduce statistical estimation, inference, and algorithms that resolve these racial biases in linear regression.The parameters in Eq. ( 1) can be identified by constraints of the form r c r β r = 0 and r c r β * r = 0, which enables unique estimation and valid inference.RGE sets c 1 = 1 and c r = 0 for r > 1, while STZ constraints use c r = 1 for all r.To build intuition for the proposed approach, notice that the generic constraints imply α 1 = α 1 + r c r β * r .Fixing r c r = 1 without loss of generality, is the average of the race-specific slopes according to the "race probabilities" {c r } (assuming c r ≥ 0 for all r).Clearly, the interpretability and equitability of the parameters in Eq. ( 1)-as well as the statistical properties of the estimators (Theorem 1)-depend critically on the choice of {c r }.
We propose abundance-based constraints (ABCs) that elicit these "race probabilities" directly from the data: r πr β r = 0, r πr β * r = 0, πr = proportion in (race) group r [5] or equivalently, E π(β R ) = 0 and E π(β * R ) = 0, where the expectation is taken over the categorical random variable R with P(R = r) = πr .If known, the population (race) proportions may be used for {π r }; otherwise, we use the sample proportions based on the data {r i } n i=1 .ABCs uniquely identify the model parameters and will be enforced during estimation.
ABCs immediately deliver equity and interpretability for the race-modified model parameters.
First, the linear model satisfies E π{µ(x, R)} = α 0 + α 1 x, which averages Eq. (1) over race to produce a global linear regression.Thus, while the race-modified model produces race-specific intercepts and slopes, the parameters α 0 and α 1 retain a global, race-averaged correspondence.Most important, the main x-effect satisfies which is the race-averaged slope for the x-variable.Unlike with RGE, for which α 1 = µ ′ x (NHW), ABCs do not anchor α 1 to the NH White group and instead provide a global interpretation for this key parameter.The benefits of this interpretation cascade down to the other parameters: [7] which is the difference between the race-specific slope and the race-averaged slope.The intercept also retains a convenient, more equitable interpretation.Suppose that the continuous covariate is centered, x = 0. Then the intercept parameter satisfies where the expectation is taken (separately) over X ∼ px for px the empirical distribution of {x i } n i=1 and R ∼ π.In this sense, α 0 is a marginal expectation of Y (SI Theorem S1) and is not anchored to the NH White group as in RGE.The race-specific intercepts proceed similarly: Equivalently, the intercept is α 0 = E π{µ(x, R)} and the race-specific intercepts are β r = µ(x, r) − E π{µ(x, R)}.Again, unlike RGE, the parameter β r is no longer relative to the NH White group.In conjunction, these results highlight the interpretability of model parameters under ABCs, including both global and race-specific effects-while critically avoiding the elevation of any single race group (Table 1).These advantages are accentuated by several important, and unique, invariance properties for statistical estimation described subsequently (Theorems 1-2).
In contrast, the inequities from RGE are compounded in the multivariable setting.RGE sets β 1 = 0 and β * 1,j = 0 for each covariate j.Thus, each x j -effect is presented for or relative to the NH White group, α j = µ ′ x j (NHW) and β * r,j = µ ′ x j (r) − µ ′ x j (NHW).Not only are these parametrizations inequitable, but also they fail to provide a global interpretation for each x j -effect.
ABCs, and the ensuing interpretations, are similarly generalizable for multiple categorical covariates and multiple continuous-categorical interactions (SI Appendix).

D R A F T
Estimation invariance with ABCs.An additional advantage of ABCs is that the global interpretation of the coefficients on x also extends to estimation.Specifically, we compare ordinary least squares (OLS) estimates for the main-only model and the race-modified model.These models have distinct parameters with distinct advantages and disadvantages.The primary benefit of the main-only model is that it produces a global x-effect, α M 1 = µ M (x + 1, r) − µ M (x, r); however, it does not admit race-specific effects.By comparison, the race-modified model produces race-specific effects; however, under RGE, there is no global x-effect: α 1 = µ(x + 1, NHW) − µ(x, NHW) is the x-effect only for the NH White group.
Crucially, OLS estimation of the race-modified model under ABCs offers the best of both worlds.
In addition to estimating race-specific effects, this approach delivers global x-effect estimates that are nearly identical between the main-only and race-modified models, under mild conditions.Specifically, let σ2 x When the variation in x is roughly the same within each race group, σ2 x[r] ≈ σ2 x [1] for all r, then the OLS estimates of the global x-effect are nearly invariant to whether or not the race-modifier (x:race) is included.
Theorem 1.Under ABCs, the OLS estimates for the race-modified model Eq. ( 1) and the main-only model Eq. ( 4) satisfy α1 ≈ αM 1 whenever the (scaled) sample variances of {x i } n i=1 are approximately equal across race groups, σ2 x[r] ≈ σ2 x [1] for all r.
This approximation is empirically robust to moderate violations of the equal-variance condition (SI Figure A9).Theorem 1 does not require independence or uncorrelatedness between X and R: the covariate may have dramatically different means and distributions across race groups, as long as the scales are about the same.Thus, the result is distinct from classical estimation invariance results with OLS (35).Further, Theorem 1 makes no assumptions about the true relationship between Y and the covariates.These results persist for multivariable regression: Theorem 2. Consider the multivariable race-modified model Eq. ( 8) and the multivariable main-only model Under ABCs, the OLS estimates satisfy α ≈ αM whenever the (scaled) sample covariances of is the (scaled) sample covariance between x j and x h in group r, with s r (j, h) = r i =r x ij x ih , and xr (j Theorem 2 shows that all p estimated coefficients on x 1 , . . ., x p are approximately invariant to the inclusion of race-modifiers for all p variables whenever the (scaled) sample covariance among the p covariates is approximately the same for each race group.Importantly, this result allows for arbitrary dependencies among the p variables X, includes a variety of dependencies between each

D R A F T
x j and R, and does not impose any assumptions about the true relationship between Y and X. Theorems 1 and 2 are presented using the sample (race) proportions for {π r }, but can be modified for population proportions.
We emphasize that Theorems 1 and 2 (and the equal-variance conditions) are not required for estimation and inference with ABCs.Rather, these results show that under ABCs, we may include race-modifiers "for free": despite the increase in complexity from main-only models to race-modified models, the estimated x-effects are nearly unchanged.Clearly, these results do not hold for RGE (Table 1, Figure 1) or STZ (SI Figure A9).Yet for ABCs, Theorems 1 and 2 remain accurate approximations (Figure 1), even under moderate violations of the conditions (SI Table A2).
Estimation.Statistical estimation with ABCs requires solving a linearly-constrained least squares problem for models of the form Eq. ( 8) given data {x i , r i , y i } n i=1 .Define θ to be the model parameters {α 0 , α, β r , β * r } r and xi to include the intercept, covariates, race variable indicators (i.e., "dummy variables"), and covariate-race interactions such that Eq. ( 8) may be written µ(x i , r i ) = x⊤ i θ.Let C encode ABCs such that Cθ = 0 enforces r πr β r = 0 and r πr β * r,j = 0 for j = 1, . . ., p, so C has m = p + 1 rows corresponding to the number of constraints.The ABC OLS estimator is To compute θ-and subsequently provide inference and penalized estimation-we reparametrize the problem into an unconstrained space with m fewer parameters.Let C ⊤ = QR be the QRdecomposition of the transposed constraint matrix with columnwise partitioning of the orthogonal  11) is equivalently solved using unconstrained OLS: The QR-decomposition has minimal cost due to the efficiency of Householder rotations and the low dimensionality of C (36).
The ABC penalized least squares estimator is where λ ≥ 0 controls the tradeoff between goodness-of-fit and complexity (measured via P).
Following Eq. ( 12), we instead solve the (unconstrained) penalized least squares problem Inference.The reparametrization strategy in Eq. ( 12) allows direct application of classical inference theory to the ABC OLS estimator: θ is a known, linear function of the (unconstrained) OLS estimator ζ.Thus, it is straightforward to derive the (Gaussian) sampling distribution of the ABC OLS estimator, which can be used to compute standard errors, hypothesis tests, and confidence intervals, and to establish unbiasedness and efficiency of the estimator (SI Appendix).
Sparsity.In the presence of many covariates, sparsity-enforced by variable selection or penalized (lasso) estimation-becomes increasingly important to eliminate unnecessary parameters, reduce estimation variability, and simplify interpretations.However, the meaning of a zero coefficient depends on the identification strategy.Consider sparsity of the race-modifier, β * r = 0.Under RGE, β * r = 0 implies that the race-specific slope is equal to the NH White group slope, x (NHW), which elevates the NH White group.Under ABCs, this same sparsity β * r = 0 implies that the race-specific slope is equal to the race-averaged slope, x (R)}, thus eliminating the bias and asymmetry from RGE.
An especially concerning case arises when the global x-effect is zero (α 1 = 0) but the racemodifier is nonzero (β * r ̸ = 0).Under ABCs, this occurs when the race-averaged x-effect is zero, x (R)} = 0, but there exists a nonzero race-specific x-effect, µ ′ x (r) = β * r + α 1 = β * r ̸ = 0.However, if the main-only model without race-modifiers Eq. ( 4) were estimated in place of the racemodified model Eq. ( 1), then the estimated x-effect would be (nearly) zero, αM 1 ≈ 0 (Theorem 2), when in fact the x-effect is both significant and race-specific.Alarmingly, it is possible that quantitative analyses based on regression models that exclude race-modifiers obscure both important and race-specific effects of certain covariates (Figure 1).

Results
NC Education Data Analysis.The proposed methods are applied to study the effects of multiple environmental, social, and other key factors on educational outcomes-and assess whether, and how, these effects vary by race.The dataset features a cohort of n = 27, 638 students in North Carolina (NC) formed by linking three administrative datasets, summarized below and described

D R A F T
in detail elsewhere (31,39,40).First, NC Detailed Birth Records include maternal and infant characteristics for all documented live births in NC.We construct maternal covariates-mother's race, age (mAge), education level, marital status, and smoking status-and child covariates, sex and birthweight percentile for gestational age (BWTpct).From residential addresses, we compute racial isolation (RI) at birth, which measures the geographic separation of NH Black individuals and communities from other race groups (29,30).Second, NC Blood Lead Surveillance includes blood lead level (BLL) measurements for each child.Lead is an adverse environmental exposure with well-known effects on cognitive development and educational outcomes (41,42).Third, the exposure (PM 2.5 ) over the year prior to the test, which is another adverse environmental exposure that has been linked to educational outcomes (43).Data characteristics are in SI Table A3.

NC
We estimate a multivariable linear regression for 4th end-of-grade reading scores that includes these environmental, social, and other factors, as well as race-modifiers (Table 2).Each continuous covariate (BLL, PM 2.5 , RI, mAge, and BWTpct) is centered and scaled and each categorical variable (mother's race, child's sex, mother's education level, mother's marital status, mother's smoking status, and economically disadvantaged) is identified using ABCs, specifically using the sample proportions for each group (SI Table A3).Race-modifiers are included for BLL, PM 2.5 , RI, mAge, and BWTpct.Standard model diagnostics confirm linearity, homoskedasticity, and Gaussian error assumptions.
ABCs generate output for each group in every categorical variable, which eliminates the presentation bias that would otherwise accompany each categorical effect (under RGE).There are highly significant (p < 0.01) negative effects for BLL and RI, where the adverse RI effect doubles for NH Black students (μ ′ RI (NHB) = αRI + β * RI:NHB = −0.020+ −0.020 = −0.040).This critical result for RI expands upon the previous model fit (Table 1): here, the model adjusts for many additional factors, yet the effect persists.Significantly lower test scores also occur for students who are NH Black, Male, or economically disadvantaged, and whose mothers who are less educated, unmarried, or smokers at time of birth.Significant positive effects are observed for the opposite categories-which is a byproduct of ABCs (e.g., the Male and Female proportions are identical, so the estimated effects must be equal and opposite)-as well as mAge and BWTpct.Finally, and interestingly, PM 2.5 is not identified as a significant main effect (p = 0.403), yet the race-specific effects are significant.
Alarmingly, a fitted model without the race-modifiers conveys an insignificant PM 2.5 effect (Figure 1), which oversimplifies and misleads.
To showcase the utility and empirical validity of the ABCs invariance property (Theorem 2), Figure 1 presents the estimates and 95% confidence intervals for the main effects that are modified  2).ABCs exhibit invariance (Theorem 2): despite the additional race-modifier parameters, the point and interval estimates for the main effects (blue) are nearly indistinguishable from those in the main effects-only model (black), thus effectively allowing the inclusion of race-modifiers "for free".In contrast, the RGE terms (red) correspond to the xj-effects for the NH White group and deviate substantially for PM2.5, RI, and mAge, including shifts in location and much wider intervals.

D R A F T
the parameterization, we compare ABCs and RGE.The estimated λ-paths for RI are in Figure 2; results for the remaining race-modified effects (BLL, PM 2.5 , mAge, and BWTpct) are in SI Appendix (Figures A1-A4).Notably, RGE fixes β * r,j = 0 for all λ, which results in 1) racially-biased shrinkage of the race-specific effects toward the NH White-specific effect and 2) attenuation of the RI effect αj (Figure 2, top right).ABCs resolve these issues.First, the model parameters are separately and equitably pulled toward zero (Figure 2, top left).Second, the RI effect αj is not attenuated, and preserves its magnitude until log λ ≈ 5 (Figure 2, top left).Finally, the race-specific RI effects merge at a global, and negative, RI effect estimate, which is selected by the one-standard-error rule (33) for choosing λ (Figure 2, bottom left).
These themes persist for the remaining race-modified effects (SI Figures A1-A4).We supplement the ABC and RGE lasso paths by including the lasso paths for overparametrized estimation (Over), which also uses dummy variables to encode all categorical variables and race-modifiers, but does not include any identifiability constraints.These parameters cannot be estimated by OLS, but can be estimated by lasso regression with λ > 0. Perhaps unsurprisingly, Over lasso estimation typically results in one of the coefficients {α j , β * r,j } r being set to zero immediately (i.e., for small λ) for each covariate j.This induces an identification similar to RGE, but without selecting the reference group in advance.Thus, it suffers from the same racial biases in estimation and selection that plague RGE.(solid) and one-standard-error rule (dot-dashed).The outcome is 4th end-of-grade reading score and the covariates include all variables in Table 2. Small λ approximately corresponds to OLS, while increasing λ yields sparsity.Under RGE, the estimates are pulled toward the reference (NH White) estimate-inducing statistical bias by race-and the RI effect is attenuated.By comparison, ABCs offer more equitable shrinkage toward a global RI effect, which is nonzero and detrimental for 4th end-of-grade reading scores.
shown; BWTpct, SI Figure A4); when the selection corresponds to the smallest | β * r,j | among race groups r from ABCs, then the Over and ABC paths are similar (BLL, SI Figure A1; BWTpct, SI Figure A4).However, when this selection sets the main effect to zero, αj = 0 (PM 2.5 , SI Figure A2) or overshrinks multiple coefficients toward zero (mAge, SI Figure A3), then the Over paths differ substantially from both the RGE and ABC paths and demonstrate erratic behavior (SI Figure A3).

Simulation results.
We evaluate the performance of ABCs for estimation, prediction, and inference using simulated data.Each simulated dataset is constructed from a Gaussian linear regression model Eq. ( 10) with p = 10 continuous covariates and one categorical (race) covariate with 4 levels.The p = 10 continuous covariates include six independent covariates, X j ∼ N (0, 1) for j = 1, 2, 3, 6, 7, 8, and four covariates that depend on the categorical variable, [X j | R = r] ∼ N (r, 1), i.e., mean one for group one, mean two for group two, etc., for j = 4, 5, 9, 10.The categorical variable is generated based on population proportions π, which we describe below.In addition to the intercept α 0 = 1, the true coefficients for the continuous covariates are nonoverlapping notches indicate significant differences between medians.These estimands are not invariant to the categorical encoding (or constraints), but ABCs and RGE are both satisfied in the data-generating process.ABCs (gold) outperform the other encodings (gray) within each estimation method.
and use a signal-to-noise ratio of one.
The omission of race-modifiers from the data-generating process satisfies both RGE and ABCs.
Finally, we require at least p + 1 observations for each categorical level, which is necessary for OLS estimation of the interaction effects.We simulate 500 such datasets for each (n, π) design.
All competing methods follow Eq. ( 8), which includes all continuous covariates, race, and all race-modifiers.Thus, all competing models are overparametrized relative to the ground truth, with 55 columns of the unconstrained designed matrix X and 44 identifiable model parameters to estimate.The estimation approaches are OLS, ridge regression, and lasso regression.The tuning parameter for ridge and lasso regression is selected using the one-standard-error rule (33).The parametrizations determine the identification constraints on β r and β * r,j : we consider ABCs, RGE, and Over.This latter option is not identified for OLS, and thus is presented only for ridge and lasso regression.
Estimation of the model parameters θ is evaluated using root mean squared error (Figure 3).
Within each estimation method (OLS, ridge, lasso), ABCs offer substantial improvements over both RGE and Over.Similar gains occur for prediction (SI Figure A5), estimation of the racespecific slopes α j + β * r,j (SI Figure A6), and confidence intervals (SI Figure A8), and specifically, ABCs produce narrower confidence intervals-and thus more powerful inference-while maintaining nominal coverage.

D R A F T Discussion
The path to more equitable decision-making and policy requires a precise and comprehensive understanding of the links between race and health and life outcomes.Alarmingly, the primary statistical tool for this task-regression analysis with race as a covariate and a modifier-in its current form propagates racial bias in both the presentation of results and the estimation of model parameters.
We introduced an alternative approach, abundance-based constraints (ABCs), with several unique benefits.First, ABCs eliminate these racial biases in both presentation and statistical estimation of linear regression models.Second, ABCs produce more interpretable parameters for race-modified models.Third, estimation with ABCs features an appealing invariance property: the estimated main effects are approximately unchanged by the inclusion of race-modifiers.Thus, analysts can include and estimate race-specific effects "for free", i.e., without sacrificing the interpretability of the global (race-invariant) main effects.Finally, ABCs are especially convenient for penalized estimation and variable selection (e.g., lasso regression), with meaningful and equitable notions of parameter sparsity and efficient computational algorithms.
Using this new approach, we estimated the effects of multiple environmental, social, and other key factors on 4th end-of-grade readings scores for a large cohort of students (n = 27, 638) in North Carolina.In aggregate, this analysis 1) identified significant race-specific effects for racial (residential) isolation, PM 2.5 and mother's age at birth; 2) showcased the racial biases and potentially misleading results obtained under previous approaches; and 3) provided more equitable and interpretable estimates, uncertainty quantification, and selection, both for main effects and race-modified effects.
We compared ABCs against several alternative strategies for linear regression with categorical variables, with evaluations on real and simulated data for OLS and penalized (lasso and ridge) regression.For OLS estimation, a related approach to encode identifiability constraints is contrasts.
In this approach, the linear model is fit under any minimally sufficient identification (RGE, STZ, ABCs, etc.) and the categorical variable coefficients are post-processed using linear contrast matrices.
Examples include dummy coding (akin to RGE), effects coding (akin to STZ), weighted effects coding (WEC; akin to ABCs), and Helmert coding (for ordered categories).However, contrasts are typically reserved for simple ANOVA or main-only models and are difficult to combine with penalized estimation, variable selection, and Bayesian regression.Further, these previous approaches do not consider or resolve the inequities of reporting or estimating race-specific effects.In particular, WEC has been advocated only in cases when "a categorical variable has categories of different sizes, and if these differences are considered relevant" (44) or "certain types of unbalanced data that are missing not at random" (45), with regression output that suffers from the same presentation bias that afflicts RGE (46).We argue more forcefully that ABCs provide a necessary resolution to certain racial biases that occur under default (RGE) approaches.Similar challenges arise for other categorical covariates, including gender identity, national origin, religion, and other protected 16 Kowal .

D R A F T
groups, and thus the potential impact of ABCs extends well beyond race.
We acknowledge that the interpretation of any "race" effect requires great care (47).Race encompasses a vast array of social and cultural factors and life experiences, with effects that vary across time and geography (27,48).In some settings, race data are unreliable or partially missing (49,50).These overarching challenges are not addressed in this paper.

ACKNOWLEDGMENTS.
We thank Amy Willis for helpful feedback and discussions that greatly
Eq. (A.1) requires L(1 + p) constraints for identification; reference group encoding (RGE) sets β ℓ,1 = 0 for all ℓ and β * ℓ,1,j = 0 for all ℓ, j.We extend ABCs based on the joint distribution of the categorical variables R. Specifically, let π = πr 1 ,...r L = P(R 1 = r 1 , . . ., R L = r L ).If known, the population proportions may be used for π; otherwise, we use the sample proportions based on the observed data {r i } n i=1 , i.e., πr 1 ,...,r L = n −1 n i=1 I{r i,1 = r 1 , . . ., r i,L = r L }. Concisely, the generalized ABCs are where ) ⊤ , and 0 L is a vector of zeros.The joint constraints in Eq. (A.2) may be equivalently represented via separate marginal expectations for the L sets of categorical covariate parameters: for instance, E π(β ℓ,R ℓ ) = E πℓ (β ℓ,R ℓ ) = r ℓ πℓ,r ℓ β ℓ,r ℓ = 0. ABCs in Eq. (A.2) provide interpretable parameter identifications with equitable presentation and estimation.These interpretations are unchanged if some or all interaction terms are omitted from Eq. (A.1), which may occur if multiple categorical variables (e.g., sex, education level) are included as covariates, but only race is included as a modifier.ABCs imply that E π{µ(x, R)} = α 0 + x ⊤ α, so that averaging the regression Eq. (A.1) over all categorical variables (jointly) yields a multivariate regression with only continuous variables.Individually, each x j -effect satisfies where µ ′ x j (r) = µ(x j + 1, x −j , r) − µ(x j , x −j , r) is the slope in the jth direction.To further simplify the interpretation, the expectation under π in Eq. (A.3) need only be taken with respect to the categorical variables that are interacted with x j (e.g., race).By comparison, the RGE parametrization yields α j = µ ′ x j (r 1 = 1, . . ., r L = 1), which is the group-specific slope for x j with each group set to its reference category (e.g., NH White, Male, etc.).Clearly, this representation compounds inequity across each categorical variable and fails to deliver a global interpretation of the x j -effect.
Interpretation of group-specific slopes and the parameters β * ℓ,r ℓ ,j proceeds by considering partial expectations π−ℓ , which is analogous to the joint distribution π but omits the ℓth categorical variable.Here, as with Eq. (A.3), this expectation need only consider the categorical variables that are interacted with x j ; if the ℓth categorical variable is the only interaction term, then no expectation is needed at all.Then the D R A F T x j -effect when the ℓth categorical variable has level r ℓ , averaged over the remaining categorical variables, is The interpretation is simpler than the notation: Eq. (A.4) directly extends the usual notion of race-specific slopes to average over any other categorical variables that modify x j .
Estimation invariance of ABCs: proofs and additional results.Estimation with ABCs delivers multiple invariance properties.First, we supplement the results from the main paper with a result for ordinary least squares (OLS) estimation of the intercept.

Proof (Theorem S1
).Under OLS, ȳ equals the sample mean of the fitted values i } n i=1 ; this is true for ABCs, RGE, STZ, etc. Then we simplify: since the continuous covariates are centered and each set of categorical variable coefficients satisfies ABCs.
Under ABCs, the estimated intercept is invariant across all main-only models, regardless of the true association between Y and the covariates.This result is unique to ABCs and provides motivation to use ABCs even without a continuous-categorical interaction, such as for traditional ANOVA models.A similar result was noted by (51) in a simplified setting.
Next, we proceed to prove Theorems 1 and 2. To do so, we leverage a classical property of OLS (35,52).
Theorem S2 (Frisch-Waugh-Lovell Theorem).For a partition of the n × p covariate matrix X = (X 0 : X 1 ) into p 0 and p 1 columns, the partition of the ordinary least squares estimator β = ( β⊤ 0 , β⊤ is the corresponding hat matrix for X 1 , and y = (y 1 , . . ., y n ) ⊤ is the vector of outcomes.
Although Theorems 1 and 2 and stated in terms of approximate equivalence under approximate equalvariance conditions, here we prove exact equivalence under exact equal-variance conditions.

D R A F T
Proof (Theorem 1).Let y = (y 1 , . . ., y n ) ⊤ , x = (x 1 , . . ., x n ) ⊤ , and Z be the matrix of categorical dummy variables with entries [Z] ir = 1 if r i = r and zero otherwise.Using Theorem S2, the estimated coefficients under the main-only model satisfy αM 1 = (x ⊤ êx∼r ) −1 ê⊤ x∼r y, where êx∼r is the vector of residuals from regressing the continuous variable {x} n i=1 on the categorical variable {r i } n i=1 (i.e., Z).Similarly, the estimated coefficients under the race-modified model satisfy α1 = (x ⊤ êx∼r+Z XQ ) −1 ê⊤ x∼r+Z XQ y, where êx∼r+Z XQ are the residuals from regressing the continuous variable {x} n i=1 on the categorical variable {r i } n i=1 (i.e., Z) and the reparametrized interaction term that enforces ABCs, Z XQ = Z X Q −(1:m) , where Z X D X Z and D X = diag(x).Thus, it suffices to show that êx∼r = êx∼r+Z XQ , which occurs when the additional (interaction) coefficients from the latter model, say bZ XQ (corresponding to Z XQ ), are identically zero.Again using Theorem S2, these estimated coefficients are bZ x [1] π⊤ under the assumption that σ2 x[r] = σ2 x [1] is common for all r.Finally, the definition of Q −(1:m) via ABCs implies that π⊤ Q −(1:m) = 0, which proves the result.
Proof (Theorem 2).Let X denote the n × p matrix of continuous covariates, Z the matrix of categorical dummy variables with entries [Z] ir = 1 if r i = r and zero otherwise, and and D X j = diag(x j ).By Theorem S2, it suffices to show that E M = E, where E M = (I n − H Z )X are the residuals from regressing each column of X on Z and E are similarly the residuals from regressing each column of X on Z and Z XQ .Thus, it is sufficient to show that the coefficients associated with Z XQ in the latter regression are identically zero.Again using Theorem S2, we see that this occurs whenever X ⊤ E Z XQ = 0, where , each of which must equal the zero vector with dimension equal to the number of categories.Noting that x ⊤ h D x j Z = (. . ., s r (j, h), . ..) with s r (j, h) = r i =r x ij x ih and x ⊤ h H Z D x j Z = (. . ., n r xr (j)x r (h), . ..) with xr (j) = n −1 r r i =r x ij , we apply the same arguments as in Theorem 1.

D R A F T
where P(θ) is a complexity penalty on the regression coefficients and λ ≥ 0 controls the tradeoff between goodness-of-fit and complexity (measured via P).Here, we consider complexity penalties of the form where ω j > 0 are known weights and typically γ = 1 for (adaptive) lasso regression or γ = 2 for (adaptive) ridge regression.
Adopting the same reparametrization strategy as in the unpenalized case, let C ⊤ = QR be the QRdecomposition of the transposed constraint matrix with columnwise partitioning of the orthogonal matrix which requires the solution to an unconstrained penalized least squares problem.For instance, under ridge regression (γ = 2), the solution is The lasso version (γ = 1) may be solved efficiently using the genlasso package in R.
For practical use, we set ω j to be the sample standard deviation of the jthe column of X (with ω 1 = 1 for the intercept).This strategy applies a standardized penalty to each covariate, which is especially important for ABCs.In particular, the magnitudes of the race-specific coefficients vary according to the abundance of race group r: by design, small values of πr inflate the corresponding coefficients β r and β * r,j .The standardized penalty adjusts for this effect to avoid overpenalization of race-specific coefficients for groups with low abundance.
Inference with ABCs.Under minimal assumptions, the (unconstrained) OLS estimator satisfies and thus the ABC OLC estimator satisfies where I is the Fisher information and ζ, θ are the true parameter values.When the regression model is paired with independent and identically distributed Gaussian errors ϵ i := y i − µ(x i , r i ) with variance σ 2 , the unconstrained OLS estimator satisfies ζ ∼ N {ζ, σ 2 ( Z⊤ Z) −1 } and thus  The invariance result for estimators with and without race-modifiers (Theorem 2) requires σx[NHW] (j) = σx[NHB] (j) = σx[Hisp] (j) for each covariate j (and similarly for the cross-covariances).Although this condition is clearly violated, the estimates and SEs demonstrate near-perfect invariance (Figure 1), which suggests strong empirical robustness for the desirable invariance property of ABCs.
R 1:m,1:m : 0), since C ⊤ has rank m.It is straightforward to verify that θ = Q −(1:m) ζ satisfies Cθ = 0 for any ζ.Then, using the adjusted covariate matrix Z = XQ −(1:m) with X = (x 1 , . . ., xn ) ⊤ , the solution to Eq. ( Standardized Testing Data contains 4th end-of-grade standardized reading scores, economic disadvantage status (determined by participation in the National Lunch Program), and residential address at time-of-test.The reading scores, standardized by the year of test (2010, 2011, or 2012), serve as the outcome variable Y .The residential information is used to estimate the average PM 2.5

Fig. 3 .
Fig. 3. Evaluating point estimates for the regression coefficients for n = 250 (left) and n = 10,000 (right) across 500 simulated datasets; improved this work.Research was sponsored by the National Institute of Environmental Health Sciences (R01ES028819) and the National Science Foundation (SES-2214726).The content is solely the responsibility of the author(s) and does not necessarily represent the official views of the NIH or the U.S. government.The findings and conclusions in this publication are those of the author(s) and do not necessarily represent the views of the North Carolina Department of Health and Human Services, Division of Public Health.

Table 1 . Linear regression output under default reference group encoding (RGE; left) and abundance- based constraints (ABCs; right): race-modified effects of racial isolation (RI) on 4th end-of-grade reading scores for students in North Carolina (y ∼ 1 + RI + race + RI:race).
The default regression output (RGE, left) induces presentation bias: the estimated Intercept and RI effect (red) refer to the NH White group, which obfuscates the highly significant and detrimental effects of RI on reading scores.For reference, the main-only model y ∼ 1 + RI + race estimates a significantly negative global RI effect (Estimate = −0.042,SE=0.007, p < 0.001; SI TableA1).The regression output under ABCs (right) eliminates presentation bias, confirms the estimated global RI effect from the main-only model (blue), and clearly highlights the critical result that the adverse RI effect is more than doubled for NH Black students (μ ′ RI (NHB) = αRI + β * RI:NHB = −0.032+ −0.038 = −0.070).

Table 2 . Linear regression output (under ABCs) for the race-modified effects of environmental, social, and other factors on 4th end-of-grade reading scores for students in North Carolina.
Data restricted to individuals with 37-42 weeks gestation, mother's age 15-44 years old at birth, BLL ≤ 80µg/dL (and capped at 10µg/dL), birth order ≤ 4, no current limited English proficiency, and residence in NC at the time of birth and time of 4th end-of-grade test."Economicallydisadvantaged" is determined by participation in the National Lunch Program.byrace.Specifically, we compare the model output for regressions with race-modifiers (i.e., all variables in Table2) and without race-modifiers (i.e., variables only in the left column of Table2), including both ABC and RGE output for the race-modified models.Remarkably, the estimates and uncertainty quantification for the simpler, main-only model are nearly indistinguishable from those for the expanded, race-modified model under ABCs.Effectively, ABCs allow estimation and inference for numerous race-specific effects (Table2, right column) "for free": the inferential summaries for the main effects are unchanged by the expansion of the model to include race-modifiers.This result empirically confirms Theorem 2, despite moderate violations of the equal-variance condition (SI Finally, we assess penalized estimation and variable selection with ABCs using lasso regression, including all the covariates from Table2.We report estimates across tuning parameter values λ for the model coefficients αj , { β * r,j } r and the race-specific slopes {μ ′ x j (r) = αj + β * r,j } r ; λ → 0 yields OLS estimates, while λ → ∞ yields sparse estimates.Since the penalized estimates depend critically on

NC Education Data: estimated main effects and 95% confidence intervals
Fig.1.Estimates and 95% confidence intervals for the main effects in the multivariable regression without race-modifiers (black) and the multivariable regression with race-modifiers under ABCs (blue) and RGE (red).Results are presented for blood lead level (BLL), PM2.5 exposure (PM2.5),racial isolation (RI), mother's age (mAge), and birthweight percentile for gestational age (BWTpct), each of which is interacted with race in the expanded model (blue, red); additional covariates include sex, mother's education level, mother's marital status, mother's smoking status, and economically disadvantaged (Table contains the sample means of {x} n i=1 by each group r, and therefore x ⊤ H Z Z X = x ⊤ Zdiag({x r } r ) = (. . ., n r x2 r , . ..) with n r = nπ r .Combining these results, we have x ⊤

Table A .1. Linear regression output for racial isolation (RI) and mother's race on 4th end-of-grade reading scores for students in North Carolina (y ∼ 1 + RI + race).
In contrast with Table1, this model uses only main effects (RI and mother's race) without any race-modifiers, thus providing a global estimate of the RI effect.ABCs are used for mother's race; however, the RI estimates, SEs, and p-values are the same under RGE.The RI output (blue) is nearly identical to the race-modified model output under ABCs (Table1, right), which confirms the estimation invariance of ABCs (Theorem 1).By comparison, the race-modified model output under RGE (Table1, left) deviates substantially, and instead refers to the RI effect only for NH White students.

Table A .3. Characteristics of the North Carolina data (n
= 27, 638).Sample proportions by group for each categorical variable.These sample proportions are used for ABCs with each categorical variable."Economically disadvantaged" is determined by participation in the National Lunch Program.