Running head: THE CONFIDENCE INTERVAL THAT WASN’T 1 The Confidence Interval that Wasn’t: Bootstrapped “Confidence Intervals” in ` 1 -Regularized Partial Correlation Networks Donald R. Williams University of California, Davis Author Note DRW was supported by a National Science Foundation Graduate Research Fellowship under Grant No. 1650042. The R code to reproduce this work is provided online. THE CONFIDENCE INTERVAL THAT WASN’T 2 Abstract I shed much needed light upon the default measure of parameter uncertainty in network psychometrics; that is, “confidence intervals” (CI) computed from bootstrapping ` 1 -regularized partial correlations. Due to the nature of the ` 1 -penalty, however, bootstrapping does not provide an accurate sampling distribution. Although this has long been known in the statistical literature, I set out to determine whether the intervals could at least be considered approximate . In multiple regression, I first describe the fundamental tension between model selection and estimation consistency inherent to the ` 1 -penalty—in the pursuit of sparsity, the sampling distribution of the non-zero coefficients is necessarily compromised which translates into coverage far below nominal levels. With the foundation laid, I proceed to investigate coverage for non-zero relations in partial correlation networks. At best, average coverage was around 0.65 for 90% CIs. With increasing sample sizes, average coverage decreased to 0.30, perhaps approaching 0 if larger sample sizes were explored. Further, coverage was heavily influenced by the mere position of an edge in the network, ranging from essentially 0 to 0.90, with an average of around 0.50. Meanwhile, for the same simulation conditions, simply bootstrapping the sample covariance matrix provided coverage at the nominal level. In light of the results, I then demonstrate how to judiciously use the bootstrap in both regularized and non-regularized networks: the former can provide a useful summary of data-mining, whereas the latter allows for making inference on network parameters. To ensure network researchers have the option of computing valid CIs, I implemented a non-regularized bootstrap for various types of partial correlations in the R package GGMnonreg Keywords: partial correlation network, confidence intervals, frequentist inference, bootstrap, ` 1 -regularization THE CONFIDENCE INTERVAL THAT WASN’T 3 The Confidence Interval that Wasn’t: Bootstrapped “Confidence Intervals” in ` 1 -Regularized Partial Correlation Networks ...there is a substantial price to be paid for sparsity... — Leeb and Pötscher (p. 203, 2008) In the social-behavioral sciences, network theory has emerged as an increasingly popular framework for understanding psychological constructs (Borsboom, 2017; Jones, Heeren, & McNally, 2017). The underlying rationale is that a group of observed variables, say, self-reported symptoms, are a dynamic system that mutually influence and interact with one another (Borsboom & Cramer, 2013). The observed variables are “nodes” and the featured connections between nodes are “edges.” This work focuses on partial correlation networks, wherein the edges represent conditionally dependent nodes—pairwise relations that have controlled for the other nodes in the network (Epskamp, Waldorp, Mottus, & Borsboom, 2018). This powerful approach has resulted in an explosion of research; for example, network analysis has been used to shed new light upon a variety of constructs including personality (Costantini et al., 2015), narcissism (Di Pierro, Costantini, Benzi, Madeddu, & Preti, 2019), and hypersexuality (Werner, Štulhofer, Waldorp, & Jurin, 2018). Recently, the foundation of network psychometrics was improved when the default methodology was revisited (Williams & Rast, 2019; Williams, Rhemtulla, Wysocki, & Rast, 2019). In the network literature, ` 1 -regularization (a.k.a., “least absolute shrinkage and selection operator” or “lasso”) emerged as the default approach for detecting conditionally dependent relations. One motivation for adopting lasso was by the thought that it reduces spurious relations. It was recently demonstrated, however, to have an inflated false positive rate that depends on many factors, including the sample size, edge size, sparsity, and the number of nodes (see Figure 6 in Williams et al., 2019). Further motivation was the thought that ` 1 -regularization is needed to mitigate overfitting. This was shown to be overstated in Williams and Rodriguez (2020). In both cases, non-regularized methods were THE CONFIDENCE INTERVAL THAT WASN’T 4 more than adequate for the goals of reducing false positives and quelling concerns of overfitting. In this work, I seek to further improve network analysis by shedding much needed light upon the default measure of parameter uncertainty, that is, “confidence intervals” (CI) that are computed from bootstrapping ` 1 -regularized partial correlations (Epskamp, Borsboom, & Fried, 2018). However, due to the nature of the ` 1 -penalty, bootstrapping does not provide an accurate sampling distribution. This is summarized in section 3.1, “Why standard bootstrapping and subsampling do not work,” of Bühlmann, Kalisch, and Meier (2014): The (limiting) distribution of such a sparse estimator is non-Gaussian with point mass at zero, and this is the reason why standard bootstrap or subsampling techniques do not provide valid confidence regions or p − values (pp. 7-8). This particular issue is most pressing for true zero and small relations (see section 3 in Knight & Fu, 2000). For the former, especially as the sample size increases, the bootstrap distribution will converge to a “spike” at zero, resulting in the CI covering zero too often. Indeed, there is proof that the standard errors (and thus intervals) are inconsistent for null associations (theorem 6.1 on p. 397, Kyung, Gilly, Ghoshz, & Casellax, 2010). On the other hand, when gradually moving away from zero, the distribution changes form. It is now compromised of “mixture of a singular normal distribution and of an absolutely continuous part” (p. 375, Kyung et al., 2010). In both cases, the distributions are far from normal which presents challenges for obtaining an accurate sampling distribution. This is illustrated in Figure 1 (panel A). Pragmatically, it may be tempting to think covering zero too often is not problematic, given this translates into fewer type I errors. However, the issue surrounding null associations hints at a deeper problem with the estimator. A limitation of lasso is that the penalty increases linearly with the size of the relation (p. 523, Fan, Feng, & Wu, 2009), THE CONFIDENCE INTERVAL THAT WASN’T 5 a peculiarity that does not diminish with more data. Accordingly, “it produces substantial biases in the estimates for large regression coefficients” (p. 18, Goeman, Meijer, & Chaturvedi, 2018). Hence, even for those edges that are clearly non-zero, the bootstrapping strategy may produce a severely compromised sampling distribution, thereby calling into question its usefulness for assessing parameter uncertainty. Defining Confidence Intervals At this point, it is important to consider the definition of a CI (a.k.a uncertainty interval, Gelman & Greenland, 2019). The basic idea is to construct an interval for a parameter of interest, including a lower and upper limit, such that, on average, it will cover the true value 100(1 − α )% of the time (Neyman, 1937). Importantly, this is inherently a frequentist concept that refers to hypothetical replications (or future random samples) from the assumed population model. Notice that this definition does not privilege a particular value, rather, when using the CI for significance testing, this is merely inspecting whether zero is covered. By definition, however, all values within are not rejected at the chosen α level (see p. 7 in Kruschke & Liddell, 2015). Hence, when computing a CI with a particular procedure (including the estimation method), the implicit claim by the researcher is that “frequency of correct results will tend to α ” ( p. 349, Neyman, 1937). Furthermore, in models with many effects, it is possible to infer the proportion of relations that will be covered. A network with 20 nodes has 190 partial correlations. With 90% CIs, the expectation is that 171 ( 190 · 0 90 ) will be cover true value. Note again this is a long run average, but it indicates nonetheless that most intervals should contain the true value for a given sample, tending to 100(1 − α )% of the relations. Why it Matters. As an illustrative example, Figure 1 (panel B) includes 95% CIs for three partial correlations. The non-regularized CI for the relation between nodes A and B excluded zero (95% CI = [0.22, 0.38]), which is therefore “statistically significant” (it was not covered). Further, values less than 0.22 and greater than 0.38 can also be rejected THE CONFIDENCE INTERVAL THAT WASN’T 6 (they were not covered). Herein lies an issue with the ` 1 interval. Notice that it is almost completely in the rejection region of the valid CI. This means that the vast majority of values contained in the ` 1 based interval should be rejected. A point of emphasis is that lasso provides an estimate of the population value, yet almost the entire sampling distribution could be ruled out by a valid measure of uncertainty. Indeed, as noted in Waldorp, Marsman, and Maris (2019), “Once the parameters are obtained it turns out that inference on network parameters is in general difficult with ` 1 -regularization” (p. 53). Misconceptions About Confidence Intervals From surveying the network literature, various misconceptions have emerged in an attempt to interpret the “CIs” computed from bootstrapping ` 1 -regularized estimates. In my view, these are a by-product of the ` 1 -penalty wreaking havoc on the sampling distribution (e.g., Figure 1). In network psychometrics, researchers are advised against using regularized “CIs” for significance testing. The rationale is that ` 1 -penalized estimates are biased towards zero, and thus an edge may differ from zero, even when it is included in the interval. Although this statements could be correct, it is important to note that the interpretation of a valid CI is not a function of which value the researcher is interested in rejecting. To make sense of this, consider inspecting the CI to determine whether, say, 0.1 is covered, which is a significance test for a non-nil null hypothesis. This again relates to coverage, in that significance testing with a CI is merely inspecting whether a value of interest is covered, with no special consideration given to zero. If moving the goal post compromises the CI, this hints at an underlying issue with the employed estimator and alternatives should be explored. Further, there seems to be some confusion surrounding both bootstrapping and frequentist inference more generally. In Fried et al. (2019), it was stated that “‘these [regularized] sampling distributions are not CIs centered on the true (unbiased) parameter THE CONFIDENCE INTERVAL THAT WASN’T 7 value” (in the supplementary material). In this context, bias is also a frequentist concept that is defined on average. Accordingly, for any given sample, the bootstrap sampling distribution (and the corresponding CI) of the sample estimate will not be centered on the true (and unknown) value. I refer to an excellent introduction to bootstrapping: Each bootstrap distribution is centred around the sample estimate, not the population value...Moreover, bootstrap CIs, like any other CIs, vary across experiments. Therefore, if we perform a single experiment, the CI we obtain does or does not contain [cover] the population value we’re trying to estimate (p. 12 Rousselet, Pernet, & Wilcox, 2019) This applies to both regularized and non-regularized estimators: regardless of which is used, or whether they are centered at the true value for a given sample, CIs are expected (within reason) to cover the true value 100(1 − α )% of the time—the definition does not change when using lasso. Revisiting the Regularization Literature How could it be network psychometrics routinely employs a measure of uncertainty that leaves something to be desired? In my view, this is partially due to somewhat conflicting information in the statistical literature. For example, Hastie, Tibshirani, and Wainwright (2015), a definitive source for regularization, used the boostrap in the section titled “Statistical Inference.” Yet, when the bootstrap was employed, a CI was never computed and the full range of estimates was visualized in a box plot (Figure 6.4 therein). 1 Further, the boostrap was also suggested in Tibshirani (1996, p. 272) and Tibshirani (2011, p. 281). Perhaps while strictly invalid, the bootstrap strategy can provide an approximate CI. This possibility is investigated with simulation. There are few examples that use the bootstrap to compute CIs. The results are not very promising, in that “for nonzero true parameter values, the coverage might [emphasis 1 Table 2.2 includes bootstrap standard errors. THE CONFIDENCE INTERVAL THAT WASN’T 8 added] be very poor” (p.541, Dezeure, Bühlmann, Meier, & Meinshausen, 2015). In Van De Geer, Bühlmann, Ritov, and Dezeure (2014), the de-sparsified lasso was compared to the residual bootstrap of Chatterjee and Lahiri (2011). For the latter, coverage of non-zero regression coefficients was often far below nominal levels (see the Tables on pp. 22 - 33). These approaches are specifically looking at high-dimensional data (e.g., p < n ), where the maximum likelihood estimate does not exist and therefore regularization is necessary. In psychology, however, the more typical network includes around 20 variables and hundreds of observations (see Table 2 in Wysocki & Rhemtulla, 2019). In these situations (low-dimensional data), CIs are easily computed with non-regularized estimation (Drton & Perlman, 2004; Williams & Rast, 2019; Williams et al., 2019). This was noted in Javanmard and Montanari (2014): In classical [low-dimensional] statistics, generic and well accepted procedures are available for characterizing the uncertainty associated to a certain parameter estimate in terms of confidence intervals...(p. 2870). Overview In what follows, I delve into computing “CIs” via bootstrapping ` -regularized partial correlations, with the intent of fully understanding their coverage properties. To my knowledge, no such work has been done in the psychological literature. I begin with multiple regression and progress to partial correlation networks. These sections include focused numerical experiments, each of which are informed by the statistical literature. The goal is to determine whether bootstrapping regularized partial correlations is salvageable: given their ubiquity in network analysis, it would be ideal if they were not too far off the mark. By way of example, the next section provides recommendations for using the bootstrap in regularized and non-regularized networks. THE CONFIDENCE INTERVAL THAT WASN’T 9 The Gaussian Graphical Model For multivariate normal data, the Gaussian graphical model (GGM) captures conditional relationships that are typically visualized to infer the underlying dependence structure (i.e., the partial correlation “network”; Højsgaard, Edwards, & Lauritzen, 2012; Lauritzen, 1996). There is an undirected graph that is denoted G = ( V, E ) , which includes a vertex set V = { 1 , ..., p } and an edge set E ⊂ V × V . The former refers to “nodes” and the set represents, say, items in a questionnaire, whereas the latter set contains the estimated network structure. Let y = ( y 1 , ..., y p ) > be a random vector indexed by the graph’s vertices that is assumed to follow a multivariate normal distribution, y ∼ N p ( 0 , Σ ) , where Σ is a p × p positive definite covariance matrix. I use Y to denote the n × p data matrix, where each row corresponds to the observations from a given individual. Further, without loss of information, the data are considered centered with mean vector 0. The undirected graph is obtained by determining which off-diagonal elements of the precision matrix, Θ = Σ − 1 , are non-zero. That is, ( i, j ) ∈ E when node i and j are determined to be conditionally dependent and set to zero otherwise. Note that the edges (or “connections”) in a GGM are partial correlations ρ ij · z that are computed directly from Θ with ρ ij · z = − θ ij √ θ ii θ jj (1) Hence, estimating partial correlation networks can be accomplished by testing whether each relation in Equation (1) is “significantly” different from zero. This is described in Drton and Perlman (2004) and Williams and Rast (2019), both of which relied on an analytic solution, whereas a more general alternative is to use the non-parametric bootstrap (Williams et al., 2019) THE CONFIDENCE INTERVAL THAT WASN’T 10 A Brief Note on Generality In this work, I assume that the data are continuous and normally distributed, that is, multivariate Gaussian. Accordingly, I rely heavily upon the Pearson partial correlation coefficient to keep the exposition manageable. This does not limit the generality of this work, in that all ideas can seamlessly be applied to polychoric (Pearson, 1900), Spearman’s rank (Kim, 2015), the so-called Gaussian rank estimator (i.e., based on Van Der Waerden scores, see references in Boudt, Cornelissen, Croux, & Boudt, 2012), and Kendall’s tau based partial correlations (Johnson, 1979), each of which are commonly used in the Gaussian graphical modeling literature (Hoff, 2007; Liu, Han, Yuan, Lafferty, & Wasserman, 2012; Mohammadi & Wit, 2015). This far-reaching applicability is due to requiring only an estimate of the covariance matrix when bootstrapping the partial correlations. Multiple Regression I begin studying coverage in multiple regression. The relatively simple case of regression can provide a foundation to begin understanding why lasso ` 1 is problematic—-a motivating example of sorts. This is further justified by the direct correspondence between the elements of Θ and multiple regression (Kwan, 2014; Stephens, 1998). Suppose that the j th column Y j is predicted by the remaining ( p − 1 ) nodes Y − j . For nodes i and j , the resulting coefficients and error variances are defined as β ij = − θ ij θ ii and σ 2 j = 1 θ ii , (2) where i and j denote the corresponding row and column of Θ , β ij is the regression weight for the j th node ( i 6 = j ), and σ 2 j is the residual variance. This allows for recovering all elements of Θ (and Σ ) with j multiple regression models. In relation to Equation (1), the THE CONFIDENCE INTERVAL THAT WASN’T 11 regression coefficients also have a direct mapping to the partial correlation, that is β ij = ρ ij · z √ θ ii /θ jj (3) This relationship is often utilized in the GGM literature. For example, there are a variety of approaches that use multiple regression to estimate the elements of Θ (Liu & Wang, 2017; Yuan, 2010) or that focus on the partial correlation matrix (Krämer, Schäfer, & Boulesteix, 2009). This is known as “neighborhood selection” (Meinshausen & Bühlmann, 2006). In the familiar context of multiple regression, ` 1 -regularization is similar to the ordinary least squares (OLS) solution, but with an added penalty to the residual sum of squares (RSS), that is, n ∑ i =1 ( y i − p ∑ j =1 x ij β j ) 2 ︸ ︷︷ ︸ RSS + λ p ∑ j =1 | β j | ︸ ︷︷ ︸ ` 1 -penalty (4) In this equation, λ is the “tuning parameter” that determines the extent to which the penalty affects the estimates. When λ = 0 , no penalty is imposed and the resulting estimates are equal to the OLS. When a very high value of λ is chosen, all the estimates will be pushed to zero. Thus, some criterion is typically used to choose the value of λ . The default choice in network psychometrics is the extended Bayesian information criterion (EBIC, Chen & Chen, 2008) . This is given by EBIC = n · log ( RSS n ) + k · log ( n ) ︸ ︷︷ ︸ BIC + 2 · k · γ · log ( p ) (5) where k is the number of selected parameters, n the sample size, p the number of predictors, and γ ( 0 ≤ γ ≤ 1 ) an additional hyperparamter (p. 3 in Chen & Chen, 2012). Note that, when γ = 0 , Equation 8 reduces to the BIC. In network analysis, the focus is THE CONFIDENCE INTERVAL THAT WASN’T 12 typically on a conservative model that includes few false positives. Accordingly, the default is γ = 0 5 with the goal of providing a relatively sparse model compared to BIC, as a result of selecting a larger value for λ in Equation (4). There is a potential problem, however, in that a fundamental tension exists between selecting the true model and parameter estimation. The theoretical results in Fan and Li (2001) demonstrated that the ` 1 -penalty can be consistent for model selection and consistently estimate the parameters, but it cannot satisfy both properties simultaneously (see Theorem 2 and Remark 1 in Fan & Li, 2001). For the former, this requires that √ nλ → ∞ , whereas, for the latter, root- n consistency requires that λ = O (1 / √ n ) (p. 1353, Fan & Li, 2001). It should be noted that alternative penalties have been developed that can achieve both at the same time. I refer interested readers to Williams (2020a, see references therein). This suggest that when erring on the side of caution there is a price to be paid—estimation accuracy and this cannot typically be overcome with more data. It follows that adding to BIC in the pursuit of sparsity, as in EBIC with γ > 0 , will further compromise the non-zero parameter estimates due selecting a larger value for λ . This logic extends to the Akaike information criterion that will typically select a smaller value for λ than BIC. As a result, while there will be fewer relations pushed to zero, less harm is done to the parameter estimates themselves. This should not be taken to mean that a more liberal information criterion should be used instead of EBIC (or BIC). Framing it this way highlights a general limitation of the ` 1 -penalty that is particularly salient for computing CIs of non-zero relations. This is due to the sampling distribution of a consistent estimator concentrating around the true value with increasing data. Because the ` 1 -based bootstrap sampling distribution will not necessarily concentrate around the population value, it is possible that coverage actually deteriorates as n increases. This insight is not entirely new: ...bootstrapping the Lasso does not lead to a consistent estimate of the THE CONFIDENCE INTERVAL THAT WASN’T 13 underlying sampling distribution which in turn could be used for constructing confidence statements (p. 350, Buhlmann, 2017). What remains to be determined, however, is just how inaccurate coverage is in low-dimensional settings that are common to the network literature. Numerical Experiment 1 In this experiment, I intentionally focus on an unrealistic situation that can be understood as the best case scenario, that is, large coefficients that will be detected with more data, orthogonal covariates, and favorable signal-to-noise ratio (SNR). This is meant to satisfy two important assumptions of lasso for consistent model selection: 2 (1) the beta-min condition, which requires that “the non-zero regression coefficients are sufficiently large (since otherwise, we cannot detect the variables in S 0 [the active set or non-zero relations] with high probability)” (p. 1214, Bühlmann, 2012). In reference to Figure 1, this ensures that any issues are not driven exclusively by effects that have point mass at zero; and (2) the irrepresentable condition (IRC), such “that the total amount of an irrelevant covariate represented by the covariates in the true model is not to reach 1” (p. 2545 Zhao & Yu, 2006). 3 In other words, the correlation between relevant and irrelevant predictors is not too large, which is automatically satisfied with orthogonal covariates. Together, this experimental design allows for isolating the effect of ` 1 -regularization on the sampling distribution. The simulation procedure was as follows: 1. Set β 1:10 = (0 1 , 0 2 . . . , 1) and β 11:20 = (0 , 0 , . . . , 0) , such that the first 10 coefficients were non-zero and the last were truly zero. These non-zero values could all be detected with increasing data. 2 There are additional assumptions of ` 1 -regularization. Figure 1 in Van De Geer, Bühlmann, and others (2009) describes how they are (often) directly related to one another. 3 The IRC was checked following Equation 2 in Zhao and Yu (2006), whereas satisfying the beta-min condition was inferred from the effects being detected. THE CONFIDENCE INTERVAL THAT WASN’T 14 2. Generate p = 20 variables X ∼ N ( 0 , I p ) , where I is a p × p identity matrix, thereby ensuring that the IRC is satisfied. 3. Set σ = √ β ′ I p β 1 , where 1 is the SNR. Note that R 2 = SNR SNR +1 , such that variance explained was 0.50. 4. Generate observations for n = { 250 , 500 , 1 , 000 } from the model y = X β ′ + ε , ε ∼ N (0 , σ ) (6) 5. Obtain the sampling distribution with a non-parametric bootstrap 6. Determine whether the true values were covered by 90% CIs. The CIs were computed as the 5th and 95th quantile of the bootstrap distribution. 4 For ` 1 -regularization, λ was selected with EBIC, including γ = { 0 , 0 5 , 1 0 } . The idea here is to show the effect of increasing the penalty (recall that larger γ values provide more regularization), with the expectation that coverage should get worse with larger values. To obtain the sampling distribution for each coefficient, a model was selected for each boostrap sample, b = 1 , ..., 500 , resulting in the estimated coefficients for a given boostrap sample (i.e., ˆ β b ). This procedure is described in Hastie et al. (2015, see Section 6.2). I also employed OLS regression with a non-parametric bootstrap. Data-driven model selection was not performed for each bootstrap sample. Thus, the boostrap sampling distributions were obtained from the full, non-regularized, model. Coverage for 90% CIs was computed from 500 simulation trials. All aspects of this work were implemented in R (version 4.0.2, R Core Team, 2017). The regularized regression models were fitted with R package glmnet Friedman, Hastie, and Tibshirani (2010) and the figures were made with ggplot (Wickham, 2016). 4 Epskamp, Borsboom, and Fried (2018) suggested to use type 6 with the quantile function in R , whereas I used the default of type 7. The results do not change according to the method used for computing the quantiles. THE CONFIDENCE INTERVAL THAT WASN’T 15 Results. Figure 2 (panel A) includes intervals computed from 100 replications ( n = 1 , 000 , β = 1 , γ = 0 5 ). The idea here is to further clarify the definition of a CI and frequentist inference more generally. Notice that the sampling distributions for OLS (denoted “Non-reg”) are not centered around the true value for a given sample. Indeed, the estimates are larger or smaller than the true value, but hardly ever centered directly upon it. Even with 100 replications, coverage for the non-regularized estimator was close to the nominal level of 0.90, as indicated by 85% of the CIs covering the true value. The ` 1 -based intervals, on the other hand, had extremely poor coverage: only 65% of the “CIs” covered the true value. By definition, this translates into rejecting the true value nearly a third of the time. Figure 1 (panel B) includes coverage of the non-zero coefficients for all 500 simulation trials ( γ = 0 5 is the default in psychology). 5 The box plot depicts the interquartile range and median coverage. Recall that, based on the tension between sparsity and parameter estimation, coverage should get incrementally worse with more penalization. This can be seen clearly, in that, with larger γ values, coverage was very low for lasso. Said another way, coverage deteriorates when moving away from non-regularized estimation. The sampling distribution is not only inaccurate when there is a point mass at zero, but even when the effects are easily detected. This is particularly striking because all of the standard assumptions were satisfied (for lasso in particular), the sample size was large ( n = 1000 ), the coefficient was always selected, such that there was no mass at zero distorting the sampling distribution (e.g., Figure 1), and setting γ = 0 5 is the default in network psychometrics (coverage was never above 0.70). There is, of course, the question of acceptable coverage. In the robust statistical literature, it is common to follow the guidelines of (X, 1978). There it was suggested that “The most liberal criterion that I am able to take seriously is 0 5 · α ≤ ρ ≤ 1 5 · α ” (p.146 Bradley, 1978), where ρ is the actual error rate. With α = 0 10 ( 90% CIs), this translates 5 Coverage for the true zero coefficients was nearly 100% THE CONFIDENCE INTERVAL THAT WASN’T 16 into coverage not being lower or higher than 0.85 and 0.95, respectively, which was never the case for lasso. Numerical Experiment 2 This experiment aims to more directly understand the relation between λ and coverage. This was previously inferred from increasing γ in Equation (8). There has been considerable work investigating optimal regularization without selecting the tuning parameter. In Belloni, Chernozhukov, and Wang (2011), for example, it was shown that λ T C = √ log ( p ) /n is a theoretically consistent regularization parameter for the square root lasso. This value has also been used in GGMs (see Rong, Ren, & Chen, 2017; Wang et al., 2016). I adopt this approach and scale λ T C such that there are 20 values ranging between 0 (OLS) and 2 · λ T C . The simulation procedure is the same as above, but with only one sample size ( n = 1 , 000 ). Results. Figure 2 (panel C) includes coverage when fixing λ . The theoretically consistent value is λ = 0 054 , which is positioned directly in the middle of the x -axis. These results reveal the not so gradual effect of moving away from λ = 0 (OLS) to an increasingly sparse model. For example, with a larger penalty coverage reduced for the non-zero relations, whereas, for the null associations, coverage approached 1.0. When using λ T C as the tuning parameter, coverage was around 0.80 for the non-zero coefficients (still far below 0.90). Recently, λ T C was shown to have an inflated false positive rate error rate that did not diminish with more data (Williams, 2020a). Accordingly, a harsher penalty would be required to reduce the false selections, that, in turn, is just where coverage is particularly bad. Again, this is the tension between “pushing” values to zero and parameter estimation with the ` 1 -penalty. Summary These experiments highlighted a fundamental issue with ` 1 -regularization: in pursuit of sparsity, accuracy of the sampling distribution is necessarily scarified. Although some THE CONFIDENCE INTERVAL THAT WASN’T 17 degree of inaccuracy is expected (due to biased estimates), the extent to which this compromised coverage of cannot be understated. Further, rather than an issue for truly zero and small relations, coverage was also very poor for the large effects. This was especially remarkable because very stringent assumptions were satisfied in both experiments. In what follows, coverage is investigated in a setting representative of the network literature. Partial Correlation Networks Extended to multivariate settings, the ` 1 -penalized likelihood for the precision matrix is defined as log det Θ − tr ( S Θ ) ︸ ︷︷ ︸ log-likelihood − λ || Θ || 1 ︸ ︷︷ ︸ ` 1 − penalty (7) where S is the sample covariance matrix, Θ = S − 1 the precision matrix, and λ is the turning parameter parameter. The graphical lasso (glasso) method applies a penalty on the sum of absolute values for the off-diagonal elements of Θ . Recall that these elements have a direct correspondence to multiple regression (i.e., β ij = − θ ij /θ ii ). Indeed, obtaining the glasso estimate of Θ can be seen as “a p coupled lasso [regression] problems.” (p.). This implies much of the same issues that plague ` 1 regression also apply to estimating psychological networks, given that the conditional dependence structure is encoded in the off-diagonal elements of Θ In network psychometrics, the default choice for selecting λ is again EBIC, that is, EBIC = − 2 · l ( Θ ) + k · log ( n ) + 4 · γ · k · log ( p ) , (8) THE CONFIDENCE INTERVAL THAT WASN’T 18 where l ( Θ ) is the (simplified) Gaussian likelihood function that is given by l ( Θ ) = 2 n [ log det Θ − tr ( S Θ ) ] (9) Note that Θ is the glasso estimate. In Equation (8), k is the number of selected edges (off-diagonal elements of Θ ), and γ ( 0 ≤ γ ≤ 1 ) that governs the addition to BIC (the default in network analysis is γ = 0 5 ). The selected network then minimizes EBIC with respect to λ . This is typically accomplished by assessing a large number (e.g., 100) of λ ’s and selecting the one for which EBIC is smallest. The network is then obtained by computing the partial correlations from Θ (Equation 1). Hence, just as in regression, the sampling distribution of the partials correlations should be increasingly compromised with larger values of γ Issues Specific to Network Analysis Over and above the conflict between model selection and parameter estimation, there are additional issues specific to the psychological network literature. At its crux, recall that the IRC states that the important and unimportant predictors cannot be correlated (at least not too much). There is an analogous assumption that similarly applies in GGMs (see Equation 28 in Ravikumar, Wainwright, Raskutti, & Yu, 2011). Two examples provided in Ravikumar et al. (2011, see Sections 3.1.1 and 3.1.2) suggest that the irrepresentable condition can be more difficult to satisfy for networks than multiple regression. However, in network analysis it is common to estimate the conditional dependence structure of items from scales that, by construction, contain highly correlated variables. It follows that the IRC will likely be violated, perhaps egregiously so. As shown in Hastie et al. (2015, Figure 11.6 therein) and Zhao and Yu (2006, Figure 2 therein), the degree to which it is violated has a direct bearing on the performance of ` 1 -regularization. Further, it is also the case that edges are often small in effect size (see Table 2 in Wysocki & Rhemtulla, 2019). This suggests that the beta-min condition may not always be THE CONFIDENCE INTERVAL THAT WASN’T 19 satisfied. This is perhaps less of concern, because it translates into some edges escaping detection and not false selections. However, as shown in Figure 1 (panel A), small relations in particular can have a severely distorted sampling distribution. Numerical Experiment 1 In this experiment, I follow a common strategy for simulation in the network literature (e.g., Epskamp, 2016; Williams et al., 2019). The true network structure was obtained by first estimating the partial correlation matrix from 20 PTSD symptoms (Armour, Fried, Deserno, Tsai, & Pietrzak, 2017) and then absolute values less than 0.05 were set to zero. Following Ravikumar et al. (2011, Equation 28), the IRC was violated and exceeded the upper bound by a factor of five. This indicates that glasso cannot recover the true model. In psychology, the failure of glasso for these kinds of data was recently highlighted in Williams and Rast (2019), Williams (2020a), Williams et al. (2019). It should be noted that the IRC is unlikely to hold with many variables, unless the ground truth is extremely sparse (see Table 1 in Zhao & Yu, 2006), which is not typically the case in psychological applications (see Table 2 in Wysocki & Rhemtulla, 2019). Accordingly, the setting for this experiment more closely reflects the network literature. The simulation procedure was as follows. Multivariate normal data were generated for n = { 250 , 500 , 1 , 000 , 2 , 500 , 10 , 000 , 25 , 000 , 50 , 000 } , given the true network structure obtained from the 20 PTSD symptoms. These large samples allowed for determining whether coverage became worse with more data. A non-parametric bootstrap was employed for glasso with the tuning parameter selected with EBIC ( γ = 0 and 0 5 ). To obtain the sampling distribution for each partial correlation, a model was selected for each boostrap sample, b = 1 , ..., 500 , resulting in the estimated relations for a given boostrap sample. I also bootstrapped a non-regularized model, which amounts to a non-parametric bootstrap for correlations. Importantly, data-driven model selection was again not performed for each bootstrap sample. Average coverage for 90% CIs (non-zero relations) THE CONFIDENCE INTERVAL THAT WASN’T 20 was computed from 500 simulation trials. These regularized models were fitted with R package GGMncv (Williams, 2021). Note that bootnet is typically used in network analysis for bootstrapping the glasso. However, using GGMncv in combination with boot (Canty & Ripley, 2020) expedited the simulations. The results do not change appreciably when using bootnet (Epskamp, Borsboom, & Fried, 2018). Results. Because 90% CIs were used for each relation, the proportion of intervals containing the true value for a given network should also be 0.90 (a long run average, of course). This corresponds to coverage averaged across the network. To emphasize this point, Figure 3 (panel A) includes intervals from one simulation trial ( n = 1 , 000 and γ = 0 5 ), where 0.62 and 0.92 denote the proportion of intervals that covered the true value. Although just one random sample, there are red flags for ` 1 -regularization. For example, several sampling distributions are truncated at zero which is indicative of a point mass at zero (e.g., Figure 1, panel A). Even for edges separated from zero, say, corresponding to at least small in effect size (> 0.1), there is a discernible difference compared to non-regularized estimation that passes the inter-ocular trauma test—it hits between the eyes. Of course, there is a chance that one simulation trial was not representative of the long run average. Figure 3 (panel B) includes the results for average coverage of the non-zero relations. Here, for n = 1 , 000 , the non-regularized method was right at 0.90 and average coverage was round 0.60 for glasso EBIC ( γ = 0 5 ). That is, just over half of the edges were covered for a given network. Unfortunately, this indicates that the ` 1 -based “CIs” in panel A were not a fluke. In general, the default in network psychometrics had very poor coverage ( γ = 0 5 ). In the smaller sample sizes, the intervals often only covered around half of the values—average coverage was around 0.50 when it should be 0.90. Initially, coverage improved with increasing data, at best reaching nearly 0.70 ( n = 2 , 500 ). This was due to