Fast Lasso method for large-scale and ultrahigh-dimensional Cox model with applications to UK Biobank Free

2.4. Fast C-index computation

Several frequently used C-index computational algorithms, including the first algorithm we tried, have time complexity |$\mathcal{O}(n^2)$|⁠. As population-scale cohorts, like UK Biobank, Million Veterans Program, and FinnGen, aggregate time-to-event data for survival analysis it is increasingly important to consider the computational costs of statistics like C-index to build and evaluate predictive models. The time-to-event data for survival analysis include age of disease onset, progression from disease diagnosis to another more severe outcome, like surgery or death. Here, we present an implementation with |$\mathcal{O}(n\log{}n)$| time complexity (and |$\mathcal{O}(n)$| space complexity) that can introduce over 10 000|$\times$| speedup for biobank-scale data relative to several R packages, and over 10|$\times$| speedup compared to existing |$\mathcal{O}(n\log n)$| time complexity (and |$\mathcal{O}(n + n\log n)$| space complexity) algorithm implemented in the survival analysis package by Therneau and Lumley (2014).

We first assume there are no tied predictions or events. We evaluate the C-index of the estimator |$\hat{\beta}$| on the data |$\{O_i, T_i, X_i\}_{i=1}^n$| through the following steps:

First relabel the data in increasing |$T_i$|⁠. This takes |$\mathcal{O}(n\log n)$|⁠. After relabeling, we have
$$\begin{equation*} T_1 < T_2 < \cdots < T_n.\end{equation*}$$
Define
$$\begin{equation*} R_i := |\{j \in [n]: T_j > T_i\}| = n - i \quad \text{for } i \in [n].\end{equation*}$$
|$R_i$| is the size of the risk set immediately after |$T_i$|⁠. Clearly, computing all |$R_i$| takes |$\mathcal{O}(n)$|⁠.
Define
$$\begin{equation*} u_i := |\{j: \hat{\beta}^TX_j \le \hat{\beta}^TX_i\}| \quad \text{for } i \in [n].\end{equation*}$$
That is, |$u_i$| is the number of individuals that |$\hat{\beta}$| predicts to have lower or equal risk of the event than |$i$|⁠. We have |$u_i \in [n]$| for all |$i$|⁠, and |$u_i > u_j$| is equivalent to |$\hat{\beta}^TX_i > \hat{\beta}^TX_j$|⁠. The |$u_i$|’s can be computed in linear time by first sorting |$\boldsymbol{X}\hat{\beta}$| (here we assume |$\boldsymbol{X}\hat{\beta}$| has already been computed and given as an input to the C-index function). The total time complexity of this step is |$\mathcal{O}(n\log n)$|⁠.
Using the above definition, the C-index (1.5) can be equivalently written as
$$\begin{equation*} \frac{\sum_{i=1}^n O_i \left[\sum_{j=1}^n 1(u_i > u_j, i < j)\right]}{\sum_{i=1}^n O_i R_i} \end{equation*}$$
The denominator clearly can be computed in linear time. In the next steps, we focus on computing the numerator.
This step is the key factor in our algorithm. For each |$i \in [n]$|⁠, define a binary vector (bitarray) |$B_i=(b_1^i,\ldots, b_{n}^i) \in \{0,1\}^n$|⁠, where we set, for each |$j$|⁠:
$$\begin{equation*} b_{u_j}^i = \begin{cases} 1 & \text{if } i < j \\ 0 & \text{otherwise.} \end{cases} \end{equation*}$$
|$B_i$| is well defined since |$[n] = \{u_1,\ldots, u_n\}$|⁠. In addition, it has two nice properties:
- (a)
  $$\begin{equation} \sum_{j=1}^n 1(u_i > u_j, i < j) = \sum_{k < u_i} b_k^i, \label{numer} \end{equation}$$
  (2.14)
  where the summation on the right-hand side is computed through an array popcount on the bitarray |$\{b_1^i, \ldots, b_{u_i - 1}^i\}$|⁠.
- (b) We can update |$B_{i+1}$| from |$B_i$| simply by setting |$b^i_{u_{i+1}}$| from |$1$| to |$0$|⁠.
In our implementation, we represent these binary vectors as bitarrays. Bitarrays are compact, and very efficient to work with. (The exact arithmetic and bitwise operations we used were primarily informed by Knuth (2011) and Muła and others (2018).) However, we need to perform |$\mathcal{O}(n)$| array-popcount operations, so the top-level algorithm is still |$\mathcal{O}(n^2)$| if each popcount takes |$\mathcal{O}(n)$| time. Here, we provide a high-level description on how we get the array-popcount operations down to |$\mathcal{O}(\log n)$|⁠. To simplify our discussion, we assume |$n$| is an integer power of |$2$|⁠.
For each |$i$|⁠, we define a binary tree |$\mathcal{G}^i$| with |$n$| leaves, each having distance |$\log_2 n $| from the root. At the |$k$|th level of |$\mathcal{G}^i$|⁠, there are |$2^k$| nodes, and the |$j$|th node among them stores the sum:
$$\begin{equation*} \sum_{l = (j-1)n/2^k+1}^{jn/2^k} b^i_l.\end{equation*}$$
For example, the root of |$\mathcal{G}^i$| stores the sum of |$B^i$|⁠, the left-child of the root stores the sum of the first half of |$B^i$|⁠, and its left-child stores the sum of the first quarter of |$B^i$|⁠. The |$j$|th leaf of |$\mathcal{G}^i$| is exactly |$b^i_j$|⁠.
With |$\mathcal{G}^i$|⁠, computing (2.14) can be done within the same time complexity as traversing from the root of |$\mathcal{G}^i$| to the |$(u_i - 1)$|th leaf. Updating the |$\mathcal{G}^i$| to |$\mathcal{G}^{i+1}$| can be done by setting |$u_{i+1}$| from |$1$| to |$0$| and traverse back to the root. Both operations are |$\mathcal{O}(\log n)$|⁠. We describe them with the pseudocode in algorithm 2.
In our implementation, each internal node in these trees has |$16$|–|$32$| children instead |$2$| to better utilize the memory hierarchy. We do not actually build the tree data structures, but using them as a concept to describe our algorithm. In the package, these trees are represented by a stack of arrays, and accessing a node’s children, its parent, and a particular leaf takes |$\mathcal{O}(1)$|⁠.
When there are tied predictions, we keep the definitions from steps 1–3. The computation in step |$4(a)$| then misses |$0.5$| times the number of ties at |$T_i$|⁠. If for some |$i$|⁠, |$b_{u_i}^i$| is already flipped before step |$4(a)$| is done, then we know there is a tie at |$T_i$|⁠, and the distance between |$u_i$| to the next unflipped bit is the number of ties already seen, so we can adjust accordingly. The tie-heavy version of the function maintains an extra table which lists the number of times each |$u_i$| has been seen. By looking up that table, it can immediately find the first unflipped bit instead of performing a potentially |$\mathcal{O}(n)$| scan.

$$\mathcal{O}(\log n)$ array count and tree update algorithm to compute (2.14)$

Algorithm 2:

|$\mathcal{O}(\log n)$| array count and tree update algorithm to compute (2.14)

3. Results

3.1. UK Biobank age of diagnosis data preparation

We have prepared an age of diagnosis dataset from the UK Biobank derived from Category 1712, the category containing data showing the “first occurrence” of any code mapped to 3-character (International Classification of Diseases) ICD-10 (see Supplementary material available at Biostatistics online).

Briefly, the data-fields have been generated by mapping: Read code information in the Primary Care data (Category 3000); ICD-9 and ICD-10 codes in the Hospital inpatient data (Category 2000); ICD-10 codes in Death Register records (Field 40001, Field 40002), and Self-reported medical condition codes (Field 20002) reported at the baseline or subsequent UK Biobank assessment center visit to 3-character ICD-10 codes.

For each code two data-fields are available: the date the code was first recorded across any of the sources listed above, the source where the code was first recorded, and information on whether the code was recorded in at least one other source subsequently.

We used these data and computed an age of diagnosis by using the Month of Birth Data Field (Data-Field 52) and Year of Birth (Data-Field 34).

3.2. Genetic data preparation

Here, we used genotype data from the UK Biobank dataset release version 2 and the hg19 human genome reference for all analyses in the study. To minimize the variability due to population structure in our dataset, we restricted our analyses to include |$337\,151$| unrelated White British individuals, used sex, Array (UK Biobank was genotyped in two different platforms), and 10 principal components derived from the genotype data as covariates (described in detail in Supplementary material available at Biostatistics online).

Algorithm 3:

Proposed C-index algorithm

We focused our analysis on variants with a minor allele frequency (MAF) greater than or equal to |$0.1\%$| for directly genotyped variants in either array, in addition to the human leukocyte antigen alleles (Bycroft and others, 2018) and copy number variants described in Aguirre and others (2019) for a total of 1.08 million variants.

Fig. 1.

snpnet-Cox paths. Each line in these plots corresponds to a variable from the best model. The vertical axis represents the |$L^1$| norm of the estimated coefficients and the horizontal axis represents the value of the coefficients. The path is computed at various level of regularization parameter. The whiskers at the top of the plot are the number of variables selected. The first 12 variables are the covariates including age, sex, PC1-10.

We split our dataset into a |$70\%$| training (⁠|$n = 236\,004$|⁠), |$10\%$| validation, (⁠|$n = 33\,716$|⁠) and |$20\%$| held out test set (⁠|$n = 67\,430$|⁠), and apply snpnet-Cox with 50 iterations. We focus our analysis on 306 ICD10 codes with at least |$950$| cases in the |$337\,151$| individuals dataset.

3.3. snpnet-Cox results

We summarize the results across the 306 ICD10 codes, but focus our detailed analysis for four of them including:

asthma (ICD10 code: J45),
gout (M10),
disorders of porphyrin and bilirubin metabolism (E80), and
atrial fibrillation and flutter (I48).

The Lasso paths for these phenotypes are illustrated in Figure 1, where the estimated individual parameter values are plotted against the |$L^1$| norm of |$\hat{\beta}$|⁠, for a decreasing sequence of |$\lambda$|⁠. For an individual with genotype |$x$|⁠, we define the Polygenic Hazard Score (PHS) to be |$\hat{\beta}^Tx$|⁠, where |$\hat{\beta}$| is the fitted regression coefficients obtained from snpnet-Cox. We assess the predictive power of PHS on survival time using the individuals in the held out test set. We applied a couple of procedures to give a high-level overview of the results. First, we assessed whether the PHS was significantly associated to the time-to-event data in the held out test set (so that we obtained a |$\textit{P}$|-value for each ICD10 code). Second, we computed the hazards ratio (HR) for the scale (standard deviation unit), and different thresholded percentiles (top |$1\%$|⁠, |$5\%$|⁠, |$10\%$|⁠, and bottom |$10\%$| compared to the 40–60|$\%$|⁠) of |$\boldsymbol{X}\hat{\beta}$|⁠. Third, we computed the C-index (Harrell and others, 1982).

The C-index for the |$101$| ICD10 codes with PHS |$\textit{P} < 0.01$| range from 0.511 to 0.884 (see Global Biobank Engine snpnet-Cox page https://biobankengine.stanford.edu/snpnetcox) and HR per standard deviation of PHS from 1.042 to 13.167. The results further highlight the sparsity property of Lasso in the Cox model implemented in snpnet-Cox with some ICD10 codes including a single active variable in the set and others with almost 2000 active variables (e.g., non-insulin-dependent diabetes mellitus).

3.3.1. Asthma - J45

Motivated by the varying age of asthma onset, a common disease that affects a substantial fraction of young adults, we hypothesized that a PHS could capture individuals that are not only at higher risk of disease onset but also at a higher risk of developing asthma at a younger age.

Here, we estimate a HR of 1.428 per SD of PHS (C-index of 0.605), and HR of 2.740, 2.137, and 1.825 for the top 1, 5, and 10|$\%$| of the PHS distribution compared to the 40–60%. Further, we find that |$14.2\%$| of individuals in the top |$1\%$| of the PHS score developed asthma by age 20.5 compared to only |$1.1\%$| in the bottom |$10\%$| and |$3.2\%$| of the 40–60 percentile of the PHS score (see Figure 2), which underscores the relevance of PHS in the context of early onset of common diseases that are hypothesized to have a monogenic signature (Kelsen and Baldassano, 2017). The asthma PHS is composed of 1.567 active variable of which some are known from previous Genome-Wide Association Studies (GWAS) of traits related to asthma. As an example, we identify the rs2381416 (MAF = 0.26) upstream of GTF3AP1 to associate with asthma with an effect size of |$-$|0.11. This variant has previously been found to associate with eosinophil count (Gudbjartsson and others, 2009) and severity of childhood asthma (Smith and others, 2017).

$Asthma. (A) The Kaplan–Meier curves for percentiles of PHSs for variants selected by snpnet-Cox, in the held out test set (orange: top $1\%$, green: top $5\%$, red: top $10\%$, blue: 40-60$\%$, and brown: bottom 10$\%$; ticks represent censored observations. Highlighted are the proportion of asthma events by age 20 across the percentile scores. (B) Plot of snpnet-Cox coefficients for asthma with $1567$ active variables. Green dots represent protein-altering variants.$

Fig. 2.

Asthma. (A) The Kaplan–Meier curves for percentiles of PHSs for variants selected by snpnet-Cox, in the held out test set (orange: top |$1\%$|⁠, green: top |$5\%$|⁠, red: top |$10\%$|⁠, blue: 40-60|$\%$|⁠, and brown: bottom 10|$\%$|⁠; ticks represent censored observations. Highlighted are the proportion of asthma events by age 20 across the percentile scores. (B) Plot of snpnet-Cox coefficients for asthma with |$1567$| active variables. Green dots represent protein-altering variants.

3.3.2. Gout - M10

Gout is a common disease, affecting at least |$1\%$| of men in Western countries, with a strong male to female imbalance (Terkeltaub, 2003). It is a form of arthritis caused by excess uric acid in the bloodstream and characterized by severe pain, redness, and tenderness in joints.

In the UK Biobank study, we estimate a HR of 1.679 per SD of PHS (C-index of 0.649), and HR of 3.70, 2.502, and 2.073 for the top 1, 5, and 10|$\%$| of the PHS distribution compared to the 40–60%. Further, we find that |$4.89\%$| of individuals in the top |$1\%$| of the PHS score developed asthma by age 50.1 compared to only |$0.30\%$| in the bottom |$10\%$| and |$1.02\%$| of the 40–60 percentile of the PHS score (see Figure 3). The gout PHS consists of 1.970 active variables, and we identify loci that have been identified in prior GWAS (Dehghan and others, 2008).

$Gout (A) The Kaplan–Meier curves for percentiles of PHSs for variants selected by snpnet-Cox, in the held out test set (orange: top $1\%$, green: top $5\%$, red: top $10\%$, blue: 40–60$\%$, and brown: bottom 10$\%$; ticks represent censored observations. Highlighted are the proportion of gout events by age 50 across the percentile scores. (B) Plot of snpnet-Cox coefficients for gout with $1970$ active variables. Green dots represent protein-altering variants.$

Fig. 3.

Gout (A) The Kaplan–Meier curves for percentiles of PHSs for variants selected by snpnet-Cox, in the held out test set (orange: top |$1\%$|⁠, green: top |$5\%$|⁠, red: top |$10\%$|⁠, blue: 40–60|$\%$|⁠, and brown: bottom 10|$\%$|⁠; ticks represent censored observations. Highlighted are the proportion of gout events by age 50 across the percentile scores. (B) Plot of snpnet-Cox coefficients for gout with |$1970$| active variables. Green dots represent protein-altering variants.

3.3.3. Disorders of porphyrin and bilirubin metabolism - E80

Bilirubin, which is the principal component of bile pigments, is the end product of the catabolism of the heme moiety of hemoglobin and other hemoproteins. If bilirubin is produced in excessive amounts or hepatic excretion of bilirubin into bile is defective, the concentration of bilirubin in the blood and tissues increases, which may result in jaundice (Bosma, 2003), a well-recognizable symptom of liver disease.

We estimate a HR of 13.167 per SD of PHS (C-index of 0.884). Here, given that we have only two active variables, we find that the snpnet-Cox algorithm finds a sparse solution (see Figure 4). One of the active variables is the intron variant (rs6742078) of UTG1A4 (MAF = 0.31) which encodes an enzyme (Uridine diphosphate) UDP-glucuronosyltransferase that transforms small lipophilic molecules such as bilirubin (Tukey and Strassburg, 2000).

$Disorders of porphyrin and bilirubin metabolism. (A) The Kaplan–Meier curves for percentiles of PHS for variants selected by snpnet-Cox, in the held out test set (orange: top $1\%$, green: top $5\%$, red: top $10\%$, blue: 40–60$\%$, and brown: bottom 10$\%$; ticks represent censored observations. Highlighted are the proportion of disorders of porphyrin and bilirubin metabolism events by age 60 across the percentile scores. (B) Plot of snpnet-Cox coefficients for disorders of porphyrin and bilirubin metabolism with $2$ active variables. Green dots represent protein-altering variants.$

Fig. 4.

Disorders of porphyrin and bilirubin metabolism. (A) The Kaplan–Meier curves for percentiles of PHS for variants selected by snpnet-Cox, in the held out test set (orange: top |$1\%$|⁠, green: top |$5\%$|⁠, red: top |$10\%$|⁠, blue: 40–60|$\%$|⁠, and brown: bottom 10|$\%$|⁠; ticks represent censored observations. Highlighted are the proportion of disorders of porphyrin and bilirubin metabolism events by age 60 across the percentile scores. (B) Plot of snpnet-Cox coefficients for disorders of porphyrin and bilirubin metabolism with |$2$| active variables. Green dots represent protein-altering variants.

3.3.4. Atrial fibrillation and flutter - I48

Atrial fibrillation is the most common type of arrhythmia in adults. The prevalence increases from less than |$1\%$| in persons younger than 60 years of age to more than |$8\%$| in those older than 80 years of age (McNamara and others, 2003). Earlier onset of atrial fibrillation is believed to have a strong genetic component and whether that has more of a polygenic or monogenic flavor is currently unknown.

In the UK Biobank study, we estimate a HR of 1.466 per SD of PHS (C-index of 0.618), and HR of 3.883, 2.319, and 1.861 for the top 1, 5, and 10|$\%$| of the PHS distribution compared to the 40–60%. Further, we find that |$6.57\%$| of individuals in the top |$1\%$| of the PHS score developed asthma by age 60 compared to only |$0.70\%$| in the bottom |$10\%$| and |$1.41\%$| of the 40–60 percentile of the PHS score (see Figure 5), which underscores the relevance of PHS in the context of early onset of atrial fibrillation.

$Atrial fibrillation. (A) The Kaplan–Meier curves for percentiles of PHSs for variants selected by snpnet-Cox, in the held out test set (orange: top $1\%$, green: top $5\%$, red: top $10\%$, blue: 40–60$\%$, and brown: bottom 10$\%$; ticks represent censored observations. Highlighted are the proportion of atrial fibrillation events by age 60 across the percentile scores. (B) Plot of snpnet-Cox coefficients for atrial fibrillation with $1604$ active variables. Green dots represent protein-altering variants.$

Fig. 5.

Atrial fibrillation. (A) The Kaplan–Meier curves for percentiles of PHSs for variants selected by snpnet-Cox, in the held out test set (orange: top |$1\%$|⁠, green: top |$5\%$|⁠, red: top |$10\%$|⁠, blue: 40–60|$\%$|⁠, and brown: bottom 10|$\%$|⁠; ticks represent censored observations. Highlighted are the proportion of atrial fibrillation events by age 60 across the percentile scores. (B) Plot of snpnet-Cox coefficients for atrial fibrillation with |$1604$| active variables. Green dots represent protein-altering variants.

4. Discussion

In this article, we developed the batch screening iterative LASSO (BASIL) algorithm (Qian and others, 2019) to find the lasso path of Cox proportional hazard models. We implemented an optimized C-index function, which computes the C-index of a fitted Cox model in |$O(n \log n)$| time with an excellent constant factor. Our method was applied to the UK Biobank dataset to identify genetic variants that are associated with time-to-event phenotypes and to build PHS. Visualizations of snpnet-Cox results across 306 ICD10 codes are available in Global Biobank Engine (https://biobankengine.stanford.edu/snpnetcox) (McInnes and others, 2019).

Our current approach does have limitations, which we hope to resolve in future work. First, we assume that each individual has independent survival times (conditional on the features). This may become a limitation as population-scale cohorts especially in population isolates like in Finland sample related individuals. Second, we do not provide procedures to estimate the confidence intervals of the selected variables, which may be useful in communicating confidence in a single active variable (Taylor and Tibshirani, 2015). Third, as we move towards whole genome sequencing data where a large fraction of variants discovered will have a rare event property, i.e., observed in a handful of individuals, the validation accuracy may need to be redefined to evaluate a fitted |$\hat{\beta}$|⁠. Fourth, we do not consider time-varying coefficients and time-varying covariates, which may improve inference in the setting where features may have multiple measurements over time. These are areas of future direction that we anticipate we will address.

5. Software

We provide the implementation of our approach in a publicly available package snpnet available at https://github.com/rivas-lab/snpnet with cindex package dependency available at https://github.com/chrchang/plink-ng/tree/master/2.0/cindex. The analysis results are published on figshare at https://figshare.com/articles/snpnet-Cox_results/12368294.

Supplementary material

Supplementary material is available at http://biostatistics.oxfordjournals.org.

Acknowledgments

Conflict of Interest: None declared.

Funding

Stanford University to R.L.; The Two Sigma Graduate Fellowship to J.Q., in part; Funai Overseas Scholarship from Funai Foundation for Information Technology and the Stanford University School of Medicine to Y.T. Stanford University and a National Institute of Health center for Multi and Trans-ethnic Mapping of Mendelian and Complex Diseases grant (5U01 HG009080) to M.A.R.; National Human Genome Research Institute (NHGRI) of the National Institutes of Health (NIH) under awards (R01HG010140). The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health; NIH (5R01 EB001988-16) and NSF (19 DMS1208164) to R.T., in part; the National Science Foundation (DMS-1407548) to T.H., in part; The National Institutes of Health (5R01 EB 001988-21).

References

Aguirre,

M.

,

Rivas,

M.

and

Priest,

J.

(

2019

).

Phenome-wide burden of copy number variation in UK Biobank

.

American Journal of Human Genetics

,

105

,

373

–

383

.

Barlow,

W. E.

and

Prentice,

R. L.

(

1988

,

03

).

Residuals for relative risk regression

.

Biometrika

75

,

65

–

74

.

Bosma,

P. J.

(

2003

).

Inherited disorders of bilirubin metabolism

.

Journal of Hepatology

38

,

107

–

117

.

Breslow,

N.

(

1974

).

Covariance analysis of censored survival data

.

Biometrics

30

,

89

–

99

.

Bycroft,

C.

,

Freeman,

C.

,

Petkova,

D.

,

Band,

G.

,

Elliott,

L. T.

,

Sharp,

K.

,

Motyer,

A.

,

Vukcevic,

D.

,

Delaneau,

O.

,

O’Connell,

J.

et al. (

2018

).

The UK Biobank resource with deep phenotyping and genomic data

.

Nature

562

,

203

.

Cox,

D. R.

(

1972

).

Regression models and life-tables

.

Journal of the Royal Statistical Society Series B (Methodological)

34

,

187

–

220

.

Dehghan,

A.

,

Köttgen,

A.

,

Yang,

Q.

,

Hwang,

S.-J.

,

Kao,

W. H. L.

,

Rivadeneira,

F.

,

Boerwinkle,

E.

,

Levy,

D.

,

Hofman,

A.

,

Astor,

B. C.

et al. (

2008

).

Association of three genetic loci with uric acid concentration and risk of gout: a genome-wide association study

.

The Lancet

372

,

1953

–

1961

.

Friedman,

J.

,

Hastie,

T.

and

Tibshirani,

R.

(

2010

).

Regularization paths for generalized linear models via coordinate descent

.

Journal of Statistical Software

33

,

1

–

22

.

Goeman

J. J.

(

2010

).

L1 penalized estimation in the Cox proportional hazards model

.

Biometrical journal

.

Biometrische Zeitschrift

,

52

,

70

—

84

.

PubMed

Gudbjartsson,

D. F.

,

Bjornsdottir,

U. S.

,

Halapi,

E.

,

Helgadottir,

A.

,

Sulem,

P.

,

Jonsdottir,

G. M.

,

Thorleifsson,

G.

,

Helgadottir,

H.

,

Steinthorsdottir,

V.

,

Stefansson,

H.

et al. (

2009

).

Sequence variants affecting eosinophil numbers associate with asthma and myocardial infarction

.

Nature Genetics

41

,

342

.

Harrell,

F. E.

,

Califf,

R. M.

,

Pryor,

D. B.

,

Lee,

K. L.

and

Rosati,

R. A.

(

1982

).

Evaluating the yield of medical tests

.

JAMA

247

,

2543

–

2546

.

Kane,

M.

,

Emerson,

J.

and

Weston,

S.

(

2013

).

Scalable strategies for computing with massive data

.

Journal of Statistical Software, Articles

55

,

1

–

19

.

Kelsen,

J. R.

and

Baldassano,

R. N.

(

2017

).

The role of monogenic disease in children with very early onset inflammatory bowel disease

.

Current Opinion in Pediatrics

29

,

566

–

571

.

Knuth,

D. E.

(

2011

).

The Art of Computer Programming, Volume 4A: Combinatorial Algorithms, Part 1

.

Pearson Education India

.

Google Preview

Li,

R.

and

Tibshirani,

R.

(

2019

).

On the use of c-index for stratified and cross-validated Cox model

.

arXiv preprint arXiv:1911.09638

.

Mcinnes,

G.

,

Tanigawa,

Y.

,

Deboever,

C.

,

Lavertu,

A.

,

Olivieri,

J. E.

,

Aguirre,

M.

and

Rivas,

M.

(

2019

).

Global Biobank Engine: enabling genotype-phenotype browsing for biobank summary statistics

.

Bioinformatics

35

:

2495

–

2497

.

McNamara,

R. L.

,

Tamariz,

L. J.

,

Segal,

J. B.

and

Bass,

E. B.

(

2003

).

Management of atrial fibrillation: review of the evidence for the role of pharmacologic therapy, electrical cardioversion, and echocardiography

.

Annals of Internal Medicine

139

,

1018

–

1033

.

Muła,

W.

,

Kurz,

N.

and

Lemire,

D.

(

2018

).

Faster population counts using AVX2 instructions,

The Computer Journal

61

,

111

–

120

.

Park,

M. Y.

and

Hastie,

T.

(

2007

).

L1-regularization path algorithm for generalized linear models

.

Journal of the Royal Statistical Society. Series B (Statistical Methodology)

69

,

659

–

677

.

Qian,

J.

,

Du,

W.

,

Tanigawa,

Y.

,

Aguirre,

M.

,

Tibshirani,

R.

,

Rivas,

M. A.

and

Hastie,

T.

(

2019

).

A fast and flexible algorithm for solving the lasso in large-scale and ultrahigh-dimensional problems

.

bioRxiv

. DOI:

10.1101/630079

.

Smith,

D.

,

Helgason,

H.

,

Sulem,

P.

,

Bjornsdottir,

U. S.

,

Lim,

A. C.

,

Sveinbjornsson,

G.

,

Hasegawa,

H.

,

Brown,

M.

,

Ketchem,

R. R.

,

Gavala,

M.

et al. (

2017

).

A rare il33 loss-of-function mutation reduces blood eosinophil counts and protects from asthma

.

PLoS Genetics

13

,

e1006659

.

Sohn,

I.

,

Kim,

J.

,

Jung,

S.-H.

and

Park,

C.

(

2009

).

Gradient lasso for Cox proportional hazards model

.

Bioinformatics

25

,

1775

–

1781

.

Sudlow,

C.

,

Gallacher,

J.

,

Allen,

N.

,

Beral,

V.

,

Burton,

P.

,

Danesh,

J.

,

Downey,

P.

,

Elliott,

P.

,

Green,

J.

,

Landray,

M.

et al. (

2015

).

UK biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age

.

PLoS Medicine

12

,

1

–

10

.

Taylor,

J.

and

Tibshirani,

R. J.

(

2015

).

Statistical learning and selective inference

.

Proceedings of the National Academy of Sciences United States of America

112

,

7629

–

7634

.

Terkeltaub,

R. A.

(

2003

).

Gout

.

New England Journal of Medicine

349

,

1647

–

1655

.

Therneau,

T. M.

and

Lumley,

T.

(

2014

).

Package ‘Survival’

.

Survival Analysis Published on CRAN

.

Tibshirani,

R.

(

1996

).

Regression shrinkage and selection via the lasso

.

Journal of the Royal Statistical Society Series B (Methodological)

58

,

267

–

288

.

Tibshirani,

R.

,

Bien,

J.

,

Friedman,

J.

,

Hastie,

T.

,

Simon,

N.

,

Taylor,

J.

and

Tibshirani,

R. J.

(

2012

).

Strong rules for discarding predictors in lasso-type problems

.

Journal of the Royal Statistical Society. Series B (Statistical Methodology)

74

,

245

–

266

.