COMETS Analytics: An Online Tool for Analyzing and Meta-Analyzing Metabolomics Data in Large Research Consortia

Workflow for Performing a Consortium Meta-Analysis Using COMETS Analytics (Web App or Local R^a Package)

Step	Actions in Step
Standardized data and models	Project coordinator^b prepares and sends an Excel sheet with “VarMap” and “Models” tabs to cohort analysts. Cohort analysts add their data to the “Metabolites,” “Subject Metabolites,” and “Subjects” tabs to complete the input file.
Input integrity check and harmonization	Cohort analysts open COMETS Analytics, select their input file, and click “check integrity.” If the data set passes the quality control check, cohort analysts continue to the analysis. If the data set fails the quality control check, cohort analysts read the error messages and make appropriate fixes.
Cohort-specific analyses with model validity check	Cohort analysts add their e-mail address (online app only). Cohort analysts click “run model.” Cohort analysts forward results and documentation from COMETS Analytics to project coordinator.
Standardized results aggregation	Project coordinator oversees harmonization of metabolite names from COMETS Analytics files. Project coordinator oversees meta-analysis. Project coordinator prepares manuscript based on meta-analysis results.

Step	Actions in Step
Standardized data and models	Project coordinator^b prepares and sends an Excel sheet with “VarMap” and “Models” tabs to cohort analysts. Cohort analysts add their data to the “Metabolites,” “Subject Metabolites,” and “Subjects” tabs to complete the input file.
Input integrity check and harmonization	Cohort analysts open COMETS Analytics, select their input file, and click “check integrity.” If the data set passes the quality control check, cohort analysts continue to the analysis. If the data set fails the quality control check, cohort analysts read the error messages and make appropriate fixes.
Cohort-specific analyses with model validity check	Cohort analysts add their e-mail address (online app only). Cohort analysts click “run model.” Cohort analysts forward results and documentation from COMETS Analytics to project coordinator.
Standardized results aggregation	Project coordinator oversees harmonization of metabolite names from COMETS Analytics files. Project coordinator oversees meta-analysis. Project coordinator prepares manuscript based on meta-analysis results.

Abbreviations: app, application; COMETS, Consortium of Metabolomics Studies.

^a R Foundation for Statistical Computing, Vienna, Austria.

^b “Project coordinator” refers to an investigator or project lead.

Table 1

Workflow for Performing a Consortium Meta-Analysis Using COMETS Analytics (Web App or Local R^a Package)

Step	Actions in Step
Standardized data and models	Project coordinator^b prepares and sends an Excel sheet with “VarMap” and “Models” tabs to cohort analysts. Cohort analysts add their data to the “Metabolites,” “Subject Metabolites,” and “Subjects” tabs to complete the input file.
Input integrity check and harmonization	Cohort analysts open COMETS Analytics, select their input file, and click “check integrity.” If the data set passes the quality control check, cohort analysts continue to the analysis. If the data set fails the quality control check, cohort analysts read the error messages and make appropriate fixes.
Cohort-specific analyses with model validity check	Cohort analysts add their e-mail address (online app only). Cohort analysts click “run model.” Cohort analysts forward results and documentation from COMETS Analytics to project coordinator.
Standardized results aggregation	Project coordinator oversees harmonization of metabolite names from COMETS Analytics files. Project coordinator oversees meta-analysis. Project coordinator prepares manuscript based on meta-analysis results.

Step	Actions in Step
Standardized data and models	Project coordinator^b prepares and sends an Excel sheet with “VarMap” and “Models” tabs to cohort analysts. Cohort analysts add their data to the “Metabolites,” “Subject Metabolites,” and “Subjects” tabs to complete the input file.
Input integrity check and harmonization	Cohort analysts open COMETS Analytics, select their input file, and click “check integrity.” If the data set passes the quality control check, cohort analysts continue to the analysis. If the data set fails the quality control check, cohort analysts read the error messages and make appropriate fixes.
Cohort-specific analyses with model validity check	Cohort analysts add their e-mail address (online app only). Cohort analysts click “run model.” Cohort analysts forward results and documentation from COMETS Analytics to project coordinator.
Standardized results aggregation	Project coordinator oversees harmonization of metabolite names from COMETS Analytics files. Project coordinator oversees meta-analysis. Project coordinator prepares manuscript based on meta-analysis results.

Abbreviations: app, application; COMETS, Consortium of Metabolomics Studies.

^a R Foundation for Statistical Computing, Vienna, Austria.

^b “Project coordinator” refers to an investigator or project lead.

As compared with other online metabolomics data analysis programs (e.g., MetaboAnalyst (36) and XCMSOnline (37)), COMETS Analytics permits adjustment for confounders (i.e., multivariable modeling) and includes novel features that support consortium-based research. For example, COMETS Analytics checks variables to ensure analytical integrity, provides descriptive statistics, documents the models, and outputs results in a meta-analysis–ready format. COMETS Analytics does not require software installation, specially prepared servers, or user expertise beyond that of how to prepare the initial data inputs. An important advantage of COMETS Analytics over sending code for specific statistical models to each center is that its “robust analytics” enable the same variable coding and the same models to be used for all participating cohorts (see “Standardized models” section below). This eliminates hand-customized code, thereby reducing coding errors and improving analytical transparency and reproducibility.

Our design for COMETS Analytics focused on 7 key design requirements. To ensure that collaborating partners are able to use the system, we determined that COMETS Analytics should 1) be highly usable and 2) protect data. To ensure that code can be maintained over time, we determined that COMETS Analytics should 3) follow current best practices for research software. Additionally, to ensure that the data analysis process is as seamless as possible, we determined that COMETS Analytics needs clear standards for 4) data input, 5) models, and 6) results. To maximize the utility of COMETS Analytics to researchers at large, we also determined that it should potentially be 7) applicable to other -omics data. We describe these requirements in Table 2 and discuss how COMETS Analytics meets each in turn below.

Table 2

Requirements Underlying the Design of COMETS Analytics

Key Requirement	Specific Requirement
High usability and accessibility	Requires no specialized software or expertise to use Includes interactive components so that users can explore data Runs tens of thousands of statistical models efficiently Reuses standardized model input for each analytical cycle Includes a step-by-step tutorial (35)
Data protection	Prioritizes and ensures data protection Includes a stand-alone package for institutes whose data policies prevent them from utilizing Web-based data analysis apps
Use of best practices for coding and algorithm development	Developed using the R^a statistical language, which is extensively used in high-throughput data analyses and has a very active user base and community Documents the development process and makes software code, tutorial, and documentation publicly available
Standardized data input	Includes a clear scheme for data formatting and model coding Performs basic checks to ensure data and model integrity Contains metadata needed to harmonize metabolite names
Standardized models	Runs models and modifies them as needed to avoid showstopper errors Documents any model issues and/or modifications
Standardized and well- documented results	Outputs reproducible results in standard format ready for meta-analysis Documents the exact models executed Displays distribution of covariates for each data set
Applicability to other -omics data	Uses a data structure that is generalizable to other -omics data Data inputs can be modified by simply listing identifiers and names for the analytes of interest in the “Metabolites” table and their values in the “Subject Metabolites” table and running analyses

Key Requirement	Specific Requirement
High usability and accessibility	Requires no specialized software or expertise to use Includes interactive components so that users can explore data Runs tens of thousands of statistical models efficiently Reuses standardized model input for each analytical cycle Includes a step-by-step tutorial (35)
Data protection	Prioritizes and ensures data protection Includes a stand-alone package for institutes whose data policies prevent them from utilizing Web-based data analysis apps
Use of best practices for coding and algorithm development	Developed using the R^a statistical language, which is extensively used in high-throughput data analyses and has a very active user base and community Documents the development process and makes software code, tutorial, and documentation publicly available
Standardized data input	Includes a clear scheme for data formatting and model coding Performs basic checks to ensure data and model integrity Contains metadata needed to harmonize metabolite names
Standardized models	Runs models and modifies them as needed to avoid showstopper errors Documents any model issues and/or modifications
Standardized and well- documented results	Outputs reproducible results in standard format ready for meta-analysis Documents the exact models executed Displays distribution of covariates for each data set
Applicability to other -omics data	Uses a data structure that is generalizable to other -omics data Data inputs can be modified by simply listing identifiers and names for the analytes of interest in the “Metabolites” table and their values in the “Subject Metabolites” table and running analyses

Abbreviations: app, application; COMETS, Consortium of Metabolomics Studies.

^a R Foundation for Statistical Computing, Vienna, Austria.

Table 2

Open in new tab Download slide

Requirements Underlying the Design of COMETS Analytics

Key Requirement	Specific Requirement
High usability and accessibility	Requires no specialized software or expertise to use Includes interactive components so that users can explore data Runs tens of thousands of statistical models efficiently Reuses standardized model input for each analytical cycle Includes a step-by-step tutorial (35)
Data protection	Prioritizes and ensures data protection Includes a stand-alone package for institutes whose data policies prevent them from utilizing Web-based data analysis apps
Use of best practices for coding and algorithm development	Developed using the R^a statistical language, which is extensively used in high-throughput data analyses and has a very active user base and community Documents the development process and makes software code, tutorial, and documentation publicly available
Standardized data input	Includes a clear scheme for data formatting and model coding Performs basic checks to ensure data and model integrity Contains metadata needed to harmonize metabolite names
Standardized models	Runs models and modifies them as needed to avoid showstopper errors Documents any model issues and/or modifications
Standardized and well- documented results	Outputs reproducible results in standard format ready for meta-analysis Documents the exact models executed Displays distribution of covariates for each data set
Applicability to other -omics data	Uses a data structure that is generalizable to other -omics data Data inputs can be modified by simply listing identifiers and names for the analytes of interest in the “Metabolites” table and their values in the “Subject Metabolites” table and running analyses

Key Requirement	Specific Requirement
High usability and accessibility	Requires no specialized software or expertise to use Includes interactive components so that users can explore data Runs tens of thousands of statistical models efficiently Reuses standardized model input for each analytical cycle Includes a step-by-step tutorial (35)
Data protection	Prioritizes and ensures data protection Includes a stand-alone package for institutes whose data policies prevent them from utilizing Web-based data analysis apps
Use of best practices for coding and algorithm development	Developed using the R^a statistical language, which is extensively used in high-throughput data analyses and has a very active user base and community Documents the development process and makes software code, tutorial, and documentation publicly available
Standardized data input	Includes a clear scheme for data formatting and model coding Performs basic checks to ensure data and model integrity Contains metadata needed to harmonize metabolite names
Standardized models	Runs models and modifies them as needed to avoid showstopper errors Documents any model issues and/or modifications
Standardized and well- documented results	Outputs reproducible results in standard format ready for meta-analysis Documents the exact models executed Displays distribution of covariates for each data set
Applicability to other -omics data	Uses a data structure that is generalizable to other -omics data Data inputs can be modified by simply listing identifiers and names for the analytes of interest in the “Metabolites” table and their values in the “Subject Metabolites” table and running analyses

Abbreviations: app, application; COMETS, Consortium of Metabolomics Studies.

^a R Foundation for Statistical Computing, Vienna, Austria.

High usability and accessibility

Consortia analyses include many institutes and centers, with varying levels of expertise in data analysis. To accommodate this broad audience of users, we emphasized high usability and accessibility in our system development. Toward this end, COMETS Analytics uses data inputs that resemble what analysts of prospective cohort studies use in their day-to-day work—namely, data file formats with participants as rows, variables as columns, and plain-language variable names (e.g., BMI for body mass index). In addition, we use Microsoft Excel (Microsoft Corporation, Redmond, Washington) as the standard format for the common data input, since most analysts are familiar with it and it has nearly universal accessibility (an estimated 1.2 billion licenses worldwide (38)). For the Web app, the user interface has interactive components that do not require special expertise to use and can provide real-time results, thus allowing the integrity of data inputs to be tested and confirmed. The system executes models quickly to accommodate the hundreds to thousands of metabolites and many models that comprise metabolomics analyses. Data analyses are hosted on Amazon Web Services (Amazon Web Services, Inc., Seattle, Washington) so that additional servers can run computationally intensive analyses in parallel, as needed. At present, using our sample input file, it takes fewer than 30 seconds to analyze the association of an exposure (e.g., BMI, age) with levels of 611 metabolites in 1,000 participants, adjusted for various factors.

When running the R package locally, the stand-alone R package can be installed using the devtools R package. This includes extensive documentation for each function and a package vignette that explains all analysis steps, from input to output. Regardless of method, the Web-based app and stand-alone R package are synchronized automatically through GitHub, ensuring that each has identical functionality and yields identical results.

Data protection

We designed COMETS Analytics to run as either a stand-alone R package or a Web-based app so that we could accommodate the different data protection concerns of participating cohorts. With the stand-alone R app, all aspects of the analysis are run locally, and analysts transfer only the summary results for central meta-analyses. With the Web app, the following measures are taken to secure data in transit, during analyses, and after analyses (Figure 1): 1) input data upload: when a user uploads an input file to the server, the data are encrypted using SSL (SSL Corporation, Houston, Texas) to protect the contents of data in transit; 2) input data retention: the input data are deleted immediately after completion of the data integrity check and analyses, which usually takes only a few seconds; 3) summary results retention: the summary results are stored in a secured private S3 bucket where data are protected at rest (Amazon S3 server-side encryption) using the 256-bit Advanced Encryption Standard (AES-256) to encrypt data. All results are deleted automatically after 7 days by configured S3 bucket life-cycle policy.

Figure 1

Security of the data flow when running COMETS Analytics through the Web application (https://www.comets-analytics.org). AES, Advanced Encryption Standard; AWS, Amazon Web Services; COMETS, Consortium of Metabolomics Studies; EC2, Elastic Compute Cloud; SSL, secure sockets layer; VPC, virtual private cloud.

Use of best practices for data stewardship and research software development

We chose R as the base language for COMETS Analytics because it is widely used, free, and open-source and has an impressive collection of well-documented analytical packages and algorithms. We used GitHub, a free and open-source system for version control, to allow multiple developers to code simultaneously and to support reporting and tracking of bugs and fixes throughout package development. We used Travis CI (https://travis-ci.org/) and AppVeyor (https://www.appveyor.com/) to test compatibility of the software across operating systems (e.g., Windows, Unix/Linux, Mac OS). We tested overall functionality at 4 different levels—unit testing, integration testing, system testing, and user acceptance testing–using the R testthat package. The unit tests evaluate analytical stability by comparing results of the 27 models in our sample input file with previously obtained benchmark results. A selection of benchmark results was also confirmed to be identical when running models in SAS (SAS Institute, Inc., Cary, North Carolina) and STATA (StataCorp LLC, College Station, Texas). The infrastructure supports parallel development of analytical modules that invoke common core functions for data and model integrity.

COMETS Analytics is designed to be open and transparent. The software code for COMETS Analytics is publicly posted in our GitHub repository (35). The metabolites, their metadata, and information on how they were harmonized are available through a link on the COMETS Analytics website and through GitHub, and results are output in a standard format that can be reused in new analyses. In addition, specific features of COMETS Analytics follow best practices in software development, including 1) defined keywords, meta-information on software, and registry registration to enhance the findability of the software; 2) detailed documentation, versioning, and licensing, and the ability to download different versions of the software through GitHub or access the latest version of the software through a National Cancer Institute–supported server; 3) build and deployment testing through continuous integration and deployment (Travis and Appveyor); and 4) implementation of unit testing to ensure the reproducibility of results using test data sets.

Standardized data input

Smooth execution of data analyses and meta-analysis requires standardized inputs, models, and results. The required input for COMETS Analytics is an Excel file that contains 5 sheets (“Metabolites,” “Subject Metabolites,” “Subjects,” “VarMap,” and “Models”). The use of separate files for metabolite metadata, subject-specific metabolite data, and subject-specific covariate data is, in our experience, the most efficient and flexible way to manage metabolomics data and should be considered a best practice. A sample input file is available at COMETS-Analytics.org and in the R package (“inst/extdata” folder). We present an overview of each sheet in Figure 2 and describe them in detail below.

Figure 2

High-level overview of the process for developing standardized data inputs, models, and results in COMETS Analytics. COMETS, Consortium of Metabolomics Studies.

Open in new tab Download slide

The “Metabolites” sheet captures metabolite metadata. Each row is a metabolite, and the columns are the metadata. Only 2 columns are required: one for the metabolite identifier (typically an R- or SAS-compliant name) and one for biochemical name. Other metabolite identifiers are optional but desirable, as they enhance metabolite harmonization. These identifiers, provided by the laboratories, may include Human Metabolome Database (HMDB), PubChem, Metabolon ID, or InChiKey. Given the lack of universal metabolite identifiers, such as reference SNP (rs) accession numbers in genomics (39), this sheet provides a mechanism for mapping metabolites across multiple laboratory platforms that utilize different nomenclatures. We stress that these data are vital for comparing findings across studies, that it is imperative for laboratories generating metabolomics data to provide them routinely, and that researchers should preserve these identifiers.

The “Subject Metabolites” sheet contains metabolite levels, as input by the participating institute. Each row is a study participant, and columns are the metabolite levels. Since data acquisition varies by platform, we rely on the expertise of each center to achieve optimal data preprocessing (normalization, imputation, and transformation), though COMETS is developing reference samples that may facilitate preprocessing in future studies. At present, no missing data are permitted for metabolite levels. In our test application, each group imputed missing data using their own standard procedures and reported procedures to the lead investigators.

The “Subjects” sheet includes covariate data. Each row is a study participant, and the columns are the covariates (age, sex, BMI, smoking status, etc.). All cells in this sheet must contain a value, but this could include a code for missing values as specified in the “VarMap” sheet.

The “VarMap” sheet is the data dictionary that cohort study investigators use to code their variables. Each row is a variable, and the columns contain details about how that variable should be coded. The principal investigator leading the meta-analysis establishes the desired coding (e.g., for sex, 0 = male and 1 = female) and then disseminates the sheet to participating institutes and centers so they can code their data (further details are provided in the “VarMap” section (currently section 2.1.4) of the tutorial (35)). The sheet is to be used without modification by participating centers.

The “Models” sheet specifies the models for the analysis. Each row is a model, and the columns provide detailed information on that model, including the exposures, adjustments, any stratifications, and model type (further details are provided in the “Models” section (currently section 2.1.5) of the tutorial (35)). Like the “VarMap” sheet, this sheet is also prepared by the principal investigator leading the meta-analysis and is to be used without modification by participating centers. These templates for coding variables and writing models ensure that instructions for variable coding and statistical models are complete and unambiguous. Moreover, they are human and machine-readable, which should minimize errors in translation, and simple enough that principal investigators without formal training in programming can complete them. Since the templates are easy to visually scan (as opposed to code or written text), they may also make it easier to catch gaps in the coding.

The last step in preparing the standardized data input is to check the integrity of the data and model(s) (see “Correlate” tab, “Integrity Check” button)—that is, whether variables appropriately match between sheets. For example, COMETS Analytics will automatically return an error if metabolites in the “Subject Metabolites” sheet lack a match in “Metabolites” or participant identifiers in “Subject Metabolites” lack a match in “Subjects,” or if cells have missing data. The error message itself describes the needed fix.

Standardized models

In its current version, COMETS Analytics supports unadjusted and partial correlation analyses based on the ppcor R package (40), with generalized models that allow for logistic regression and proportional hazards regression in development. The layout of the correlation analysis is shown in Figure 3, with separate panels for input (A) and output (B). On the input side, users can choose from 3 different methods of analysis. The “custom” model allows users to select exposures, outcomes, covariates, and stratifications directly through the user interface. The “prespecified” model allows users to select from a prepopulated list of models from the input file’s models sheet. The “all models” option prompts users to enter an e-mail address, and then runs all analyses in the “Models” sheet and e-mails users a link to the results.

Figure 3

Running analyses with the COMETS Analytics Web application (https://www.comets-analytics.org) using the “custom” models option. Data analysis in COMETS Analytics starts in the data analysis panel (A). The user specifies his/her cohort from a list, selects the file, and presses the “check integrity” button. If the data set passes the integrity check, the user can select one of the 3 modes of analysis. When the user selects “custom model,” he/she can then select exposures (e.g., age) and outcomes (e.g., “all metabolites”) from the data input file, as well as model covariates to adjust for or stratify by. When the user clicks “run model,” results will appear in the results panel (B). This panel prints results for each outcome × exposure permutation, and users can sort results and download them as a comma-separated values (CSV) file. If the user selects “all models” in the data analysis panel, the analyses will instead run in the background, and the user will be sent an e-mail with a link to results once analyses are complete. COMETS, Consortium of Metabolomics Studies.

Open in new tab Download slide

A key asset of COMETS Analytics is that it deploys a system of “robust analytics” for diagnosing and handling showstopper errors on the fly, particularly the issue of singular matrices. This common issue arises when 1 or more covariates is a perfectly linear combination of the others, causing models to not converge. Typically, researchers address this by recoding data inputs, such as by combining categorical variables. However, this process is time-consuming, error-prone, and often poorly documented. The “robust analytics” system, in contrast, diagnoses the singularity of each model in advance, identifies the covariate most implicated, drops it (a mathematically valid solution, since the most implicated covariates are implicitly incorporated when models are singular), and repeats the process until models converge. This eliminates hand-coded data inputs, thereby reducing coding errors and improving analytical transparency and reproducibility.

COMETS Analytics performs 4 specific checks to ensure that models are executable. The first check eliminates analyses with too few participants (25 participants in the present iteration of COMETS Analytics). The second check determines whether any covariates have 0 or near-0 variance (nearZeroVar function in the CARET R package (41)) and, if so, removes them from the models. The third check identifies and removes covariates that are highly collinear in multivariable models (based on eigenvalues; trim.matrix function in the CARET R package (42)). The fourth check detects linear dependencies among the remaining covariates and removes covariates that have linear dependency (findLinearCombos function in the CARET R package). Each check is done for the whole data set and for each stratum in stratified analyses, and any modifications are noted on-screen or in the results e-mail.

Standardized and well-documented results

In the “custom” and “prespecified” model modes, results are output in the right-hand panel of the screen, and they include Spearman correlation coefficients, P values, and labels that identify the model (outcome, exposure, and adjustment variables). Users can sort results by clicking on column headers, visualize them in the “heat map” tab (further details are presented in the tutorial), or download them by clicking “download.”

Using the Web-based “all models” mode, users are sent an e-mail with a link to several files, including 1) results, 2) descriptive statistics, 3) metabolite metadata, and 4) input metadata. “Results” files are like those produced in the modes above, with standardized names that make it possible to automate meta-analyses across cohorts. The “descriptive statistics” file provides mean values and percentiles (continuous variables) and frequencies (categorical variables) for participant covariates and metabolite levels, and the “metabolite metadata” file includes metabolite details. Finally, the “input metadata” file includes the originally submitted “VarMap” and “Models” sheets, which document the exact coding and models used in the analyses. Locally running all models through the COMETS Analytics R package produces the same results in zipped format and can also be sent centrally for meta-analyses.

Applicability to other -omics data

Although COMETS Analytics was designed to analyze metabolomics data, the flexible data structure is applicable to other -omics data. For example, transcriptomic or proteomic data can be input as a gene or protein metadata table in the “Metabolites” sheet, a table of subject gene or normalized protein abundances in the “Subject Metabolites” sheet, and a table of subject covariates in the “Subject Data” sheet. To analyze these other -omics data with COMETS Analytics, users would store gene or protein identifiers (e.g., IGF1BP_1) and names (e.g., insulin-like growth factor-binding protein 1) in the “metabid” and “metabolite_name” columns of the “Metabolites” table and gene or protein abundances in the “Subject Metabolites” table. No other data modifications are needed, and data analyses would be fully functional, even though the input data are proteomics or transcriptomics rather than metabolomics.

TEST APPLICATION

We used COMETS Analytics to conduct 2 different consortium-based analyses, one focused on age-metabolite associations and the other focused on BMI–metabolite associations (manuscripts in preparation). Age and BMI are among the most important risk factors and potential confounders in epidemiologic analyses. Understanding how each factor relates to metabolism will be crucial for interpreting future metabolite-disease associations.

These analyses together included 68 data sets from 45 cohorts with 134,742 research participants, with metabolomics data generated on an aggregate 14 different metabolomics platforms. Over the course of these 2 analyses, we evaluated 27 different models (12 for the age analysis and 15 for the BMI analysis, as shown in the sample input file) adjusted for confounders and stratified by factors of interest (sex, prior heart disease, etc.). These rich, large-scale analyses provided a thorough test of the app’s robustness when applied to real-world data and under conditions of high processing loads.

We evaluated COMETS Analytics’ performance according to 2 main criteria: 1) whether analysts could prepare data inputs on their own and 2) whether input files were correctly processed. With respect to the first criterion, we found that analysts had only minor issues preparing data inputs. The most common problems were that analysts did not recode missing covariate data according to instructions or they did not impute missing metabolite data. We made trained staff available by phone to help analysts with these issues; these calls typically took 10–15 minutes. With assistance, all analysts were able to prepare data inputs correctly. Nearly all analysts used the Web app; the one analyst who used the R stand-alone application reported no difficulties. The stand-alone app will probably receive increased real-world use in the future, given the 2018 implementation of the European Union General Data Protection Regulation.

With respect to the second criterion, we found that all protocol-specified models executed on the first pass. In our experience, this does not often occur in consortium-based analyses. More typically, protocol-specified models will fail due to irrelevant covariate adjustments (e.g., adjusting for sex in a cohort of women only), and analysts will drop covariates in a trial-and-error fashion until models execute. COMETS Analytics, in contrast, algorithmically diagnoses, documents, and fixes model issues, without requiring code or data inputs to be modified. We observed no errors in the application of these algorithms over approximately 1,600 models (60 data sets × 27 models). Our “robust analytics” system thus presents a promising way to streamline analyses and alleviate analyst burden.

Our documentation also underscores why automated approaches are needed. In aggregate, we found that more than 3,000 model fixes were needed to ensure that all models executed—an average of 80 fixes per cohort and 3 fixes per model per cohort. Approximately 90% of fixes involved removing covariates with negligible variance or only 1 value—for example, removing sex from the models for studies of women. To our knowledge, no other published articles have examined the frequency of model failures in large research consortia.

One key product of our test application was the development of a metabolite dictionary that links metabolite names across participating cohorts (available at https://www.comets-analytics.org). Metabolites in this dictionary were matched across studies using unique identifiers like HMDB identifiers or InChiKey identifiers that cohorts provided with their metadata, or by metabolite name when other metadata were lacking. We did not collect information on the level of confidence of metabolite identities as defined by the Metabolomics Standards Initiative (43), since metabolomics laboratories historically have not provided these data to epidemiologic researchers. That said, many of the platforms are targeted and confirm metabolite identities against known standards (e.g., Biocrates (Biocrates Life Sciences AG, Innsbruck, Austria)). The most frequently used laboratory—Metabolon, Inc. (Morrisville, North Carolina)—has documented that its metabolites are usually identified at the tier 1 level of confidence (44). Since some represented laboratories have not confirmed metabolite identities, we treat matches as provisional rather than definite. Researchers conducting consortium-based analyses using our metabolite-matching scheme should carefully evaluate the heterogeneity of associations by metabolomics platform in case of mismatch.

Another key function of this metabolite dictionary is documenting the availability of metabolites across cohorts—crucial information for designing consortium studies. Investigators interested in evaluating metabolite biomarkers of coffee, citrus fruit, and fish intake, for example, can query the dictionary to determine the aggregate number of COMETS participants with these biomarkers and the number contributed by each cohort. This information should help guide investigators as to which cohorts are best suited for their project.

At present, the dictionary includes 4,647 metabolites measured in at least 1 COMETS cohort. Because not all metabolites are measured in each study, the estimated sample size in COMETS can vary substantially from metabolite to metabolite. To describe this variation in sample sizes, we divided metabolites into 4 groups based on their frequency (Table 3). For the metabolites in the categories of “most frequent” (the 58 metabolites measured in 40 or more data sets), “frequent” (696 metabolites measured in 15–39 data sets), “less frequent” (1,180 metabolites measured in 6–14 data sets), and “least frequent” (2,713 metabolites measured in 5 or fewer data sets), the median numbers of participants were 58,884, 17,658, 9,023, and 973, respectively. These results highlight that metabolite coverage in COMETS is both deep (for some metabolites, data are available on more than 58,000 participants) and broad (for several thousand metabolites, there are thousands of participants with data).

Table 3

Numbers of Data Sets and Study Participants Used for the Current Metabolomics Analyses

Metabolite Group^a	No. of Metabolites in Category	Median No. of Data Sets (IQR)	Median No. of Participants (IQR)
Most frequent	58	47 (42–51)	58,884 (48,069–96,115)
Frequent	696	22 (18–28)	17,658 (14,571–25,739)
Less frequent	1,180	10 (8–12)	9,023 (5,049–16,501)
Least frequent	2,713	2 (1–2)	973 (332–2,103)

Metabolite Group^a	No. of Metabolites in Category	Median No. of Data Sets (IQR)	Median No. of Participants (IQR)
Most frequent	58	47 (42–51)	58,884 (48,069–96,115)
Frequent	696	22 (18–28)	17,658 (14,571–25,739)
Less frequent	1,180	10 (8–12)	9,023 (5,049–16,501)
Least frequent	2,713	2 (1–2)	973 (332–2,103)

Abbreviation: IQR, interquartile range.

^a Results are presented separately for 4 groups of metabolites defined by frequency of measurement. The “most frequent” group includes metabolites measured in 40 or more data sets; the “frequent” group includes metabolites measured in 15–39 data sets; the “less frequent” group includes metabolites measured in 6–14 data sets; and the “least frequent” group includes metabolites measured in 5 or fewer data sets.

Table 3

Numbers of Data Sets and Study Participants Used for the Current Metabolomics Analyses

Metabolite Group^a	No. of Metabolites in Category	Median No. of Data Sets (IQR)	Median No. of Participants (IQR)
Most frequent	58	47 (42–51)	58,884 (48,069–96,115)
Frequent	696	22 (18–28)	17,658 (14,571–25,739)
Less frequent	1,180	10 (8–12)	9,023 (5,049–16,501)
Least frequent	2,713	2 (1–2)	973 (332–2,103)

Metabolite Group^a	No. of Metabolites in Category	Median No. of Data Sets (IQR)	Median No. of Participants (IQR)
Most frequent	58	47 (42–51)	58,884 (48,069–96,115)
Frequent	696	22 (18–28)	17,658 (14,571–25,739)
Less frequent	1,180	10 (8–12)	9,023 (5,049–16,501)
Least frequent	2,713	2 (1–2)	973 (332–2,103)

Abbreviation: IQR, interquartile range.

^a Results are presented separately for 4 groups of metabolites defined by frequency of measurement. The “most frequent” group includes metabolites measured in 40 or more data sets; the “frequent” group includes metabolites measured in 15–39 data sets; the “less frequent” group includes metabolites measured in 6–14 data sets; and the “least frequent” group includes metabolites measured in 5 or fewer data sets.

CONCLUSIONS

COMETS Analytics provides a new framework with which to analyze data and aggregate results for large research consortia. Key assets of COMETS Analytics include protection of data; its “robust analytics,” which make it possible to apply the same models to all cohorts; and use of real-time checks to help ensure high-quality results. As compared with other Web-based software for statistical analysis (36, 37, 45), COMETS Analytics requires no specialized software, servers, or data agreements to use and can therefore be readily deployed in cohorts as they join research consortia. Data inputs and models are easy to implement, and no specialized expertise is needed beyond basic knowledge of Excel. We also note that the software provides the ability to code covariables and define models, such that the use of common data models could be supported for consortium meta-analyses by modifying the “VarMap” and “Models” tabs of the input data file. Because there is no sharing of individual-level data, cohorts that do not allow data-sharing may still participate. In our test application, analysts easily completed their analyses, demonstrating that the software is usable and performs well at scale. This test application also allowed us to build a metabolite dictionary that links metabolite names across 60 different data sets.

COMETS Analytics has some limitations. It takes time to set up the input files, and metabolite identifiers for each study must be linked to those that already exist in our metabolite dictionary, a process that can be slow and has potential for error. In addition, COMETS Analytics requires local analysts to conduct their own data preprocessing. Because firsthand experience with the particularities of each cohort’s data is required to appropriately prepare data inputs, this may be the best recourse at present. However, COMETS will continue to evaluate preprocessing methods, in case methods can ultimately be standardized. After the initial preprocessing is performed, participation for future projects will take less time. A further limitation is that reliance on individual-level data may preclude the conduct of some types of pathway-based statistical analyses or other multimetabolite analyses. At present, COMETS Analytics does not mandate use of a published common data model, as each cohort study maintains and is responsible for its own data. However, by using and reusing standardized data templates, we anticipate that inputs will become increasingly standardized, helping us move toward a common data model. COMETS Analytics is currently capable of supporting the use of such a model.

To date, several hundred national and international users have used COMETS Analytics, and we expect numbers to grow given the increasing use of metabolomics data in epidemiology. With a robust infrastructure in place, we anticipate continued development of COMETS Analytics modules, including pathway analyses. Since the software is agnostic to data type, we may also consider adapting it for use in other -omics fields. We continue to solicit comments from the user community, and we welcome feedback from readers.

ACKNOWLEDGMENTS

Author affiliations: Biostatistics Center and Department of Biostatistics and Bioinformatics, Milken Institute School of Public Health, George Washington University, Washington, DC, United States (Marinella Temprosa); Metabolic Epidemiology Branch, Division of Cancer Epidemiology and Genetics, National Cancer Institute, Bethesda, Maryland, United States (Steven C. Moore, Kaitlyn M. Mazzilli, Erikka Loftfield, Kathleen McClain); Epidemiology and Genomics Research Program, Division of Cancer Control and Population Sciences, National Cancer Institute, Bethesda, Maryland, United States (Krista A. Zanetti); Information Management Services, Inc., Rockville, Maryland, United States (Nathan Appel, David Ruggieri); Infrastructure and Information Technology Operations Branch, Center for Biomedical Informatics and Information Technology, National Cancer Institute, Bethesda, Maryland, United States (Kai-ling Chen, Brian Park); Channing Division of Network Medicine, Department of Medicine, Brigham and Women’s Hospital and Harvard Medical School, Boston, Massachusetts, United States (Rachel S. Kelly, Jessica A. Lasky-Su, Oana A. Zeleznik); Division of Human Nutrition and Health, Wageningen University, Wageningen, the Netherlands (Laura Trijsburg); Biomedical Informatics Department, College of Medicine, Ohio State University, Columbus, Ohio, United States (Ewy A. Mathé); and Division of Preclinical Innovation, National Center for Advancing Translational Sciences, Bethesda, Maryland, United States (Ewy A. Mathé).

M.T. and S.C.M. contributed equally to this study.

This work was supported by the Division of Cancer Control and Population Sciences, National Cancer Institute, and by the Intramural Research Program of the National Institutes of Health (Division of Cancer Epidemiology and Genetics, National Cancer Institute, and National Center for Advancing Translational Sciences).

The main website for COMETS Analytics (https://www.comets-analytics.org) provides links for accessing the software code repository for the R package and provides direct downloads to a sample input file and the harmonization database.

We thank Dr. Mary C. Playdon (University of Utah, Salt Lake City, Utah) for suggestions on an early draft of this article.

Members of the Consortium of Metabolomics Studies who generously shared metabolite names from their cohort studies for development of a metabolite dictionary: Drs. Demetrius Albanes (National Cancer Institute, Bethesda, Maryland), Yoav Ben-Shlomo (University of Bristol, Bristol, United Kingdom), Eric Boerwinkle (University of Texas Health Science Center at Houston, Houston, Texas), Bo L. Chawes (University of Copenhagen, Copenhagen, Denmark), Caroline Dale (University College London, London, United Kingdom), A. Heather Eliasson (Harvard T.H. Chan School of Public Health, Boston, Massachusetts), Christian Geiger (Helmholtz Zentrum München, Munich, Germany), Neil J. Goulding (University of Bristol), Andrea Gsur (Medical University of Vienna, Vienna, Austria), Marc J. Gunter (International Agency for Research on Cancer, Lyon, France), Sei Harada (Keio University, Tokyo, Japan), David M. Herrington (Wake Forest School of Medicine, Winston-Salem, North Carolina), Joel N. Hirschhorn (Broad Institute of MIT and Harvard, Boston, Massachusetts), Mattias Johannson (International Agency for Research on Cancer), Rachel S. Kelly (Brigham and Women’s Hospital and Harvard Medical School, Boston, Massachusetts), Mika Kivimaki (University College London), Jessica Lasky-Su (Brigham and Women’s Hospital and Harvard Medical School), Charles E. Matthews (National Cancer Institute), Christina Menni (King’s College London, London, United Kingdom), Steven C. Moore (National Cancer Institute), Eric Orwoll (Oregon Health and Science University, Portland, Oregon), Alexandre C. Pereira (University of São Paolo, São Paolo, Brazil), Lucilla Poston (King’s College London), Qibin Qi (Albert Einstein College of Medicine, New York, New York), Vasan S. Ramachandran (Boston University School of Medicine, Boston, Massachusetts), Kathryn M. Rexrode (Brigham and Women’s Hospital and Harvard Medical School), Rui Wang-Sattler (Helmholtz Zentrum München), Wei Jie Seow (National University of Singapore, Singapore), Svati H. Shah (Duke University, Durham, North Carolina), Eric J. Shiroma (National Institute on Aging, Bethesda, Maryland), Xiao-Ou Shu (Vanderbilt University Medical Center, Nashville, Tennessee), Rachel Stolzenberg-Solomon (National Cancer Institute), Victoria L. Stevens (Rollins School of Public Health, Emory University, Atlanta, Georgia), Toru Takebayashi (Keio University), Marinella Temprosa (George Washington University, Washington, DC), Emmi Tikkanen (Nightingale Health, Helsinki, Finland), Therese Tillin (University College London), Ioanna Tzoulaki (Imperial College London, London, United Kingdom), Cornelia M. Ulrich (University of Utah, Salt Lake City, Utah), Andrew Wong (University College London), and Bing Yu (University of Texas Health Science Center at Houston).

Conflict of interest: none declared.

REFERENCES

1.

Sampson

JN

,

Boca

SM

,

Shu

XO

, et al.

Metabolomics in epidemiology: sources of variability in metabolite measurements and implications

.

Cancer Epidemiol Biomarkers Prev

.

2013

;

22

(

4

):

631

–

640

.

2.

Ioannidis

JPA

.

Why most published research findings are false

.

PLoS Med

.

2005

;

2

(

8

):

e124

.

3.

Kraft

P

,

Zeggini

E

,

Ioannidis

JP

.

Replication in genome-wide association studies

.

Stat Sci

.

2009

;

24

(

4

):

561

–

573

.

4.

Floegel

A

,

Stefan

N

,

Yu

Z

, et al.

Identification of serum metabolites associated with risk of type 2 diabetes using a targeted metabolomic approach

.

Diabetes

.

2013

;

62

(

2

):

639

–

648

.

5.

Wang

TJ

,

Larson

MG

,

Vasan

RS

, et al.

Metabolite profiles and the risk of developing diabetes

.

Nat Med

.

2011

;

17

(

4

):

448

–

453

.

6.

Menni

C

,

Fauman

E

,

Erte

I

, et al.

Biomarkers for type 2 diabetes and impaired fasting glucose using a nontargeted metabolomics approach

.

Diabetes

.

2013

;

62

(

12

):

4270

–

4276

.

7.

Yu

D

,

Moore

SC

,

Matthews

CE

, et al.

Plasma metabolomic profiles in association with type 2 diabetes risk and prevalence in Chinese adults

.

Metabolomics

.

2016

;

12

:

3

.

8.

Tang

WH

,

Wang

Z

,

Levison

BS

, et al.

Intestinal microbial metabolism of phosphatidylcholine and cardiovascular risk

.

N Engl J Med

.

2013

;

368

(

17

):

1575

–

1584

.

9.

Shah

SH

,

Bain

JR

,

Muehlbauer

MJ

, et al.

Association of a peripheral blood metabolic profile with coronary artery disease and risk of subsequent cardiovascular events

.

Circ Cardiovasc Genet

.

2010

;

3

(

2

):

207

–

214

.

10.

Kraus

WE

,

Muoio

DM

,

Stevens

R

, et al.

Metabolomic quantitative trait loci (mQTL) mapping implicates the ubiquitin proteasome system in cardiovascular disease pathogenesis

.

PLoS Genet

.

2015

;

11

(

11

):e1005553.

11.

His

M

,

Viallon

V

,

Dossus

L

, et al.

Prospective analysis of circulating metabolites and breast cancer in EPIC

.

BMC Med

.

2019

;

17

(

1

):

178

.

12.

Moore

SC

,

Playdon

MC

,

Sampson

JN

, et al.

A metabolomics analysis of body mass index and postmenopausal breast cancer risk

.

J Natl Cancer Inst

.

2018

;

110

(

6

):

588

–

597

.

13.

Mayers

JR

,

Wu

C

,

Clish

CB

, et al.

Elevation of circulating branched-chain amino acids is an early event in human pancreatic adenocarcinoma development

.

Nat Med

.

2014

;

20

(

10

):

1193

–

1198

.

14.

Schmidt

JA

,

Fensom

GK

,

Rinaldi

S

, et al.

Patterns in metabolite profile are associated with risk of more aggressive prostate cancer: a prospective study of 3,057 matched case-control sets from EPIC

.

Int J Cancer

.

2020

;

146

(

3

):

720

–

730

.

15.

Schmidt

JA

,

Rinaldi

S

,

Scalbert

A

, et al.

Plasma concentrations and intakes of amino acids in male meat-eaters, fish-eaters, vegetarians and vegans: a cross-sectional analysis in the EPIC-Oxford cohort

.

Eur J Clin Nutr

.

2016

;

70

(

3

):

306

–

312

.

16.

Mondul

AM

,

Sampson

JN

,

Moore

SC

, et al.

Metabolomic profile of response to supplementation with beta-carotene in the Alpha-Tocopherol, Beta-Carotene Cancer Prevention Study

.

Am J Clin Nutr

.

2013

;

98

(

2

):

488

–

493

.

17.

Menni

C

,

Graham

D

,

Kastenmuller

G

, et al.

Metabolomic identification of a novel pathway of blood pressure regulation involving hexadecanedioate

.

Hypertension

.

2015

;

66

(

2

):

422

–

429

.

18.

Newgard

CB

,

An

J

,

Bain

JR

, et al.

A branched-chain amino acid-related metabolic signature that differentiates obese and lean humans and contributes to insulin resistance

.

Cell Metab

.

2009

;

9

(

4

):

311

–

326

.

19.

Cheng

S

,

Rhee

EP

,

Larson

MG

, et al.

Metabolite profiling identifies pathways associated with metabolic risk in humans

.

Circulation

.

2012

;

125

(

18

):

2222

–

2231

.

20.

Moore

SC

,

Matthews

CE

,

Sampson

JN

, et al.

Human metabolic correlates of body mass index

.

Metabolomics

.

2014

;

10

(

2

):

259

–

269

.

21.

Wurtz

P

,

Wang

Q

,

Soininen

P

, et al.

Metabolomic profiling of statin use and genetic inhibition of HMG-CoA reductase

.

J Am Coll Cardiol

.

2016

;

67

(

10

):

1200

–

1210

.

22.

ATBC Cancer Prevention Study Group

.

The Alpha-Tocopherol, Beta-Carotene Lung Cancer Prevention Study: design, methods, participant characteristics, and compliance

.

Ann Epidemiol

.

1994

;

4

(

1

):

1

–

10

.

23.

Childhood Asthma Management Program Research Group

.

The Childhood Asthma Management Program (CAMP): design, rationale, and methods

.

Control Clin Trials

.

1999

;

20

(

1

):

91

–

120

.

24.

Diabetes Prevention Program Research Group

.

Long-term effects of lifestyle intervention or metformin on diabetes development and microvascular complications over 15-year follow-up: the Diabetes Prevention Program Outcomes Study

.

Lancet Diabetes Endocrinol

.

2015

;

3

(

11

):

866

–

875

.

25.

Gaziano

JM

,

Sesso

HD

,

Christen

WG

, et al.

Multivitamins in the prevention of cancer in men: the Physicians’ Health Study II randomized controlled trial

.

JAMA

.

2012

;

308

(

18

):

1871

–

1880

.

26.

Prorok

PC

,

Andriole

GL

,

Bresalier

RS

, et al.

Design of the Prostate, Lung, Colorectal and Ovarian (PLCO) Cancer Screening Trial

.

Control Clin Trials

.

2000

;

21

(

6 suppl

):

273S

–

309S

.

27.

Litonjua

AA

,

Lange

NE

,

Carey

VJ

, et al.

The Vitamin D Antenatal Asthma Reduction Trial (VDAART): rationale, design, and methods of a randomized, controlled trial of vitamin D supplementation in pregnancy for the primary prevention of asthma and allergies in children

.

Contemp Clin Trials

.

2014

;

38

(

1

):

37

–

50

.

28.

Cheng

TY

,

Makar

KW

,

Neuhouser

ML

, et al.

Folate-mediated one-carbon metabolism genes and interactions with nutritional factors on colorectal cancer risk: Women’s Health Initiative Observational Study

.

Cancer

.

2015

;

121

(

20

):

3684

–

3691

.

29.

NCI-NHGRI Working Group on Replication in Association Studies

,

Chanock

SJ

,

Manolio

T

, et al.

Replicating genotype-phenotype associations

.

Nature

.

2007

;

447

(

7145

):

655

–

660

.

30.

Ioannidis

JPA

,

Castaldi

P

,

Evangelou

E

.

A compendium of genome-wide associations for cancer: critical synopsis and reappraisal

.

J Natl Cancer Inst

.

2010

;

102

(

12

):

846

–

858

.

31.

Lawlor

DA

,

Tilling

K

,

Davey Smith

G

.

Triangulation in aetiological epidemiology

.

Int J Epidemiol

.

2016

;

45

(

6

):

1866

–

1886

.

32.

Yu

B

,

Zanetti

KA

,

Temprosa

M

, et al.

The Consortium of Metabolomics Studies (COMETS): metabolomics in 47 prospective cohort studies

.

Am J Epidemiol

.

2019

;

188

(

6

):

991

–

1012

.

33.

Gaye

A

,

Marcon

Y

,

Isaeva

J

, et al.

DataSHIELD: taking the analysis to the data, not the data to the analysis

.

Int J Epidemiol

.

2014

;

43

(

6

):

1929

–

1944

.

34.

Doiron

D

,

Burton

P

,

Marcon

Y

, et al.

Data harmonization and federated analysis of population-based studies: the BioSHaRE Project

.

Emerg Themes Epidemiol

.

2013

;

10

(

1

):

12

.

35.

Temprosa

M

. CBIIT/R-cometsAnalytics. https://github.com/CBIIT/R-cometsAnalytics/.

Published April 12, 2021

.

Accessed April 12, 2021

.

36.

Chong

J

,

Yamamoto

M

,

Xia

J

.

MetaboAnalystR 2.0: from raw spectra to biological insights

.

Metabolites

.

2019

;

9

(

3

):

57

.

37.

Tautenhahn

R

,

Patti

GJ

,

Rinehart

D

, et al.

XCMS Online: a web-based platform to process untargeted metabolomic data

.

Anal Chem

.

2012

;

84

(

11

):

5035

–

5039

.

38.

Callaham

J

. There are now 1.2 billion Office users and 60 million Office 365 commercial customers. https://www.windowscentral.com/there-are-now-12-billion-office-users-60-million-office-365-commercial-customers.

Published March 31, 2016

.

Accessed April 12, 2021

.

39.

Sherry

ST

,

Ward

MH

,

Kholodov

M

, et al.

dbSNP: the NCBI database of genetic variation

.

Nucleic Acids Res

.

2001

;

29

(

1

):

308

–

311

.

40.

Kim

S

.

ppcor: an R package for a fast calculation to semi-partial correlation coefficients

.

Commun Stat Appl Methods

.

2015

;

22

(

6

):

665

–

674

.

41.

Kuhn

M

.

Building predictive models in R using the caret package

.

J Stat Softw

.

2008

;

28

(

5

):

26

.

. https://CRAN.R-project.org/package=caret.

42.

Kuhn

M

. caret: Classification and Regression Training.

(R package, version 6.0-88)

Published May 15, 2021

.

Accessed July 6, 2021

.

43.

Sumner

LW

,

Amberg

A

,

Barrett

D

, et al.

Proposed minimum reporting standards for chemical analysis: Chemical Analysis Working Group (CAWG) Metabolomics Standards Initiative (MSI)

.

Metabolomics

.

2007

;

3

(

3

):

211

–

221

.

44.

Evans

AM

,

Bridgewater

B

,

Liu

Q

, et al.

High resolution mass spectrometry improves data quantity and quality as compared to unit mass resolution mass spectrometry in high-throughput profiling metabolomics

.

Metabolomics

.

2014

;

4

(

2

):

132

.

45.

Fortier

I

,

Raina

P

,

Van den Heuvel

ER

, et al.

Maelstrom Research guidelines for rigorous retrospective data harmonization

.

Int J Epidemiol

.

2017

;

46

(

1

):

103

–

105

.