With statisticians being a little slow to get involved on the front lines of bioinformatics, many basic and advanced statistical principles have been underutilized in the collection and modeling of bioinformatics data. Baggerly KA, Morris JS, Edmonson SR, Coombes KR. Clinical application of the 70-gene profile: the mindact trial. We also confirmed such confounding of run order and case/control status in another follow up study by this group (Baggerly et al., 2004a). A simplified model illustrating these interrelationships is given by Mallick et al. Reactome: a knowledgebase of biological pathways. Bild AH, Chang JT, Johnson WE, Piccolo SR. A field guide to genomics research. for normal probes, log2(2/2) = 0; single copy losses, log2(1/2) = 1; single copy gains, log2(3/2) = 0.58 etc. Baggerly and Coombes (2010) reported these irregularities in an Annals of Applied Statistics paper, showed that their attempts to follow the reported procedures resulted in predictive results no better than random chance, and strongly urged that clinical trials that had commenced to test these signatures be suspended until these irregularities are rectified. In these regression models, the regression coefficients are themselves functions defined on the same space as the responses, and so after model fitting differential expression can be assessed by determining for which functional locations the coefficients differ significantly from zero. SNP arrays are one of the most common types of high-resolution chromosomal microarrays (Mei et al., 2000). Evaluating the performance of new approaches to spot quantification and differential expression in 2-dimensional gel electrophoresis studies. Platform-specific specialized software packages are used to align the SNPs to chromosomal locations, generating genome-wide DNA profiles of copy number alterations and allelic frequencies that can then be interrogated to answer various scientific and clinical questions. Davies H, Bignell GR, Cox C, Stephens P, Edkins S, Clegg S, Teague J, Woffendin H, Garnett MJ, Bottomley W, et al. Statisticians, with their deep understanding of variability and uncertainty quantification, play a key role in these efforts. In 2009, Illumina introduced a 27k array (Bibikova et al., 2009) that measured methylation at 27,578 CpG sites from 14,495 genes, with roughly two CpG per genes. Prior distributions are placed on the basis-space regression coefficients that induce the type of L1 or L2 penalization behavior that leads to appropriately smoothed/regularized functional coefficients. IEEE International Symposium on Biomedical Imaging. Joint and individual variation explained (jive) for integrated analysis of multiple data types. More complicated relationships and feedback loops are being discovered and work off of this fundamental information flow. Examples include aggregation of information across multiple probes or genomic locations to produce gene expression summaries, performing peak or spot detection in proteomics to obtain counts or relative protein abundance measurements, or segmenting regions of the genome believed to have common copy number values. Alignment is only done after peak or spot detection, so does not make use of the spatial information in the raw spectra or gels that might lead to improved registration. They also provide genotypic information for the SNPs, which when considered across multiple SNPs can be used to study haplotypes. Mutations of the braf gene in human cancer. 2D gel electrophoresis was developed in the 1970s (OFarrell, 1975), and has served as the primary workhorse for high-throughput expression proteomics. In most cases, such efforts are not feasible, and thus there could be many other pivotal studies with spurious or otherwise erroneous results that are allowed to stand, and countless resources spent trying to replicate or build on these results, and in some cases patient treatment decisions being based upon them. Multivariate dependencies: Models, analysis and interpretation. Recently, estimation and computational approaches have been developed the generalize graphical model estimation for multi-platform data to infer more integrated networks for get a holistic view of the dependencies (Ha et al., 2015; Ni et al., 2014, 2016). For example, this includes correlations across genes present in common biological pathways and relationships among measurements from different technological platforms that each contain different biological information based the their molecular resolution level (e.g. Bhattacharjee A, Richards WG, Staunton J, Li C, Monti S, Vasa P, Ladd C, Beheshti J, Bueno R, Gillette M, et al. (2004) proposed a statistically principled method called circular binary segmentation (CBS) that provides natural way to segment a chromosome into contiguous regions and bypasses parametric modeling of the data. Here, we will briefly overview several important proteomic technologies that involve estimating absolute or relative abundance levels, including low to moderate-throughput assays that can be used to study small numbers of pre-specified proteins and high-throughput methods that can survey a larger slice of the proteome. We review some of the recent developments in this area, mostly in the context of cancer, since its one of most well-characterized disease-system at different molecular levels. (2010) proposed a DAG-based model to infer microRNA regulatory networks. Chen X, Chen M, Ning K. Bnarray: an r package for constructing gene regulatory networks from microarray data by using bayesian network. Baggerly KA, Edmonson SR, Morris JS, Coombes KR. Transcriptomic and proteomic data typically generate large scale multivariate data with large number of variables (genes/proteins), typically much higher orders than the sample size the large p, small n situation, which is a common thread in nearly all of the technologies described above. methylation). (2008a)) that utilize fundamental statistical principles to more efficiently borrow strength within and between spectra, and thus produce substantially improved results. In recent years, advances in statistical modeling have led to a growing set of tools available to build flexible statistical models. Initial methods performed protein quantification one-sample-at-a-time. (2015) developed an efficient Bayesian method for discovering non-linear edge structures in DAG models which allows the functional form of the relationships between nodes to be non-parametrically determined by the data. case vs. control. High-resolution serum proteomic patterns for ovarian cancer detection. High resolution analysis of dna copy number variation using comparative genomic hybridization to microarrays. Keith Baggerly and Kevin Coombes set out to understand these studies so they could replicate them in other settings. This strategy has the advantage of modeling all of the data and being straightforward to implement, with one able to apply any desired statistical model in parallel to the different elements of the object. Towards systematic functional characterization of cancer genomes. One of the key attributes that sets statisticians apart from other quantitative scientists is their understanding of variability and uncertainty quantification. EURASIP J. Bioinformatics and Systems Biology. As with any standard dilution assays, the expression patterns typically follow a sigmoidal curve i.e. Many types of mutations can be characterized, including point mutations, insertions, deletions, and translocations. (2012) use post hoc loess smoothing to account for correlation in the data and gain further efficiency, and Lee and Morris (2015) apply Bayesian functional mixed models to detect DMRs. While drawing some general conclusions, we organize the core of this paper around four key areas: In this article, we do not attempt to exhaustively summarize the work that has been done, but instead attempt to illustrate contributions and highlight the motivating statistical principles. While the particulars are technology-specific, there are several general considerations that apply to nearly all technologies. This heterogeneity implies that we actually measure a composite copy number estimate across a mixture of cell types, which tends to attenuate the ratios toward zero. The resulting gel image is characterized by hundreds or thousands of spots each corresponding to proteins present in the sample that are analyzed to assess protein differences across samples. While each technology has its own characteristics and caveats (see Section 4.1), the basic read outs contain expression level estimates for thousands on genes on a per-sample basis. official website and that any information you provide is encrypted The first is that the set size affects testing power due to imbalance between null and alternate hypothesis (Newton et al., 2007). A simulation study (Baggerly et al., 2005b) revealed that the spectra in the second data set had pervasive differences between cancer and normal, which should not be the case since we expect that biological proteomic differences should be characterized by a limited number of specific peaks, not the entire spectrum. As mentioned above, one of the hallmarks of genetic variations in cancer is genomic instability of cancerous cells that are manifested as copy number changes across the genome that can be measured using high-resolution and high-throughput assays such as aCGH and SNP arrays (see Section 2.2 for details). Bibikova M, Barnes B, Tsan C, Ho V, Klotzle B, Le JM, Delano D, Zhang L, Schroth GP, Gunderson KL, Fan JB, Shen R. High density dna methylation array with single cpg site resolution. Sparse integrative clustering of multiple omics data sets. Schena M, Shalon D, Davis RW, Brown PO. (2005); Morris et al. Reverse phase protein array: validation of a novel proteomic technology and utility for analysis of primary leukemia specimens and hematopoietic stem cells. SNP array analysis of germline samples have been extensively used in genome-wide association studies (GWAS) to find genetic markers associated with various disease of interest. Baladandayuthapani et al. Despite the excitement generated by these apparent successes, no other groups were able to get this strategy to work, in spite of its sole reliance on publicly available microarray data and cell lines that anyone could have obtained. differential expression, fold change etc), GSEA computes an enrichment score to reflect the degree to which a predefined pathway (using the databases above) is over-represented and then can be used to obtain ranked lists. Array comparative genomic hybridization and its applications in cancer. Illustration of Types of Multi-platform Genomics Data and Their Interrelationships. Once the raw intensities from the RPPA slides have been adjusted for background and other spatial trends, the next preprocessing step is to quantify/estimate the concentration of each protein and sample based on the underlying assumption that the intensity of a given spot on the array is proportional to the amount of protein. https://archive.org/details/KPIX2012021303000060Minutes, http://bioinformatics.mdanderson.org/Software/OOMPA, http://www.sciencedirect.com/science/article/B6WK9-4C604WK-1/2/9a861453b1df438db4cff4e718f94246, http://biostatistics.oxfordjournals.org/cgi/content/abstract/9/3/432, http://bioinformatics.oxfordjournals.org/cgi/content/abstract/25/11/1384, http://bioinformatics.oxfordjournals.org/cgi/content/abstract/25/5/650. Heatmap of Ovarian Cancer Data: Heatmap of mass spectra from 216 samples in Petricoin et al. Ni Y, Stingo FC, Baladandayuthapani V. Integrative Bayesian network analysis of genomic data. Flexible modeling can bridge the gap between the extremes of reductionistic feature extraction approaches that can miss information contained in the data and elementwise modeling approaches that model all of the data but sacrifice efficiency and inferential accuracy by ignoring relationships in the data. In spite of missed opportunities, there have been substantial efforts and success stories where (sometimes advanced) statistical tools have been developed for these data, leading to improved results and deep scientific contributions. Most methods proceed by modeling a single sample/array at a time, and thus fail to borrow strength across multiple samples to infer shared regions of copy number aberrations (Olshen et al., 2004; Tibshirani and Wang, 2008). As described in Morris (2015), one of the hallmarks of functional regression is to use basis function representations (e.g. Chromosomal CGH resolution is limited to 1020 Mb, hence any aberration smaller than that will not be detected. A variant of 2DGE that can potentially lead to more accurate relative abundance measurements is 2D difference gel electrophoresis (DIGE, Karp and Lilley (2005)), which involves labeling two samples with two different dyes, loading them onto the same gel, and then scanning the gel twice using different lasers that differentially pick up on the two dyes. These models provide representations of the conditional independence structure of the multivariate distribution to develop and infer gene/protein networks. with 30 depth indicating that we expect to get at least 30 counts of each genomic location. Petricoin EFI, Ardekani AM, Hitt BA, Levine PJ, Fusaro VA, Steinberg SM, Mils GB, Simone C, Fishman DA, Kohn EC, Liotta LA. This approach is based on a multilevel functional mixed effects model that flexibly models not only within subject variability but also allows us to conduct population-level inference to obtain segments of shared copy number changes. (2004b), the FDA sent a letter to the company to hold off on marketing Ovacheck, and six months later they notified them of the need to conduct further validation studies, studies which never were able to reproduce the initial spectacular results. Also, since the publishing of these papers, the commercial software packages for preprocessing proteomics data have incorporated many of these principles and as a result improved their performance. Also, given the extensive efforts in the biological research community to build up knowledge resources that are freely available online, such the recent large-scale federal efforts for unified databases especially in cancer e.g. The .gov means its official. This has led to proliferation of statistical, bioinformatics and data mining efforts to collectively analyze and model the large volume of data. Seidel C. Introduction to DNA microarrays. Microarrays: biotechnologys discovery platform for functional genomics. A number of different feature extraction approaches are available in the current literature and in commercial software packages. To overcome these limitations, Reverse-phase protein arrays (RPPA) have been developed to provide quantitative, high-throughput, time- and cost-efficient analysis of small to moderate number of proteins (dozens to hundreds) using small amounts of biological material (Tibes et al., 2006). The protein concentrations are then determined for the samples and subsequently, serial 2-fold dilutions prepared from each sample are then arrayed on a glass slide. In RPPA analyses, proteins are isolated from the biological specimens such as cell lines, tumors, or serum using standard laboratory-based methods. This basic process can be regulated and altered through epigenetic processes such as DNA methylation that help regulate transcription, post translational modification of histone proteins within the chromatin structures encasing the DNA, or by micro RNAs (miRNAs) that degrade targeted mRNA. Based on this heatmap, it became evident that the benign cysts were very different from both cancers and normals, and that the benign cyst mass spectra from the first data set looked much like the WCX2 array data from the second study. Brown PO, Botstein D. Exploring the new world of the genome with dna microarrays. In this article, we attempt to summarize some of the key contributions of statisticians to bioinformatics, focusing on four areas: (1) experimental design and reproducibility, (2) preprocessing and feature extraction, (3) unified modeling, and (4) structure learning and integration. First, since each slide is probed with a single antibody targeting that protein, the protein expression of the different samples should share common chemical and hybridization profiles. Apparently, the panel was never shown the full list of irregularities raised by Baggerly and Coombes, but only a hand-picked subset that were more easily addressible. Grossman RL, Heath AP, Ferretti V, Varmus HE, Lowy DR, Kibbe WA, Staudt LM. Several processing steps need to be applied to the raw data generated by the genomics platforms described in Section 2 before they are ready for downstream statistical analysis.
Large-scale simultaneous hypothesis testing: the choice of a null hypothesis. Methods that model the raw data in their entirety have potential to capture insights missed by feature extraction approaches. Tukey (1977) highlights the importance of good exploratory data analysis in statistics, and this is not any less true in the world of big data. An official website of the United States government. Researchers broadly deemed this a more appropriate statistical criterion for discovery involving high-dimensional genomics and proteomics data. Biology and medicine have moved to a place where big data are becoming ubiquitous in research and even clinical practice. Tibshirani R, Wang P. Spatial smoothing and hot spot detection for cgh data using the fused lasso. In either case, it is important to ensure all gene selection and modeling decisions are made using the training data alone. DNA copy number data are characterized by high noise levels that add random measurement errors to the observations. The resulting data consists of a series of dilution series for each RPPA slide, so as to ensure at least one spot from the series in the linear range of expression. This has the potential to discover potential casual mechanisms between genes that are not typically accorded by other more naive methods. The Maximizing sensitivity for detecting changes in protein expression: Experimental design using minimal cydyes. The new PMC design is here! These statistical model-based microarray preprocessing packages have become the status quo for preprocessing and are still widely used today. While not commonly used in practice for high-throughput genomics data, we believe flexible modeling strategies like these are promising, and should be further pursued and explored by statistical researchers for complex biomedical data like these. New technologies are continually being developed and introduced at a rapid rate, and there are many new challenges these data will bring. Barfield RT, Kilaru V, Smith AK, Conneely KN. Diploid organisms such as humans have two copies of each autosome (i.e. (2002). Olshen AB, Venkatraman E, Lucito R, Wigler M. Circular binary segmentation for the analysis of array-based dna copy number data. (2007) attempted to incorporate Gene Ontology pathway information to predict survival time. These high profile re-analyses by statisticians have contributed to a greater level of awareness of these crucial issues, and statisticians have been taking leadership roles in helping funding agencies and journals to construct policies that contribute to greater transparency and reproducibility in research (Peng, 2009; Stodden et al., 2013; Collins and Tabak, 2014; Fuentes, 2016; Hofner et al., 2016). Following are a few examples.
The initial human genome project involved a complete sequencing of a human genome, which took 13 years (19902003) and cost roughly $3 billion. They propose a joint estimation model based on a three-parameter logistic curve and estimating the parameters pooling all the information on an array to estimate global parameters. Morris et al. This model was fitted using alternating least squares applied after performing a loess-based normalization to a reference array, and using an iterative outlier filtering algorithm to remove outlying probes, arrays, and observations, with missing data principles allowing the estimation of the gene expression values i. This can be used in paired designs to find proteins with differential abundance between two conditions, or in more general designs a common reference material can be used on the second channel as an internal reference factor. In general, a gene class is defined as a collection of genes defined to be biologically associated given a biological reference based on scientific literature, transcription factor database, expert opinion, or empirical and theoretical evidence. Baggerly KA, Coombes KR, Morris JS. If done well, feature extraction can be an efficient strategy to reduce dimensionality, simplify the data, and focus inference on quantities that are most readily biologically interpretable. Failure to use all of the information in the data increases the risk of missed discoveries. Baggerly and Coombes had numerous interactions with the senior authors, Potti and Nevins, but they were unable or unwilling to address the most serious of these issues and eventually stopped responding. Baggerly KA, Morris JS, Coombes KR. Qin L-X. It leads to missing data when a given peak does not have a corresponding detected peak for all spectra, and leads to many types of errors, including peak detection errors, peak matching errors, and peak boundary estimation errors that all worsen considerably as the number of spectra increases. Note that unlike aCGH arrays, SNP arrays have the advantage of detecting both copy number alterations as well as LOH events given the allelic fractions, typically referred to as the B-allele frequencies (Beroukhim et al., 2006). (2003a) (4161 citations, Google Scholar) improved upon this approach by modeling the log transformed gene expressions, which adjusted for the heteroscedasticity in the data and allowed the fitting of the multiplicative probe affinities in a linear model framework. Pollack JR, Perou CM, Alizadeh AA, Eisen MB, Pergamenschikov A, Williams CF, Jeffrey SS, Botstein D, Brown PO. Statistical expertise in the experimental design and low-level processing stages are equally if not more important than end-stage modeling, since errors and inefficiencies in these steps propagate into subsequent analyses, and can preclude the possibility of making valid discoveries and scientific conclusions even with the best constructed end-stage modeling strategies. Third, they allow specific inferential questions to be answered through explicit parameterizations. Werhli AV, Grzegorczyk M, Husmeier D. Comparative evaluation of reverse engineering gene regulatory networks with relevance networks, graphical gaussian models and bayesian networks. (2004) propose a two stage approach, first computing a kernel representation for data in each platform and subsequently combining kernels across platforms in a classification model. Validation of gene signatures that predict the response of breast cancer to neoadjuvant chemotherapyy: A substudy of the eortc 10994/big 00-01 clinical trial. Boehm JS, Hahn WC. Begley CG, Ellis L. Drug development: Raise standards for preclinical cancer research. Morris J, Carroll R. Wavelet-based functional mixed models. Proteomic pattern diagnostics: Producers and consumers in the era of correlative science. Because the spots on the gel contain actual physical proteins, the proteomic identity of a spot can be determined by cutting it out of the gel and further analyzing it using protein identification techniques like tandem mass spectrometry (see below). Novel risk stratification of patients with neuroblastoma by genomic signature, which is independent of molecular signature. Since it is the gene selection and not the parameter estimation that typically introduces the most variability into the modeling, this practice can lead to strongly biased predictive accuracy assessments. By applying these principles, we can continue to develop efficient methods that can strongly impact the field of bioinformatics moving forward. In an idealized scenario where all of the cells in a disease sample have the same genomic alterations and are uncontaminated by normal cells, the log-ratios would assume specific discrete values e.g. In the software shipped with their arrays, they quantified gene expression by taking a simple average of the differences of PM MM for all probes for a given array, a method they called AvDiff. Anderson Cancer Center, as many cancer researchers wanted to try this approach for early detection of other cancers and came to the Biostatistics department for help planning these studies. Variable slope normalization of reverse phase protein arrays. Broadly, there are two types of chromosomal microarrays: array-based Comparative genomic hybridization (aCGH arrays) and single nucleotide polymorphism microarrays (SNP arrays). Peak detection is only done on individual spectra, while ignoring information about whether there appears to be a corresponding peak present in other replicate spectra. (2007) propose an objective Bayes Hidden-Markov Model which effectively borrows strength between neighboring SNPs by explicitly incorporating distance information as well the genotype information via B-allele frequencies. log-ratios as a function of the genomic location that allows efficient borrowing of strength both within and across arrays to model such data. Genome-wide dna methylation profiling using infinium assay. More recently scientists have developed techniques that integrate aspects of both traditional and molecular cytogenetic techniques called chromosomal micorarrays (Vissers et al., 2010). Robust classification of functional and quantitative image data using functional mixed models. Collins FS, Tabak LA. These challenges are further accentuated by the fact that typically in many settings the variables far exceed the sample size (a.k.a big n, small p problem). Ramaswamy S, Tamayo P, Rifkin R, Mukherjee S, Yeang CH, Angelo M, Ladd C, Reich M, Latulippe E, Mesirov JP, et al. The general term Bioinformatics refers to a multidisciplinary field involving computational biologists, computer scientists, mathematical modelers, systems biologists, and statisticians exploring different facets of the data ranging from storing, retrieving, organizing and subsequent analysis of biological data. A workflow for novel image-based differential analysis of lc-ms experiments. The resulting data consist of log fluorescence ratios as a function of the genomic location and provide a cytogenetic representation of the relative DNA copy number variation. Over a series of elution times, the set of separated proteins are then fed into a MS analyzer to produce a spectrum. This can be applied to genome-wide data including methylation and copy number data by modeling the data as a function of the chromosomal locus, or to proteomics data by modeling mass spectra as spiky functions of m/z values or 2DGE images or LC-MS profiles as image data, which can be viewed as functional data on a two-dimensional domain. Integromics espouses the philosophy that a disease is driven by numerous molecular/genetic alterations and the interactions between them, with each type of alteration likely to provide a unique but complementary view of disease progression. One common approach to measure methylation is to use sodium bisulfite conversion, in which sodium bisulfite added to DNA fragments that converts unmethylated cytosine into uracil, allowing the estimation of a beta value measuring the percent methylation at a given CpG site. This documentation should include any preprocessing steps, gene selection, model training, and model validation procedures. However, the great biological sensitivity that makes these assays so desirable for research can also make them highly sensitive to variability in experimental conditions or sample handling.