Van Eyk, Dunn - Proteomic and Genomic Analysis of Cardiovascular Disease - 2003 (522919), страница 7
Текст из файла (страница 7)
Before expression data from different cDNA or high-density oligonucleotide microarrays can be compared to each other, the data need to be normalized. Normalization attempts to identify the biological information by removing the impact of non-biological influence on the data, and by correcting for systematic bias in expression data. Systematic bias can be caused by differences in labeling efficiencies, scanner malfunction, differences in the initial quantity ofmRNA, different concentrations of DNA on the arrays (reporter bias), printingand tip problems and other microarray batch bias, uneven hybridization, as wellas experimenter-related issues.Every normalization procedure is likely to remove or even distort some of thebiological information.
Therefore, it is a good idea to address the problems leading to systematic bias in order to keep normalization to a minimum. Misalignedlasers can easily be fixed, and reciprocal labeling with swapped color dyes will allow correction for differences in labeling efficiencies in cDNA microarray experiments. Sensible cDNA microarray design can help to distinguish reporter biascaused by different DNA concentrations from “biological” effects caused by thesystematic arrangement of reporters on the array.
Uneven hybridization can becaused by insufficient amounts of labeled probe that fail to saturate the targetspots. However, the experimenters themselves can be one of the largest sources ofsystematic variability. Considering the many steps necessary to perform a microar-13141 Microarray Expression Profiling in Cardiovascular Diseaseray experiment, it doesn’t come as a surprise that experiments done by the sameexperimenter have been shown to cluster more tightly and have less variabilitythan experiments done by several experimenters.
Taken together, sensible designof arrays and experiments, systematic error checking, the use of reference samples, replicates, consistent methods, and good quality control can significantly enhance data quality and minimize the need for data normalization.There are several techniques that are widely used to normalize gene-expressiondata (reviewed in [17]), such as total intensity normalization, linear regressiontechniques [18], ratio statistics [19], and LOWESS (LOcally WEighted ScatterplotSmoothing) or LOESS (LOcally wEighted regreSSion) correction [20]. Every normalization strategy relies on a set of assumptions.
It is important to understandyour data to know whether these assumptions are appropriate to use on your dataset or not. In general, all of the strategies assume that the average gene does notchange, either looking at the entire data set or at a user-defined subset of genes.Total intensity normalization relies on the assumptions that equal amounts of labeled RNA sample have been hybridized to the arrays. Furthermore, when usedwith cDNA microarrays, this technique assumes that an equal number of genesare upregulated and downregulated, so that the overall intensity of all the elements on an array is the same for all RNA samples (control and experimental).Under these assumptions, a normalization factor can be calculated to re-scale theoverall intensity value of the arrays.
Affymetrix MAS uses a scaling factor to bringall the arrays in an experiment to a preset arbitrary target intensity value.Linear regression techniques rely on the assumption that a significant fraction ofgenes are expressed at the same level when RNA populations from closely relatedsamples are compared. Based on this assumption, plotting the intensity values ofall genes from one sample against the intensity values of all genes from the othersample should result in genes that are expressed at equal levels clustering along astraight line. Regression techniques are then used to adjust the slope of this lineto one.
However, it has been shown for both, cDNA and high-density oligonucleotide experiments, that the signal intensities are nonlinear [21]. In these cases, a robust local regression technique such as LOWESS correction is more suitable [22].Some techniques rely on a sufficient number of non-differentially expressedgenes, such as “housekeeping” genes or exogenous control genes that have beenspiked into the RNA before labeling. However, if the number of pre-determined“housekeeping” genes is small or their intensities do not cover a range of different intensity levels, this approach is a bad choice to fit normalization curves.
Also,many of the so-called housekeeping genes do exhibit a natural variability in theirexpression level. Spiked in controls can span a broad range of ratio and intensitylevels and may be useful to detect systematic bias, but cannot account for differences in the initial amount of RNA.Comparison analysis. After normalization, the data for each gene are typically reported as an “ expression ratio” (or its logarithm) of the normalized value of theexpression level for the experimental sample divided by the normalized value ofthe control sample if cDNA arrays have been used. For oligonucleotide arrays,1.2 Computational Analysis of Microarray DataAverage Difference (Affymetrix GeneChip Analysis Suite) or Signals (AffymetrixMAS 5.0) are used as measure for absolute gene expression levels.For experiments involving a pair of conditions, the next step is to identify genesthat are differentially expressed. Various techniques have been proposed for theDifferent views on the reproducibilityof microarray measurements.
In an Affymetrixmicroarray experiment involving three types ofmice (CSX –/–, +/– and +/+), with three independent replicates for each type, Panel Ashows the standard deviation of each gene ineach type of mouse (y-axis) plotted againstthe mean expression level (x-axis) for thatgene for that group (there are 13,179 genemeasurements and 3 types of mice, or 39,537points). In this view, there appears to be ahigher standard deviation (and thus higher irreproducibility) in genes measured at higherexpression levels.
Panel B shows the samepoints, with the y-axis now showing standarddeviation divided by the mean expression level. For most genes, the normalized standardFig. 1.3deviation now appears to be the same acrossthe higher mean gene expression level, exceptfor a few genes at a very low mean expressionlevel with large standard deviation. Panel Cshows the same graph, with the y-axis showing the logarithm of the standard deviationdivided by the mean expression level. Here,most genes’ standard deviation is aroundone-tenth of their mean expression, except forseveral genes with low expression levels.
Panel D shows all genes from two of the replicateexperiments, with each axis representing theexpression measurement on each chip. Ideally, this plot should represent a line withslope 1. Instead, one sees a typical “fishtail”diagram, with lower expression levels seemingly having less reproducibility.15161 Microarray Expression Profiling in Cardiovascular Diseaseselection of differentially expressed genes. Earlier studies have used arbitrary cutoff values such as twofold increase or decrease post-normalization without providing the theoretical background for choosing this level as significant.
The inherentproblem with this simple technique lies in the fact that the experimental and biological variability is far greater for genes that are expressed at low levels than forgenes that are expressed at high levels, and that variability is different across different experiments (Fig. 1.3). Therefore, selecting significant genes based on anarbitrary fold change across the entire range of experimental data tends towardspreferentially selecting genes that are expressed at low levels.New data analysis techniques are continuously being developed, driven in partby the obvious need to move beyond setting arbitrary fold-change cut-off values,and because none of the existing techniques has found widespread acceptance inthe community so far.
Several approaches apply widely-used parametric statisticaltests such as Student’s t-Test [23] and ANOVA (ANalysis Of VAriance) [24], ornon-parametric tests such as Mann-Whitney U test [25] or Kruskal-Wallis test(www.cardiogenomics.org) for every individual gene. However, due to the costs ofmicroarray experiments, the number of replicates is usually low and thereby canlead to inaccurate estimates of variance.The true power of microarray experiments does not come from the analysis ofsingle experiments attempting to identify single gene expression changes or signaling pathways, but from analyzing many experiments that survey a variety oftime points, phenotypes, or experimental conditions in order to identify globalregulatory networks.
Successful examples include genome-scale experiments identifying genes in the yeast mitotic cell cycle [1], or tumor classification [2]. In orderto identify common patterns of gene expression from multiple hybridizations,more sophisticated clustering tools have to be used.1.2.3Clustering AlgorithmsIdentifying patterns of gene expression and grouping genes into expressionclasses provides much greater insight into their potential biological relevance thansimple lists of up- and downregulated genes. Mathematical algorithms and computational tools to cluster the data are rapidly evolving, but no method has beendescribed that seems universally suited to reveal biologically meaningful patternsin all data sets.
Indeed, it is becoming increasingly clear that every analysis method might reveal a different aspect of the data set. Consequently, asking the rightquestions and choosing the appropriate algorithms to answer these questions is acrucial element of a meaningful experimental design. Within the scope of thischapter, we can only briefly describe some of the most commonly used techniques. This overview can by no means be comprehensive, because new tools andalgorithms are continuously being developed.Current methodologies in functional genomics that use larger RNA expressiondata sets for clustering can be roughly divided into two categories: supervised learning (analysis to determine ways to accurately split into or predict groups of sam-1.2 Computational Analysis of Microarray Dataples or diseases), and un-supervised learning (analysis looking for characterizationof the components of a data set, without a priori input on cases or genes).