Van Eyk, Dunn - Proteomic and Genomic Analysis of Cardiovascular Disease - 2003 (522919), страница 8
Текст из файла (страница 8)
In supervised learning, algorithms attempt to learn a concept from labeled training examples, to predict the labels of the test set correctly. Within supervised learning,there are two classes of techniques, including (1) single feature or sample determination (finding genes or samples that match a particular pattern, using nearest-neighbor [26] or t-tests), (2) multiple feature determination (finding combinations of genes that match a particular a priori pattern, using decision trees [27],neural networks [28], or support vector machines [29–31]).In unsupervised learning, algorithms are written to find patterns within a dataset, instead of trying to determine how best to predict a “correct answer.” Withinunsupervised learning, there are three classes of techniques, including (1) featuredetermination (determine genes with interesting properties, without specificallylooking for a particular pattern determined a priori, such as using principal component analysis [32–36]), (2) cluster determination (determine groups of genes orsamples with similar patterns of gene expression) using nearest neighbor clustering [26, 37], self-organizing maps [38, 39], k-means clustering, or one and two-dimensional dendrograms [40, 41]), and (3) network determination (determinegraphs representing gene-gene or gene-phenotype interactions, using Boolean networks [42-44], Bayesian networks [45], and relevance networks [46–48]).Cluster determination algorithms try to find genes that have similar expressionlevels and group them together into clusters.
To approach this problem mathematically, an expression vector is defined that represents each gene as a multidimensional point in “expression space”. In this view, each experiment represents a distinct axis, and the expression level of the gene gives its geometric coordinate.With this definition in place, there are several ways we can then define a “distance”, ranging from the straight-forward Euclidean distance and correlation coefficient, to mutual-information.
Mathematical descriptions of various distance metrics are available as supplementary material to [17].Before we cluster the data, it might be useful to re-scale the data in order to enhance specific aspects of the genes in an experiment. In a process called “meancentering”, each vector is re-scaled to set the average expression of each gene to 0.Consequently, we can easily identify up- or downregulation of each gene in respect to its average expression. If one agrees with the assumption that most genesare not changing between control and experimental samples, then mean centering is particularly useful.
It is also useful in time-series experiments, when we aremore interested in following the variations from the average expression ratherthan in the absolute expression values at each time point.Hierarchical clustering algorithms have become the most popular tools for analyzing gene-expression data [40]. First, distances between points are calculated, withEuclidean distance measures being the most commonly used [49]. Related genesare thought to be closer to each other, therefore, the clustering process follows thesimple principle:17181 Microarray Expression Profiling in Cardiovascular Disease0.
Initial clusters = isolated data points1. Update clusters by merging the two “nearest” clusters2. Go to Step 1 until only one cluster is left.The resulting cluster can be visualized as a single hierarchical tree that resemblesa phylogenetic classification with a number of nested subsets. Hierarchical clustering assumes that the nature of the data structure is fundamentally hierarchical, asin evolutionary studies. However, it is questionable whether this is also true forcomplex microarray data. Another potential problem of methods that use Euclidean distances is the difficulty in finding genes that are negatively related to eachother [48].Examples of non-hierarchical clustering include k-means clustering, self-organizing maps (SOMs), and relevance networks.
In k-means clustering, genes are partitioned into a pre-defined number of different clusters, without trying to specifythe relationship between individual genes [50]. Therefore, it requires an advancedprior knowledge about the number of clusters that are represented within thedata set. Some groups use other tools, such as hierarchical tree algorithms, tofirst identify the optimal number of clusters before applying the k-means algorithms [51].
SOMs have initially been developed to model complex data structuresin biological neural networks [52], such as the topological relationship of neuronsin the cerebral cortex of the brain. SOMs have been found to be well-suited tocluster and visualize large high-dimensional data sets, and have therefore foundwidespread application in telecommunications, artificial speech, and speech recognition, before being used for microarray data analysis [53]. Similar to k-meansclustering, SOMs partition the genes into clusters based on their similarity ingene expression, with the additional constraint that the cluster centers are restricted to lie in a predefined topology.
Reference vectors are defined for each partition that are adjusted to best fit to the expression vectors of the assigned genes.Relevance networks are based on comprehensive pairwise comparisons that can beused to find correlations between disparate biological measures, such as RNA expression and susceptibility to pharmaceuticals [48].Algorithms like the ones cited above are referred to as “unsupervised” methods,because they don’t rely on any prior assumptions regarding the functions of genesto be clustered. On the other hand, “supervised” methods, such as decision trees,nearest neighbor, and support vector machines (SVMs), use existing biological information of gene function as a “training data set” [30].
SVMs first learn to distinguish between different classes of genes within the training data set. Havinglearned the features of these classes, SVMs can then identify and classify unknown genes with similar features from gene expression data.Though there are now many clustering techniques available for the functionalgenomics researcher, it is still crucial to have a question or hypothesis in mind before selecting a technique. Hypotheses such as “What uncategorized genes have anexpression pattern similar to these genes that are well characterized?”, “How different is the pattern of expression of gene X from other genes?”, “What category offunction might gene X belong to?”, and “What are all the gene-gene interactions1.2 Computational Analysis of Microarray Datapresent among these tissue samples?” implicitly guide the choice of the appropriatetype of algorithm (supervised or unsupervised) as well as the specific selection.Although cluster analysis is a powerful tool, no cluster analysis gives absoluteanswers.
Selecting different normalization strategies and distance metrices, orusing different algorithms can lead to the identification of completely differentclusters, and many of them might not be biological meaningful. Therefore, usingour biological understanding about the system under study is essential in order todecide whether an analysis gave us valid data.1.2.4Data SharingMost microarray experiments interrogate thousands of genes on tens to hundredsof individual samples; each generating large lists of hundreds of differentially expressed genes. These data can be extremely complex.
We have seen in our datasets that different biological insights may be uncovered by different approaches toanalyses on the same data set. In most scientific publications of microarray data,the authors provide their own, sometimes subjective, interpretation on a subset ofgenes that they believe encompass the important aspects of the study. However,the identity of genes whose expression levels do not change is often as importantas those that do.
In addition, many researchers are only interested in one or a fewgenes; their interests are ill served if these genes do not make it into these “selected” gene lists. To make the most efficient use of microarray data, it is therefore important that the raw unprocessed data are available to others, as each researcher brings in a different perspective and different analytical methods thatwill help to extract insights beyond those identified by the original set of authors.Also, those data often serve as training data sets for the development of new cluster algorithms and other analytical tools [48].Only the largest microarray laboratories have established their own databases(Tab. 1.1) [54], whereas microarray data accompanying publications are typically reported on the author’s websites, if at all. Public microarray gene expression databases are being developed by the National Center for Biotechnology Information(NCBI), the Gene Expression Omnibus GEO, and by the European BioinformaticsInstitute (EBI) (see Tab.
1.1 for links).However, the formats and annotations for data exchange that would allow the interested scientist to easily download the data, understand the experiment, and evaluate the data quality, have not been established yet. In fact, most data come withoutany annotations at all, making them inaccessible for the general scientific community. On the first Microarray Gene Expression Database (MGED) meeting that tookplace in November 1999 in Cambridge, UK, a working group was established underthe same name to facilitate the adoption of standards for microarray experiments(Tab. 1.1). This group addresses the most critical issues related to the exchange ofmicroarray data: 1) The formulation of the minimum information about a microarray experiment required to interpret and verify the results (MIAME); 2) The establishment of a data exchange format (MAGE-ML for MicroArray Gene Expression19201 Microarray Expression Profiling in Cardiovascular Diseasemarkup Language) and object model (MAGE-OM); 3) The development of ontologies for microarray experiment description and biological material annotation; and4) The development of recommendations regarding experimental controls and datanormalization methods.