Genome Project - Primer on molecular genetics - 1992 (522926), страница 6
Текст из файла (страница 6)
Actually, finding the gene is even more difficult, becauseeven close up, the gene still looks like just another piece of hay. However, maps giveclues on where to look; the finer the map’s resolution, the fewer pieces of hay to be tested.25Primer onMolecularGeneticsOnce the neighborhood of a gene of interest has been identified, several strategies can beused to find the gene itself. An ordered library of the gene neighborhood can be constructed if one is not already available.
This library provides DNA fragments that can bescreened for additional polymorphisms, improving the genetic map of the region andfurther restricting the possible gene location. In addition, DNA fragments from the regioncan be used as probes to search for DNA sequences that are expressed (transcribed toRNA) or conserved among individuals. Most genes will have such sequences. Thenindividual gene candidates must be examined. For example, a gene responsible for liverdisease is likely to be expressed in the liver and less likely in other tissues or organs.
Thistype of evidence can further limit the search. Finally, a suspected gene may need to besequenced in both healthy and affected individuals. A consistent pattern of DNA variationwhen these two samples are compared will show that the gene of interest has very likelybeen found. The ultimate proof is to correct the suspected DNA alteration in a cell andshow that the cell’s behavior reverts to normal.ORNL-DWG 91M-173705'LINKED FLANKINGMARKERLINKED FLANKINGMARKERDISEASE GENEFig. 13.
Cloning aDisease Gene byChromosome Walking.After a marker is linked towithin 1 cM of a diseasegene, chromosomewalking can be used toclone the disease geneitself. A probe is firstconstructed from agenomic fragment identified from a library asbeing the closest linkedmarker to the gene. Arestriction fragmentisolated from the end ofthe clone near the diseaselocus is used to reprobethe genomic library for anoverlapping clone. Thisprocess is repeated several times to walk acrossthe chromosome andreach the flanking markeron the other side of thedisease-gene locus.(Source: see Fig. 11.)26Probe from5' flankingmarker isused to identifyan overlappingfragment from agenomic libraryGENOMIC DNAFRAGMENTPROBEProbes from the 3' endsof cloned fragments are used toidentify successive overlappingcloned fragmentsChromosome walking continues until a clone isidentified that contains the 3' flanking marker3'Model Organism ResearchMost mapping and sequencing technologies were developed from studies of nonhumangenomes, notably those of the bacterium Escherichia coli, the yeast Saccharomycescerevisiae, the fruit fly Drosophila melanogaster, the roundworm Caenorhabditis elegans,and the laboratory mouse Mus musculus.
These simpler systems provide excellentmodels for developing and testing the procedures needed for studying the much morecomplex human genome.A large amount of genetic information has already been derived from these organisms,providing valuable data for the analysis of normal gene regulation, genetic diseases, andevolutionary processes. Physical maps have been completed for E.
coli, and extensiveoverlapping clone sets are available for S. cerevisiae and C. elegans. In addition,sequencing projects have been initiated by the NIH genome program for E. coli,S. cerevisiae, and C. elegans.Mouse genome research will provide much significant comparative information because ofthe many biological and genetic similarities between mouse and man. Comparisons ofhuman and mouse DNA sequences will reveal areas that have been conserved duringevolution and are therefore important.
An extensive database of mouse DNA sequenceswill allow counterparts of particular human genes to be identified in the mouse and extensively studied. Conversely, information on genes first found to be important in the mousewill lead to associated human studies. The mouse genetic map, based on morphologicalmarkers, has already led to many insights into human biology. Mouse models are beingdeveloped to explore the effects of mutations causing human diseases, including diabetes, muscular dystrophy, and several cancers.
A genetic map based on DNA markers ispresently being constructed, and a physical map is planned to allow direct comparisonwith the human physical map.Informatics: Data Collection and InterpretationCollecting and Storing DataThe reference map and sequence generated by genomeresearch will be used as a primary information source forhuman biology and medicine far into the future. The vastamount of data produced will first need to be collected,stored, and distributed. If compiled in books, the datawould fill an estimated 200 volumes the size of a Manhattan telephone book (at 1000 pages each), and reading itwould require 26 years working around the clock (Fig.14).Because handling this amount of data will require extensive use of computers, database development will be amajor focus of the Human Genome Project. The presentchallenge is to improve database design, software forHUMAN GENETIC DIVERSITY:The Ultimate Human Genetic DatabaseAny two individuals differ in about 3 x 106 bases (0.1%).The population is now about 5 x 109.A catalog of all sequence differences would require15 x 1015 entries.This catalog may be needed to find the rarest or mostcomplex disease genes.27Primer onMolecularGeneticsdatabase access and manipulation, and data-entry procedures to compensate for thevaried computer procedures and systems used in different laboratories.
Databases needto be designed that will accurately represent map information (linkage, STSs, physicallocation, disease loci) and sequences (genomic, cDNAs, proteins) and link them to eachother and to bibliographic text databases of the scientific and medical literature.Interpreting DataNew tools will also be needed for analyzing the data from genome maps and sequences.Recognizing where genes begin and end and identifying their exons, introns, and regulatory sequences may require extensive comparisons with sequences from related speciessuch as the mouse to search for conserved similarities (homologies).
Searching a database for a particular DNA sequence may uncover these homologous sequences in aknown gene from a model organism, revealing insights into the function of the corresponding human gene.Correlating sequence information with genetic linkage data and disease gene researchwill reveal the molecular basis for human variation. If a newly identified gene is found tocode for a flawed protein, the altered protein must be compared with the normal versionto identify the specific abnormality that causes disease. Once the error is pinpointed,researchers must try to determine how to correct it in the human body, a task that willrequire knowledge about how the protein functions and in which cells it is active.ORNL-DWG 91M-17472HUMAN GENOME200 Telephone Books(1000 pages each)Fig.
14. Magnitude ofGenome Data. If the DNAsequence of the humangenome were compiled inbooks, the equivalent of200 volumes the size of aManhattan telephone book(at 1000 pages each)would be needed to holdit all. New data-analysistools will be neededfor understanding theinformation from genomemaps and sequences.28Model Organism GenomesDrosophila (fruit fly)yeastE. coli (bacterium)10 books1 book300 pagesyeast chromosome 3 14 pages(longest continuous sequence now known)Correct protein function depends on the three-dimensional(3D), or folded, structure the proteins assume in biologicalenvironments; thus, understanding protein structure will beessential in determining gene function.
DNA sequenceswill be translated into amino acid sequences, and researchers will try to make inferences about functions eitherby com-paring protein sequences with each other or bycomparing their specific 3-D structures (Fig. 15).ORNL-DWG 91M-17473GENEBecause the 3-D structure patterns (motifs) that proteinmolecules assume are much more evolutionarily conserved than amino acid sequences, this type of homologyFUNCTIONsearch could prove more fruitful.
Particular motifs mayserve similar functions in several different proteins, information that would be valuable in genome analyses.Currently, however, only a few protein motifs can be recognized at the sequence level.Continued development of analytic capabilities to facilitate grouping protein sequencesinto motif families will make homology searches more successful.Mapping DatabasesPROTEINSTRUCTUREFig. 15. UnderstandingGene Function.Understanding howgenes function willrequire analyses of the3-D structures of theproteins for which thegenes code.The Genome Data Base (GDB), located at Johns Hopkins University (Baltimore, Maryland), provides location, ordering, and distance information for human genetic markers,probes, and contigs linked to known human genetic disease.
GDB is presently working onincorporating physical mapping data. Also at Hopkins is the Online Mendelian Inheritancein Man database, a catalog of inherited human traits and diseases.The Human and Mouse Probes and Libraries Database (located at the American TypeCulture Collection in Rockville, Maryland) and the GBASE mouse database (located atJackson Laboratory, Bar Harbor, Maine) include data on RFLPs, chromosomal assignments, and probes from the laboratory mouse.Sequence DatabasesNucleic Acids (DNA and RNA)Public databases containing the complete nucleotide sequence of the human genome andthose of selected model organisms will be one of the most useful products of the HumanGenome Project.
Four major public databases now store nucleotide sequences: GenBankand the Genome Sequence DataBase (GSDB) in the United States, European MolecularBiology Laboratory (EMBL) Nucleotide Sequence Database in the United Kingdom, andthe DNA Database of Japan (DDBJ). The databases collaborate to share sequences,which are compiled from direct author submissions and journal scans. The four databasesnow house a total of almost 200 Mb of sequence.
Although human sequences predominate, more than 8000 species are represented. [Paragraph updated July 1994]29Primer onMolecularGeneticsProteinsThe major protein sequence databases are the Protein Identification Resource (NationalBiomedical Research Foundation), Swissprot, and GenPept (both distributed withGenBank). In addition to sequence information, they contain information on protein motifsand other features of protein structure.Impact of the Human Genome ProjectThe atlas of the human genome will revolutionize medical practice and biologicalresearch into the 21st century and beyond.