Penn State University Center for Comparative Genomics and Bioinformatics

Comparative Genomics and Bioinformatics at Penn State University:
Using the Mouse and Rat Genome Sequences to Find Function in the Human Genome


The DNA sequence of a genome records all the information needed for an organism to develop from a fertilized egg to an adult. However, we do not know all the important DNA sequences for any organism, and this is a particularly difficult challenge for the human genome because of its large size. Many different research groups are making good strides in finding all the protein-coding genes, but even once these are all known, it is unlikely that they will account for more than about 1.5% of the genome. An even more difficult problem is finding the functional sequences in the non-coding DNA, over 98% of the genome. The best way currently to try to find such non-coding functional sequences is by comparison with genome sequences of related species. Important sequences should not change as much as neutral sequences in inter-species comparisons.

A collaborative effort between the research groups headed by Webb Miller (Departments of Biology and of Computer Science and Engineering) and Ross Hardison (Department of Biochemistry and Molecular Biology) at Penn State University has been developing software for aligning long genomic sequences and analyzing the results since about 1989. They, along with Francesca Chiaromonte (Department of Statistics), joined the Mouse Genome Sequencing Consortium in 2001 to meet a challenging but important goal. They wanted to align the entire human genome (almost 3 billion nucleotides) with the entire mouse genome (about 2.5 billion nucleotides) at high sensitivity and specificity, and then analyze those alignments to find likely functional DNA sequences. By working with collaborators at the University of California at Santa Cruz, headed by David Haussler and Jim Kent, the whole-genome alignments were computed and made public shortly after the mouse genome was assembled. In 2003, they joined the Rat Genome Sequencing Consortium to compute and analyze alignments among human, mouse, and rat genomes. The 3-way alignments are even more challenging to compute, but again these were done shortly after the assemblies were released, and the results were made public both at the UCSC Genome Browser (http://genome.ucsc.edu) and the PSU Genome Alignment and Annotation Database (GALA at http://www.bx.psu.edu).

Analysis of the aligned genome sequences are prominently featured both in the major mouse genome paper in Nature (Waterston et al., 2002, Nature 420:520-562) and in the major rat paper that is scheduled for the April 01, 2004 issue of Nature, as well as in several companion papers in Genome Research in 2003 and 2004. Some of the most important conclusions from examining the whole genome alignments are summarized here. The rate of evolution varies substantially from region to region across the genome, and this variability is seen in the rates for most of the mechanisms by which DNA changes (nucleotide substitutions, deletions, insertions, and recombination). By incorporating this rate variation into analyses of the aligned DNA, the segments of DNA more likely to be functional (under selection) can be identified with greater precision. Indeed, a minimal estimate of the proportion of the human genome that is under selection is about 6%. Thus most of our genome is not functional (as had long been suspected), but the functional portion is substantially greater than the amount coding for protein. A different approach based on training sets of known regulatory regions has been developed to predict which of the regions under selection are likely to be involved in gene regulation. These DNA segments are now being tested for their effects in cell transfection and expression experiments, with a good success rate. These and similar studies should deepen our understanding of genome evolution, improve the prediction of functional DNA segments (both coding and non-coding), and provide information in public databases that will accelerate the progress of all aspects of the biosciences.

This work is now being pursued in the Center for Comparative Genomics and Bioinformatics at PSU. This Center and the Bioinformatics Consulting Center (headed by Drs. James Rosenberger, Naomi Altman, and Izabela Makalowska) are the major bioinformatics groups within the Institute for Genomics, Proteomics and Bioinformatics in the Huck Institutes of the Life Sciences.

February, 2004


$Revision: 1.1 $$Date: 2004/02/28 21:37:03 $