|
Comparative Genomics and Bioinformatics at Penn State University:
Using the Mouse and Rat Genome Sequences to Find Function in the Human Genome
The DNA sequence of a genome records all the information needed for an
organism to develop from a fertilized egg to an adult. However, we do
not know all the important DNA sequences for any organism, and this is
a particularly difficult challenge for the human genome because of its
large size. Many different research groups are making good strides in
finding all the protein-coding genes, but even once these are all
known, it is unlikely that they will account for more than about 1.5%
of the genome. An even more difficult problem is finding the functional
sequences in the non-coding DNA, over 98% of the genome. The best way
currently to try to find such non-coding functional sequences is by
comparison with genome sequences of related species. Important
sequences should not change as much as neutral sequences in
inter-species comparisons.
A collaborative effort between the research groups headed by Webb
Miller (Departments of Biology and of Computer Science and Engineering)
and Ross Hardison (Department of Biochemistry and Molecular Biology) at
Penn State University has been developing software for aligning long
genomic sequences and analyzing the results since about 1989. They,
along with Francesca Chiaromonte (Department of Statistics), joined the
Mouse Genome Sequencing Consortium in 2001 to meet a challenging but
important goal. They wanted to align the entire human genome (almost 3
billion nucleotides) with the entire mouse genome (about 2.5 billion
nucleotides) at high sensitivity and specificity, and then analyze
those alignments to find likely functional DNA sequences. By working
with collaborators at the University of California at Santa Cruz,
headed by David Haussler and Jim Kent, the whole-genome alignments were
computed and made public shortly after the mouse genome was assembled.
In 2003, they joined the Rat Genome Sequencing Consortium to compute
and analyze alignments among human, mouse, and rat genomes. The 3-way
alignments are even more challenging to compute, but again these were
done shortly after the assemblies were released, and the results were
made public both at the UCSC Genome Browser
(http://genome.ucsc.edu)
and the PSU Genome Alignment and Annotation Database
(GALA at http://www.bx.psu.edu).
Analysis of the aligned genome sequences are prominently featured both
in the major mouse genome paper in Nature (Waterston et al., 2002,
Nature 420:520-562) and in the major rat paper that is scheduled for
the April 01, 2004 issue of Nature, as well as in several companion
papers in Genome Research in 2003 and 2004. Some of the most important
conclusions from examining the whole genome alignments are summarized
here. The rate of evolution varies substantially from region to region
across the genome, and this variability is seen in the rates for most
of the mechanisms by which DNA changes (nucleotide substitutions,
deletions, insertions, and recombination). By incorporating this rate
variation into analyses of the aligned DNA, the segments of DNA more
likely to be functional (under selection) can be identified with
greater precision. Indeed, a minimal estimate of the proportion of the
human genome that is under selection is about 6%. Thus most of our
genome is not functional (as had long been suspected), but the
functional portion is substantially greater than the amount coding for
protein. A different approach based on training sets of known
regulatory regions has been developed to predict which of the regions
under selection are likely to be involved in gene regulation. These DNA
segments are now being tested for their effects in cell transfection
and expression experiments, with a good success rate. These and similar
studies should deepen our understanding of genome evolution, improve
the prediction of functional DNA segments (both coding and non-coding),
and provide information in public databases that will accelerate
the progress of all aspects of the biosciences.
This work is now being pursued in the Center for Comparative Genomics
and Bioinformatics at PSU. This Center and the Bioinformatics
Consulting Center (headed by Drs. James Rosenberger, Naomi Altman, and
Izabela Makalowska) are the major bioinformatics groups within the
Institute for Genomics, Proteomics and Bioinformatics in the Huck
Institutes of the Life Sciences.
February, 2004
|