Genomic Approaches to Genetics
The sequencing of human and multiple other genomes has enabled the capability for rapid generation of other large 'genomic' data sets that can catalyze rapid advances in understanding biology. Some of these catalogues are already well underway - including accurate representations of most human cDNAs (the Mammalian Gene Collection project), the genotyping of millions of common human single base variants (the HapMap project) - while others are just beginning to be developed. In addition to involvement in more than 20 genome projects, the BCM-Human Genome Sequencing Center has been a key contributor to each of these efforts, and currently generates about 3.5 Gb of raw data each month.
An important new data set that needs to be generated is the collection of alleles that have known functional association with human disease or other phenotypes. So far about 2,000 of the >5,000 Mendelian disease loci, and a handful of alleles that contribute to common disease, have been identified. This leaves a large gap of a lack of association between variation in the majority of human genes and recognizable phenotypes. While ongoing systematic and focused studies of the genetic basis of individual diseases will continue to fill in this gap, there is an opportunity to speed this pathway to discovery.
The increased efficiencies of large scale data generation, and specific platform sequencing technology developments, suggest that a large catalog of low frequency human genetic variation could be directly constructed through deep re-sequencing of large sample sets. As a necessary part of the process of sequence based mutation discovery includes the independent validation of changes using alternative assays, it is technically practical to separate the process of the mutation discovery from the task of mutation validation and measurement of the distribution of change in key cohorts. We therefore propose a population based approach to this large scale mutation discovery. In this proposal the key issue is the actual distribution of frequencies of relatively rare changes, and the role of allelic heterogeneity in disease.