Prominent use of distal 5' transcription start sites and discovery of a large number of additional exons in ENCODE regions
This report presents results of a systematic empirical annotation of mRNAs products from 399 annotated protein-coding loci across the 1% of the human genome targeted by the Encyclopedia of DNA elements (ENCODE) pilot project using a combination of 5'Rapid Amplification of cDNA Ends (RACEs) and high-density resolution tiling arrays. RACE allows detection of low copy number transcripts/isoforms and a high-resolution analysis of genes individually, while pooling strategies and array hybridization permit to reach high-throughput readout. We identified previously unannotated and often tissue/cell line specific transcribed fragments (RACEfrags), both 5' distal to the annotated 5' terminus and internal to the annotated gene bounds for the vast majority (81.5%) of the tested genes. Half of the distal RACEfrags span large segments of genomic sequences away from the main portion of the coding transcript and often overlap with the upstream-annotated gene(s). These novel exons have lower GC contents than those of annotated exons. Notably, more than 50% of the novel transcripts resulting from inclusion of novel exons have changes in their open reading frames. A significant fraction of distal RACEfrags show expression levels comparable to those of known exons of the same locus, suggesting that they are not part of very minority splice forms. These results might revise our current understanding of the architecture of protein-coding genes. They have significant implications for our views on locations of regulatory regions in the genome and for the interpretation of sequence polymorphisms mapping to regions hitherto considered to be "non-coding" ultimately relating to the identification of disease-related sequence alterations.