CodingMotif: exact determination of overrepresented nucleotide motifs in coding sequences.
Document Type
Article
Publication Date
2-14-2012
Keywords
Algorithms, Chromatin Immunoprecipitation, Codon, DNA, Genomics, Humans, Nucleotide Motifs, Open Reading Frames, RNA, Software, Transcription Factors, Untranslated Regions
JAX Source
BMC Bioinformatics 2012 Feb 14; 13:32.
PMID
22333114
Volume
13
First Page
32
Last Page
32
ISSN
1471-2105
Abstract
BACKGROUND: It has been increasingly appreciated that coding sequences harbor regulatory sequence motifs in addition to encoding for protein. These sequence motifs are expected to be overrepresented in nucleotide sequences bound by a common protein or small RNA. However, detecting overrepresented motifs has been difficult because of interference by constraints at the protein level. Sampling-based approaches to solve this problem based on codon-shuffling have been limited to exploring only an infinitesimal fraction of the sequence space and by their use of parametric approximations.
RESULTS: We present a novel O(N(log N)2)-time algorithm, CodingMotif, to identify nucleotide-level motifs of unusual copy number in protein-coding regions. Using a new dynamic programming algorithm we are able to exhaustively calculate the distribution of the number of occurrences of a motif over all possible coding sequences that encode the same amino acid sequence, given a background model for codon usage and dinucleotide biases. Our method takes advantage of the sparseness of loci where a given motif can occur, greatly speeding up the required convolution calculations. Knowledge of the distribution allows one to assess the exact non-parametric p-value of whether a given motif is over- or under- represented. We demonstrate that our method identifies known functional motifs more accurately than sampling and parametric-based approaches in a variety of coding datasets of various size, including ChIP-seq data for the transcription factors NRSF and GABP.
CONCLUSIONS: CodingMotif provides a theoretically and empirically-demonstrated advance for the detection of motifs overrepresented in coding sequences. We expect CodingMotif to be useful for identifying motifs in functional genomic datasets such as DNA-protein binding, RNA-protein binding, or microRNA-RNA binding within coding regions. A software implementation is available at http://bioinformatics.bc.edu/chuanglab/codingmotif.tar.
Recommended Citation
Ding Y,
Lorenz W,
Chuang J.
CodingMotif: exact determination of overrepresented nucleotide motifs in coding sequences. BMC Bioinformatics 2012 Feb 14; 13:32.