<back> iDMMPMM procedure details

Motif comparison table description

We used AhoPro to compare quality of different motifs using different sequence sets. AhoPro is used to compute P-value i. e. the probability to observe one or more motif occurences better than some threshold in a random sequence of given length. Random sequence length was set as the 3 maximum motif length - 2 to take into account possible self-overlapping effects. For random sequence we use Bernoulli model with the nucleotide probabilities estimated from Drosophila genome (April 2004, dm2; dmel40 in our notation). Taking the range of PWM thresholds one can plot P-values versus the number of sequences having best site occurences better or equal than some fixed threshold corresponding to the concrete P-value. It can be taken as a ROC-curve (see Kulakovskiy et. al., 2009 for details). For each sequence set the 'overfitted' motif built from this set was excluded from the comparison.


Column names:
'tot' column refers to total number of points used in comparison (maximum number is equal to sequence count minus one). This is limited by:
- the fixed threshold of each motif (the calculation stops for the motif if its fixed threshold is reached)
- the maximum allowed P-value (we used 0.1 upper bound)
'imm' column refers to the number of cases where an integrated ('imm', made using all available sources) motif beats all other motifs
'exc' column refers to the number of cases where an 'except' motif (created by integrating all sequence sets except the one used for the quality test) beats all single-set based motifs.


Columns are empty where there were no data for concrete factor from the selected source. Columns contain -1 where there are only two available data sources and therefore 'imm' vs 'except' vs single-source based motifs is useless.

Fixed motif thresholds were set as the mean + 3 s.d. taking the PWM score distribution over all possible words. See Kulakovskiy et. al., 2009 for details.


NOTE: you can access all comparison data and graphs on the concrete factor page.


Green lines on the motif logos indicate strong motif DIC thresholds (see Chipmunk 'Details' page) and (the highest one) DIC for the case where only 2 of 4 nucleotides are present.


Additional control using independent ChIP-chip dataset

The same procedure was applied to all factors for which there was ChIP-chip data available. As the independent control set we used 500bp regions around top 300 peaks (skipping top 100 peaks used for motifs construction). Only 'imm' and single-dataset-based motifs were used in this comparison.


Information on motif construction procedure