Uniquely mappable reads (N_uniq map reads):
The count of the number of sequence reads for this sample that can be
aligned to a single genomic location; this does not distinguish between reads
that were obtained multiple times (redundant reads) and reads obtained only once
(non-redundant reads). A larger number of reads from a sufficiently complex
library increases the chances of finding all true binding sites; however, the
number of reads required is not known with certainty, and likely depends on
enrichment, antibody quality in ChIP experiments, and the fraction of the genome
containing the feature being measured.
Self-consistent peaks, IDR n (Self Cons IDR):
An estimate of the number of enriched regions in a single sample. A dataset
is divided into 2 pseudo-replicates that are analyzed by peak-calling at relaxed
stringency followed by IDR filtering at the indicated IDR threshold.
Replicate-consistent peaks, IDR n (Rep Cons IDR):
The number of enriched regions, determined using IDR (Irreproducible Discovery Rate)
using this sample and a replicate. Potential enriched regions are identified using a
peak caller at very low stringency, then the IDR method is used to determine which peaks
are signal and which are noise, at the indicated IDR threshold. As this analysis is
performed using pairs of datasets, the output number of peaks is identical for these two
datasets using this method.
Signal Portion of Tags (SPOT):
A measure of enrichment, analogous to the commonly used
fraction of reads in peaks metric. SPOT calculates the fraction of reads that fall in
tag-enriched regions identified using the Hotspot program, (Hotspot and SPOT are described
on the ENCODE Software Tools page) from a sample of 10 million reads. Note that because methods of
measuring enrichment based on determining the fraction of reads that fall in peaks are
sensitive to the determination of enriched regions, comparison is possible only when using
the identical peak caller and parameters. Larger SPOT values indicate higher signal to
noise; 1.0 is the maximum possible value (all reads are signal) and 0 is the minimum possible
value (all reads are noise). For FAIRE, more than 10 million reads are typically required to
reliably detect peaks.
PCR Bottleneck Coefficient (PBC):
A measure of library complexity. This is the ratio
(non-redundant, uniquely mappable reads)/(all uniquely mappable reads), and is further
described on the ENCODE Software Tools page. Provisionally, 0-0.5 is severe bottlenecking, 0.5-0.8
is moderate bottlenecking, 0.8-0.9 is mild bottlenecking, while 0.9-1.0 is no bottlenecking.
Very low values can indicate a technical problem, such as PCR bias, or a biological finding,
such as a very rare genomic feature. Nuclease-based assays (DNase, MNase) detecting features
with base-pair resolution (transcription factor footprints, positioned nucleosomes) are
expected to recover the same read multiple times, resulting in a lower PBC score for these
assays. Note that the most complex library, random DNA, would approach 1.0, thus the very
highest values can indicate technical problems with libraries. It is the practice for some
labs outside of ENCODE to remove redundant reads; after this has been done, the value for this
metric is 1.0, and this metric is not meaningful. 82% of TF ChIP, 89% of His ChIP, 77% of
DNase, 98% of FAIRE, and 97% of control ENCODE datasets have no or mild bottlenecking.
Normalized Strand Cross-correlation coefficient (NSC):
A measure of enrichment derived without dependence on prior determination of enriched
regions. Higher values indicate more enrichment, values less than 1.1 are relatively low
NSC scores, and the minimum possible value is 1 (no enrichment). Forward and reverse strand
read coverage signal tracks are computed (number of unique mapping read starts at each base
in the genome on the + and - strand separately). The forward and reverse tracks are shifted
towards and away from each other by incremental distances and for each shift, the Pearson
correlation coefficient is computed. In this way, a cross-correlation profile is computed
representing the correlation between forward and reverse strand coverage at different shifts.
The highest cross-correlation value is obtained at a strand shift equal to the predominant
fragment length in the dataset as a result of clustering/enrichment of relative fixed-size
fragments around the binding sites of the target factor. The NSC is the ratio of the maximal
cross-correlation value (which occurs at strand shift equal to fragment length) divided by the
background cross-correlation (minimum cross-correlation value over all strand shifts). This
score is sensitive to technical effects; for example, high-quality antibodies such as H3K4me3
and CTCF score well for all cell types and ENCODE production groups, and variation in enrichment
in particular IPs is detected as stochastic variation. This score is also sensitive to biological
effects; narrow marks score higher than broad marks (H3K4me3 v H3K36me3, H3K27me3) for all cell
types and ENCODE production groups, and features present in some individual cells but not others
in a population are expected to have lower scores.
Relative Strand Cross-correlation coefficient (RSC):
A measure of enrichment derived without dependence on prior determination of enriched
regions. The minimum possible value is 0 (no signal), highly enriched experiments have
values greater than 1, and values much less than 1 may indicate low quality. Forward
and reverse strand read coverage signal tracks are computed (number of unique mapping
read starts at each base in the genome on the + and - strand separately). The forward
and reverse tracks are shifted towards and away from each other by incremental distances
and for each shift, the Pearson correlation coefficient is computed. In this way, a
cross-correlation profile is computed representing the correlation values between forward
and reverse strand coverage at different shifts. The highest cross-correlation value is
obtained at a strand shift equal to the predominant fragment length in the dataset as a
result of clustering/enrichment of relative fixed-size fragments around the binding sites
of the target factor. For short-read datasets (< 100 bp reads) and large genomes with
a significant number of non-uniquely mappable positions (e.g., human and mouse), a
cross-correlation phantom-peak is also observed at a strand-shift equal to the read length.
This read-length peak is an effect of the variable and dispersed mappability of positions
across the genome. For a significantly enriched dataset, the fragment length cross-correlation
peak (representing clustering of fragments around target sites) should be larger than the
mappability-based read-length peak. The RSC is the ratio of the fragment-length cross-correlation
value minus the background cross-correlation value, divided by the phantom-peak cross-correlation
value minus the background cross-correlation value.