Skip to content

nailpolish consensus

Consensus call duplicated reads. The reads must first have been indexed. By default, reads within each duplicate group will be clustered to eliminate false duplicates.

Usage

$ nailpolish consensus --help
Generate a consensus-called 'cleaned up' file

Usage: nailpolish consensus [OPTIONS] <INPUT>

Arguments:
  <INPUT>  the input .fastq

Options:
  -o, --output <OUTPUT>              the output .fastq, or empty for stdout
  -t, --threads <THREADS>            the number of threads to use [default: 4]
      --report-original-reads        for each duplicate group of reads, report the original reads along with the consensus
      --report-original-header       if the original read headers are valuable, this will create a orig_header field
                                     in the consensus called result with the entire original read header
      --extra-stats                  add debugging information to the read header [intended for internal development]
                                     warning: since timings are reported, the output will not be identical across runs
      --no-clustering                disable the clustering algorithm; this will prevent nailpolish from detecting
                                     and separating false duplicates
      --len <LEN>                    filter lengths to a value within the given float interval [a,b] [default: 0,15000]
      --qual <QUAL>                  filter average read quality to a value within the given float interval [a,b] [default: 0,inf]
      --max-group-size <N>           skip consensus calling for groups larger than this size [default: 250]
      --large-group-method <METHOD>  how to handle groups larger than --max-group-size
                                     [default: passthrough] [possible values: passthrough, drop, sample, longest]
      --sort-by <TAG>                sort groups by the specified capture group tag (e.g. 'CB' for cell barcode)
  -h, --help                         Print help

Output format

A .fastq file will be produced. Read headers carry metadata as SAM auxiliary tags in the FASTQ comment field (tab-separated, after the read name). See the Output format reference for a complete description of all tags.

A typical output looks like this (tabs shown as newlines for clarity):

@processed_12047_1
  MI:Z:GATAGCTAGCAACAAT_ATTTTACCGACC
  nI:i:12047
  CB:Z:GATAGCTAGCAACAAT
  UB:Z:ATTTTACCGACC
  nT:Z:consensus
  nC:i:1
  nL:i:2

Options

  • --threads: set the number of threads that nailpolish should use
  • --report-original-reads: report the original reads as well as the consensus read
  • --report-original-header: report the original headers of the reads used to produce a consensus
  • --no-clustering: disable the false duplicate detection algorithm (see below)
  • --len <LEN>: filter reads by sequence length. Reads outside the interval are excluded from consensus calling. Default: 0,15000 (reads longer than 15,000 bp are excluded, as excessively long reads from sequencing errors can dominate consensus calling time).
  • --qual <QUAL>: filter reads by average base quality. Default: 0,inf (no quality filter).
  • --max-group-size <N>: the size threshold for large-group handling (default: 250). Groups exceeding this size are processed according to --large-group-method.
  • --large-group-method <METHOD>: controls what happens to groups that exceed --max-group-size. Options:
    • passthrough (default): output all reads unmodified with no consensus calling — the existing behaviour, preserved for backwards compatibility. Very large groups are typically caused by false duplicates; skipping consensus calling prevents an outsized impact on runtime.
    • drop: omit the group from output entirely.
    • sample: pseudorandomly subsample reads down to --max-group-size reads and produce a consensus from the sample. The random seed is derived from the group ID, so output is fully reproducible for a given input file.
    • longest: keep only the longest reads (up to --max-group-size) and produce a consensus from those reads.
  • --sort-by <TAG>: sort groups by the named capture group tag before output (e.g. --sort-by CB to sort by cell barcode).

False duplicate detection

By default, nailpolish clusters reads within each duplicate group to detect and separate false duplicates — reads that share a barcode/UMI by coincidence rather than by biology. Before adding each read to a partial order alignment graph, nailpolish checks whether the read aligns well to the existing graph. If the alignment introduces too many new nodes relative to existing ones (more than 25% of valid nodes), the read is assigned to a new cluster rather than merged into the current one.

To disable this behaviour — for example, when you are confident that all reads in a group are true duplicates, or when using pre-clustered inputs from a tool like isONclust — pass --no-clustering. This provides a small performance benefit and guarantees a single consensus per group.

# default: false duplicate detection enabled
nailpolish consensus reads.fastq -o output.fastq

# disabled: one consensus per group, no splitting
nailpolish consensus --no-clustering reads.fastq -o output.fastq

Note

The false duplicate detection was designed for reads that share no biological similarity. For pre-clustered inputs from tools with relaxed clustering criteria (e.g. isONclust), many loosely similar clusters may still pass through as a single consensus. Whether to use --no-clustering depends on your confidence in the upstream clustering.