---
title: nailpolish consensus
---

# nailpolish consensus

Consensus call duplicated reads.
The reads must first have been [indexed](./index.md).
By default, reads within each duplicate group will be clustered to eliminate false duplicates.

## Usage

```shell
$ nailpolish consensus --help
Generate a consensus-called 'cleaned up' file

Usage: nailpolish consensus [OPTIONS] <INPUT>

Arguments:
  <INPUT>  the input .fastq

Options:
  -o, --output <OUTPUT>              the output .fastq, or empty for stdout
  -t, --threads <THREADS>            the number of threads to use [default: 4]
      --report-original-reads        for each duplicate group of reads, report the original reads along with the consensus
      --report-original-header       if the original read headers are valuable, this will create a orig_header field
                                     in the consensus called result with the entire original read header
      --extra-stats                  add debugging information to the read header [intended for internal development]
                                     warning: since timings are reported, the output will not be identical across runs
      --no-clustering                disable the clustering algorithm; this will prevent nailpolish from detecting
                                     and separating false duplicates
      --len <LEN>                    filter lengths to a value within the given float interval [a,b] [default: 0,15000]
      --qual <QUAL>                  filter average read quality to a value within the given float interval [a,b] [default: 0,inf]
      --max-group-size <N>           skip consensus calling for groups larger than this size [default: 250]
      --large-group-method <METHOD>  how to handle groups larger than --max-group-size
                                     [default: passthrough] [possible values: passthrough, drop, sample, longest]
      --sort-by <TAG>                sort groups by the specified capture group tag (e.g. 'CB' for cell barcode)
  -h, --help                         Print help
```

## Output format

A `.fastq` file will be produced. Read headers carry metadata as SAM auxiliary tags in the FASTQ
comment field (tab-separated, after the read name). See the [Output format reference](../reference/output-format.md)
for a complete description of all tags.

A typical output looks like this (tabs shown as newlines for clarity):

```
@processed_12047_1
  MI:Z:GATAGCTAGCAACAAT_ATTTTACCGACC
  nI:i:12047
  CB:Z:GATAGCTAGCAACAAT
  UB:Z:ATTTTACCGACC
  nT:Z:consensus
  nC:i:1
  nL:i:2
```

## Options

- `--threads`: set the number of threads that _nailpolish_ should use
- `--report-original-reads`: report the original reads as well as the consensus read
- `--report-original-header`: report the original headers of the reads used to produce
  a consensus
- `--no-clustering`: disable the false duplicate detection algorithm (see below)
- `--len <LEN>`: filter reads by sequence length. Reads outside the interval are excluded
  from consensus calling. Default: `0,15000` (reads longer than 15,000 bp are excluded,
  as excessively long reads from sequencing errors can dominate consensus calling time).
- `--qual <QUAL>`: filter reads by average base quality. Default: `0,inf` (no quality filter).
- `--max-group-size <N>`: the size threshold for large-group handling (default: 250).
  Groups exceeding this size are processed according to `--large-group-method`.
- `--large-group-method <METHOD>`: controls what happens to groups that exceed `--max-group-size`.
  Options:
    - `passthrough` *(default)*: output all reads unmodified with no consensus calling —
      the existing behaviour, preserved for backwards compatibility. Very large groups are
      typically caused by false duplicates; skipping consensus calling prevents an outsized
      impact on runtime.
    - `drop`: omit the group from output entirely.
    - `sample`: pseudorandomly subsample reads down to `--max-group-size` reads and produce
      a consensus from the sample. The random seed is derived from the group ID, so output
      is fully reproducible for a given input file.
    - `longest`: keep only the longest reads (up to `--max-group-size`) and produce a consensus
      from those reads.
  - `--sort-by <TAG>`: sort groups by the named capture group tag before output
    (e.g. `--sort-by CB` to sort by cell barcode).

## False duplicate detection

By default, _nailpolish_ clusters reads within each duplicate group to detect and separate
_false duplicates_ — reads that share a barcode/UMI by coincidence rather than by biology.
Before adding each read to a partial order alignment graph, nailpolish checks whether the read
aligns well to the existing graph. If the alignment introduces too many new nodes relative to
existing ones (more than 25% of valid nodes), the read is assigned to a new cluster rather than
merged into the current one.

To disable this behaviour — for example, when you are confident that all reads in a group are
true duplicates, or when using pre-clustered inputs from a tool like isONclust — pass
`--no-clustering`. This provides a small performance benefit and guarantees a single consensus
per group.

```bash
# default: false duplicate detection enabled
nailpolish consensus reads.fastq -o output.fastq

# disabled: one consensus per group, no splitting
nailpolish consensus --no-clustering reads.fastq -o output.fastq
```

!!! note
    The false duplicate detection was designed for reads that share no biological similarity.
    For pre-clustered inputs from tools with relaxed clustering criteria (e.g. isONclust),
    many loosely similar clusters may still pass through as a single consensus.
    Whether to use `--no-clustering` depends on your confidence in the upstream clustering.---
title: nailpolish extract
---

# nailpolish extract

Retrieve the original unmodified reads within duplicate groups that match a predicate.

## Usage
```bash
$ nailpolish extract --help
Extract reads belonging to groups that match a predicate

Usage: nailpolish extract [OPTIONS] <INPUT>

Arguments:
  <INPUT>  the input .fastq

Options:
  -o, --output <OUTPUT>          the output .fastq, or empty for stdout
      --id <ID>                  Filter by specific group IDs (comma-separated)
      --key <KEY>                Filter by regex pattern for the key
      --group-size <GROUP_SIZE>  Filter by the size of the duplicate group
      --read-nums <READ_NUMS>    Choose a subset of reads by index within a group (comma-separated)
      --format <FORMAT>          Output format type [default: fastq] [possible values: fastq, fasta, metadata]
  -h, --help                     Print help
```

## Predicates
These are mutually exclusive predicates i.e. only one can be given at a time.

- `--id`: A comma-separated list (i.e. 5 or 5,6,7) of group IDs
- `--key`: a regular expression matching the BC_UMI key
- `--group-size`: the number of reads in the duplicate group.
  This is equivalent to the total number of reads called in all the clusters of a group.
  For example, the below group (with two clusters) has a group size of 5.

```bash
@GATAGCTAGCAACAAT_ATTTTACCGACC|id=12047|type=consensus|cluster=1|reads_called=2
#└────────────key────────────┘      └─id
@GATAGCTAGCAACAAT_ATTTTACCGACC|id=12047|type=consensus|cluster=2|reads_called=3
```

## Other options

- `--read-nums`: select a subset of reads by their index within each matching group
  (comma-separated, e.g. `1,2`). Can be combined with any predicate.
- `--format`: output format. Options are `fastq` (default), `fasta`, or `metadata`
  (tab-separated metadata only, without sequence data).---
title: nailpolish index
---

# nailpolish index

This command is used to create an index file from a demultiplexed `.fastq`.
An index is required to run the other _nailpolish_ commands.
The index command supports reads in multiple formats.

## Usage

An index file will be created at `<file>.fastq.nailpolish.idx`.

```shell
$ nailpolish index --help
Create an index file from a demultiplexed .fastq

Usage: nailpolish index [OPTIONS] <INPUT> [PRESET]

Arguments:
  <INPUT>
          the input .fastq file

  [PRESET]
          [default: bc-umi]

          Possible values:
          - bc-umi:           @BARCODE_UMI format as produced by Flexiplex for 10x3 chemistry
          - umi-tools:        `_<UMI>` format as produced by `umi-tools extract`
          - illumina:         bcl2fastq format, which has `:<UMI>` at the end of the read ID
          - sam-tagged-cb-ub: .sam tag format with barcode and UMI (CB:Z and UB:Z tags)
          - sam-tagged-cb:    .sam tag format with barcode only (CB:Z tag)

Options:
      --overwrite
          overwrite an existing index file, if it exists

      --clusters <CLUSTERS>
          use a file containing pre-clustered reads. the file must be semicolon-delimited
          with a header line, where the first column is the read ID and subsequent columns
          are tag names. for example:
            read_id;CB;UB
            READ_HEADER_1;BARCODE1;UMI1

      --barcode-regex <BARCODE_REGEX>
          barcode regex format type, for custom header styles. this will override the preset given.
          for example, for the `bc-umi` preset:
              ^(?<CB>[ATCGNX]{16})_(?<UB>[ATCGNX]{12})

      --skip-unmatched
          skip, instead of error, on reads which are not accounted for:
          - if a cluster file is passed, any reads which are not in any cluster
          - if a barcode regex or preset is used (default), any reads which do not match the regex

  -h, --help
          Print help (see a summary with '-h')
```

## Indexing different file types

### Flexiplex output

If your reads were demultiplexed with [Flexiplex](https://github.com/DavidsonGroup/flexiplex) (Cheng et al. 2024),
use the default `bc-umi` preset:

```bash
nailpolish index reads.fastq
# equivalent to:
nailpolish index reads.fastq bc-umi
```

Read headers are expected to look like `ATCGATCGATCGATCG_ATCGATCGATCG` (16 bp barcode + 12 bp UMI).

### SAM/BAM-tagged reads

If your reads come from an alignment pipeline that annotates reads with SAM tags, use one of the SAM presets:

```bash
# reads tagged with both CB (cell barcode) and UB (UMI)
nailpolish index reads.fastq sam-tagged-cb-ub

# reads tagged with CB only (no UMI)
nailpolish index reads.fastq sam-tagged-cb
```

These presets extract tags from the read comment field, which in FASTQ format carries the SAM auxiliary tags
(e.g., `\t:CB:Z:ATCGATCG\t:UB:Z:TTTTTTTT`).

### Pre-clustered reads from isONclust

If you have already clustered reads with an external tool such as
[isONclust](https://github.com/ksahlin/isONclust), use the `--clusters` option.
_nailpolish_ can then perform consensus calling on these clusters, which is significantly faster
than calling the `spoa` binary in a loop since it uses the same library under the hood.

isONclust produces output in the format `cluster_id<tab>read_header`. Convert it to the cluster
file format expected by _nailpolish_ using `awk`:

```bash
awk 'BEGIN{print "read_id;UB"} {print $2";"$1}' isonclust_clusters.tsv > clusters.txt
nailpolish index --clusters clusters.txt reads.fastq
```

Here, the cluster ID is used as the `UB` tag (effectively treated as the grouping key).

## Reading the index
### Presets

Five presets are bundled with _nailpolish_ for common barcode formats.
These are useful when the header of each read contains information about the barcode.

- `bc-umi`: read headers look like this: `ATCGATCGATCG_ATCGATCGATCGATCG` in the `BC_UMI` format.
  This is the default barcoding format produced by
  the [Flexiplex demultiplexer](https://github.com/DavidsonGroup/flexiplex) (Cheng et al. 2024).
- `umi-tools`: read headers look like this: `HISEQ:87:00000000T_ATCGATCGATCG` where `ATCGATCGATCG` is the UMI sequence.
  This is the default UMI header format expected from the [umi-tools](https://doi.org/10.1101/gr.209601.116) (Smith et al. 2017)
  collection of UMI management tools.
- `illumina`: read headers look like this: `SIM:1:FCX:1:2106:15337:1063:ATCGATCGATCG 1:N:0:ATCACG` where `ATCGATCGATCG`
  is the UMI sequence.
  This is the default UMI header format produced by tools such as `bcl2fastq`.
- `sam-tagged-cb-ub`: reads carry both a cell barcode (`CB:Z:`) and UMI (`UB:Z:`) as SAM auxiliary tags
  in the read comment field.
- `sam-tagged-cb`: reads carry only a cell barcode (`CB:Z:`) as a SAM auxiliary tag.

### Barcode regex

For reads where barcodes and UMIs are contained in the header, in an esoteric format, a custom regular expression
can be provided through the `--barcode-regex <BARCODE_REGEX>` parameter. As examples, here are the regular expressions
for the presets above:

- `bc-umi`: `--barcode-regex "^(?<CB>[ATCGNX]{16})_(?<UB>[ATCGNX]{12})"`
- `umi-tools`: `--barcode-regex "_(?<UB>[ATCGNX]+)$"`
- `illumina`: `--barcode-regex ":(?<UB>[ATCGNX]+)$"`

Capture group names (e.g. `CB`, `UB`) determine what tag names appear in the output of `nailpolish consensus`.

Regular expressions are parsed by the excellent `regex` library for Rust.
This library is performant and has guarantees on worst-case time complexity;
however, the scope of supported regular expression features is more limited.
For complex queries, it is recommended that you consult the [crate documentation](https://docs.rs/regex/latest/regex/)
and test your regular expression using [regex101](https://regex101.com/), ensuring that you set the 'Flavor' to 'Rust'.

_nailpolish_ expects that every read in the input `.fastq` **must** be able to be matched to the provided
regular expression.

### Cluster file

_nailpolish_ can alternatively read grouping information from a separately provided file, if this information is
not in the read headers. This is useful when reads have been pre-clustered by an external tool.

The file must be **semicolon-delimited** (`;`) and must include a **header line** as the first row.
The first column must be named `read_id` and contain the read header. Subsequent columns define the tag names
(e.g. `CB`, `UB`) and their values for each read.

```
read_id;CB;UB
READ_HEADER_1;BARCODE1;UMI1
READ_HEADER_2;BARCODE2;UMI2
```

The column names in the header row (after `read_id`) become the tag names used in the output of `nailpolish consensus`.

By default, _nailpolish_ expects that every read in the input `.fastq` **must** have a corresponding entry in the
cluster file.
In the event where this is not the case, _nailpolish_ will error. To ignore this error and silently skip over any
unmatched reads, the `--skip-unmatched` flag should be passed.---
title: nailpolish summary
---

# nailpolish summary

Quickly review the quality and duplicate rate of the dataset.
The reads must first have been [indexed](./index.md).

## Usage

```shell
$ nailpolish summary --help
Generate a summary of duplicate statistics from an index file

Usage: nailpolish summary [OPTIONS] <INPUT>

Arguments:
  <INPUT>  Input .fastq file

Options:
  -o, --output <OUTPUT>  Output .html file. By default, will write to <file>.summary.html
  -h, --help             Print help
```

## Output

[See an example summary output file.](../assets/summary.html)

<iframe src="../assets/summary.html" style="width: 100%; height: 60vh; min-height: 500px;"></iframe>---
title: Home
description: A high-performance Rust tool for error correcting PCR duplicates in sequencing data
---

# Welcome to Nailpolish's documentation!

Nailpolish is a high-performance Rust tool designed to improve the accuracy of sequencing data by error correcting PCR duplicates.

Nailpolish identifies PCR duplicates in barcoded data (reads containing identical barcodes and UMIs, forming "duplicate groups") and applies the partial order alignment consensus algorithm to replace multiple duplicate reads with a single consensus error-corrected read. This process corrects sequencing errors which naturally occur in the reads, improving the overall quality and reliability of sequencing data.

Nailpolish operates in a reference-free manner, first identifying duplicate groups and then clustering within each duplicate group. This process ensures that only true duplicates are included in consensus calling. That is, unrelated reads that share barcodes and UMIs (due to read or demultiplexing errors) are **not** consensus called together, and are instead separated into separate clusters.

<img src="./assets/consensus_diagram.png" alt="consensus diagram" width="400"/>

See the [Quick Start](./quickstart.md) guide to begin using Nailpolish with your data.

[^1]: For a singular .md file containing the entire documentation that you can point your LLM to, see [`llms.md`](/nailpolish/llms.md)---
title: Installation
---

# Installation
---
title: Quick Start
---

# Quick Start


## Warning

**This guide was written for Nailpolish v0.1.0 and so is considered depreciated. Commands listed here likely will not work.**

This quick start guide will walk you through installing Nailpolish and running it on a small demo dataset.
The demo dataset is a small subset of the _scmixology2_ Chromium 10x droplet-based dataset, sequenced using
Nanopore technology, released by [Tian et al. (2021)](https://doi.org/10.1186/s13059-021-02525-6).

Our Flexiplex tool is used to demultiplex the dataset.

## Install

_For more information, see [Install](./install.md)._

For x64 Linux, run:

```shell
curl --proto '=https' --tlsv1.2 -LsSf "https://github.com/DavidsonGroup/nailpolish/releases/download/nightly_develop/nailpolish" -o nailpolish
chmod +x nailpolish
```

### Get test files

Download the `scmixology2` subset reads using:

```shell
wget https://github.com/DavidsonGroup/nailpolish/releases/download/sample-fastq-for-quickstart/scmixology2_sample.fastq
```

## Indexation

_For more information, see [nailpolish index](commands/index.md)._

By default, nailpolish expects the barcode and UMI to be in the `@BC_UMI` format at the start of the header.
Alternative barcode and UMI formats can be provided through either a preset (one of `bc-umi`, `umi-tools`, `illumina`)
or a custom barcode regex.

```shell
# write the index file to `index.tsv`
nailpolish index --index index.tsv scmixology2_sample.fastq
```

## Summary of duplicate count

_For more information, see [nailpolish summary](commands/summary.md)._

A .html file can be generated to summarise some key statistics about the input reads.
The output file is written to `summary.html` by default.

```shell
nailpolish summary index.tsv
```

![nailpolish summary](./assets/summary_image.png)

## Consensus call duplicates

_For more information, see [nailpolish consensus](commands/consensus.md)._

By consensus calling duplicates, only one read is returned per UMI group.
For singleton reads, there is no change
(apart from including UMI group information in the header).

```shell
nailpolish call \
  --index index.tsv \
  --input scmixology2_sample.fastq \
  --output scmixology2_sample_consensus_called.fastq \
  --threads 4 
```

There are alternative parameters which can be passed to configure the output.
See the _[nailpolish consensus](commands/consensus.md)_ documentation for more.---
title: Output format
---

# Output format

_nailpolish_ produces standard `.fastq` files. Read headers carry metadata as **SAM auxiliary tags**
embedded in the FASTQ comment field (the part of the `@` line after the first tab). This format is
compatible with the [SAM specification](https://samtools.github.io/hts-specs/SAMtags.pdf) and is
passed through to SAM/BAM output by aligners such as minimap2 when they are run with the `-y` flag.

A typical nailpolish read header looks like this (tabs shown as newlines for clarity):

```
@processed_12047_1
  MI:Z:ATCGATCGATCGATCG_TTTTTTTTTTTT
  nI:i:12047
  CB:Z:ATCGATCGATCGATCG
  UB:Z:TTTTTTTTTTTT
  nT:Z:consensus
  nC:i:1
  nL:i:3
```

Fields are separated by tab characters (`\t`). The read name (before the first tab) encodes the read type,
group ID, and cluster ID. Each subsequent field is a SAM tag in `TAG:TYPE:VALUE` form.

---

## Read name

The read name prefix depends on the read type:

| Prefix                              | Read type                                          | Example              |
| ----------------------------------- | -------------------------------------------------- | -------------------- |
| `processed_{group}_{cluster}`       | Consensus or simplex                               | `processed_12047_1`  |
| `original_{group}_{cluster}_{read}` | Original read (requires `--report-original-reads`) | `original_12047_1_3` |
| `filtered_{group}_{read}`           | Filtered read (group exceeded `--max-group-size`)  | `filtered_99_1`      |

---

## Tags

| Tag              | SAM type | Description                                                                                                                                                                        |
| ---------------- | -------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `MI`             | `:Z:`    | Molecular barcode. All barcode components joined with `_` (e.g., `CB_UB`). Present on all reads.                                                                                   |
| `nI`             | `:i:`    | Duplicate group index. Unique per distinct barcode key. Present on all reads.                                                                                                      |
| *(capture tags)* | `:Z:`    | One tag per named capture group from the index (e.g., `CB`, `UB`). Tag names come from the regex capture group names or the cluster file header columns. Present on all reads.     |
| `nT`             | `:Z:`    | Read type: `consensus`, `simplex`, `original`, or `filtered`. Present on all reads.                                                                                                |
| `nC`             | `:i:`    | Cluster index within the duplicate group (1-indexed). Present on `consensus` and `original` reads. Absent when `--no-clustering` is passed.                                        |
| `nL`             | `:i:`    | Number of reads in this cluster (or group if `--no-clustering`). Present on `consensus` and `simplex` reads.                                                                       |
| `nR`             | `:i:`    | Index of this read within its duplicate group (1-indexed). Present on `original` reads only.                                                                                       |
| `nH`             | `:Z:`    | JSON array of original read headers used to produce this read. Present on `consensus` and `simplex` reads when `--report-original-header` is passed.                               |
| `nE`             | `:i:`    | Elapsed time to process this duplicate group, in microseconds. Present on `consensus` and `simplex` reads when `--extra-stats` is passed. Output is non-deterministic across runs. |
| `nA`             | `:Z:`    | JSON array of SPOA alignment prediction results (fields: `new_nodes`, `sequence_len`, `valid_nodes`). Present on `original` reads when `--extra-stats` is passed.                  |

---

## Read types

The `nT` tag identifies how a read was produced. There are four possible values:

### `consensus`

The group contained two or more reads. A consensus sequence was generated using partial order alignment.
If clustering is enabled (the default), one consensus read is produced per cluster within the group.

```
@processed_12047_1	MI:Z:ATCGATCGATCGATCG_TTTTTTTTTTTT	nI:i:12047	CB:Z:ATCGATCGATCGATCG	UB:Z:TTTTTTTTTTTT	nT:Z:consensus	nC:i:1	nL:i:3
```
```
@processed_12047_2	MI:Z:ATCGATCGATCGATCG_TTTTTTTTTTTT	nI:i:12047	CB:Z:ATCGATCGATCGATCG	UB:Z:TTTTTTTTTTTT	nT:Z:consensus	nC:i:2	nL:i:2
```

Both reads share the same `MI` and `nI` (same duplicate group) but have different `nC` values (different
clusters, i.e., different consensus sequences). This happens when false duplicate detection splits the group.

### `simplex`

The group contained exactly one read. No consensus calling was performed; the read is passed through
with updated tags.

```
@processed_2829_1	MI:Z:GCAGTTAAGGATATAC_ACAGTTTCTTTG	nI:i:2829	CB:Z:GCAGTTAAGGATATAC	UB:Z:ACAGTTTCTTTG	nT:Z:simplex	nL:i:1
```

### `original`

The original, unmodified read from a group — emitted alongside the consensus when `--report-original-reads`
is passed. Each original read records which cluster it contributed to (`nC`) and its index in the group (`nR`).

```
@original_12047_1_3	MI:Z:ATCGATCGATCGATCG_TTTTTTTTTTTT	nI:i:12047	CB:Z:ATCGATCGATCGATCG	UB:Z:TTTTTTTTTTTT	nT:Z:original	nC:i:1	nR:i:3
```

Original reads appear in the output **before** the corresponding consensus read for each cluster.

### `filtered`

The group exceeded the `--max-group-size` threshold (default: 250 reads). Consensus calling was skipped
for this group. Each read in the group is emitted individually.

```
@filtered_99_1	MI:Z:ATCGATCGATCGATCG_TTTTTTTTTTTT	nI:i:99	CB:Z:ATCGATCGATCGATCG	UB:Z:TTTTTTTTTTTT	nT:Z:filtered
```

Filtered reads retain all global tags but no cluster or count tags.


## SAM compatibility

Because nailpolish uses the SAM auxiliary tag format, reads can be aligned with minimap2 and the
tags will be preserved in the resulting BAM:

```bash
minimap2 -y -ax splice reference.fa output.fastq | samtools sort -o aligned.bam
```

The `-y` flag instructs minimap2 to copy FASTQ comment fields into the SAM output as auxiliary tags.
All nailpolish tags will be available for downstream filtering in tools like `samtools view`.

---
title: Overview
---

This section documents all available nailpolish commands.

| Command                                            | Description                                                                                    |
| -------------------------------------------------- | ---------------------------------------------------------------------------------------------- |
| [`nailpolish index`](../commands/index.md)         | Create an index file from a demultiplexed `.fastq`. Required before running any other command. |
| [`nailpolish consensus`](../commands/consensus.md) | Consensus call duplicate read groups into a cleaned-up `.fastq`.                               |
| [`nailpolish extract`](../commands/extract.md)     | Extract original unmodified reads from groups matching a predicate.                            |
| [`nailpolish summary`](../commands/summary.md)     | Generate an interactive HTML summary of duplicate statistics.                                  |
