Welcome to the nailpolish documentation.
Nailpolish is a tool written in Rust for error correction of PCR duplicates, which occur when multiple reads have the same barcode and UMI. By quickly identifying these duplicates and performing a graph-based consensus algorithm, Nailpolish can replace these reads with a single consensus error-corrected read. This process can help to correct some of the sequencing errors naturally occurring in the reads.
See the Quick Start for more information.
Quick Start
This quick start guide will walk you through installing Nailpolish and running it on a small demo dataset. The demo dataset is a small subset of the scmixology2 Chromium 10x droplet-based dataset, sequenced using Nanopore technology, released by Tian et al. (2021).
Our Flexiplex tool is used to demultiplex the dataset.
Install
For more information, see Install.
For x64 Linux, run:
curl --proto '=https' --tlsv1.2 -LsSf "https://github.com/DavidsonGroup/nailpolish/releases/download/nightly_develop/nailpolish" -o nailpolish
chmod +x nailpolish
Get test files
Download the scmixology2
subset reads using:
wget https://github.com/DavidsonGroup/nailpolish/releases/download/sample-fastq-for-quickstart/scmixology2_sample.fastq
Indexation
For more information, see nailpolish index.
By default, nailpolish expects the barcode and UMI to be in the @BC_UMI
format at the start of the header.
Alternative barcode and UMI formats can be provided through either a preset (one of bc-umi
, umi-tools
, illumina
)
or a custom barcode regex.
# write the index file to `index.tsv`
nailpolish index --index index.tsv scmixology2_sample.fastq
Summary of duplicate count
For more information, see nailpolish summary.
A .html file can be generated to summarise some key statistics about the input reads.
The output file is written to summary.html
by default.
nailpolish summary index.tsv
Consensus call duplicates
For more information, see nailpolish call.
By consensus calling duplicates, only one read is returned per UMI group. For singleton reads, there is no change (apart from including UMI group information in the header).
nailpolish call \
--index index.tsv \
--input scmixology2_sample.fastq \
--output scmixology2_sample_consensus_called.fastq \
--threads 4
There are alternative parameters which can be passed to configure the output. See the nailpolish call documentation for more.
nailpolish index
This command is used to create an index file from a demultiplexed .fastq
.
An index is required to run the other nailpolish commands.
The index command supports reads in multiple formats.
Usage
$ nailpolish index --help
Create an index file from a demultiplexed .fast2q
Usage: nailpolish index [OPTIONS] <FILE> [PRESET]
Arguments:
<FILE> the input .fastq file
[PRESET] [default: bc-umi] [possible values: bc-umi, umi-tools, illumina]
Options:
--index <INDEX> the output index file [default: index.tsv]
--clusters <CLUSTERS> whether to use a file containing pre-clustered reads, with every line in one of two formats: READ_ID;BARCODE or, READ_ID;BARCODE;UMI
--barcode-regex <BARCODE_REGEX> barcode regex format type, for custom header styles. This will override the preset given. For example: ^@([ATCG]{16})_([ATCG]{12}) for the BC-UMI preset
--skip-unmatched skip, instead of error, on reads which are not accounted for: - if a cluster file is passed, any reads which are not in any cluster - if a barcode regex or preset is used (default), any reads which do not match the regex
-h, --help Print help (see more with '--help')
Presets
Three presets are bundled with nailpolish for common barcode formats. These are useful when the header of each read contains information about the barcode.
bc-umi
: read headers look like this:@ATCGATCGATCG_ATCGATCGATCGATCG
in the@BC_UMI
format. This is the default barcoding format produced by the Flexiplex demultiplexer (Cheng et al., 2024).umi-tools
: read headers look like this:@HISEQ:87:00000000T_ATCGATCGATCG
whereATCGATCGATCG
is the UMI sequence. This is the default UMI header format expected from the umi-tools collection of UMI management tools.illumina
: read headers look like this:@SIM:1:FCX:1:2106:15337:1063:ATCGATCGATCG 1:N:0:ATCACG
whereATCGATCGATCG
is the UMI sequence. This is the default UMI header format produced by tools such asbcl2fastq
.
Barcode regex
For reads where barcodes and UMIs are contained in the header, in an esoteric format, a custom regular expression
can be provided through the --barcode-regex <BARCODE_REGEX>
parameter. As examples, here are the regular expressions
for the presets above:
bc-umi
:--barcode-regex "^([ATCG]{16})_([ATCG]{12})"
umi-tools
:--barcode-regex "_([ATCG]+)$"
illumina
:--barcode-regex ":([ATCG]+)$"
Regular expressions are parsed by the excellent regex
library for Rust.
This library is performant and has guarantees on worst-case time complexity;
however, the scope of supported regular expression features is more limited.
For complex queries, it is recommended that you consult the crate documentation
and test your regular expression using regex101, ensuring that you set the 'Flavor' to 'Rust'.
By default, nailpolish expects that every read in the input .fastq
must be able to be matched to the provided
regular expression.
In the event where this is not the case, nailpolish will error. To ignore this error and silently skip over any
unmatched reads, the --skip-unmatched
flag should be passed.
Cluster file
nailpolish can alternatively extract UMIs from a separately provided delimiter-separated file, if this information is
not in the read headers.
The file must be semicolon-delimited (;
). Rows must be in the format READ_ID;BARCODE
or READ_ID;BARCODE;UMI
.
Note that no header line should be present in the file.
By default, nailpolish expects that every read in the input .fastq
must have a corresponding entry in the
cluster file.
In the event where this is not the case, nailpolish will error. To ignore this error and silently skip over any
unmatched reads, the --skip-unmatched
flag should be passed.