nf-taxblast is a Nextflow pipeline that uses a "split-process-combine" approach to efficiently perform homology searches of large multi-sequence fasta files (entire transcriptome or proteome) using DIAMOND1 or NCBI BLAST23. This works best on high performance computing (HPC) clusters, which can utilise multiple compute nodes to process each "chunk" of query sequences in parallel. The workflow exposes all (or at least most) of the command-line options of either tool to the user, allowing full control of the output (including custom output formats and using the "tax-aware" v5 NCBI databases to report taxonmic classifications of the returned hits).
Clone this repository to your system with git clone https://github.com/IdoBar/nf-taxblast.git
(don't forget where you downloaded it to!)
Note
If you are new to Nextflow4 and nf-core5, please refer to this page on how to set-up Nextflow. Make sure to test your setup with -profile test
before running the workflow on actual data. You will likely need to adapt one the institutional profiles provided by nf-core
to your specific HPC, read more about it here and contact your HPC admin if you need further assistance (I can probably help if you use any of the Australian HPC systems).
- Nucleotide or protein multi-sequence file in a
fasta
format. - A local repository of NCBI BLAST databases (v5), use the
--download
flag to download the required databases before running the homology search.
Once you have the databases, you can run the pipeline using:
nextflow run nf-taxblast.nf \
--app blastn --query QUERY.fasta \
--chunkSize 200 --db "path-of-db/db" \
--out QUERY.blastn.db.outfmt6 \
--outDir <OUTDIR> \
-profile conda,blastn_tax \
To see the complete usage information, run nextflow run <path-of-nf-taxblast>/nf-taxblast.nf --help
, which will print the following usage information:
Usage:
The typical command for running the pipeline is as follows:
nextflow run nf-taxblast.nf --app blastn --query QUERY.fasta --chunkSize 200 --db "path-of-db/db" -profile conda,blastn_tax
nextflow run nf-taxblast.nf --app "diamond blastp" --query QUERY.faa --chunkSize 5000 --db "path-of-db/db" -profile docker,diamond_tax
Mandatory arguments:
--app <value> BLAST/DIAMOND program to use (diamond blastp/x must be quoted!)
Valid options: [blastn, blastp, tblastn, blastx, 'diamond blastp', 'diamond blastx']
--query <file.fatsa> Query fasta file of sequences you wish to BLAST
--db <path-of-db/db> Path of the BLAST or DIAMOND database.
If BLAST database is provided for DIAMOND and taxonomy information is requested
then a suitable database will be created (see Taxonomy options below).
Default: [$BLASTDB/nt or $BLASTDB/nr for protein search]
-profile <profile1,profile2> Configuration profile to use. Can use multiple (comma separated)
Available profiles for container systems: [conda/apptainer/singularity/local/docker]
Available profiles with preset database and output formats: [blastn_tax/diamond_tax]
Available profiles with test datasets and databases: [test/test_tax/test_p/test_p_tax/test_d/test_d_tax]
Optional arguments:
--out <outfile.outfmt6> Output filename of final BLAST output. Default: [QUERY.app.db.outfmt6]
--outDir <path> Output folder for the results. Default: [results]
--outCols <'std'> Output columns (must be quoted!). Default: ['std']
--headers <false> Include headers in the output table. Default: false
--blastOpts <'-evalue 10'> Additional options for BLAST command (must be quoted!).
Default: ['-evalue 1e-10 -max_target_seqs 20']
--dmndOpts <'-e 10e-10'> Additional options for BLAST command (must be quoted!).
Default: ['-e 1e-10 -k 20']
--chunkSize <num> Number of fasta records to use in each job when splitting the query fasta file.
This option can also take the size of each subquery (like 200.KB, 5.KB, etc.)
Default: [250]
--queueSize <num> Maximum number of jobs to be queued [50]
--download <false> Download database before running homology search. Default: false
Taxonomy options:
--taxDbDir <path-of-taxdb/db> Location of taxonomy db files (prot.accession2taxid.FULL.gz, nodes.dmp and names.dmp)
to allow DIAMOND return taxonomic information columns.
If the required files cannot be found in the path they will be automatically downloaded
from the NCBI.
Information about the required files and where to download them can be found at
https://github.com/bbuchfink/diamond/wiki/3.-Command-line-options#makedb-options
Default: [same path as the database]
--taxListFile <taxid.list> A file with list of taxonomy IDs to limit the search space.
--help This usage statement.
Warning
Please provide pipeline parameters via the CLI or Nextflow -params-file
option. Custom config files including those provided by the -c
Nextflow option can be used to provide any configuration except for parameters; see the nf-core docs.
By default, the pipeline generates a tab-delimited file, with the combined top 20 hits (homologs) found for each of the query sequences (with a threshold E-value of 1e-10). This output format is commony known as outfmt 6 std
, and includes the standard 12 columns (qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore
, see BLASTn output format 6). Other columns can be added with the --outCols
option, according to the specifications in the BLAST documentation and DIAMOND documentation. Personally, I prefer to always include the stitle
column, which is the description of the hit sequence.
Note
Please note that the default results table doesn't include any headers. You can add the headers by using the --headers
flag (which will save the output with the .outfmt7
suffix).
One of the unique features of nf-taxblast
(compared with other Nextflow implementations or BLAST) is that it allows the user to add columns with taxonomic classifications of the results by using the --outCols
option. Some options include staxids
(Subject Taxonomy ID(s), separated by a ';'), sscinames
(Subject Scientific Name(s), separated by a ';'), scomnames
(Subject Common Name(s), separated by a ';'), and sskingdoms
(Subject Super Kingdom(s), separated by a ';'). These can be included in the results' table by using the (blastn/blastp/diamond)_tax
profiles.
DIAMOND also offers to add taxonomy columns similar to BLAST, with slight differences: skingdoms
(Unique Subject Kingdom(s), separated by a ';', not to be confused with sskingdoms
) and sphylums
(Unique Subject Phylums(s), separated by a ';') are available in addition to the BLAST taxonomy columns mentioned above, however scomnames
is not available.
DIAMOND, however, can not use the default NCBI databases to include taxonomic features and must create its own version of the database. If taxonomy columns are requested, nf-taxblast
will look for a file named db.dmnd
in the DB folder and if it can't find it, it will download the NCBI prot.accession2taxid.FULL.gz mapping file, extract the sequences from the requested database in FASTA
format and will create the required db.dmnd
file, as detailed in DIAMOND makedb-options.
DIAMOND can also be used just for taxonomic classification by using --outCols 102
. This will print only the Query ID (qseqid
), NCBI taxid (staxids
) and the E-value (evalue
) of the best alignment (without the matching sequences' ID or any other columns). A fourth column containing the taxonomic lineage in text form can be added by using the option --dmndOpts '--include-lineage'
.
Another useful feature allows the user to limit the database to specific taxonomic lineages. This is useful when the user seeks to find hits from certain taxonomic groups (see more advantages and details in this article). To use this feature, use the --taxListFile taxids.list
option to provide a text file with a list of taxonomy ids (each in a separate row, can be at any taxonomic level, from phylum to species). You can use the NCBI Taxonomy Tax Identifier or other tools, such as the ETE Toolkit (python) or taxize (R) to find the taxid
for your species/genus/family, etc.
Download the NCBI mRNA Refseq database and run blastn
against it, generating an output table with header and including taxonomy information columns.
nextflow run nf-taxblast.nf \
--app blastn --query QUERY.fasta \
--chunkSize 200 --db "refseq_rna" --download \
--out QUERY.blastn.refseq_rna.outfmt7 --headers \
--outDir example \
-profile conda,blastn_tax \
Test the pipeline with blastn
nextflow run nf-taxblast.nf -profile conda,test
Test the pipeline with blastn
and additional taxonomy columns
nextflow run nf-taxblast.nf -profile conda,test_tax
Test the pipeline with blastp
and singularity
nextflow run nf-taxblast.nf -profile singularity,test_p
Test the pipeline with blastp
, apptainer and additional taxonomy columns
nextflow run nf-taxblast.nf -profile apptainer,test_p_tax
Test the pipeline with diamond
and apptainer
nextflow run nf-taxblast.nf -profile apptainer,test_d
Test the pipeline with diamond
and additional taxonomy columns
nextflow run nf-taxblast.nf -profile conda,test_d_tax
Nextflow provides an easy method to monitor the progress of the pipeline, status of the tasks and resource usage via Seqera Tower. This can be enabled with the -with-tower
flag. Please read the instructions on how to create a Tower API key and use it to monitor the runs here.
nf-taxblast
was inspired by an earlier work I did (blast_tax), implemented in Nextflow following the example workflow demonstrated in https://github.com/nextflow-io/blast-example, with improvements to expose more options to the user and allow taxonomic classification of the results.
If you would like to contribute to this pipeline, please contact me, raise an Issue or fork this repo, edit it and suggest a PR.
If you use this tool, please use the following citation:
Bar, I. IdoBar/nf-taxblast: v0.5.1. Zenodo. (2025). 10.5281/zenodo.1568359
- Buchfink, B., Xie, C. & Huson, D. H. Fast and sensitive protein alignment using DIAMOND. Nat Meth 12, 59–60 (2015). 10.1038/nmeth.3176
- Camacho, C. et al. BLAST+: architecture and applications. BMC Bioinformatics 10, 421 (2009). 10.1186/1471-2105-10-421
- Camacho, C. et al. BLAST® Command Line Applications User Manual. (National Center for Biotechnology Information (US), Bethesda, MD, USA, 2013). link
- Langer, B. E. et al. Empowering bioinformatics communities with Nextflow and nf-core. bioRxiv Preprint (2024). 10.1101/2024.05.10.592912
- Di Tommaso, P. et al. Nextflow enables reproducible computational workflows. Nature Biotechnology 35, 316–319 (2017). 10.1038/nbt.3820
- Yim, W. C. & Cushman, J. C. Divide and Conquer (DC) BLAST: fast and easy BLAST execution within HPC environments. PeerJ 5, (2017).10.7717/peerj.3486
- Hernández-Salmerón, J. E. & Moreno-Hagelsieb, G. Progress in quickly finding orthologs as reciprocal best hits: comparing blast, last, diamond and MMseqs2. BMC Genomics 21, 741 (2020). 10.1186/s12864-020-07132-6