Skip to content

A Nextflow workflow for scalable distributed taxonomy-enabled NCBI Blast and Diamond homology searches

License

Notifications You must be signed in to change notification settings

IdoBar/nf-taxblast

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DOI Nextflow run with conda run with docker run with apptainer/singularity

Introduction

nf-taxblast is a Nextflow pipeline that uses a "split-process-combine" approach to efficiently perform homology searches of large multi-sequence fasta files (entire transcriptome or proteome) using DIAMOND1 or NCBI BLAST23. This works best on high performance computing (HPC) clusters, which can utilise multiple compute nodes to process each "chunk" of query sequences in parallel. The workflow exposes all (or at least most) of the command-line options of either tool to the user, allowing full control of the output (including custom output formats and using the "tax-aware" v5 NCBI databases to report taxonmic classifications of the returned hits).

Installation

Clone this repository to your system with git clone https://github.com/IdoBar/nf-taxblast.git (don't forget where you downloaded it to!)

Note

If you are new to Nextflow4 and nf-core5, please refer to this page on how to set-up Nextflow. Make sure to test your setup with -profile test before running the workflow on actual data. You will likely need to adapt one the institutional profiles provided by nf-core to your specific HPC, read more about it here and contact your HPC admin if you need further assistance (I can probably help if you use any of the Australian HPC systems).

Usage

Required input:

  • Nucleotide or protein multi-sequence file in a fasta format.
  • A local repository of NCBI BLAST databases (v5), use the --download flag to download the required databases before running the homology search.

Once you have the databases, you can run the pipeline using:

nextflow run nf-taxblast.nf \
   --app blastn --query QUERY.fasta \
   --chunkSize 200 --db "path-of-db/db" \
   --out QUERY.blastn.db.outfmt6 \
   --outDir <OUTDIR> \
   -profile conda,blastn_tax \

To see the complete usage information, run nextflow run <path-of-nf-taxblast>/nf-taxblast.nf --help, which will print the following usage information:

   Usage:
      The typical command for running the pipeline is as follows:
      nextflow run nf-taxblast.nf --app blastn --query QUERY.fasta --chunkSize 200 --db "path-of-db/db" -profile conda,blastn_tax
      nextflow run nf-taxblast.nf --app "diamond blastp" --query QUERY.faa --chunkSize 5000 --db "path-of-db/db" -profile docker,diamond_tax

      Mandatory arguments:
       --app <value>                  BLAST/DIAMOND program to use (diamond blastp/x must be quoted!)
                                      Valid options: [blastn, blastp, tblastn, blastx, 'diamond blastp', 'diamond blastx'] 
       --query <file.fatsa>           Query fasta file of sequences you wish to BLAST
       --db <path-of-db/db>           Path of the BLAST or DIAMOND database. 
                                      If BLAST database is provided for DIAMOND and taxonomy information is requested
                                      then a suitable database will be created (see Taxonomy options below). 
                                      Default: [$BLASTDB/nt or $BLASTDB/nr for protein search]   
       -profile <profile1,profile2>   Configuration profile to use. Can use multiple (comma separated)
                                      Available profiles for container systems: [conda/apptainer/singularity/local/docker]
                                      Available profiles with preset database and output formats: [blastn_tax/diamond_tax]
                                      Available profiles with test datasets and databases: [test/test_tax/test_p/test_p_tax/test_d/test_d_tax]

       Optional arguments:
       --out <outfile.outfmt6>        Output filename of final BLAST output. Default: [QUERY.app.db.outfmt6]
       --outDir <path>                Output folder for the results. Default: [results]
       --outCols <'std'>              Output columns (must be quoted!). Default: ['std']
       --headers <false>              Include headers in the output table. Default: false
       --blastOpts <'-evalue 10'>     Additional options for BLAST command (must be quoted!). 
                                      Default: ['-evalue 1e-10 -max_target_seqs 20']
       --dmndOpts <'-e 10e-10'>       Additional options for BLAST command (must be quoted!). 
                                      Default: ['-e 1e-10 -k 20'] 
       --chunkSize <num>              Number of fasta records to use in each job when splitting the query fasta file. 
                                      This option can also take the size of each subquery (like 200.KB, 5.KB, etc.) 
                                      Default: [250]
       --queueSize <num>              Maximum number of jobs to be queued [50]
       --download <false>             Download database before running homology search. Default: false

       Taxonomy options:
       --taxDbDir <path-of-taxdb/db>  Location of taxonomy db files (prot.accession2taxid.FULL.gz, nodes.dmp and names.dmp) 
                                      to allow DIAMOND return taxonomic information columns. 
                                      If the required files cannot be found in the path they will be automatically downloaded 
                                      from the NCBI.
                                      Information about the required files and where to download them can be found at 
                                      https://github.com/bbuchfink/diamond/wiki/3.-Command-line-options#makedb-options
                                      Default: [same path as the database]
       --taxListFile <taxid.list>     A file with list of taxonomy IDs to limit the search space.
       --help                         This usage statement.
     

Warning

Please provide pipeline parameters via the CLI or Nextflow -params-file option. Custom config files including those provided by the -c Nextflow option can be used to provide any configuration except for parameters; see the nf-core docs.

Pipeline output

By default, the pipeline generates a tab-delimited file, with the combined top 20 hits (homologs) found for each of the query sequences (with a threshold E-value of 1e-10). This output format is commony known as outfmt 6 std, and includes the standard 12 columns (qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore, see BLASTn output format 6). Other columns can be added with the --outCols option, according to the specifications in the BLAST documentation and DIAMOND documentation. Personally, I prefer to always include the stitle column, which is the description of the hit sequence.

Note

Please note that the default results table doesn't include any headers. You can add the headers by using the --headers flag (which will save the output with the .outfmt7 suffix).

Taxonomy options

Adding taxonomy information columns

One of the unique features of nf-taxblast (compared with other Nextflow implementations or BLAST) is that it allows the user to add columns with taxonomic classifications of the results by using the --outCols option. Some options include staxids (Subject Taxonomy ID(s), separated by a ';'), sscinames (Subject Scientific Name(s), separated by a ';'), scomnames (Subject Common Name(s), separated by a ';'), and sskingdoms (Subject Super Kingdom(s), separated by a ';'). These can be included in the results' table by using the (blastn/blastp/diamond)_tax profiles.

DIAMOND also offers to add taxonomy columns similar to BLAST, with slight differences: skingdoms (Unique Subject Kingdom(s), separated by a ';', not to be confused with sskingdoms) and sphylums (Unique Subject Phylums(s), separated by a ';') are available in addition to the BLAST taxonomy columns mentioned above, however scomnames is not available.

DIAMOND, however, can not use the default NCBI databases to include taxonomic features and must create its own version of the database. If taxonomy columns are requested, nf-taxblast will look for a file named db.dmnd in the DB folder and if it can't find it, it will download the NCBI prot.accession2taxid.FULL.gz mapping file, extract the sequences from the requested database in FASTA format and will create the required db.dmnd file, as detailed in DIAMOND makedb-options.
DIAMOND can also be used just for taxonomic classification by using --outCols 102. This will print only the Query ID (qseqid), NCBI taxid (staxids) and the E-value (evalue) of the best alignment (without the matching sequences' ID or any other columns). A fourth column containing the taxonomic lineage in text form can be added by using the option --dmndOpts '--include-lineage'.

Limiting the search to specific taxonomic lineages

Another useful feature allows the user to limit the database to specific taxonomic lineages. This is useful when the user seeks to find hits from certain taxonomic groups (see more advantages and details in this article). To use this feature, use the --taxListFile taxids.list option to provide a text file with a list of taxonomy ids (each in a separate row, can be at any taxonomic level, from phylum to species). You can use the NCBI Taxonomy Tax Identifier or other tools, such as the ETE Toolkit (python) or taxize (R) to find the taxid for your species/genus/family, etc.

Examples

Download the NCBI mRNA Refseq database and run blastn against it, generating an output table with header and including taxonomy information columns.

nextflow run nf-taxblast.nf \
   --app blastn --query QUERY.fasta \
   --chunkSize 200 --db "refseq_rna" --download \
   --out QUERY.blastn.refseq_rna.outfmt7 --headers \
   --outDir example \
   -profile conda,blastn_tax \

Running tests

Test the pipeline with blastn

nextflow run nf-taxblast.nf -profile conda,test

Test the pipeline with blastn and additional taxonomy columns

nextflow run nf-taxblast.nf -profile conda,test_tax

Test the pipeline with blastp and singularity

nextflow run nf-taxblast.nf -profile singularity,test_p

Test the pipeline with blastp, apptainer and additional taxonomy columns

nextflow run nf-taxblast.nf -profile apptainer,test_p_tax

Test the pipeline with diamond and apptainer

nextflow run nf-taxblast.nf -profile apptainer,test_d

Test the pipeline with diamond and additional taxonomy columns

nextflow run nf-taxblast.nf -profile conda,test_d_tax

Monitor progress and utilised resources

Nextflow provides an easy method to monitor the progress of the pipeline, status of the tasks and resource usage via Seqera Tower. This can be enabled with the -with-tower flag. Please read the instructions on how to create a Tower API key and use it to monitor the runs here.

Credits

nf-taxblast was inspired by an earlier work I did (blast_tax), implemented in Nextflow following the example workflow demonstrated in https://github.com/nextflow-io/blast-example, with improvements to expose more options to the user and allow taxonomic classification of the results.

Contributions and Support

If you would like to contribute to this pipeline, please contact me, raise an Issue or fork this repo, edit it and suggest a PR.

Citations

If you use this tool, please use the following citation:

DOI
Bar, I. IdoBar/nf-taxblast: v0.5.1. Zenodo. (2025). 10.5281/zenodo.1568359

Tools used in the workflow:

  1. Buchfink, B., Xie, C. & Huson, D. H. Fast and sensitive protein alignment using DIAMOND. Nat Meth 12, 59–60 (2015). 10.1038/nmeth.3176
  2. Camacho, C. et al. BLAST+: architecture and applications. BMC Bioinformatics 10, 421 (2009). 10.1186/1471-2105-10-421
  3. Camacho, C. et al. BLAST® Command Line Applications User Manual. (National Center for Biotechnology Information (US), Bethesda, MD, USA, 2013). link

Nextflow publications:

  1. Langer, B. E. et al. Empowering bioinformatics communities with Nextflow and nf-core. bioRxiv Preprint (2024). 10.1101/2024.05.10.592912
  2. Di Tommaso, P. et al. Nextflow enables reproducible computational workflows. Nature Biotechnology 35, 316–319 (2017). 10.1038/nbt.3820

Additional reading on similar alternatives and performance comparison:

  1. Yim, W. C. & Cushman, J. C. Divide and Conquer (DC) BLAST: fast and easy BLAST execution within HPC environments. PeerJ 5, (2017).10.7717/peerj.3486
  2. Hernández-Salmerón, J. E. & Moreno-Hagelsieb, G. Progress in quickly finding orthologs as reciprocal best hits: comparing blast, last, diamond and MMseqs2. BMC Genomics 21, 741 (2020). 10.1186/s12864-020-07132-6

About

A Nextflow workflow for scalable distributed taxonomy-enabled NCBI Blast and Diamond homology searches

Topics

Resources

License

Stars

Watchers

Forks