nf-core/funcscan
(Meta-)genome screening for functional and natural product gene sequences
1.1.3
). The latest
stable release is
2.0.0
.
Define where the pipeline should find input data and save output data.
Path to comma-separated file containing information sample names and paths to corresponding FASTA files.
string
^\S+\.csv$
Before running the pipeline, you will need to create a design file with information about the samples to be scanned by nf-core/funcscan, containing sample name
and path/to/your/contigs.fasta
. Use this parameter to specify its location. It has to be a comma-separated file with 2 columns, and a header row (sample, fasta
). See usage docs.
The output directory where the results will be saved. You have to use absolute paths to storage on Cloud infrastructure.
string
Email address for completion summary.
string
^([a-zA-Z0-9_\-\.]+)@([a-zA-Z0-9_\-\.]+)\.([a-zA-Z]{2,5})$
Set this parameter to your e-mail address to get a summary e-mail with details of the run sent to you when the workflow exits. If set in your user config file (~/.nextflow/config
) then you don't need to specify this on the command line for every run.
MultiQC report title. Printed as page header, used for filename if not otherwise specified.
string
These parameters influence which workflow (ARG, AMP and/or BGC) to activate.
Activate antimicrobial peptide screening tools.
boolean
Activate antimicrobial resistance gene screening tools.
boolean
Activate biosynthetic gene cluster screening tools.
boolean
These options influence the generation of annotation files required for downstream steps in ARG, AMP, and BGC workflows.
Specify which annotation tool to use for some downstream tools.
string
Specify whether to save gene annotations in the results directory.
boolean
These parameters influence the annotation algorithm of Bacteria used by BAKTA.
Specify a path to BAKTA database.
string
Specify a path to a database that is prepared in a BAKTA format.
Download full or light version of the Bakta database if not supplying own database.
string
If you want the pipeline to download the Bakta database for you, you can choose between the full (33.1 GB) and light (1.3 GB) version. The full version is generally recommended for best annotation results, because it contains all of these:
- UPS: unique protein sequences identified via length and MD5 hash digests (100% coverage & 100% sequence identity)
- IPS: identical protein sequences comprising seeds of UniProt's UniRef100 protein sequence clusters
- PSC: protein sequences clusters comprising seeds of UniProt's UniRef90 protein sequence clusters
- PSCC: protein sequences clusters of clusters comprising annotations of UniProt's UniRef50 protein sequence clusters
If download bandwidth, storage, memory, or run duration requirements become an issue, go for the light version (which only contains PSCCs) by modifying the annotation_bakta_db_downloadtype
flag.
More details can be found in the documentation
Modifies tool parameter(s):
- BAKTA_DBDOWNLOAD:
--type
Specify the minimum contig size.
integer
1
Specify the minimum contig size that would be annotated by BAKTA.
If run with '--annotation_bakta_compliant', the minimum contig length must be set to 200. More details can be found in the documentation.
Modifies tool parameter(s):
- BAKTA:
--min-contig-length
Specify the genetic code translation table.
integer
11
Specify the genetic code translation table used for translation of nucleotides to amino acids.
All possible genetic codes (1-25) used for gene annotation can be found here. More details can be found in the documentation.
Modifies tool parameter(s):
- BAKTA:
--translation-table
Specify the type of bacteria to be annotated to detect signaling peptides.
string
Specify the type of bacteria expected in the input dataset for correct annotation of the signal peptide predictions. More details can be found in the documentation.
Modifies tool parameter(s):
- BAKTA:
--gram
Specify that all contigs are complete replicons.
boolean
This flag expects contigs that make up complete chromosomes and/or plasmids. By calling it, the user ensured that the contigs are complete replicons. More details can be found in the documentation.
Modifies tool parameter(s):
- BAKTA:
--complete
Changes the original contig headers.
boolean
This flag specifies that the contig headers should be rewritten. More details can be found in the documentation.
Modifies tool parameter(s):
- BAKTA:
--keep-contig-headers
Clean the result annotations to standardise them to Genbank/ENA conventions.
boolean
The resulting annotations are cleaned up to standardise them to Genbank/ENA/DDJB conventions. CDS without any attributed hits and those without gene symbols or product descriptions different from hypothetical will be marked as 'hypothetical'.
When activated the '--min-contig-length' will be set to 200. More info can be found here.
Modifies tool parameter(s):
- BAKTA:
--compliant
Activate tRNA detection & annotation.
boolean
This flag activates tRNAscan-SE 2.0 that predicts tRNA genes. More details can be found in the documentation.
Modifies tool parameter(s):
- BAKTA:
--skip-trna
Activate tmRNA detection & annotation.
boolean
This flag activates Aragorn that predicts tmRNA genes. More details can be found in the documentation.
Modifies tool parameter(s):
- BAKTA:
--skip-tmrna
`
Activate rRNA detection & annotation.
boolean
This flag activates Infernal vs. Rfam rRNA covariance models that predicts rRNA genes. More details can be found in the documentation.
Modifies tool parameter(s):
- BAKTA:
--rrna
Activate ncRNA detection & annotation.
boolean
This flag activates Infernal vs. Rfam ncRNA covariance models that predicts ncRNA genes.
BAKTA distinguishes between ncRNA genes and (cis-regulatory) regions to enable the distinction of feature overlap detection.
This including distinguishing between ncRNA gene types: sRNA, antisense, ribozyme and antitoxin. More details can be found in the documentation.
Modifies tool parameter(s):
- BAKTA:
--ncrna
Activate ncRNA region detection & annotation.
boolean
This flag activates Infernal vs. Rfam ncRNA covariance models that predicts ncRNA cis-regulatory regions.
BAKTA distinguishes between ncRNA genes and (cis-regulatory) regions to enable the distinction of feature overlap detection.
This including distinguishing between ncRNA (cis-regulatory) region types: riboswitch, thermoregulator, leader and frameshift element. More details can be found in the documentation.
Modifies tool parameter(s):
- BAKTA:
--skip-ncrna-region
Activate CRISPR array detection & annotation.
boolean
This flag activates PILER-CR that predicts CRISPR arrays. More details can be found in the documentation.
Modifies tool parameter(s):
- BAKTA:
--skip-crispr
Skip CDS detection & annotation.
boolean
This flag skips CDS prediction that is done by PYRODIGAL with which the distinct prediction for complete replicons and uncompleted contigs is done.
For more information on how BAKTA predicts CDS please refer to BAKTA documentation.
Modifies tool parameter(s):
- BAKTA:
--skip-cds
Activate pseudogene detection & annotation.
boolean
This flag activates the search for reference Phytochelatin Synthase genes (PCSs) using hypothetical CDS as seed sequences, then aligns the translated PCSs against up-/downstream-elongated CDS regions. For more info refer to BAKTA documentation.
Modifies tool parameter(s):
- BAKTA:
--skip-pseudo
Skip sORF detection & annotation.
boolean
Skip the prediction of sORFs from amino acids stretches as less than 30aa. For more info please refer to BAKTA documentation. All sORF without gene symbols or product descriptions different from hypothetical will be discarded, while only those identified hits exhibiting proper gene symbols or product descriptions different from hypothetical will still be included in the final annotation.
Modifies tool parameter(s):
- BAKTA:
--skip-sorf
Activate gap detection & annotation.
boolean
Activates any gene annotation found within contig assembly gaps. For more info. please refer to BAKTA documentation.
Modifies tool parameter(s):
- BAKTA:
--skip-gap
Activate oriC/oriT detection & annotation.
boolean
Activates the BAKTA search for oriC/oriT genes by comparing results from Blast+ (generated by cov=0.8, id=0.8) and the MOB-suite of oriT & DoriC oriC/oriV sequences. Annotations of ori regions take into account overlapping Blast+ hits and are conducted based on a majority vote heuristic. Region edges may be fuzzy. For more info please refer to the BAKTA documentation.
Modifies tool parameter(s):
- BAKTA:
--skip-ori
Activate generation of circular genome plots.
boolean
Activate this flag to generate genome plots (might be memory-intensive).
Modifies tool parameter(s):
- BAKTA:
--skip-plot
These parameters influence the annotation algorithm used by Prokka.
Use the default genome-length optimised mode (rather than the metagenome mode).
boolean
By default, Prokka's --metagenome mode is used in the pipeline to improve the gene prediction of highly fragmented metagenomes.
By specifying this parameter Prokka will instead use it's default mode that is optimised for singular 'complete' genome sequences.
For more information, please check Prokka documentation.
Modifies tool parameter(s):
- Prokka:
--metagenome
Suppress the default clean-up of the gene annotations.
boolean
By default, annotation in Prokka is carried out by alignment to other proteins in its database, or the databases the user provides via the tools --proteins
flag. The resulting annotations are then cleaned up to standardise them to Genbank/ENA conventions.
'Vague names' are set to 'hypothetical proteins', 'possible/probable/predicted' are set to 'putative' and 'EC/CPG and locus tag ids' are removed.
By supplying this flag you stop such clean up leaving the original annotation names.
For more information please check Prokka documentation.
This flag suppresses this default behavior of Prokka (which is to perform the cleaning).
Modifies tool parameter(s):
- Prokka:
--rawproduct
Specify the kingdom that the input represents.
string
Specifies the kingdom that the input sample is derived from and/or you wish to screen for
⚠️ Prokka cannot annotate Eukaryotes.
For more information please check Prokka documentation.
Modifies tool parameter(s):
- Prokka:
--kingdom
Specify the translation table used to annotate the sequences.
integer
11
Specify the translation table used to annotate the sequences. All possible genetic codes (1-25) used for gene annotation can be found here. This flag is required if the flag --kingdom
is assigned.
For more information please check Prokka documentation.
Modifies tool parameter(s):
- Prokka:
--gcode
Minimum contig size required for annotation (bp).
integer
1
Specify the minimum contig lengths to carry out annotations on. The Prokka developers recommend that this should be >= 200 bp, if you plan to submit such annotations to NCBI.
For more information please check Prokka documentation.
Modifies tool parameter(s):
- Prokka:
--mincontiglen
Minimum e-value cut-off.
number
0.000001
Specifiy the minimum e-value used for filtering the alignment hits.
For more information please check Prokka documentation.
Modifies tool parameter(s):
- Prokka:
--evalue
Set the assigned minimum coverage.
integer
80
Specify the minimum coverage percent of the annotated genome. This must be set between 0-100.
For more information please check Prokka documentation.
Modifies tool parameter(s):
- Prokka:
--coverage
Allow transfer RNA (trRNA) to overlap coding sequences (CDS).
boolean
Allow transfer RNA (trRNA) to overlap coding sequences (CDS). Transfer RNAs are short stretches of nucleotide sequences that link mRNA and the amino acid sequence of proteins. Their presence helps in the annotation of the sequences, because each trRNA can only be attached to one type of amino acid.
For more information please check Prokka documentation.
Modifies tool parameter(s):
- Prokka:
--cdsrnaolap
Use RNAmmer for rRNA prediction.
boolean
Activates RNAmmer instead of the Prokka default Barrnap for rRNA prediction during the annotation process. RNAmmer classifies ribosomal RNA genes in genome sequences by using two levels of Hidden Markov Models. Barrnap uses the nhmmer tool that includes HMMER 3.1 for HMM searching in RNA:DNA style.
For more information please check Prokka documentation.
Modifies tool parameter(s):
- Prokka:
--rnammer
Sequencing centre ID.
string
Add the sequencing center ID used in generating the raw sequences. This flag is typically requested in combination with the --compliant
flag when contigs need to be renamed due to non-conforming contig headers. For more information please check Prokka documentation.
Modifies tool parameter(s):
- Prokka:
--centre
Force contig name to Genbank/ENA/DDJB naming rules.
boolean
Force the contig headers to conform to the Genbank/ENA/DDJB contig header standards. This is activated in combination with --centre [X]
when contig headers supplied by the user are non-conforming and therefore need to be renamed before Prokka can start annotation. This flag activates --genes --mincontiglen 200
. For more information please check Prokka documentation.
Modifies tool parameter(s):
- Prokka:
--compliant
Assign the locus tag for the contig header.
string
Prokka
Assign a special name to the contig. This is used when a specific group of samples are run in a batch. For more information please check Prokka documentation.
Modifies tool parameter(s):
- Prokka:
--locustag
Add the gene features for each CDS hit.
boolean
For every CDS annotated, this flag adds the gene that encodes for that CDS region. For more information please check Prokka documentation.
Modifies tool parameter(s):
- Prokka:
--addgenes
These parameters influence the annotation algorithm used by Prodigal.
Specify whether to use Prodigal's single-genome mode for long sequences.
boolean
By default Prodigal runs in 'single genome' mode that requires sequence lengths to be equal or longer than 20000 characters.
However, more fragmented reads from MAGs often result in contigs shorter than this. Therefore, nf-core/funcscan will run with the meta
mode by default, but providing this parameter allows to override this and run in single genome mode again.
For more information check Prodigal documentation.
Modifies tool parameter(s):
-PRODIGAL:-p
Does not allow partial genes on contig edges.
boolean
Suppresses partial genes from being on contig edge, resulting in closed ends. Should only be activated for genomes where it is sure the first and last bases of the sequence(s) do not fall inside a gene. Run together with -p normal
(former -p single
) .
For more information check Prodigal documentation.
Modifies tool parameter(s):
- PRODIGAL:
-c
Specifies the translation table used for gene annotation.
integer
11
Specifies which translation table should be used for seqeunce annotation. All possible genetic code translation tables can be found here. The default is set at 11, which is used for standard Bacteria/Archeae.
For more information check Prodigal documentation.
Modifies tool parameter(s):
- PRODIGAL:
-g
Forces Prodigal to scan for motifs.
boolean
Forces PRODIGAL to a full scan for motifs rather than activating the Shine-Dalgarno RBS finder, the default scanner for PRODIGAL to train for motifs.
For more information check Prodigal documentation.
Modifies tool parameter(s):
- PRODIGAL:
-n
These parameters influence the annotation algorithm used by Pyrodigal.
Specify whether to use Pyrodigal's single-genome mode for long sequences.
boolean
By default Pyrodigal runs in 'single genome' mode that requires sequence lengths to be equal or longer than 20000 characters.
However, more fragmented reads from MAGs often result in contigs shorter than this. Therefore, nf-core/funcscan will run with the meta
mode by default, but providing this parameter allows to override this and run in single genome mode again.
For more information check Pyrodigal documentation.
Modifies tool parameter(s):
-PYRODIGAL:-p
Does not allow partial genes on contig edges.
boolean
Suppresses partial genes from being on contig edge, resulting in closed ends. Should only be activated for genomes where it is sure the first and last bases of the sequence(s) do not fall inside a gene. Run together with -p single
.
For more information check Pyrodigal documentation.
Modifies tool parameter(s):
- PYRODIGAL:
-c
Specifies the translation table used for gene annotation.
integer
11
Specifies which translation table should be used for seqeunce annotation. All possible genetic code translation tables can be found here. The default is set at 11, which is used for standard Bacteria/Archeae.
For more information check Pyrodigal documentation.
Modifies tool parameter(s):
- PYRODIGAL:
-g
Forces Pyrodigal to scan for motifs.
boolean
Forces Pyrodigal to a full scan for motifs rather than activating the Shine-Dalgarno RBS finder, the default scanner for Pyrodigal to train for motifs.
For more information check Pyrodigal documentation.
Modifies tool parameter(s):
- PYRODIGAL:
-n
Generic options for database downloading
Specify whether to save pipeline-downloaded databases in your results directory.
boolean
While nf-core/funcscan can download databases for you, often these are very large and can significantly slow-down pipeline runtime if the databases have to be downloaded every run.
Specifying --save_databases
while save the pipeline-downloaded databases in your results directory. This applies to: BAKTA, DeepBGC, DeepARG, AMRFinderPlus, antiSMASH, and DRAMP.
You can then move the resulting directories/files to a central cache directory of your choice for re-use in the future.
If you do not specify these flags, the database files will remain in your work/
directory and will be deleted if cleanup = true
is specified in your config, or if you run nextflow clean
.
Antimicrobial Peptide detection using a deep learning model.
Skip AMPlify during AMP-screening.
boolean
Antimicrobial Peptide detection using machine learning
Skip AMPir during AMP-screening.
boolean
Specify which machine learning classification model to use.
string
AMPir uses a supervised statistical machine learning approach to predict AMPs. It incorporates two support vector machine classification models, "precursor" and "mature".
The precursor module is better for predicted proteins from a translated transcriptome or translated gene models. The alternative model (mature) is best suited for AMP sequences after post-translational processing, typically from direct proteomic sequencing.
More information can be found in the AMPir documentation.
Modifies tool parameter(s):
- AMPir:
model =
Specify minimum protein length for prediction calculation.
integer
10
Filters result for minimum protein length.
Note that amino acid sequences that are shorter than 10 amino acids long and/or contain anything other than the standard 20 amino acids are not evaluated and will contain an NA as their prob_AMP value
More information can be found in the AMPir documentation.
Modifies tool parameter(s):
- AMPir parameter:
min_length
in thecalculate_features()
function
Antimicrobial Peptide detection based on predefined HMM models
Skip HMMsearch during AMP-screening.
boolean
Specify path to the AMP hmm model file(s) to search against. Must have quotes if wildcard used.
string
HMMSearch performs biosequence analysis using profile hidden Markov Models.
The models are specified in.hmm
files that are specified with this parameter
e.g.
--amp_hmmsearch_models '/<path>/<to>/<models>/*.hmm'
You must wrap the path in quotes if you use a wildcard, to ensure Nextflow expansion not bash!
For more information check HMMER documentation.
Saves a multiple alignment of all significant hits to a file.
boolean
Save a multiple alignment of all significant hits (those satisfying inclusion thresholds) to a file
For more information check HMMER documentation.
Modifies tool parameter(s):
- HMMsearch:
-A
Save a simple tabular file summarising the per-target output.
boolean
Save a simple tabular (space-delimited) file summarizing the per-target output, with one data line per homologous target sequence found.
For more information check HMMER documentation.
Modifies tool parameter(s)
- HMMsearch:
--tblout
Save a simple tabular file summarising the per-domain output.
boolean
Save a simple tabular (space-delimited) file summarizing the per-domain output, with one data line per homologous domain detected in a query sequence for each homologous model.
For more information check HMMER documentation.
Modifies tool parameter(s):
- HMMsearch:
--domtblout
Antimicrobial Peptide detection mining from metagenomes
Skip Macrel during AMP-screening.
boolean
AntiMicrobial Peptides parsing and functional classification tool
Path to AMPcombi reference database directory (DRAMP).
string
AMPcombi uses the 'general AMPs' dataset of the (DRAMP database)[http://dramp.cpu-bioinfor.org/downloads/] for taxonomic classification. If you have a local version of it, you can provide the path to the folder containing the reference database files:
- a fasta file with a
.fasta
file extension - the corresponding table with with functional and taxonomic classifications in
.tsv
file extension.
For more information check AMPcombi documentation.
Specify probability cutoff to filter AMPs
number
0.4
Specify the minimum probability an AMP hit must have to be retained in the final output file. Anything below this threshold will be removed.
For more information check AMPcombi documentation.
Modifies tool parameter(s):
- AMPCOMBI:
--cutoff
Antimicrobial resistance gene detection based on NCBI's curated Reference Gene Database and curated collection of Hidden Markov Models
Skip AMRFinderPlus during the ARG-screening.
boolean
Specify the path to a local version of the ARMfinderPlus database.
string
Specify the path to a local version of the ARMFinderPlus database. If no input is given, the pipeline will download the database for you.
See the nf-core/funcscan usage documentation for more information.
Minimum percent identity to reference sequence.
number
-1
Specify the minimum percentage amino-acid identity to reference protein or nucleotide identity for nucleotide reference must have if a BLAST alignment (based on methods: BLAST or PARTIAL) was detected, otherwise NA.
If you specify -1
, this means use a curated threshold if it exists and 0.9
otherwise.
Setting this value to something other than -1
will override any curated similarity cutoffs. For BLAST: alignment is > 90% of length and > 90% identity to a protein in the AMRFinderPlus database. For PARTIAL: alignment is > 50% of length, but < 90% of length and > 90% identity to the reference, and does not end at a contig boundary.
For more information check AMRFinderPlus documentation.
Modifies tool parameter(s):
- AMRFinderPlus:
--ident_min
Minimum coverage of the reference protein.
number
0.5
Minimum proportion of reference gene covered for a BLAST-based hit analysis if a BLAST alignment was detected, otherwise NA.
For BLAST-based hit analysis: alignment is > 90% of length and > 90% identity to a protein in the AMRFinderPlus database or for PARTIAL: alignment is > 50% of length, but < 90% of length and > 90% identity to the reference, and does not end at a contig boundary.
For more information check AMRFinderPlus documentation.
Modifies tool parameter(s):
- AMRFinderPlus:
--coverage_min
Specify which NCBI genetic code to use for translated BLAST.
integer
11
NCBI genetic code for translated BLAST. Number from 1 to 33 to represent the translation table used for BLASTX.
See translation table for more details on which table to use.
For more information check AMRFinderPlus documentation.
Modifies tool parameter(s):
- AMRFinderPlus:
--translation_table
Add the plus genes to the report.
boolean
Provide results from "Plus" genes in the output files.
Mostly the plus
genes are an expanded set of genes that are of interest in pathogens. This set includes stress response (biocide, metal, and heat resistance), virulence factors, some antigens, and porins. These "plus" proteins have primarily been added to the database with curated BLAST cutoffs, and are generally identified by BLAST searches. Some of these may not be acquired genes or mutations, but may be intrinsic in some organisms. See AMRFinderPlus database for more details.
Modifies tool parameter(s):
- AMRFinderPlus:
--plus
Add identified column to AMRFinderPlus output.
boolean
Prepend a column containing an identifier for this run of AMRFinderPlus. For example this can be used to add a sample name column to the AMRFinderPlus results. If set to true
, the --name <identifier>
is the sample name.
Modifies tool parameter(s):
- AMRFinderPlus:
--name
Antimicrobial resistance gene detection using a deep learning model
Skip DeepARG during the ARG-screening.
boolean
Specify the path to the DeepARG database.
string
Specify the path to a local version of the DeepARG database (see the pipelines' usage documentation). If no input is given, the module will download the database for you, however this is not recommended, as the database is large and this will take time.
Specify the numeric version number of a user supplied DeepaRG database.
integer
2
The DeepARG tool itself does not report explicit the database version it uses. We assume the latest version (as downloaded by the tool's database download module), however if you supply a different database, you must supply the version with this parameter for use with the downstream hAMRonization tool.
The version number must be without any leading v
etc.
Specify which model to use (short or long sequences).
string
Specify which model to use: short sequences for reads (SS
), or long sequences for genes (LS
). In the vast majority of cases we recommend using the LS
model when using funcscan
For more information check DeepARG documentation.
Modifies tool parameter(s):
- DeepARG:
--model
Specify minimum probability cutoff under which hits are discarded.
number
0.8
Sets the minimum probability cutoff below which hits are discarded.
For more information check DeepARG documentation.
Modifies tool parameter(s):
- DeepARG:
--min-prob
Specify E-value cutoff under which hits are discarded.
number
1e-10
Sets the cutoff value for Evalue below which hits are discarded
For more information check DeepARG documentation.
Modifies tool parameter(s):
- DeepARG:
--arg-alignment-evalue
Specify percent identity cutoff for sequence alignment under which hits are discarded.
integer
50
Sets the value for Identity cutoff for sequence alignment
For more information check DeepARG documentation.
Modifies tool parameter(s):
- DeepARG:
--arg-alignment-identity
Specify alignment read overlap.
number
0.8
Sets the value for the allowed alignment read overlap.
For more information check DeepARG documentation.
Modifies tool parameter(s):
- DeepARG:
--arg-alignment-overlap
Specify minimum number of alignments per entry for DIAMOND step of DeepARG.
integer
1000
Sets the value of minimum number of alignments per entry for DIAMOND.
For more information check DeepARG documentation.
Modifies tool parameter(s):
- DeepARG:
--arg-num-alignments-per-entry
Antimicrobial resistance gene detection using a deep learning model
Skip fARGene during the ARG-screening.
boolean
Specify comma-separated list of which pre-defined HMM models to screen against
string
class_a,class_b_1_2,class_b_3,class_c,class_d_1,class_d_2,qnr,tet_efflux,tet_rpg,tet_enzyme
Specify via a comma separated list any of the hmm-models of the pre-defined models:
- Class A beta-lactamases: class_a
- Subclass B1 and B2 beta-lactamases: class_b_1_2
- Subclass B3 beta-lactamases: class_b_3
- Class C beta-lactamases: class_c - Class D beta-lactamases:
class_d_1,
class_d_2 - qnr:
qnr - Tetracycline resistance genes
tet_efflux,
tet_rpg,
tet_enzyme`
For more information check fARGene documentation.
For example: --arg_fargenemodel 'class_a,qnr,tet_enzyme'
Modifies tool parameter(s):
- fARGene:
--hmm-model
Specify to save intermediate temporary files to results directory.
boolean
fARGene generates many additional temporary files which in most cases won't be useful and thus by default are not saved to the pipeline's result directory.
By specifying this parameter, the directories tmpdir/
, hmmsearchresults/
and spades_assemblies/
will be also saved in the output directory for closer inspection by the user, if necessary.
The threshold score for a sequence to be classified as a (almost) complete gene.
number
The threshold score for a sequence to be classified as a (almost) complete gene. If not pre-assigned, it is assigned by the hmm_model used based on the trade-off between sensitivity and specificity.
For more details see code documentation.
Modifies tool parameter(s):
- fARGene:
--score
The minimum length of a predicted ORF retrieved from annotating the nucleotide sequences.
integer
90
The minimum length of a predicted ORF retrieved from annotating the nucleotide sequences. By default the pipeline assigns this to 90% of the assigned hmm_model sequence length.
For more information check fARGene documentation.
Modifies tool parameter(s):
- fARGene:
--min-orf-length
Defines which ORF finding algorithm to use.
boolean
By default, pipeline uses prodigal/prokka for the prediction of ORFs from nucleotide sequences. Another option is the NCBI ORFfinder tool that is built into fARGene, the use of which is activated by this flag.
For more information check fARGene documentation.
Modifies tool parameter(s):
- fARGene:
--orf-finder
The translation table/format to use for sequence annotation.
string
pearson
The translation format that transeq should use for amino acid annotation from the nucleotide sequences. More sequence formats can be found in transeq 'input sequence formats'.
For more information check fARGene documentation.
Modifies tool parameter(s):
- fARGene:
--translation-format
Antimicrobial resistance gene detection, based on alignment to the CARD database
Skip RGI during the ARG-screening.
boolean
Save RGI output .json file.
boolean
When activated, this flag saves the .json
file in the RGI output directory. The .json
file contains the ARG predictions in a format that can be can be uploaded to the CARD website for visualization. See RGI documentation for more details. By default, the .json
file is generated in the working directory but not saved in the results directory to save disk space (.json
file is quite large and not required downstream in the pipeline).
Specify to save intermediate temporary files the results directory.
boolean
RGI generates many additional temporary files which in most cases won't be useful so by default are not saved.
By specifying this parameter, the files including temp
in the name will be also saved in the output directory for closer inspection by the user, if necessary.
Specify the alignment tool to be used.
string
Specifies the alignment tool to be used. By default RGI runs BLAST and this is also set as default in the nf-core/funcscan pipeline. Using this flag the user can activate the alignment by DIAMOND again.
For more information check RGI documentation.
Modifies tool parameter(s):
- RGI:
--alignment_tool
Include all of loose, strict and perfect hits (i.e. >=95% identity) found by RGI.
boolean
true
When activated it includes 'Loose' hits (a.k.a. Discovery) in addition to strict and perfect hits. All 'Loose' matches of 95% identity or better are automatically listed as 'Strict', regardless of alignment length (RGI v. <6.0.0). This behaviour can be overrun by using the --exclude_nudge flag. The 'Loose' algorithm works outside of the detection model cut-offs to provide detection of new, emergent threats and more distant homologs of AMR genes, but will also catalog homologous sequences and spurious partial matches that may not have a role in AMR.
For more information check RGI documentation.
Modifies tool parameter(s):
- RGI:
--include_loose
Suppresses the default behaviour of RGI with --arg_rgi_includeloose
.
boolean
true
This flag suppresses the default behaviour of RGI with --include_loose
, which lists all 'Loose' matches of >= 95% identity as 'Strict', regardless of alignment length. With this strict and perfect labels are added. This is discontinued in future versions of RGI.
For more information check RGI documentation.
Modifies tool parameter(s):
- RGI:
--exclude_nudge
Include screening of low quality contigs for partial genes.
boolean
This flag should be used only when the contigs are of poor quality (e.g. short) to predict partial genes.
For more information check RGI documentation.
Modifies tool parameter(s):
- RGI:
--low_quality
Specify a more specific data-type of input (e.g. plasmid, chromosome)
string
This flag is used to specify the data type used as input to RGI. By default this is set as 'NA', which makes no assumptions on input data.
For more information check RGI documentation.
Modifies tool parameter(s):
- RGI:
--data
Antimicrobial resistance gene detection, based on alignment to CBI, CARD, ARG-ANNOT, Resfinder, MEGARES, EcOH, PlasmidFinder, Ecoli_VF and VFDB.
Skip ABRicate during the ARG-screening.
boolean
Specify which of the provided public databases to use by ABRicate.
string
Specifies which database to use from dedicated list of databases available by ABRicate.
For more information check ABRicate documentation.
Modifies tool parameter(s):
- ABRicate:
--db
Minimum percent identity of alignment required for a hit to be considered.
integer
80
Specifies the minimum percent identity used to classify an ARG hit using BLAST alignment.
For more information check ABRicate documentation.
Modifies tool parameter(s):
- ABRicate:
--minid
Minimum percent coverage of alignment required for a hit to be considered.
integer
80
Specifies the minimum coverage of the nucleotide sequence to be assigned an ARG hit using BLAST alignment. In the ABRicate matrix, an absent gene is assigned (.
) and if present, it is assigned the estimated coverage (#
).
For more information check ABRicate documentation.
Modifies tool parameter(s):
- ABRicate:
--mincov
Biosynthetic gene cluster detection
Skip antiSMASH during the BGC screening
boolean
Path to user-defined local antiSMASH database.
string
It is recommend to pre-download the antiSMASH databases to your machine and pass the path of it to this parameter, as this can take a long time to download - particularly when running lots of pipeline runs.
See the pipeline documentation for details on how to download this. If running with docker or singularity, please also check --bgc_antismash_installationdirectory
for important information.
Path to user-defined local antiSMASH directory. Only required when running with docker/singularity.
string
This is required when running with docker and singularity (not required for conda), due to attempted 'modifications' of files during database checks in the installation directory, something that cannot be done in immutable docker/singularity containers.
Therefore, a local installation directory needs to be mounted (including all modified files from the downloading step) to the container as a workaround.
Minimum longest-contig length a sample must have to be screened with antiSMASH.
integer
1000
This specifies the minimum length that the longest contig must have for the entire sample to be screened by antiSMASH.
Any samples that do not reach this length will be not be sent to antiSMASH, therefore you will not receive output for these samples in your --outdir
.
⚠️ This is not the same as
--bgc_antismash_contigminlength
, which specifies to only analyse contigs above that threshold but within a sample that has already passed--bgc_antismash_sampleminlength
sample filter!
Minimum length a contig must have to be screened with antiSMASH.
integer
1000
This specifies the minimum length that a contig must have for the contig to be screened by antiSMASH.
For more information see the antiSMASH documentation.
This will only apply to samples that are screened with antiSMASH (i.e., those samples that have not been removed by --bgc_antismash_sampleminlength
).
You may wish to increase this value compared to that of --bgc_antismash_sampleminlength
, in cases where you wish to screen higher-quality (i.e., longer) contigs, or speed up runs by not screening lower quality/less informative contigs.
Modifies tool parameter(s):
- antiSMASH:
--minlength
Turn on clusterblast comparison against database of antiSMASH-predicted clusters.
boolean
Compare identified clusters against a database of antiSMASH-predicted clusters using the clusterblast algorithm.
For more information see the antiSMASH documentation.
Modifies tool parameter(s):
- antiSMASH:
--cb-general
Turn on clusterblast comparison against known gene clusters from the MIBiG database.
boolean
This will turn on comparing identified clusters against known gene clusters from the MIBiG database using the clusterblast algorithm.
MIBiG is a curated datbase of experimentally characterised gene clusters and with rich associated metadata.
For more information see the antiSMASH documentation.
Modifies tool parameter(s):
- antiSMASH:
--cb-knownclusters
Turn on clusterblast comparison against known subclusters responsible for synthesising precursors.
boolean
Turn on additional screening for operons involved in the biosynthesis of early secondary metabolites components using the clusterblast algorithm.
For more information see the antiSMASH documentation.
Modifies tool parameter(s):
- antiSMASH:
--cb-subclusters
Turn on ClusterCompare comparison against known gene clusters from the MIBiG database.
boolean
Turn on comparison of detected genes against the MIBiG database using the ClusterCompare algorithm - an alternative to clusterblast.
Note there will not be a dedicated ClusterCompare output in the antiSMASH results directory, but is present in the HTML.
For more information see the antiSMASH documentation.
Modifies tool parameter(s):
- antiSMASH:
--cc-mibig
Generate phylogenetic trees of secondary metabolite group orthologs.
boolean
Turning this on will activate the generation of additional functional and phyogenetic analysis of genes, via comparison against databases of protein orthologs.
For more information see the antiSMASH documentation.
Modifies tool parameter(s):
- antiSMASH:
--cb-smcog-trees
Defines which level of strictness to use for HMM-based cluster detection
string
Defines which level of strictness to use for HMM-based cluster detection.
These correspond to screening of different groups of 'how well-defined' clusters are. For example, loose
will include screening for 'poorly defined' clusters (e.g. saccharides), relaxed
for partially present clusters (e.g. certain types of NRPS), whereas strict
will screen for well-defined clusters such as Ketosynthases.
You can see the rules for the levels of strictness here.
For more information see the antiSMASH documentation.
Modifies tool parameter(s):
- antiSMASH:
--hmmdetection-strictness
Specify which taxonomic classification of input sequence to use
string
This specifies which set of secondary metabolites to screen for, based on the taxon type the secondary metabolites are from.
This will run different pipelines depending on whether the input sequences are from bacteria or fungi.
For more information see the antiSMASH documentation.
Modifies tool parameter(s):
- antiSMASH:
--taxon
A deep learning genome-mining strategy for biosynthetic gene cluster prediction
Skip deepBGC during the BGC screening.
boolean
Path to local deepBGC database folder.
string
Average protein-wise DeepBGC score threshold for extracting BGC regions from Pfam sequences.
number
0.5
The DeepBGC score threshold for extracting BGC regions from Pfam sequences based on average protein-wise value. This is a prediction score that the domain is a part of a BGC.
For more information see the DeepBGC documentation.
Modifies tool parameter(s)
- DeepBGC:
--score
Run DeepBGC's internal Prodigal step in single
mode to restrict detecting genes to long contigs
boolean
By default DeepBGC's Prodigal runs in 'single genome' mode that requires sequence lengths to be equal or longer than 20000 characters.
However, more fragmented reads from MAGs often result in contigs shorter than this. Therefore, nf-core/funcscan will run with the meta
mode by default, but providing this parameter allows to override this and run in single genome mode again.
For more information check Prodigal documentation.
Modifies tool parameter(s)
- DeepBGC:
--prodigal-meta-mode
Merge detected BGCs within given number of proteins.
integer
Merge detected BGCs within given number of proteins.
For more information see the DeepBGC documentation.
Modifies tool parameter(s)
- DeepBGC:
--merge-max-protein-gap
Merge detected BGCs within given number of nucleotides.
integer
Merge detected BGCs within given number of proteins.
For more information see the DeepBGC documentation.
Modifies tool parameter(s)
- DeepBGC:
--merge-max-nucl-gap
Minimum BGC nucleotide length.
integer
1
Minimum length a BGC must have (in bp) to be reported as detected.
For more information see the DeepBGC documentation.
Modifies tool parameter(s)
- DeepBGC:
--min-nucl
Minimum number of proteins in a BGC.
integer
1
Minimum number of proteins in a BGC must have to be reported as 'detected'.
For more information see the DeepBGC documentation.
Modifies tool parameter(s)
- DeepBGC:
--min-proteins
Minimum number of protein domains in a BGC.
integer
1
Minimum number of domains a BGC must have to be reported as 'detected'.
For more information see the DeepBGC documentation.
Modifies tool parameter(s)
- DeepBGC:
--min-domains
Minimum number of known biosynthetic (as defined by antiSMASH) protein domains in a BGC.
integer
Minimum number of biosynthetic protein domains a BGC must have to be reported as 'detected'. This is based on antiSMASH definitions.
For more information see the DeepBGC documentation.
Modifies tool parameter(s)
- DeepBGC:
--min-bio-domains
DeepBGC classification score threshold for assigning classes to BGCs.
number
0.5
DeepBGC classification score threshold for assigning classes to BGCs.
For more information see the DeepBGC documentation.
Modifies tool parameter(s)
- DeepBGC:
--classifier-score
Biosynthetic gene cluster detection
Skip GECCO during the BGC screening.
boolean
Enable unknown region masking to prevent genes from stretching across unknown nucleotides.
boolean
Enable unknown region masking to prevent genes from stretching across unknown nucleotides during ORF detection based on P(y)rodigal.
For more information see the GECCO documentation.
Modifies tool parameter(s):
- GECCO:
--mask
The minimum number of coding sequences a valid cluster must contain.
integer
3
Specify the number of consecutive genes a hit must have to be considered a part of a possible BGC region during BGC extraction.
For more information see the GECCO documentation.
Modifies tool parameter(s):
- GECCO:
--cds
The p-value cutoff for protein domains to be included.
number
1e-9
The p-value cutoff for protein domains to be included.
For more information see the GECCO documentation.
Modifies tool parameter(s):
- GECCO:
--pfilter
The probability threshold for cluster detection.
number
0.8
Specify the minimum probability a predicted gene must have to be considered a part of a BGC during BGC extraction.
Reducing this value may increase number and length of hits, but will reduce the accuracy of the predictions.
For more information see the GECCO documentation.
Modifies tool parameter(s):
- GECCO:
--threshold
The minimum number of annotated genes that must separate a cluster from the edge.
integer
The minimum number of annotated genes that must separate a possible BGC cluster from the edge. Edge clusters will still be included if they are longer. A lower number will increase the number of false positives on small contigs. Used during BGC extraction.
For more information see the GECCO documentation.
Modifies tool parameter(s):
- GECCO:
--edge-distance
Biosynthetic Gene Cluster detection based on predefined HMM models
Skip HMMsearch during BGC-screening.
boolean
Specify path to the BGC hmm model file(s) to search against. Must have quotes if wildcard used.
string
HMMSearch performs biosequence analysis using profile hidden Markov Models.
The models are specified in.hmm
files that are specified with this parameter
e.g.
--bgc_hmmsearch_models '/<path>/<to>/<models>/*.hmm'
You must wrap the path in quotes if you use a wildcard, to ensure Nextflow expansion not bash!
For more information check HMMER documentation.
Saves a multiple alignment of all significant hits to a file.
boolean
Save a multiple alignment of all significant hits (those satisfying inclusion thresholds) to a file
For more information check HMMER documentation.
Modifies tool parameter(s):
- HMMsearch:
-A
Save a simple tabular file summarising the per-target output.
boolean
Save a simple tabular (space-delimited) file summarizing the per-target output, with one data line per homologous target sequence found.
For more information check HMMER documentation.
Modifies tool parameter(s)
- HMMsearch:
--tblout
Save a simple tabular file summarising the per-domain output.
boolean
Save a simple tabular (space-delimited) file summarizing the per-domain output, with one data line per homologous domain detected in a query sequence for each homologous model.
For more information check HMMER documentation.
Modifies tool parameter(s)
- HMMsearch:
--domtblout
Influences parameters required for the reporting workflow.
Specifies summary output format
string
Specifies which summary report format to generate with hamronize summarize
: tsv, json or interactive (html)
Modifies tool parameter(s)
- HMMsearch:
-t
,--summary_type
Reference genome related files and options required for the workflow.
Name of iGenomes reference.
string
If using a reference genome configured in the pipeline using iGenomes, use this parameter to give the ID for the reference. This is then used to build the full paths for all required reference genome files e.g. --genome GRCh38
.
See the nf-core website docs for more details.
Path to FASTA genome file.
string
^\S+\.fn?a(sta)?(\.gz)?$
This parameter is mandatory if --genome
is not specified. If you don't have a BWA index available this will be generated for you automatically. Combine with --save_reference
to save BWA index for future runs.
Directory / URL base for iGenomes references.
string
s3://ngi-igenomes/igenomes
Do not load the iGenomes reference config.
boolean
Do not load igenomes.config
when running the pipeline. You may choose this option if you observe clashes between custom parameters and those supplied in igenomes.config
.
Parameters used to describe centralised config profiles. These should not be edited.
Git commit id for Institutional configs.
string
master
Base directory for Institutional configs.
string
https://raw.githubusercontent.com/nf-core/configs/master
If you're running offline, Nextflow will not be able to fetch the institutional config files from the internet. If you don't need them, then this is not a problem. If you do need them, you should download the files from the repo and tell Nextflow where to find them with this parameter.
Institutional config name.
string
Institutional config description.
string
Institutional config contact information.
string
Institutional config URL link.
string
Set the top limit for requested resources for any single job.
Maximum number of CPUs that can be requested for any single job.
integer
16
Use to set an upper-limit for the CPU requirement for each process. Should be an integer e.g. --max_cpus 1
Maximum amount of memory that can be requested for any single job.
string
128.GB
^\d+(\.\d+)?\.?\s*(K|M|G|T)?B$
Use to set an upper-limit for the memory requirement for each process. Should be a string in the format integer-unit e.g. --max_memory '8.GB'
Maximum amount of time that can be requested for any single job.
string
240.h
^(\d+\.?\s*(s|m|h|d|day)\s*)+$
Use to set an upper-limit for the time requirement for each process. Should be a string in the format integer-unit e.g. --max_time '2.h'
Less common options for the pipeline, typically set in a config file.
Display help text.
boolean
Display version and exit.
boolean
Method used to save pipeline results to output directory.
string
The Nextflow publishDir
option specifies which intermediate files should be saved to the output directory. This option tells the pipeline what method should be used to move these files. See Nextflow docs for details.
Email address for completion summary, only when pipeline fails.
string
^([a-zA-Z0-9_\-\.]+)@([a-zA-Z0-9_\-\.]+)\.([a-zA-Z]{2,5})$
An email address to send a summary email to when the pipeline is completed - ONLY sent if the pipeline does not exit successfully.
Send plain-text email instead of HTML.
boolean
File size limit when attaching MultiQC reports to summary emails.
string
25.MB
^\d+(\.\d+)?\.?\s*(K|M|G|T)?B$
Do not use coloured log outputs.
boolean
Incoming hook URL for messaging service
string
Incoming hook URL for messaging service. Currently, MS Teams and Slack are supported.
Custom config file to supply to MultiQC.
string
Custom logo file to supply to MultiQC. File name must also be set in the MultiQC config file
string
Custom MultiQC yaml file containing HTML including a methods description.
string
Boolean whether to validate parameters against the schema at runtime
boolean
true
Show all params when using --help
boolean
By default, parameters set as hidden in the schema are not shown on the command line when a user runs with --help
. Specifying this option will tell the pipeline to show all parameters.
Validation of parameters fails when an unrecognised parameter is found.
boolean
By default, when an unrecognised parameter is found, it returns a warinig.
Validation of parameters in lenient more.
boolean
Allows string values that are parseable as numbers or booleans. For further information see JSONSchema docs.