Processing recommendations

Currently, Qiita supports the processing of raw data from:

  1. Target gene barcoded sequencing
  2. Shotgun sequencing
  3. Metatranscriptome sequencing

Note that the selected processing recommendations are mainly guided towards performing meta-analyses, this is combine different studies, even from different wet lab techniques or sequencing technologies. However, these parameters shouldn’t prevent you using the resulting tables as your primary analytical source.

Target gene barcoded sequencing

For this you can start with raw, pre-demultiplexing data or per-sample FASTQ. Either way, you will need to “Split libraries and QC”, which uses the default in QIIME 1.9.1. Once your demultiplexed and QCed artifact is created you need to select which processing to perform. There are two main ideologies/methodologies to process target gene data: sequence clustering and sequence deblur.

Sequencing deblur (preferred)

For this we use deblur. Here 2 BIOM tables are generated by default: deblur final table and deblur reference hit table. The former is the full biom table, which can be used with any target gene and wetlab work; the latter is the trimmed version to those sequences that match Greengenes at 80% similarity, a really basic and naive filtering. Each of those BIOM tables, is accompanied by a FASTA that contains the representative sequences. The OTU IDs are given by the unique sequence.

Note that deblur needs all sequences to be trimmed at the same length, thus the recommended pipeline is to trim everything at 150bp and the deblur.

Sequencing clustering

Here we use close reference picking, for an explanation of the different picking methods see “Subsampled open-reference clustering creates consistent, comprehensive OTU definitions and scales to billions of sequences”. Here we generate a single BIOM table with the OTUs/per-sample. The OTU IDs are given based on the reference database selected.

Currently, we have the reference databases: Greengenes version 3_8-97, Silva 119 and Unite 7. Depending on your selection is if the reference has a phylogenetic tree.

Shotgun sequencing

Qiita currently has one active shotgun metagenomics data analysis pipeline: a per sample bowtie2 alignment step with Woltka classification using either the WoLr1 or Rep200 databases. Below you will find more information about each of these options.

The current workflow is as follows:

  1. A single step per sample adapter removal (via fastp) and host filtering (via minimap2); more information below.
  2. Taxonomy profiling using bowtie2 as an aligner and two different reference databases; see sections below

Note that we recommend only uploading sequences that have already been through QC and human sequence removal. However, we recommend that all sequence files go through adapter and host filtering within the system to ensure they are ready for subsequent meta-analyses. Currently, the fastp command is set to autodetect adaptors so this command is available for all different wetlab processing and we provide the following host references for your convenience:

  • auto-detect adapters and artifacts + phix filtering: This is a deblur artifacts reference, mainly for debugging and testing
  • auto-detect adapters and cheetah + phix filtering
  • auto-detect adapters and cow + phix filtering
  • auto-detect adapters and hamster + phix filtering
  • auto-detect adapters and horse + phix filtering
  • auto-detect adapters and merge_genomes + phix filtering : is the combined genomes of a cheetah, cow, hamster, horse, human, mouse, pig, rabbit, and rat
  • auto-detect adapters and mouse + phix filtering
  • auto-detect adapters and pig + phix filtering
  • auto-detect adapters and rabbit + phix filtering
  • auto-detect adapters and rat + phix filtering
  • auto-detect adapters only filtering [not recommended]

Note that the command produces up to 6 output artifacts based on the aligner and database selected:

  • Alignment Profile: contains the raw alignment file and the no rank classification BIOM table
  • Taxonomic Prediction - phylum: contains the phylum level taxonomic predictions BIOM table
  • Taxonomic Prediction - genus: contains the genus level taxonomic predictions BIOM table
  • Taxonomic Prediction - species: contains the genus level taxonomic predictions BIOM table
  • Per genome Predictions: contains the per genome level taxonomic predictions BIOM table
  • Per gene Predictions: Only WoLr1, contains the per gene level taxonomic predictions BIOM table

Aligners

Note that some of these are legacy option but not available for new processing.

  1. Bowtie2: The classical ultrafast short sequence aligner. Based on-FM indexing of genome sequences to achieve efficient memory and CPU performance. We tuned the parameter setting for Bowtie2 to achieve optimal alignment accuracy for typical shotgun metagenome datasets.
  2. BURST: A novel sequence aligner featuring exhaustive (in contrast to heuristic) alignment against the entire reference genome database, hence achieving the highest accuracy, though with relatively long running time.
    • Version: 0.99.8
    • Alignment file format: b6o (BLAST tabular output, i.e., -outfmt 6)
    • Website: https://github.com/knights-lab/BURST
    • Citation: Gabriel Al-Ghalith and Dan Knights. BURST enables optimal exhaustive DNA alignment for big data. DOI 2017:doi.org/10.5281/zenodo.806850
    • Note: Manuscript under review.
  3. UTree A sequence classifier based on _k_-mer signature matching along a tree structure. Analogous to Kraken but with higher computational efficiency. The fastest option.
    • Version: 2.0 RF
    • Alignment file format: custom mapping file
    • Website: https://github.com/knights-lab/UTree
    • Citation: Gabriel Al-Ghalith and Dan Knights. Faster and lower-memory metagenomic profiling with UTree. DOI: 10.5281/zenodo.998252

Reference databases

Note that some of these are legacy option but not available for new processing.

  1. WoLr1 (“Web of Life” release 1): An even representation of microbial diversity, selected using an prototype selection algorithm based on the MinHash distance matrix among all non-redundant bacterial and archaeal genomes from NCBI (RefSeq and GenBank, complete and draft), plus several genome quality control criteria. A high-quality reference phylogeny is available for this genome pool, enabling subsequent phylogeny-based analyses. Also available are curated taxonomic annotations, based on NCBI and GTDB systems.
    • Domains: Bacteria, Archaea
    • Number of genomes: 10,575
    • Total length (bp): 32,861,886,373
    • Citation: Zhu Q, Mai U, Pfeiffer W, et al. Phylogenomics of 10,575 genomes reveals evolutionary proximity between domains Bacteria and Archaea. Nat Commun. 2019. 10(1):5477. doi: 10.1038/s41467-019-13443-4.
    • Numbers of taxonomic units:
      • Kingdoms: 2
      • Phyla: 146
      • Classes: 89
      • Orders: 196
      • Families: 422
      • Genera: 2,081
      • Species: 9,105
      • Strains: 89
      • Note: Nucleotide sequences per genome were concatenated with a linker of 20 “N”s.
  2. Rep200: NCBI representative and reference microbial genomes, corresponding to RefSeq release 200 (2020-05-14)
    • Genomes: 11,955
    • Nucleotides: 926,894
    • Basepairs: 62,823,581,921 (excluding gaps)
    • Numbers of taxonomic units:
      • Archaea: 419
      • Bacteria: 11080
      • Fungi: 320
      • Protozoa: 88
      • Viral: 48
  3. Rep94: NCBI representative and reference microbial genomes, corresponding to RefSeq release 94.
    • Domains: Bacteria, Archaea
    • Number of genomes: 5,808
    • Total length (bp): 23,165,526,011
    • Note: Nucleotide sequences per genome were concatenated with a linker of 20 “N”s.
    • Numbers of taxonomic units:
      • Kingdoms: 2
      • Phyla: 38
      • Classes: 85
      • Orders: 186
      • Families: 427
      • Genera: 1,931
      • Species: 5,636
      • Strains: 84
  4. Rep82: NCBI representative and reference microbial genomes, corresponding to RefSeq release 82.
    • Not available anymore for new processing
    • Domains: Bacteria, Archaea, Viruses/Viroids
    • Number of genomes: 10,519
    • Total length (bp): 20,387,349,319
    • Note: Plasmids were isolated from bacterial and archaeal host genomes and considered as separate genomes.
    • Numbers of taxonomic units:
      • Kingdoms: 6
      • Phyla: 55
      • Classes: 362
      • Orders: 182
      • Families: 452
      • Genera: 2,264
      • Species: 11,852
      • Strains: 4,263

Metatranscriptome processing

Qiita currently has one active Metatranscriptome data analysis pipeline, as follows:

  1. Ribosomal read filtering via SortMeRNA; details below. This produces a Ribosomal reads and a Non-ribosomal reads artifact/
  2. Taxonomic profiling via Woltka; for more information see details above.

Sample processing guidelines for metatranscriptomic data

Ribosomal read filtering

SortMeRNA is used for removal of ribosomal reads from quality filtered Metatranscriptome data

Latest SortMeRNA version: v2.1

Input: Quality filtered Metatranscriptome reads (FASTA/FASTQ) Ribosomal reads are identified by searching against pre-curated rRNA databases. Currently, rRNA databases covering bacteria, archaea and eukarya were downloaded and indexed from SILVA and Rfam. Currently indexed databases and their clustering ids:

  • silva-bacterial-16s-id 90%
  • silva-bacterial-23s-id 98%
  • silva-archaeal-16s-id 95%
  • silva-archaeal-23s-id 98%
  • silva-eukarya-18s-id 95%
  • silva-eukarya-28s-id 98%
  • rfam-5s-database-id 98%
  • rfam-5.8s-database-id 98%

The above databases and ID cut-offs were chosen to work with a range of samples including more diverse/complex environmental samples.

Building Custom databases

Custom databases can also be built in addition to the above mentioned databases. Custom databases can be built by using the using the ARB package to extract FASTA files for:

  • 16S bacteria, 16S archaea and 18S eukarya using SSURef_NR99_119_SILVA_14_07_14_opt.arb
  • 23S bacteria, 23S archaea and 28S eukarya using LSURef_119_SILVA_15_07_14_opt.arb

The built databases will then have to be indexed before running SortMeRNA. Reference database(s) and their corresponding indexes separated by “,” and multiple databases are separated by “:”

SortMeRNA Usage

SortMeRNA filters the ribosomal from the non-ribosomal reads from the input sample dataset (via BLAST search)and outputs two fasta/q files containing the ribosomal and non-ribosomal reads respectively. Additionally, a summary file showing the proportion of reads matching to each of the screened ribosomal databases can also be made available. Default options have been set to report only the best alignment per read reaching E-value. For non ribo-depleted samples (i.e. total RNA), the ribosomal reads obtained from SortMeRNA can be further used in taxonomic/compositional analysis. In the case of ribo-depleted samples, only the non-ribosomal reads are used in downstream analyses such as assembly, mapping, differential gene abundance analyses etc.