Qiita can be seen as an analytical pipeline broker that can apply any specific pipeline, tool, or script to any of its stored data. All the analytical pipelines are autonomous, independently developed, and tested, which facilitates the support of current tools and the development of new ones. This principle is supported via virtual environments and artifacts. Virtual environments for each pipeline gives the freedom of adding any pipeline with any software dependencies to Qiita. Artifacts, basically any file in the system, from raw sequence to contingency tables or even data visualizations, permits the system to store any kind of data but also define within each pipelines which commands and parameters can applied to them.
Qiita’s main entity is the idea of a study. A study can have many samples, with many preparations, that have been sequenced several times, Figure 1. Additionally, study artifacts have 3 different states: sandboxed, private and public. A sandboxed artifact has all operational capabilities in the system but is not publicly available, allowing for quick integration with other studies but at the same time keeping it private so the user can improve the analysis. Once a user decides that is time to make their artifact public they can request an administrator to validate their study information and make it private and possibly submit to a permanent repository, where it can also be kept private until the user wants to make it public. At this stage in Qiita the whole study (including all processed data) is private. This process is completely automatic via the Graphical User Interface (GUI). Currently sequence data is being deposited for permanent storage to the European Nucleotide Archive (ENA), part of the European Bioinformatics Institute (EBI). Finally, when the user is ready, usually when the main manuscript of the study is ready for publication, the user can request for the artifact to be made public public, both in Qiita and the permanent repository, Figure 2.
Qiita allows for complex study designs¶
As seen in Figure 1 studies are the main source of data for Qiita, and studies can contain only one set of samples but can also contain multiple sets, each of which can have a different preparations.
The traditional study design includes a single sample and a single preparation information file. However as technology improves, study designs become more complex where a study with a defined set of collected samples can have subsets prepared in different ways so we can answer different questions. For example, let’s imagine a study looking at how different microbial communities changes during mammalian corpse decomposition; thus, your full study design is to collect a set of samples, which you will then process with 16S, 18S and ITS primers. This will result in 1 sample and 3 preparation information files, see it in Qiita.
Now, let’s imagine other more complex example:
- All of the samples were prepped for 16S and sequenced in two separate MiSeq runs
- 50 of the samples were prepped for 18S and ITS, and sequenced in a single MiSeq run
- 50 of the samples were prepped for WGS and sequenced on a single HiSeq run
- 30 of the samples have metabolomic profiles
To represent this project in Qiita, you will need to create a single study with a single sample information file that contains all 100 of the samples. Separately, you will need to create four prep information files that describe the preparations for the corresponding samples. All raw data uploaded will need to correspond to a specific preparation (prep) information file. For instance, the data sets described above would require the following data and prep information:
- All of the samples prepped for 16S and sequenced in two separate
- 1 prep information file describing the two MiSeq runs (use a run_prefix column to differentiate between the two MiSeq runs, more on metadata below) where the 100 samples are represented
- the 4-6 fastq raw data files without demultiplexing (i.e., the forward, reverse (optional), and barcodes for each run)
- 50 of the samples prepped for 18S and ITS, and sequenced in a single
- prep information files, one describing the 18S and the other describing the ITS preparations
- the 2-3 fastq raw data files (forward, reverse (optional), and barcodes)
- 50 of the samples prepped for WGS and sequenced on a single HiSeq run
- 1 prep information files describing how the samples were multiplexed
- the 2-3 fastq raw data files (forward, reverse (optional), and barcodes).
- NOTE: We currently do not have a processing pipeline for WGS but should soon.
- 30 of the samples with metabolomic profiles
- 1 prep information file. the raw data file(s) from the metabolomic characterization.
- NOTE: We currently do not have a processing pipeline for metabolomics but should soon.
Qiita allows the hosting of multiple portals within the same infrastructure. This allows each portal to have a subset of studies (often with a similar theme) in a different URL but sharing the same resources. Sharing the same backend resources avoids having multiple sites and data getting out of sync.
The current available portals are:
- Main site (all studies from all portals): qiita.microbio.me
- Sloan portal (build environment): sloan_microbe.microbio.me
- Earth Microbiome Project portal (studies generated under the EMP): emp.microbio.me