Anatomy of an Archive¶
QIIME 2 stores data in a directory structure called an Archive. These archives are zipped to make moving data simple and convenient.
An example of this file:
uuid: 45c12936-4b60-484d-bbe1-98ff96bad145 type: FeatureTable[Frequency] format: BIOMV210DirFmt
It is possible for
format to be set as
null in some cases;
it means the
/data/ directory (described below) does not have a schema.
This occurs when the
type is set as
Visualization (representing a Visualization (Type)).
Where data is stored, the payload of an archive,
is in an aptly named
The structure of this subdirectory depends on the payload.
If the archive is a visualization,
then the payload is an interactive visualization implemented as a small static website
index.html file and any other assets).
Additional information about visualizers can be found here: Visualizers.
In addition to storing data, we can store metadata containing information such as what actions were performed, what versions exist, what references to cite. A more complete description can be found in Decentralized Provenance Tracking.
As it relates to the archive structure, the
/provenance/ directory is designed
to be self-contained and self-referential. This means that it duplicates some
of the information available in the root of the archive, but this
simplifies the code responsible for tracking and reading provenance.
To better illustrate this idea, we can look at the following diagram, representing an archive:
Looking closely we see the previously described
/data/ directory and
metadata.yaml file, in addition to a
VERSION file (described below)
/provenance/ directory in question.
Following the provenance directory, we see that the provenance structure is
repeated within the
This directory contains the ancestral provenance of all artifacts
used up to this point. Because the structure repeats itself, it is possible to
create a new provenance directory by simply adding all input artifacts’
directories into a new
/provenance/artifacts/ directories of the original inputs can be also merged together.
Because the directories are named by a UUID, we know the identity of each ancestor,
and if seen twice, can simply be ignored.
This simplifies the problem of capturing ancestral provenance to one of merging
uniquely named file-trees.
ZIP files are a ubiquitous and well understood format. There is a huge variety of software available to read and manipulate ZIP files.
The ZIP format enables random access of files within the archive making it possible to read data without extracting the entire contents of the ZIP file (in contrast to a linear archive like TAR).
qiime2.core.archive.archiver:_ZipArchive is the structure responsible for
managing the contents of a ZIP file (using
Every QIIME 2 archive has the following structure:
A root directory which is named a standard representation of a UUID (version 4),
and a file within that directory named
VERSION the following text will be present:
QIIME 2 archive: <integer version> framework: <version string>
This file is NOT YAML (and shouldn’t be). The goal is to avoid it being caught up by a future refactor where some other structured file format is used instead of YAML (we do like YAML however). Additionally, line-endings are currently unspecified, but in practice will be UNIX-style.
<integer version> is the version that the archive was saved with.
This may be used to identify the schema of the archive structure,
allowing software to dispatch appropriate parsing logic.
As a historical example, archive version ‘
0’ had no
This means there is no reason to look for it in the archive.
Admittedly it is just as easy to check if the directory exists,
however this pattern can be used for more complex cases.
These rules are encoded in