Cache

cache.get_cache()

Gets the cache we have instructed QIIME 2 to use in this invocation. By default this is a cache located at $TMPDIR/qiime2/$USER, but if the user has set a cache it is the cache they set. This is used by various parts of the framework to determine what cache they should be saving to/loading from.

Returns:

The cache QIIME 2 is using for the current invocation.

Return type:

Cache

Examples

>>> test_dir = tempfile.TemporaryDirectory(prefix='qiime2-test-temp-')
>>> cache_path = os.path.join(test_dir.name, 'cache')
>>> cache = Cache(cache_path)
>>> # get_cache() will return the temp cache, not the one we just made.
>>> get_cache() == cache
False
>>> # After withing in the cache we just made, get_cache() will return it.
>>> with cache:
...     get_cache() == cache
True
>>> # Now that we have exited our cache, we will get the temp cache again.
>>> get_cache() == cache
False
>>> test_dir.cleanup()

The Cache Class

class qiime2.core.cache.Cache(path=None)

General structure of the cache:

artifact_cache/
├── data/
│   ├── uuid1/
│   ├── uuid2/
│   ├── uuid3/
│   └── uuid4/
├── keys/
│   ├── bar.yaml
│   ├── baz.yaml
│   └── foo.yaml
├── pools/
│   └── puuid1/
│       ├── uuid1 -> ../../data/uuid1/
│       └── uuid2 -> ../../data/uuid2/
├── processes/
│   └── <process-id>-<process-create-time>@<user>/
│       ├── uuid3 -> ../../data/uuid3/
│       └── uuid4 -> ../../data/uuid4/
└── VERSION

Data: The data directory contains all of the artifacts in the cache in unzipped form.

Keys: The keys directory contains yaml files that refer to either a piece of data or a pool. The data/pool referenced by the key will be kept as long as the key exists.

Pools: The pools directory contains all named (keyed) pools in the cache. Each pool contains symlinks to all of the data it contains.

Processes: The processes directory contains process pools of the format <process-id>-<process-create-time>@<user> for each process that has used this cache. Each pool contains symlinks to each element in the data directory the process that created the pool has used in some way (created, loaded, etc.). These symlinks are ephemeral and have lifetimes <= the lifetime of the process that created them. More permanent storage is done using keys.

VERSION: This file contains some information QIIME 2 uses to determine what version of QIIME 2 was used to create the cache and what version of cache it is (if we make breaking changes in the future this version number will allow for backwards compatibility).

__init__(path=None, process_pool_lifespan=45)

Creates a Cache object backed by the directory specified by path. If no path is provided, it gets a path to a temp cache.

Warning

If no path is provided and the path $TMPDIR/qiime2/$USER exists but is not a valid cache, we remove the directory and create a cache there.

Parameters:
  • path (str or PathLike object) – Should point either to a non-existent writable directory to be created as a cache or to an existing writable cache. Defaults to None which creates the cache at $TMPDIR/qiime2/$USER.

  • process_pool_lifespan (int) – The number of days we should allow process pools to exist for before culling them.

__enter__()

Tell QIIME 2 to use this cache in its current invocation (see get_cache).

__exit__(*args)

Tell QIIME 2 to go back to using the default cache.

classmethod is_cache(path)

Tells us if the path we were given is a cache.

Parameters:

path (str or PathLike object) – The path to the cache we are checking.

Returns:

Whether the path we were given is a cache or not.

Return type:

bool

Examples

>>> test_dir = tempfile.TemporaryDirectory(prefix='qiime2-test-temp-')
>>> cache_path = os.path.join(test_dir.name, 'cache')
>>> cache = Cache(cache_path)
>>> Cache.is_cache(cache_path)
True
>>> test_dir.cleanup()
create_pool(key, reuse=False)

Used to create named pools.

Named pools can be used by pipelines to store all intermediate results created by the pipeline and prevent it from being reaped. This allows us to resume failed pipelines by collecting all of the data the pipeline saved to the named pool before it crashed and reusing it so we don’t need to run the steps that created it again and can instead rerun the pipeline from where it failed.

Once the pipeline completes, all of its final results will be saved to the pool as well with the idea being that the user can then reuse the pool keys to refer to the final data and get rid of the pool now that the pipeline that created it has completed.

Parameters:
  • key (str) – The key to use to reference the pool.

  • reuse (bool) – Whether to reuse a pool if a pool with the given keys already exists.

Returns:

The pool we created.

Return type:

Pool

Examples

>>> test_dir = tempfile.TemporaryDirectory(prefix='qiime2-test-temp-')
>>> cache_path = os.path.join(test_dir.name, 'cache')
>>> cache = Cache(cache_path)
>>> pool = cache.create_pool(key='key')
>>> cache.get_keys() == set(['key'])
True
>>> cache.get_pools() == set(['key'])
True
>>> test_dir.cleanup()
garbage_collection()

Runs garbage collection on the cache in the following steps:

1. Iterate over all keys and log all data and pools referenced by the keys.

2. Iterate over all named pools and delete any that were not referred to by a key while logging all data in pools that were referred to by keys.

3. Iterate over all process pools and log all data they refer to.

4. Iterate over all data and remove any that was not referenced.

This process destroys data and named pools that do not have keys along with process pools older than the process_pool_lifespan on the cache which defaults to 45 days. It never removes keys.

We lock out other processes and threads from accessing the cache while garbage collecting to ensure the cache remains in a consistent state.

save(ref, key)

Saves data into the cache by creating a key referring to the data then copying the data if it is not already in the cache.

Parameters:
  • ref (Result) – The QIIME 2 result we are saving into the cache.

  • key (str) – The key we are saving the result under.

Returns:

A Result backed by the data in the cache.

Return type:

Result

Examples

>>> from qiime2.sdk.result import Artifact
>>> from qiime2.core.testing.type import IntSequence1
>>> test_dir = tempfile.TemporaryDirectory(prefix='qiime2-test-temp-')
>>> cache_path = os.path.join(test_dir.name, 'cache')
>>> cache = Cache(cache_path)
>>> artifact = Artifact.import_data(IntSequence1, [0, 1, 2])
>>> saved_artifact = cache.save(artifact, 'key')
>>> # save returned an artifact that is backed by the data in the cache
>>> str(saved_artifact._archiver.path) ==                 str(cache.data / str(artifact.uuid))
True
>>> cache.get_keys() == set(['key'])
True
>>> test_dir.cleanup()
load(key)

Loads the data pointed to by a key. Will defer to Cache.load_collection if the key contains ‘order’. Will error on keys that refer to pools without order.

Parameters:

key (str) – The key to the data we are loading.

Returns:

The loaded data pointed to by the key.

Return type:

Result

Raises:

ValueError – If the key does not reference any data meaning you probably tried to load a pool.

Examples

>>> from qiime2.sdk.result import Artifact
>>> from qiime2.core.testing.type import IntSequence1
>>> test_dir = tempfile.TemporaryDirectory(prefix='qiime2-test-temp-')
>>> cache_path = os.path.join(test_dir.name, 'cache')
>>> cache = Cache(cache_path)
>>> artifact = Artifact.import_data(IntSequence1, [0, 1, 2])
>>> saved_artifact = cache.save(artifact, 'key')
>>> loaded_artifact = cache.load('key')
>>> loaded_artifact == saved_artifact == artifact
True
>>> str(loaded_artifact._archiver.path) ==                 str(cache.data / str(artifact.uuid))
True
>>> test_dir.cleanup()
remove(key)

Removes a key from the cache then runs garbage collection to remove anything the removed key was referencing and any other loose data.

Parameters:

key (str) – The key we are removing.

Raises:

KeyError – If the key does not exist in the cache.

Examples

>>> from qiime2.sdk.result import Artifact
>>> from qiime2.core.testing.type import IntSequence1
>>> test_dir = tempfile.TemporaryDirectory(prefix='qiime2-test-temp-')
>>> cache_path = os.path.join(test_dir.name, 'cache')
>>> cache = Cache(cache_path)
>>> artifact = Artifact.import_data(IntSequence1, [0, 1, 2])
>>> saved_artifact = cache.save(artifact, 'key')
>>> cache.get_keys() == set(['key'])
True
>>> cache.remove('key')
>>> cache.get_keys() == set()
True
>>> # Note that the data is still in the cache due to our
>>> # saved_artifact causing the process pool to keep a reference to it
>>> cache.get_data() == set([str(saved_artifact.uuid)])
True
>>> del saved_artifact
>>> # The data is still there even though the reference is gone because
>>> # the cache has not run its own garbage collection yet. For various
>>> # reasons, it is not feasible for us to safely garbage collect the
>>> # cache when a reference in memory is deleted. Note also that
>>> # "artifact" is not backed by the data in the cache, it only lives
>>> # in memory, but it does have the same uuid as "saved_artifact."
>>> cache.get_data() == set([str(artifact.uuid)])
True
>>> cache.garbage_collection()
>>> # Now it is gone
>>> cache.get_data() == set()
True
>>> test_dir.cleanup()
get_data()

Returns a set of all data in the cache.

Returns:

All of the data in the cache in the form of the top level directories which will be the uuids of the artifacts.

Return type:

set[str]

get_keys()

Returns a set of all keys in the cache.

Returns:

All of the keys in the cache. Just the names now what they refer to.

Return type:

set[str]

get_pools()

Returns a set of all pools in the cache.

Returns:

The names of all of the named pools in the cache.

Return type:

set[str]

get_processes()

Returns a set of all process pools in the cache.

Returns:

The names of all of the process pools in the cache.

Return type:

set[str]