Maintainer guide#
Managing data#
This section describes how the data shipped with Eradiate is managed.
Overview#
Eradiate ships data of various sizes and maturity levels. The current data management system tries to achieve a compromise between ease of use, reproducibility and bandwidth and storage efficiency.
The entry point to Eradiate’s data management system is the eradiate.data
module. The open_dataset
and
load_dataset
functions serve data retrieved from
data stores. Eradiate currently has three data stores
aggregated as a MultiDataStore
instance, accessible as the
eradiate.data.data_store
member. Each one of these aggregated data
stores is referenced with an identifier:
small_files
A directory of files versioned in the eradiate-data GitHub repository. This data store contains small files, and implementing reproducibility with it is fairly simple. This is the location where data goes by default, i.e. when it is small enough to fit there. The
small_files
data store is accessed offline in a development setup, i.e. when the Eradiate repository and its submodules are cloned locally, or remotely when Eradiate is installed is user mode. This data store holds a registry of files which it uses for integrity checks when it is accessed online. Only files in the registry can be served by this data store, regardless if it is accessed online or offline: files of the eradiate-data repository which are not registered cannot be accessed through it.large_files_stable
A directory of files hosted remotely. The files in this data store are expected to be too large to be conveniently stored in a Git repository. The data store holds a registry, used for integrity checks upon download, and it contains stable files. This means that the files should not be modified: if data is to be changed, it should be saved as a new file. This data store is expected to guarantee reproducibility. Unregistered files can also not be served by this data store.
large_files_unstable
Another directory of files hosted remotely. Like for
large_files_stable
, the files in this data store are expected to be too large to be conveniently stored in a Git repository. Files there are not registered: any query leading to data being downloaded will be considered as successful. This store does not guarantee reproducibility! In particular, this is the location where experimental data sets are located. Files stored there should not be used to write tests.
The data_store
is queried by passing paths to the desired
resources (relative to the root of the registered data stores) to the
data_store.fetch()
method.
The aggregated stores are successively queried, and the outcome of the first
successful query is returned.
All online stores implement the following features which help reduce the amount of online storage and traffic:
caching: requested data is automatically downloaded and cached locally;
lazy download: if data is already available locally, it is not downloaded again;
compressed data substitution: upon query, online stores first check if a file with the same name and the
.gz
extension is available; if so, that file is downloaded, then automatically decompressed locally and served.
File registries, as mentioned earlier, are used for integrity checks when downloading. They are also used to check if data has changed: if the online hash value of the requested resource is different from the hash of the file in the local cache directory, the file is downloaded again.
Note
The large_files_unstable
data store has no hash check: this means that
refreshing its local cache can only be achieved by deleting its contents.
Modifying the data#
Each store requires a different protocol.
small_files
Install pre-commit and install the git hook scripts:
cd $ERADIATE_SOURCE_DIR/resources/data pre-commit install
Now add some data and commit your changes:
git checkout -b my_branch git add some_data.nc git commit -m "Added some data"
The output should look something like:
Update registry..........................................................Failed - hook id: update-registry - files were modified by this hook Creating registry file from '.' Using rules in 'registry_rules.yml' Writing registry file to 'registry.txt' 100% 181/181 [00:00<00:00, 100859.44it/s]
The hook script failed because we changed the data and the changes were not commited. This is the expected behaviour. The hook script updated the registry file with the sha256 sum of the data file we added. Now add the changes to the registry file and commit again:
git add registry.txt git commit -m "Added some data"
This time, the output should look something like:
Update registry..........................................................Passed [master 0b9c760] Added some data 2 files changed, 2 insertions(+) create mode 100644 spectra/some_data.nc
The rules used to create the registry file are defined in the
"registry_rules.yml"
file. Be aware that if you add a data file that is not included by these rules, it will not be registered and therefore it will not be accessible by the data store.If, for some reason, you cannot use pre-commit, then you must be very careful and update the registry manually using the
eradiate data make-registry
command-line tool (it should be run in the data submodule).large_files_stable
The most complicated: avoid updating the files, just add new ones. When doing so, you have to update the registry: compute the sha256 hash of the new file (e.g.
sha256sum
command-line tool) and update the registry file with this new entry. If you happen to have the full contents of the data store on your hard drive, you may also use theeradiate data make-registry
command-line tool to update the registry automatically.large_files_unstable
The simplest: just drop the file in the remote storage, it will be immediately accessible.
Managing dependencies#
Eradiate is managed using the Pixi project manager. It notably allows us to maintain a Conda-based setup with most dependencies sourced from PyPI, with layered requirements. Be sure to read the relevant entries in the Pixi documentation:
We use the following requirement groups, which manifest as features in the Pixi model:
optional
: Only contains the Eradiate kernel package. This is needed to allow developers to install all dependencies except the kernel.recommended
: Optional packages used by specific subcomponents of Eradiate.docs
: Packages needed to compile the documentation.test
: Packages needed to run the test suite.
When adding new requirements, be sure to:
Prioritize PyPI packages, using the
--pypi
option.Register the new requirement to the appropriate group.
Check in the lock file after it is updated.
Note
Our Read The Docs build uses stock Python virtual environments and installs
dependencies with Pip. A specific requirement file is generated for that
purpose, using the docs-lock
Pixi task. This task is automatically
executed upon committing, so the file is always up-to-date.
Making a release of Eradiate#
Preparation
Make sure main is up-to-date and all tests pass.
1This applies only if the Pixi environment is not activated already, e.g.
by a If necessary, 1This applies only if the Pixi environment is not activated already, e.g. by a
direnv
script. start a Pixi shell:pixi shell -e dev
In your shell, set the variable
RELEASE_VERSION
to the target version value:export RELEASE_VERSION=X.Y.Z
Create a new branch for the release:
git checkout main && git pull upstream main && git checkout -b bump/prepare-v$RELEASE_VERSION
Make sure that dependencies are correct (check in particular the kernel version). Use the release checker utility for this:
python resources/release.py check-mitsuba
Bump the version number using Bump My Version:
pixi run bump
Update the change log.
Commit the changes:
git commit -am "Bump version to $RELEASE_VERSION"
Update the version and release date fields in
CITATION.cff
:python resources/release.py update-citation
Update the PyPI README content:
direnv
script.python resources/release.py update-pypi-readme
Push the changes:
git push origin
Pull request
Create a pull request to check changes with peers.
Merge the pull request once everything is correct.
Release publication
Create a draft release on GitHub and update it.
Using release candidates on Test PyPI, make sure that built Pyhon wheels will work as expected. A typical installation command of a Test PyPI release is
python3 -m pip install --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ 'eradiate[kernel]==0.30.0rc4'
Finalize release notes and create the release tag. Make sure that the release commit is referenced only by one tag.
Build and upload Python wheels.
Post-release: Prepare the next development cycle
In your shell, set the variable
RELEASE_VERSION
to the target version value:export RELEASE_VERSION=X.Y.Z-dev0
Bump the version number using:
pixi run bump