RELATED: Git Branching Strategies, Versioning, Artifacting, SDLC - ALM & CICD
Artifacting is the process of packaging a project prior to release, and is essential as it mitigates many risks in both producing and consuming software products. Beyond simple archives, there are many types of packaging -- often language or framework dependent -- which have been developed to suit various use cases.
This primer demonstrates how to package files as zip and tar.gz; leverage an .artifactignore file similar to .gitignore; and generate and use a checksum file.
Tip
- See 12factor and O'Reilly:Beyond the Twelve-Factor App for more on maturing your SDLC.
- Have a consistent naming scheme, such as:
<project name>_<version>.<extension>. - Set a
PRODUCT_NAMEvariable, and use${PWD##\*/}if the build job executes from the top level directory the project. Or useTOPDIR=$(git rev-parse --show-toplevel; PROJECT_NAME=${TOPDIR##\*/})when executing in a git repo to automatically set the name of the artifact to the name of the repo.
Just-in-time builds from commit hashes have been known-bad practice for decades, but this isn't the only risk which can be overcome with artifacting:
- The laws of physics and their bit-flips (single-event upset).
It cannot be assumed that the same asset can be built twice. - Volatility in the software supply chain. Dependency hell is real.
It cannot be assumed that the same asset can be built twice. - Configuration Deltas. Idempotence and build once postures are essential.
- Cost. Multiple build-test cycles dramatically increase tech-spend.
- Implementing CICD early ensures early identification of related issues. Don't wait until the week you hope to deploy before you implement tooling, as there may be substantial software refactoring required.
- Storing credentials in the repository (hardcoding) by ensuring software ingests environment variables at processes instantiation (idempotence).
- Operational complexity (cost++) relating to '-rc', or '-beta' style pre-release identifiers. (There are scenarios where these are required.)
- The potential for software to be modified post-release by including/publishing the file hash for verification.
- Defense agianst partial or malformed downloads.
- Patch-builds are their own mystical beast to handle.
Archive type artifacts generally come in the form of a zip, or tar.gz. The following examples demonstrate how to accomplish this in linux, using modern gnu utilities.
Note
- MacOS utilities not guaranteed to function.
Runbrew install zip unzip gnu-tar - Use either semver, or calver when versioning.
Proprietary versioning standards are guaranteed risk.
Aligning to industry standards is important as many SDLC tools expect semver. If diverging, verify that the toolchain will support your custom approach before investing heavily. - If both zip and tar.gz artifacts will be produced for a project, special attention must be given to the pattern matching used by each utility, as they are not the same. For example, include 'foo/', 'foo/**' for zip, and '**/foo' and '**/foo/*' for tar to fully exclude the 'foo' directory from the archive. WARN: Zip does not support '**/..' patterns.
The following script packages the current directory contents ignoring anything specified in the '.artifactignore' file.
Note
Exclusion patterns ARE NOT like those used by .gitignore.
For example, the pattern '**/foo' does not function as it does with tar or git. See here for more.
#!/usr/bin/env bash
product=$1
version=$2
tmp_dir='/tmp'
exclude_file='.artifactignore'
artifact_name="${product}-${version}.zip"
zip -x@${exclude_file} ${tmp_dir}/${artifact_name} ../create-artifact-zip.sh <project-name> <version>The following script packages the current directory contents, ignoring anything specified in the '.artifactignore' file.
Note
Exclusion patterns are like those used by .gitignore, which can referenced here.
#!/usr/bin/env bash
product=$1
version=$2
tmp_dir='/tmp'
exclude_file='.artifactignore'
artifact_name="${product}-${version}.tgz"
tar -X ${exclude_file} -zcvf ${tmp_dir}/${artifact_name} ../create-artifact-tar.gz.sh <project-name> <version>The contents of the .artifactignore file will generally be language and project specific. Have a look at the github/gitignore repo to get started.
As an example, the .artifactignore file may contain the following:
.artifactignore
.gitignore
**/README.md
**/logs
**/*.log
**/trace.*
**/*.env
.git/
.git/**
.github/
.github/**
node_modulesA checksum is a computational hash of a file, meaning that in theory, if it's changed then the checksum will no longer match.
Assuring consumers that the artifact they receive is complete and unmodified is a critical component of the artifacting strategy. Originally, providing an md5sum with a file was important as internet technologies could not always guarantee complete delivery of a file, but is even more important these days as bad actors may attempt to modify a file in a malicious way.
Tip
md5 is no longer secure. Use sha256.
Checksum files can be generated discretely for each file:
echo dog > pets.txt
sha256 pets.txt > pets.txt.sha256Or for multiple files:
echo dog > pets.txt
echo wrench > tools.txt
sha256sum pets.txt >> sha256sum.txt
sha256sum tools.txt >> sha256sum.txtFiles can be validated individually:
sha256sum -c pets.txt.sha256
pets.txt: OK
echo cat > pets.txt
sha256sum -c pets.txt.sha256
pets.txt: FAILED
sha256sum: WARNING: 1 computed checksum did NOT match
Multiple files can be validated at once:
sha256sum -c sha256sums.txt
tools.txt: OK
pets.txt: OK
echo cat > pets.txt
sha256sum -c sha256sums.txt
tools.txt: OK
pets.txt: FAILED
sha256sum: WARNING: 1 computed checksum did NOT matchYou may have noticed that there is operationally no difference between generating and validating a single or multiple files. While some prefer a single file with all the checksums, some rely on being able to retrieve the artifact and checksum discretely. Why not do both?
Generate a '.sha256', or 'sha256sums.txt' file (or both) with the artifact(s) and make it available. Clients can download and validate as needed.
Relying on Github's Release feature ensures that the checksum files cannot be modified after release without there being a record. Additionally, the checksums can be included in the release notes.
Where artifacts are made available via another method, such as a website, distributing a record of the artifacts and their checksums via an out-of-band channel is a good way to provide another layer of assurance to consumers.
Tip
A simple asset healthcheck system can be established by emitting urls and checksums to a database so that they can be validated intermittently.