How to Archive a Digital Collection
This recipe will guide you in developing a digital archival file collection suitable for deposit in an academic institution’s digital archive, including descriptive XML metadata documentation and a readme document.
- A collection of files suitable for archiving
- A text editor such as Sublime Text (Mac, Windows and Linux), Notepad++ (Windows) or TextWrangler/BBEdit (Mac)
- An XML metadata schema suitable for describing document collections, such as the CIRCA archive-oriented XML
- An XML validator, such as oXygen or the Chrome web browser
- A digital academic archive such as the University of Alberta’s Educational Resource Archive (ERA)
- Your archive’s deposit requirements
These steps are broad by necessity. Please see the Discussion section below for more information on each.
- Identify and gather the files you wish to archive.
- Decide what copyright and license type best applies to your collection for archival purposes.
- Catalogue the files according to their file structure.
- If your collection is large, identify sub-sets that can be used to create smaller archival bundles.
- Develop a readme file for each bundle.
- Develop an XML file for each bundle.
- Create a document folder for each bundle, containing its readme file, XML file and all associated data files.
- Compress or encode the document folder according to your archive’s deposit format preferences (ex: .zip)
- Following the instructions provided by your archive, deposit each bundle.
Gathering Your Files
Depending on the nature of the file collection you are working with, it may be already collected in one place, or scattered across a number of places. The first step is therefore to ensure you have all files collected in one place, and arranged into folders and subfolders as appropriate to the collection.
In the case of the JBS Archive, Dr. Smith gathered all available materials in a website and on a USB drive.
Licensing Your Files
Early on in your archive collection’s development, consider what type of distribution license you wish to apply to it. This will be influenced by any copyright that may apply. For example, many items deposited in the University of Alberta’s ERA use one of the numerous versions of the Creative Commons license.
Depending on any copyright issues that may apply, this is also the time to consider who you wish to have access to the final archive. ERA permits depositors to specify a range from no-one besides the depositor to only within the University of Alberta research community, to anyone; check with your institute’s archive service for their available options.
Cataloguing Your Files
In preparation for developing the readme and final bundle, create a plain-text file describing the file structure and contents of your collection, and listing all the files types found within.
For the file types, detail what each is and determine whether it can be presented as-is (ex: .docx, .pdf, .html). If the file type is of some concern, determine whether those files need additional information provided for them, or whether they should be converted to another format.
Once you have catalogued the files extant in your collection, it is useful to identify whether it can be split into any sub-sets. This is useful for two reasons: 1) It assists future researchers in homing in on the parts they are interested in, and 2) It reduces the size of the final bundles for ease of describing, uploading and downloading them. There may be overlap between bundles as appropriate.
For example, the JBS Archive was split into nine separate bundles corresponding to five periods of Dr. Smith’s career, plus four bibliographic categories.
Develop a Readme File
Each bundle will require its own readme file, customized for its constituent files. It exists to provide information to the researchers who may be working with the bundle in the future, and should therefore provide context and other information that cannot be gleaned from the bundle on its own. Individual readme files may include information such as:
- The name of the archive and the bundle
- Situate the bundle in relation to the larger collection by listing any other related bundles.
- List all file formats included
- Provide background or other descriptive information related to the bundle
- Provide a list of descriptive key words for the bundle
- Provide a bibliography for all applicable files in the bundle
- List all files included in the bundle, following the convention you established in developing the file catalogue
- Provide licensing information and any applicable copyright.
Develop an XML File
Much like the readme file, the XML file provides information about the bundle. Unlike the readme, it is designed to conform to an established metadata convention, and its information can therefore be extracted with an automated process.
Tags should encompass information such as:
- The title of the bundle
- The original creator
- The digital record / record creator
- The subject of the bundle (keywords)
- A description of the contents of the bundle
- The physical description (in this case, a brief description of the arrangement and type of files)
- Where the originals can be found / who they are held by
- Publication information (bibliographic information)
- Date(s) of origination for the document(s) in the bundle
- Document type(s) (ex: articles, book chapters, images, manuals, etc.)
- Electronic formats (ex: file types like .pdf, .tif, .docx)
- Provenance (where the collection is derived from, such as an individual or organization)
- Related materials (ex: your archival collection as a whole)
- Coverage (ex: what the bundle encompasses, such as all works by John B. Smith for the period of 1970-1984)
- Identifiers (ISBNs, SKUs, etc.)
- Rights (access rights, such as public, private, restricted, etc.)
- Access (ex: public use, restricted use; see your chosen license and local archive’s options for the available options)
Once you have decided on and populated your tagset for each bundle, the final step in this stage is to validate your XML documents, such as by opening each XML file in the Chrome browser and resolving all issues reported by the browser.
Organize Your Bundles Into Folders
With the XML and readme files finalized, you can now gather together all files affiliated with each bundle into its own directory (folder). If required, adjust the file listing in your readme files to reflect the final directory structure.
Compress or Encode Your Bundles
Use your computer’s native compression program to ‘zip’ each bundle’s folder.
Deposit Your Bundles
Follow your institution’s archive service’s instructions to deposit your bundle.