In Part 1 of this series on Personal Digital Archiving, I wrote about general strategies for long-term preservation planning. Part 2 covers characteristics of digital file formats that you should consider when deciding how to preserve your digital materials.
Digitization and Born-Digital
This post focuses on “born-digital” files that you create on your computer. For more information on digitization, the process of creating digital versions of analog physical media such as paper documents, photographs, tape recordings, etc, check out the helpful guides created by UM’s Digital Conversion Unit, including Best Practices for digitizing Audio and Video. DCU’s Quick and Dirty Guide for Home Digitization includes information about scanner settings for documents and photos. For more complicated digitization tasks, such as film or video, DCU offers a list of available services from Digitization Vendors.
I won’t go into digitization in more detail, but much of the following information about born-digital media also applies to digitized files. Digitized files may represent physical media, but they’re still “born-digital” in the sense that they’re created on a computer.
The best time to think about preservation is before you create your files. Your initial preservation decisions will make it easier to keep your digital files accessible over time. As you create your files, remember to organize them and capture any metadata that will make them easier to sort and browse. See Part 1 for naming conventions and other file management tips.
Quality vs. Size
The main factor you should consider when creating digital files is the trade-off between quality and size. If you decide that your files should be the highest quality available, storage costs will quickly add up. Preservation requires maintenance over time, so if you commit to a large amount of storage, you should be prepared to expand it as your digital collections grow. Also keep in mind that maintaining adequate back-ups (three copies of your hard drive) requires triple the storage capacity. I’ll present some storage options in Part 3.
When attempting to strike a balance between quality and file size in selecting which file formats to use, compression is another factor to consider. Some file formats are uncompressed, meaning they save every bit of data available for that file. Compressed formats use algorithms to determine which data in a file is least likely to be missed and remove this data to make the file size more efficient. For example, a compressed image format may discard certain information about the color range of an image if its algorithm defines this data as beyond the scope of what the human eye is capable of noticing. It can be difficult to tell the difference between the same image in compressed or uncompressed formats, which is why many people consider compressed formats to be good enough for everyday use. However, there are two types of compression, one of which carries a higher risk of data loss:
- Lossless compression reduces the file size but saves all the data in the file, which is restored when the file is uncompressed.
- Lossy compression compresses files by throwing data out every time the file is saved. If you copy or change a file that uses lossy compression, the file will be compressed all over again when you save it. The file size may not change much, but the quality of the file will decrease as more data is lost each time you save it.
Uncompressed or lossless compression is preferable if you’re worried about the quality of the file, but lossy compression can be fine depending on how you intend to use your files. For example, when sharing images online you probably want a smaller file size that’s easier to transmit.
Proprietary vs. Open Source
Digital formats have different levels of transparency that archivists consider when deciding which formats are likely to be the best for long-term preservation. File formats have specifications that document how the format is encoded and how it should be interpreted by software. Proprietary formats are maintained by organizations or commercial entities that protect the specifications by hiding them from public view. Files in proprietary formats may only be accessible in a specific type of commercial software. Specifications for open source formats, on the other hand, are open to public view. These formats are well-documented, tend to be accessible across different software platforms, and are often maintained by a community of users (although some commercial companies make their formats open source or partially open). Proprietary formats depend on the organization that maintains them, so if the organization chooses to stop supporting the format or goes out of business, the format may become obsolete because users don’t have access to the format specifications. However, proprietary formats may also have a critical mass of users that exert significant commerical pressure on the company responsible for maintaining the format for long-term use. Nevertheless, open source formats are often considered the most likely to endure over time, as the formats can be recreated from the open specifications should the rendering software become obsolete.
File Formats by Media Type
Technology is constantly evolving. There’s no guarantee that any of the file formats currently in use will still be accessible in 10 years. Archivists make assessments about the preservation risks of file formats based on various factors, such as the ones mentioned above, and try to select formats that seem most likely to remain sustainable over time. I’m reluctant to make prescriptive recommendations about which formats are “best” for preservation, so please consider the following list to be casual suggestions based on current expectations for long-term file format accessibility. For each type of media, I’ll mention a few formats that are generally considered good enough for everyday use as well as formats that are considered preferable for long-term preservation. File formats are listed by the extension codes that appear after the “.” at the end of the file name.
Text or Data Documents
- DOCX (text document) or XLSX (spreadsheet)
Microsoft Office open formats as of 2007. DOC / XLS are the proprietary versions in use from 1997–2003: better to update by saving as DOCX / XLSX.
- PDF (formatted text/image documents), preferably version 1.7
PDF/A is the open “archival” version with preservation constraints.
- TXT (text document) or CSV (spreadsheet)
No frills or formatting, but most accessible formats for preserving content.
Many email services have options to batch export or archive email in the MBOX format, whether individually or as a ZIP package. Look in your email service’s Options, Account Info, FAQ, or Help sections for instructions.
Webmail services should have a basic option to right-click/print and save/convert-to-PDF.
You should always be able to copy and paste email as a plain text document.
Note: Use folders/labels and filters to sort emails as you receive them. Consider saving only a selection of important emails rather than your entire inbox.
compressed (lossy), open source, commonly used by cameras and for web display
compressed (lossless), common for web display, limited colors
uncompressed or compressed (lossless [LZW] or lossy)
uncompressed image data: variety of “RAW” formats used by cameras, good for master but may need to be converted to another format in order to render
Note: The sheer volume of digital photos that results from the convenience of camera-phones can be a major problem for preservation planning. Organization is important. Depending on how you use your photos, you may want to weed out or delete duplicate or poor-quality images, as well as photos that don’t capture the intended subject of the image. Metadata is also important. Be consistent and descriptive when naming or grouping your photos (as always, it’s better to do this when you create them and have the most information about the context of their creation).
There are many photo management applications and services to help you organize and share your photos. Some things to consider when researching these applications:
- Does the application store your photos on your computer or in a web-based service? Even if you decide to use a photo management service, you should also organize your photos on your own computer and back them up yourself.
- Is photo metadata kept in an external database, or embedded in the photo itself? Some photo sites may strip your metadata from your photo files. For example, if you create titles and tags for your photos within the application, does that information stay with the photo if you copy it to your computer?
- Most cameras embed date and time info as well as location and GPS data in digital photos. You may want to consider your privacy and whether you really want to share this information if you upload your photos to social media or a photo-sharing site.
compressed (lossy), common for streaming and mobile devices
[Hear the audio data that gets thrown out in the process of compressing an MP3: Ghost in the MP3.]
compressed (lossy), better quality than MP3
compressed (lossless); not supported in iTunes: Apple has a proprietary version known as ALAC
Note: Important terms related to the quality of audio files:
- Sampling rate: Times-per-second that the sound wave is captured
- 44.1 kHz = CD quality, good enough for human hearing
- Bit depth: Bits per sample; higher depth = wider range of sounds captured per sample
- 16 bit = CD quality; 24 bit = DVD/Blu-ray audio quality
- Bit rate: bits stored per second
- 320 kbps (kilobits per second) = high quality bit rate
compressed, common for streaming and mobile devices
Note: High quality uncompressed video files result in very large file sizes. Uncompressed standard definition can be around 40+ GB per hour.
- Frame rate: Frames/images displayed per second (fps)
- 24 fps = film; 60 fps = HDTV
- Resolution: Vertical and horizontal pixel dimensions
- 640x480 = standard definition (SD); 1280x720 = basic high def (HD)
Check back later for Part 3: Storage Options.