Introduction to IR Preservation

Introduction to IR Preservation

The following is approximately what I said at a recent NASIG presentation, Preserving Content from Your Institutional Repository. My co-presenter, Carol Ann Borchert, covered more of the central points of IR preservation, including conducting an environmental scan, LOCKSS, Portico, and preservation plans, so this text omits those major issues. This is very much an introduction to the topic for people who know little or nothing about digital preservation.

Digital preservation is a HUGE topic. Specialized positions are assigned this responsibility and whole conferences are devoted to it. This presentation will be an introduction to digital preservation meant for IR managers and serialists. If you were expecting something more in-depth, you may wish to choose a different session; it is OK to leave, we won’t be offended. If this session leaves you wanting to find out more about digital preservation in general, the Library of Congress blog “The Signal” provides a great non-scary way to keep up with some of the things happening in the field.

An IR is…

A library and an IR generally provide the archiving role in the scholarly ecosystem. The 2006 SPEC Kit on Institutional Repositories (#292) provides a good definition of an IR (p.13):

a permanent, institution-wide repository of diverse, locally produced digital works (e.g., article preprints and postprints, data sets, electronic theses and dissertations, learning objects, and technical reports) that is available for public use and supports metadata harvesting.

Institutional repositories are designed as long-term homes for intellectual output from your college or university. We assume this means the content will be preserved because it is in the IR. However, there is far more to preserving the content than merely adding it to the IR. This is particularly true when preservation is not the primary purpose or priority of a repository.

An IR is not…

Today’s IRs generally are not preservation repositories. For example, when scanning a book for inclusion in a preservation repository, you typically would have a series of image files. Most IRs would post a PDF access copy instead. Most IRs have not tried to meet all the requirements of a fully trustworthy repository, focusing instead on access to content. While this role for IRs is shifting, our presentation will not consider an IR as a fully trusted repository. We will be looking at things an IR manager should be aware of and discuss with colleagues.

10 basic characteristics of preservation repositories

CRL lists the ten basic characteristics of digital preservation repositories, which is where we should all be heading. I won’t read them, and fully meeting them is beyond the scope of this presentation.

The sources at the bottom provide links to several documents with much more information which are very useful when assessing your repository and services. Alex Ball’s “Preservation and Curation in Institutional Repositories” is a great document to start with. You can then use tools like DRAMBORA and the OpenDOAR policies tool and the nestor checklist to create and improve your polices and assess your own repository. A repository can then try to comply with The Trustworthy Repositories Audit & Certification: Criteria and Checklist (TRAC), which is used by CRL in its auditing and certification of digital repositories. HathiTrust is an example of a certified Trustworthy Repository.

These sources emphasize documentation, transparency, adequacy and measurability; the documentation and transparency sections I think are of particular interest for IR managers. For example, is your IR mission statement or goals easily found? Do you have stated criteria for selection/inclusion? Do you regularly review your policies and procedures? Is your funding secure?

The year is 2100— can you read your files?

If you had a stack of punch cards, could you do anything with them? We all know that storage mediums have changed dramatically, and we have seen the rise and fall of floppy disks, zip discs, Jaz drives, and CD-ROMs. Just because files are stored on a server or in the cloud does not keep them safe from being outmoded; after all, the files are still on something not actually in a magical cloud. The files themselves may be in a format that can longer be read. Can you access files saved from your old Tandy? How about data or a program used in a lab in the 1990s? Existing software may not be backward compatible with the old format. A Microsoft product file created in the 1990s may be unreadable today. Files may require a computer program which is no longer available. When a company goes out of business and no one gets responsibility for the software, everything running on it may become inaccessible in the future. While many programs can be made accessible on an emulator, you still need to make sure you have access to an emulator to ensure the files can be used. Even if none of these problems exist, the document itself may have had access controls set on it (e.g. password restrictions) which make it inaccessible. This is of course inadvisable for an open, publicly accessible archive, but things can be added accidentally.

Localized disasters

Having a disaster plan is important for an IR and one element on the trusted repository check lists. It is always prudent to have a backup somewhere else, preferably far enough away that it wouldn’t be subject to the same disaster. These disasters can take many forms, including fire, flood, tornado, hurricane, earthquake, and tsunami. While war is probably unlikely for all of us in the room, it is a serious threat in many places. In the case of Timbuktu, people originally believed the digitized objects may be all that is left of some of these important historical and cultural documents; fortunately it now seems most were smuggled to safety and only a few hundred manuscripts were destroyed. Occasionally a disaster allows some time to prepare, as for our flood 5 years ago. We moved servers out of the building to higher ground as part of our evacuation. Sometimes a disaster is gives no warning as for the University South Florida flood from a rainstorm and broken pipe, which hit their journals, especially their local history titles. Computer viruses are yet another form of local disaster.

Backups vs. preservation

As with your home files, your repository should be getting backed up by your system administrators. This usually means making regular copies of new files and changed files in case of a server crash. However, this backup plan should not be confused with preservation. Having the data is different than having the data in a form that is accessible and hasn’t been altered. The JISC briefing paper, Digital Preservation: Continued Access to Authentic Digital Assets, which is a great introduction to digital preservation, gives a clear explanation of the difference between backups and preservation:

Disaster recovery strategies and backup systems are not sufficient to ensure survival and access to authentic digital resources over time. A backup is a short-term data recovery solution following loss or corruption and is fundamentally different to an electronic preservation archive.

Backups are one piece of proper preservation so make sure these standard concerns are properly addressed locally.

Exit Strategy

You should always be able to migrate all your content and metadata out of your current repository to something else, with no loss of data. Ensuring your content will remain accessible when you change repository software is crucial. You must have all your metadata, including administrative metadata, and the items must be clearly identifiable to match the metadata. Assuming your repository is built on standards, this shouldn’t be a problem; your IR is probably OAI compliant so you can at least get data that way, but this may not include all fields, such as administrative notes that do not make sense to map to Dublin Core and share with others. Hopefully you can extract all your metadata in XML or a csv file or other open format so that you can easily use it in another system, retaining the structure. The exit strategy is also important in case your repository ever needs to move elsewhere due to organizational shifts (if institution closes/merged or if funding went away etc.)

Test, test and test some more

You should test the output of the metadata and files to make sure you are getting what you expect, with no problems. For example, make sure any metadata corrections or additions are retrieved. Make sure the files and metadata are structured in such a way that you can easily use them. Check that you aren’t losing Unicode characters and that long fields are not truncated. These issues shouldn’t be a problem in 2013, but it would be wise to confirm. You also want to ascertain you are getting all your files. If you have content on a streaming server or you link out to data sets, be sure you have considered how these items are being preserved as well. You know the odd items in your IR, so check them.

We are able to get quarterly back-ups of our repository data from our vendor, bepress. This ensures we have all our content on local servers as well as out of state with our vendor host. We found an issue with the back-up which has resulted in small modification to our metadata. USF uses a shell script to do a weekly harvest of the metadata through OAI and curls the information into a file, which is copied onto a server backed up in another city. Finally, in a linked data world, preserving the metadata for your unique content in your IR is especially important.

Persistent identifiers

Another aspect of an exit strategy & preservation is having persistent identifiers. If you do migrate to a new system, you want it to be as seamless for the readers as possible. If you have persistent URLs for content, you can point to the new location and all is well. A DOI may be appropriate for original IR content, but they shouldn’t be used for pre- and post-prints or publisher versions of articles in the repository, so you use a handle or other URN.

Preserving the web

While this presentation is focusing on institutional repositories, I wanted to mention another way to preserve output from your institution. There may be content that your Archivist has identified as important to preserve but which is not appropriate for your repository or which the department or individual does not wish to include in the repository for some reason. In these cases you may want to use web archiving to collect and preserve this content. You can also use web archiving procedures to add a project’s website to your repository.

Archive-It

We have as subscription to Archive-it, a web archiving service of the Internet Archive to harvest and preserve digital collections. We have collected journals before they moved into our repository and journals and newsletters published outside of our repository. We also collect general content from around the University which is usually of a general and administrative nature, but sometimes includes scholarly content.

Internet Archive

You can also include repository content, especially unique content, directly in Internet Archive, as can be done with digitized books. The State of Montana has uploaded documents to Internet Archive, treating it somewhat like a repository.

IRs are a bit different…

An IR is a little different than other library digital collections for which all the content is typically owned & uploaded by the library. If you are uploading all the files in house, you may already have them all. However, at least some of the content is hopefully self-submitted or even automatically deposited via SWORD or in another fashion. This means the item in your repository may be the only copy you have. Some of your repository content will be born digital and other content will be digitized. Even if the print exists, it may not be retained after it has been scanned.

Access copy vs. preservation copy

The scans may have been done as high quality, preservation scans or may be simply access copies. The digitization may have happened by library staff or may have been done elsewhere so you may have little control over this process. If you have preservation scans, it may not be as important to preserve the access copy. The preservation needs may also be different when the analog item exists. This mix of born digital, scanned with no print, scanned with print, and scanned to preservation standards gives an extra challenge when deciding what exactly needs to be preserved.

IRs have special problems…

Some repositories add a title page as part of upload to brand the content, give it authority and identify it with the repository, full citation and rights information, which is often lacking on an individual item. However, this inserted cover page has now altered your content so you need to consider which version you are preserving, the actual original, or the one with the cover page. Furthermore, depending on the process used for the cover page, a new PDF may be getting made out of the cover page and the original file (vs. inserting a page) and the new PDF may lose features, such as tagging for accessibility. In this case it is probably quite important to retain the original file, before upload.

File formats

As you all know, it is best to use open file formats when possible to ensure files will remain accessible in the future. However, if people self-submit, you may have little control over what actually goes into your IR. Unless you have strict technical controls on what can be uploaded, you will get content that is not in an ideal format. Widely used formats may not be as big a concern because there will probably be tools for their conversion in the future. You will need to make a policy decision regarding how absolute you will be about file formats; will you turn down content in an unusual format? Since IR managers typically want to remove barriers for deposit and preserve it for the long term, we generally allow non-ideal content to be added and have to deal with it after it has been deposited. Data sets are not necessarily a special file format but they pose extra challenges because you need to ensure you have the metadata about what was collected, the settings of instruments etc. in order for the data to be meaningful for others. One problem item for us is a 3-D model made in a proprietary program. We are working to get a more open format of the model.

PDF/A

Much of your content is probably PDFs, but it may not be PDF/A, which is the best version for long term archiving. PDF/A is an ISO standard “which provides a mechanism for representing electronic documents in a manner that preserves their visual appearance over time, independent of the tools and systems for creating or rending the files”( http://www.pdfa.org/publication/pdfa-in-a-nutshell-2-0). This format does not allow features which will hinder long term archiving, such as encryption, embedded audio and video, anything requiring external software for display or playback or and JavaScript. All fonts must be embedded, color information must be given in a standard color profile and metadata must be embedded in XMP format. While PDF 1.7 is also an ISO standard, it doesn’t require or limit these features so is inferior for archiving.

The bulk of our content is theses and journal articles and not created by us. We can influence the journal editors, but they are not yet making PDF/A’s. If migrate your content to PDF/A, be sure to record that such a change was made.

Theses

The ETDs will be a more difficult challenge for us because our Graduate College is completely in charge of them and submissions go to ProQuest. We receive the files from ProQuest weeks after graduation. We will need to convince our Graduate College to require PDF/A submissions. The graduates can also include supplemental files of any format. Our 1938 ETDs currently have a range of supplemental file formats supplementing our theses.

21

.avi

3

.NTS

1

.avp

2

.pde

8

.doc

6

.pdf

2

.mov

4

.txt

2

.mp3

3

.wmv

1

.mp4

18

.xls

4

.mpg

2

.zip

1

.mxf

One zip folder contains an executable and 2 very large data files and the other includes 501 smaller files. I am unsure if zipped files present preservation challenges.

Public preservation policy

You should have a policy regarding what types of file format migration you will do and make it clear what you are committing to preserve. Make sure the depositor understands the scope and restrictions; if the files are in a proprietary format there is only so much you can do for the items long term. It may be best to keep the original and output as an open version that can be read in the future. As always, looking at other institutions’ policies can be very instructive if you do not have a public preservation policy yet. I really like Illinois’ Preservation Support Policy and File format recommendations.

Preservation metadata

Caplan’s Understanding PREMIS provides an excellent introduction to PREMIS, defining preservation metadata as supporting “activities intended to ensure the long-term usability of a digital resource.” She also notes that people not directly involved with digital preservation do not need to know all the details, but would find it helpful to become familiar with it. This is particularly true if you are involved with evaluating or implementing an IR. You will be much better off reading her document than having me try to explain it all to you.

Digital provenance

From our perspective, I believe the most important point is recording actions taken in repository management that are important to know about for preservation (i.e. altering or deleting digital objects). Even if you don’t have preservation system you may want to track these actions, such adding a cover page or optimizing a large file for faster download. If you migrate content to another format (to be more open or in a newer version), you need to make it clear what was done to the file. These changes can put the “authenticity of the resource in doubt”. Caplan states (p.3):

Metadata can help support authenticity by documenting the digital provenance of the resource — its chain of custody and authorized change history.

Some technical information can be extracted from files, such as the name of the program creating the item and the creation date, but if this is not available it should be recorded. Most PREMIS event types are for recording actions after an item was ingested into a preservation repository, so you may need to invent your own event types to consistently record actions before this step.

[This is where my co-presenter went into the real substance of IR Preservation. You will have to wait for the Hollywood movie version.]

Sources

Ball, Alex. Preservation and Curation in Institutional Repositories. Digital Curation Centre, UKOLN, 2010. Version 1.3 http://www.dcc.ac.uk/sites/default/files/documents/reports/irpc-report-v1.3.pdf

Caplan, Priscilla. Understanding PREMIS. Library of Congress, ©2009. http://www.loc.gov/standards/premis/understanding-premis.pdf

Center for Research Libraries. “Ten Principles.” http://www.crl.edu/archiving-preservation/digital-archives/metrics-assessing-and-certifying/core-re

Digital Repository Audit Method Based On Risk Assessment (DRAMBORA). Glasgow, 2009. http://www.dcc.ac.uk/resources/repository-audit-and-assessment/drambora

JISC. Digital Preservation: Continued Access to Authentic Digital Assets (Nov. 2006) http://www.jisc.ac.uk/publications/briefingpapers/2006/pub_digipreservationbp.aspx

Nestor Working Group. Catalogue of Criteria for Trusted Digital Repositories. Frankfurt am Main, Dec. 2006. Urn: de:0008-2006060703

OpenDOAR Policies Tool. http://www.opendoar.org/tools/en/policies.php

Oettler, Alexandra. PDF/A in a Nutshell 2.0: PDF for long-term archiving. Berlin: Association for Digital Document Standards e. V., ©2013. http://www.pdfa.org/wp-content/uploads/2013/04/PDFA_in_a_Nutshell_21.pdf

Pennock, Maureen. Web-Archiving. DPC Technology Watch Report 12-01 March 2013. DOI: http://dx.doi.org/10.7207/twr13-01

Reference Model for an Open Archival Information System (OAIS). Recommended Practice CCSDS 650.0-M-2. Magenta Book, June 2012. http://public.ccsds.org/publications/archive/650x0m2.pdf

Trustworthy Repositories Audit & Certification: Criteria and Checklist (TRAC). Version 1.0. Feb 2007. http://www.crl.edu/archiving-preservation/digital-archives/metrics-assessing-and-certifying/trac

University of Houston Libraries, Institutional Repository Task Force. Institutional Repositories. SPEC Kit 292. July 2006. http://publications.arl.org/Institutional-Repositories-SPEC-Kit-292/3

University of Illinois at Urbana-Champaign. “IDEALS Digital Preservation Support Policy.” ©2013 https://services.ideals.illinois.edu/wiki/bin/view/IDEALS/PreservationSupportPolicy

University of Illinois at Urbana-Champaign. “Preparing Items for Deposit into IDEALS. File Format Recommendations” ©2013 https://services.ideals.illinois.edu/wiki/bin/view/IDEALS/SubmissionPrep#File_Format_Recommendations

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s