Often, data contributor repositories that we harvest contain records that point to both freely-available and restricted-access digital resources. Sometimes repositories partition these records into OAI sets (e.g., "freely accessible texts") that can be easily harvested, and sometimes they do not. When they do not, additional effort on our part is required to selectively filter only the freely-accessible digital resource records. This is entirely dependent on the records themselves-- the metadata itself must contain some indication of restriction policy (e.g., "This material is accessible to the public, freely and without charge.") in order for us to perform filtering. Records frequently do not contain this information and only by following the link to the digital resource does availability become clear. Consequently, the decision to keep or not keep an entire repository's records based on the discovery of some restricted records has been challenging.
Recently, we've encountered a number of repositories containing a large number of records a significant percentage of which point to restricted materials-- these materials are restricted for a certain duration and then become freely available. While, in a perfect OAI world, these records would reflect the correct restrictions and changes to these restrictions so that newly modified records could be re-harvested and re-filtered, this is almost never the case. Again, the decision to retain these repositories is difficult when faced with such shifting restrictions.
Early on, we decided to host our own restricted-access records, i.e., the University of Michigan's digital collections that are accessible via subscription (individual and institutional). Our rationale was that to NOT include these records we would:
- not be providing the full picture of the digital resources we host
- be limiting access to those who are subscribers and would be able to retrieve these resources
- thwart those who may desire to become subscribers after learning of the existence of these resourcess
As a result of these issues, we felt that an enunciation of our collection development policy would be helpful. Our stated policy, including processing issues, is as follows:
- We harvest and retain all records that point to digital resources.
- This includes freely-available and restricted-access digital resources.
- For those small data repositories with some incorrect UTF-8 or XML that causes our transformation engine to fail, we will fix these records such that the engine can successfully complete. However, because we have to handle these repositories more than once each time we harvest and consequently cause time delays, they are harvested on a monthly basis. Those repositories containing many incorrect UTF-8 or XML records will be dropped. We will communicate these problems with the repository owner as time allows. Repositories with no UTF-8 or XML errors are harvested on a weekly basis.
- When harvesting on a regular basis (weekly or monthly) has failed at least three times and we are unable to discover a new OAI baseURL for a repository, we may drop the repository. As above, we communicate these problems with the repository owner as time allows.
- We will not add repositories with fewer than 5 records that point to digital resources. Communication from the data repository owner of an increase in records is the most efficient method to re-engage our harvesting of the repository.