Recently I gave a guest presentation at a SLIS class on authority control in institutional repositories. I followed a speaker discussing traditional MARC authority records. The following is approximately what I said.
I am here, in Duncan’s words, to destroy any lingering appeal that MARC might have to you.
First, a couple of key differences regarding names in an IR. Traditionally, LC authority records are not created for people who write only articles, which is typical in STEM fields. The dream is to have content be self submitted or submitted in an automated way, such as with SWORD (Simple Web-service Offering Repository Deposit). If self submitted, the process needs to be very fast and straightforward. Since specialists may not be inputting the data, data quality may be quite variable and the library may expend no resources towards to standardization or cleaning it up.
Many authors can appear on one article. Sometimes the most important is first, sometimes last, sometimes no order at all. Unlike in MARC cataloging where limits are put on listing authors (traditionally the rule of three), all are listed in repositories.
On the left if an example from our repository. Author affiliation is also important on the article. We only include the affiliation for our local authors and we try to only include University of Iowa when they were here as author. Below is the same article in the journal, which includes affiliations for each of the authors.
On the right is an example from Arxiv. Multiple authors are extremely common. All of these authors being involved create more probabilities of repeated names.
Publications, indexes or citation styles often include just initials. This is particularly common in science. This makes name disambiguation even more difficult. the slide shows results from web of science for JE Kasper. There are four different authors appearing in the first 8 results. (Two of the people have the same name but I believe they are different people.)
A citation uses the form of name on the published item. This means the repository may have a mix names. If all these items were in our repository, it could be difficult to connect them all. Most repositories do not have adequate controls on names to link these forms of names together.
Our theses can easily be a form of name that is not used for any other publications. Our Registrar requires one’s full name at registration and our Graduate College requires the name under which you are registered on the thesis title page. These variation are easy to reconcile for the reader, but can be more difficult with machine matching. The published name may also vary across fields.This same individual is known as Jim in our digital collection and published as James with no middle initial in ceramics.
This is especially true for people who change their name, such as if a woman changes her name when married and has published under both forms. If we had items by Hillary Rodham Clinton, her results would be split in a browse between R and C. It would be obvious to people that this is the same individual but most authors are not this well known.
In the case of Deirdre McCloskey’s change, she wisely kept her initials the same, which helps keep her publications together in some indexes.
Many people in repositories have common names and have fairly minor publications, so disambiguation can be awkward.
In my case, the dominant Wendy Robertsons are a mathematician, born in 1927, and a novelist, born in 1941, and a UK woman in the health industry. Note that she appears as both Wendy and W Robertson on actual articles. Note also that I found full names and date of birth from the openly accessible authority records for the first two. I consider this to be a privacy issue and am glad I do not have an authority record where this kind of information may appear.
I had long noted these other Wendy Robertsons, so when I began to publish, I used my middle initial. despite my best intentions, One publication omitted the C, and I was very annoyed at myself. When a variant is this minor, should we standardize the name in the repository or should we go with what is on the piece? Various institutions answer this variously. And we are not entirely consistent locally. For example, I believe I always used included a C in the metadata for my conference presentations (left) even though I often omitted from the slides or schedule. However, this improves the consistency of my indexing Google Scholar.
Notice at the top (right) of the Google Scholar results.
There is a little information about me and how many times I have been cited. Notice that I can edit this information, sharing as much or as little as I want, and even merging, adding or deleting citations.
Scholars really care about things like this so these metrics are very important. People want to be easily identify all their articles and only their articles. They want to know how often they have been cited or downloaded. These metrics are important for tenure. The interest in this means there are efforts at name disambiguation for authors that start from outside the library sphere. This is where ORCID comes in. The article you read for today (Haak, L., Fenner, M., Paglione, L., Pentz, E., & Ratner, H. (2012) ORCID: A system to uniquely identify researchers. Learned Publishing 25(4). http://dx.doi.org/10.1087/20120404) gives a very good overview of ORCID, so I won’t talk much about it now. Two of the most important points from the article, in my opinion are:
ORCID has taken the stance that the use of an identifier should first and foremost reduce the reporting burden for researchers, both in the immediate task of filling out basic information on forms, as well as in longer-term progress reporting.
ORCID also recognizes that, first and foremost, individuals own their record. A central principle of the ORCID initiative is that researchers control the defined privacy settings of their own ORCID record data.
ORCID also connects to Thomson Reuters ResearcherID, yet another ID initiative.
As far as I know, we are not yet actively promoting ORCID with researchers, but I hope we will be soon. If we were a member research organization, we would be able to create IDs for our faculty.
Member research organizations may create ORCID records on behalf of their faculty, staff, or students. Research organizations can include ORCID identifiers in local systems, such as personnel databases, identity management systems, or research information systems, and also supports interoperability with external systems. As a public identifier, ORCIDs can be exposed where proprietary or private identifiers cannot.
Our IR which uses Digital Commons software from bepress lacks authority control. We can cluster people, but there is no identifier. An email address is closest thing we have to an identifier.
As more publishers begin support of ORCID, I hope that within the year our repository will support it it as well. Ideally it would be part of the submission process so that author details could be pulled in. Hopefully we will also be able to easily have repository items added to an ORCID record.
While off the topic of authority control in repositories, I wanted to give you a few thoughts on repository metadata in general. Much of our use comes from Google and Google Scholar. Google Scholar uses meta tags in the html header. This same structured data can also be used by reference managers to import citations. In addition to mapping to Dublin Core so that content can go to shared library collections, it is very important to look at non-library standards when considering your data.If you want more information about repository metadata in general, you may wish to look at slides and extensive speakers notes from a presentation I did two years ago.
This work is licensed under a Creative Commons Attribution 3.0 Unported License.