CAIRSS Blog

2009/07/21

NicNames and People Australia – some thoughts for CAIRSS

Filed under: Uncategorized — ptsefton @ 1:01 pm

Please note – The CAIRSS blog has relocated to http://cairss.caul.edu.au/blog

This post looks at a couple of name-related services that will be of interest to CAIRSS people. There are a lot of Author ID systems. This post is not a survey of all of them, see this list .

I visited the National Library of Australia on Wednesday June 17th, to look at the their new Single Business Discovery System prototype and the People Australia service. The SBDS is much more interesting than the name implies (I’m assured the name will change), it lets you search a big index of:

I’d like to talk about one part of this, People Australia, and reflect on how this fits with the ARROW mini-project looking at researcher identities, the NicNames Project. I dropped in on NicNames at Swinburne University in Melbourne on July 15th.

This blog post has been reviewed by staff from the NLA and Swinburne, thanks all I tried to address the things you raised but please use the comments if I missed anything.

Why would a repository manager care about these services? Both of these systems are built around identities of people. They promise to allow us to identify researchers uniquely and link those identities to research outputs and other stuff in the repository. No more searching through all the Lees and Smiths for the one you want or trying to bring together publications for people who have changed their name.

It is well understood that people will have lots of IDs from lots of sources. University staff numbers which are probably private, tax file numbers, email addresses , openIds and Australian Access Federation IDs (eventually, maybe).

I will look at how I think these services might look to a variety of players; repository managers, depositors and end-users. For a more background about some of the issues I recommend that you read this post by Andy Powell first. He reports that the UK scene seems to be involve attempts to create centralized name services, whereas here we are looking at distributed systems that talk to each other. Yes People Australia is a large project at the National Library of Australia, but it is designed to be part of a decentralised mesh of naming services.

Repository Managers: Establishing Local IDs

If you’re a repository manager, and you want to get usable, stable persistent IDs this is what you will do.

  1. Arrange to have NicNames installed. It’s a web application which is installed separately to the repository. There might be a local instance at your site, or potentially you could share with another institution.

  2. Load all your repository data into the NicNames website from your repository. This involves extracting data from the repository in, say, MARCXML and then feeding that to NicNames.

    The NicNames application will consider the data and try to identify unique authors by looking the strings used to identify them and by looking at co-authors and subject codes. So if there are two Alan Smiths working in different disciplines it will try to work out that they are different. NicNames can be configured to store as much information about each identity as you like such as multiple different name strings that have been used to refer to it, and potentially the ID of any repository items (this will become important later).

    Once it has done this it will give the repository manager some kind of interface to confirm NicNames’ suspicions and/or correct it when it gets things wrong. Behind the scenes, NicNames assigns unique IDs for individuals. These are not related to People Australia IDs at this stage of the process. So now instead of just a string like Alan Smith in the name field of some metadata we have something like:

     <person><name>Alan Smith</name><id>NN:00000001</id></person>

    We might also have other records that have a different form of the name but with the same ID which is the whole point of this exercise. (You could also store a canonical string like Smith, Alan to save the software having to look it up in the NickNames system but then what if you have to change it?)

    <person><name>Smith, Prof. A</name><id>NN:00000001</id></person>

  3. When you are happy with the names they can be imported back into the repository. This will have to be coded for each repository platform separately and at this stage this has not been done. One of the issues is that VITAL (which is the ARROW platform) does not have any APIs to allow this kind of batch update something would have to be written at the level of Fedora, which does the data storage under VITAL.

    Alternatively it would be possible to use an architecture where you didn’t have to change the repository at all it would continue to store strings for names, and the NicNames system would hold the data about IDs. A third system, a discovery layer, could then present a browse-and-search view of the repository-plus-name-IDs. That might sound a bit problematic, but it might be pragmatic where it is difficult to change the core repository software (even plugins we develop at USQ take months to make it into our local ePrints). It’s actually a semantic-web approach where different facts can be distributed on the web. I’ll write up this design pattern in another post on my own blog.

Object1At Swinburne the main use case involves repository staff running batches of records from EndNote into VITAL as that’s the way their repository workflow is set up it’s all done in the library with no self deposit They will:

  • Transform EndNote data to MARCXML.

  • Put the MARCXML into NicNames as above and sort out the names.

  • Export the MARCXML back out of NicNames, with added IDs.

  • Put the MARCXML into VITAL as per normal practice at Swinburne.

I talked to Swinburne staff about having a look at Squire the ARROW-sponsored replacement for VITAL which might be able to be integrated into their workflow and it might be able to be adapted to help inject NicNames IDs back into Fedora. So for new records, there will be a unique identifier in the record. How this will be displayed in VITAL remains to be seen.

1 Depositors

Now we turn our attention to the users, whoever is putting in resources using some kind of web ingest system. The NicNames team are starting with VALET which is the open source repository ingest tool that comes with VITAL but it should be simple to plug it in to other systems like ePrints. Here’s what a typical depositor will do:

  1. Start depositing a new item as usual.

  2. Start typing in a name field for, say, an author.

    If what you are typing appears to match with an existing identity a form will pop up where you can pick which author you mean. See the screenshot on the NicNames blog.

    If there are no matches then there will have to be some way to create a new identity in NicNames.

    Behind the scenes the ingest application will be talking to NicNames.

That’s it. Your repository now knows a definite identity for each person associated with a resource so you can have as many Alan Smiths as you like, and be able to deal with people who have published under several names. There is still the matter of what the interface will look like. Rebecca Parker tells me:

One of the outcomes of the project will be recommendations from a user-centred design process … we’ll be making suggestions about best practice for displaying name variants in research databases generally, with obvious local impact on how to manage these in institutional repositories.

2 Where does the NLA come into this?

So if a repository manager has used NicNames to establish IDs for people, and depositors have associated new deposits with those ID we have a local unique ID for each person in the repository, but that doesn’t help when records are harvested by the NLA, for their Discovery Service,because to that system a NicNames ID is meaningless unless it can associate it with the People Australia name system. What we want is a way to tie the NicNames ID to the People Australia ID.

People Australia is designed to work with multiple distributed identity management systems; it keeps an EAC record for each entity (person or organisation) which can have multiple identifiers associated with it. I assume what’s needed in the case of repository content is either to match a NicNames identity with an existing People Australia identity or if the match can’t be made, make a new People Australia identifier with an EAC record that contains the NicNames ID.

The process will work like this:

  1. People Australia will harvest name data out of NicNames systems using OAI-PMH and will attempt to match them to People Australia identities, and if that fails make new ones. (I’m still not really clear on how this might happen this bit is not in either system yet).

  2. Now, when the People Australia harvester pulls OAI PMH records out of the repository they will have NickNames IDs in them in the Dublin core, and the NLA system will be able to associate those with People Australia IDs.

Basil Dewhurst at the NLA summed up some of the advantages of public persistent IDs, which is what People Australia will provide (NicNames can’t provide that on it’s own it’s a bit of software and you need software-plus-governance to provide persistence):

The case for People Australia Ids is that theyre _persistent_, _public_, and enable discovery of information about people and the resources they create across collections and domains.  Importantly we plan to pull in the VIAF names later in the year and this means that we can link to researchers internationally as well as in Australia.  Research doesnt stop at the coast line !

He notes that Machine to Machine interfaces are another key advantage of these systems which I take to mean that repositories can talk to each other to build a distributed identity system.

3 Summary what does this mean for repository managers?

For most repositories I think that it is a case of waiting to see what happens. At the moment, the People Australia services is not creating IDs for stuff that comes in via the ARROW Discovery Service, and the NicNames project has not yet released any code, or got it working in any of the partner institutions. When NicNames is released the CAIRSS team will have a look and see what would be involved in getting it into production across the various platforms in use in Australia, and we’ll keep talking to the NLA about how NicNames IDs will flow through to People Australia.


* It’s impressive that when I copied the tabs from the top of the SBDS page they pasted into my document as bullet points. That’s clean design.

Advertisements

Blog at WordPress.com.