CAIRSS Blog

2010/03/09

Following and communicating with CAIRSS

Filed under: Uncategorized — caulcairss @ 3:46 pm

Please note – The CAIRSS blog has relocated to http://cairss.caul.edu.au/blog

2009/12/07

eResearch Australasia 2009, who CAIRSS?

Filed under: Uncategorized — ptsefton @ 4:15 pm

Please note – The CAIRSS blog has relocated to http://cairss.caul.edu.au/blog

Kate Watson has reminded me to blog about the eResearch Australasia conference held in November from a CAIRSS perspective. What’s going on in eResearch that university repository managers should be aware of?

Here’s my top five things to think about in order of urgency, with 1 being the most immediate and five being a longer-term consideration:

  1. Look at what other CAIRSS sites are doing with eResearch and data. There were some great examples of different thinking about how IRs fit into eResearch at the workshop on data management run by QUT and CSIRO, with appearances from some familiar faces from the IR world talking about their institutional planning for data management: Institutional approaches to data management support: exploring different models. We’re interviewing for a new one-year position at USQ for an ANDS/CAIRSS liaison person to help bring these stories to the CAIRSS community, start to put up resources for data management on the CAIRSS site and help the IR community keep in contact with ANDS.

  2. Consider RIF-CS, the new ANDS-developed metadata format for describing data collections.

    The Registry Interchange Format – Collections and Services (RIF-CS) Schema was developed as a data interchange format for supporting the submission of metadata to a collections service registry.

    http://ands.org.au/resource/rif-cs.html

    This format is something that will be important to those IRs which end up hosting data collections and/or or metadata about data collections. I am encouraging the ANDS team to hold at least one meeting for the developers and metadata specialists in the repository community to tell us the background to this schema, and go through the thinking behind the design. (I know there’s a workshop about deploying the new standard, Gumboots for the Data Deluge: defining and describing collections for the Australian Research Data Commons, but I am thinking more about one that might (a) convince us why we need a new standard by explaining the thinking behind its design and (b) take input into future directions for the standard).

  3. Think about the Australian Access Federation. It’s still rolling out, apparently. I have always been quite sceptical about some of the more complicated use-cases involving role-based authorisation to repository resources, but I think the current AAF story is a bit more believable; I wrote about promising developments in the Australian Access Federation on my blog. Repository managers, it would be worthwhile checking with your local IT department if you are not already in the AAF. And if you have any IR requirements to lock-down content for AAF users then let Tim McCallum the CAIRSS techie know and we’ll see what we can do to help.

  4. Looking beyond the kinds of interfaces we’re using now there was a wonderful presentation from Mitchell Whitelaw of new visualisation techniques for navigating large data sets: Exploring Archival Collections with Interactive Visualisation. This was a revelation to me, seeing a word-cloud linked to a dynamic visualisation. Do yourself a favour and check out the A1 explorer Screencast. In the same session Duncan Dickinson from our team at USQ showed some early work we have done on bringing data capture down to the desktop with The Fascinator, Creating an eResearch Desktop for the Humanities. We’ll definitely be looking at how we can let you use Mitchell’s tools over your data.

  5. Get ready for web-scale annotation services as part of the scholarly communications process. I missed the presentation on Universal Collaborative Annotations with Thin Clients Supporting User Feedback to the Atlas of Living Australia but I heard about it from a few people. The team here at ADFI was inspired to plug the open source tools released by UQ into our ICE publishing system as part of ICE week and The Fascinator (it you’re technically inclined you can try it out). It’s early days yet but I think that the standards behind these systems will be key to a new world of peer-review, thesis examination and public participation in scholarship not to mention collaboration on document authoring, assignment marking and thesis supervision.

Copyright Peter Sefton, 2009. Licensed under Creative Commons Attribution-Share Alike 2.5 Australia. <http://creativecommons.org/licenses/by-sa/2.5/au/>

HTTP://DBPEDIA.ORG/SNORQL/?QUERY=SELECT+%3FRESOURCE%0D%0AWHERE+{+%0D%0A%3FRESOURCE+%3CHTTP%3A%2F%2FDBPEDIA.ORG%2FONTOLOGY%2FPERSON%2FBIRTHPLACE%3E+%3CHTTP%3A%2F%2FDBPEDIA.ORG%2FRESOURCE%2FSYDNEY%3E+%3B%0D%0A%3CHTTP%3A%2F%2FDBPEDIA.ORG%2FONTOLOGY%2FPERSON%

This post was written in OpenOffice.org, using templates and tools provided by the Integrated Content Environment project and published to WordPress using The Fascinator.

2009/07/21

NicNames and People Australia – some thoughts for CAIRSS

Filed under: Uncategorized — ptsefton @ 1:01 pm

Please note – The CAIRSS blog has relocated to http://cairss.caul.edu.au/blog

This post looks at a couple of name-related services that will be of interest to CAIRSS people. There are a lot of Author ID systems. This post is not a survey of all of them, see this list .

I visited the National Library of Australia on Wednesday June 17th, to look at the their new Single Business Discovery System prototype and the People Australia service. The SBDS is much more interesting than the name implies (I’m assured the name will change), it lets you search a big index of:

I’d like to talk about one part of this, People Australia, and reflect on how this fits with the ARROW mini-project looking at researcher identities, the NicNames Project. I dropped in on NicNames at Swinburne University in Melbourne on July 15th.

This blog post has been reviewed by staff from the NLA and Swinburne, thanks all I tried to address the things you raised but please use the comments if I missed anything.

Why would a repository manager care about these services? Both of these systems are built around identities of people. They promise to allow us to identify researchers uniquely and link those identities to research outputs and other stuff in the repository. No more searching through all the Lees and Smiths for the one you want or trying to bring together publications for people who have changed their name.

It is well understood that people will have lots of IDs from lots of sources. University staff numbers which are probably private, tax file numbers, email addresses , openIds and Australian Access Federation IDs (eventually, maybe).

I will look at how I think these services might look to a variety of players; repository managers, depositors and end-users. For a more background about some of the issues I recommend that you read this post by Andy Powell first. He reports that the UK scene seems to be involve attempts to create centralized name services, whereas here we are looking at distributed systems that talk to each other. Yes People Australia is a large project at the National Library of Australia, but it is designed to be part of a decentralised mesh of naming services.

Repository Managers: Establishing Local IDs

If you’re a repository manager, and you want to get usable, stable persistent IDs this is what you will do.

  1. Arrange to have NicNames installed. It’s a web application which is installed separately to the repository. There might be a local instance at your site, or potentially you could share with another institution.

  2. Load all your repository data into the NicNames website from your repository. This involves extracting data from the repository in, say, MARCXML and then feeding that to NicNames.

    The NicNames application will consider the data and try to identify unique authors by looking the strings used to identify them and by looking at co-authors and subject codes. So if there are two Alan Smiths working in different disciplines it will try to work out that they are different. NicNames can be configured to store as much information about each identity as you like such as multiple different name strings that have been used to refer to it, and potentially the ID of any repository items (this will become important later).

    Once it has done this it will give the repository manager some kind of interface to confirm NicNames’ suspicions and/or correct it when it gets things wrong. Behind the scenes, NicNames assigns unique IDs for individuals. These are not related to People Australia IDs at this stage of the process. So now instead of just a string like Alan Smith in the name field of some metadata we have something like:

     <person><name>Alan Smith</name><id>NN:00000001</id></person>

    We might also have other records that have a different form of the name but with the same ID which is the whole point of this exercise. (You could also store a canonical string like Smith, Alan to save the software having to look it up in the NickNames system but then what if you have to change it?)

    <person><name>Smith, Prof. A</name><id>NN:00000001</id></person>

  3. When you are happy with the names they can be imported back into the repository. This will have to be coded for each repository platform separately and at this stage this has not been done. One of the issues is that VITAL (which is the ARROW platform) does not have any APIs to allow this kind of batch update something would have to be written at the level of Fedora, which does the data storage under VITAL.

    Alternatively it would be possible to use an architecture where you didn’t have to change the repository at all it would continue to store strings for names, and the NicNames system would hold the data about IDs. A third system, a discovery layer, could then present a browse-and-search view of the repository-plus-name-IDs. That might sound a bit problematic, but it might be pragmatic where it is difficult to change the core repository software (even plugins we develop at USQ take months to make it into our local ePrints). It’s actually a semantic-web approach where different facts can be distributed on the web. I’ll write up this design pattern in another post on my own blog.

Object1At Swinburne the main use case involves repository staff running batches of records from EndNote into VITAL as that’s the way their repository workflow is set up it’s all done in the library with no self deposit They will:

  • Transform EndNote data to MARCXML.

  • Put the MARCXML into NicNames as above and sort out the names.

  • Export the MARCXML back out of NicNames, with added IDs.

  • Put the MARCXML into VITAL as per normal practice at Swinburne.

I talked to Swinburne staff about having a look at Squire the ARROW-sponsored replacement for VITAL which might be able to be integrated into their workflow and it might be able to be adapted to help inject NicNames IDs back into Fedora. So for new records, there will be a unique identifier in the record. How this will be displayed in VITAL remains to be seen.

1 Depositors

Now we turn our attention to the users, whoever is putting in resources using some kind of web ingest system. The NicNames team are starting with VALET which is the open source repository ingest tool that comes with VITAL but it should be simple to plug it in to other systems like ePrints. Here’s what a typical depositor will do:

  1. Start depositing a new item as usual.

  2. Start typing in a name field for, say, an author.

    If what you are typing appears to match with an existing identity a form will pop up where you can pick which author you mean. See the screenshot on the NicNames blog.

    If there are no matches then there will have to be some way to create a new identity in NicNames.

    Behind the scenes the ingest application will be talking to NicNames.

That’s it. Your repository now knows a definite identity for each person associated with a resource so you can have as many Alan Smiths as you like, and be able to deal with people who have published under several names. There is still the matter of what the interface will look like. Rebecca Parker tells me:

One of the outcomes of the project will be recommendations from a user-centred design process … we’ll be making suggestions about best practice for displaying name variants in research databases generally, with obvious local impact on how to manage these in institutional repositories.

2 Where does the NLA come into this?

So if a repository manager has used NicNames to establish IDs for people, and depositors have associated new deposits with those ID we have a local unique ID for each person in the repository, but that doesn’t help when records are harvested by the NLA, for their Discovery Service,because to that system a NicNames ID is meaningless unless it can associate it with the People Australia name system. What we want is a way to tie the NicNames ID to the People Australia ID.

People Australia is designed to work with multiple distributed identity management systems; it keeps an EAC record for each entity (person or organisation) which can have multiple identifiers associated with it. I assume what’s needed in the case of repository content is either to match a NicNames identity with an existing People Australia identity or if the match can’t be made, make a new People Australia identifier with an EAC record that contains the NicNames ID.

The process will work like this:

  1. People Australia will harvest name data out of NicNames systems using OAI-PMH and will attempt to match them to People Australia identities, and if that fails make new ones. (I’m still not really clear on how this might happen this bit is not in either system yet).

  2. Now, when the People Australia harvester pulls OAI PMH records out of the repository they will have NickNames IDs in them in the Dublin core, and the NLA system will be able to associate those with People Australia IDs.

Basil Dewhurst at the NLA summed up some of the advantages of public persistent IDs, which is what People Australia will provide (NicNames can’t provide that on it’s own it’s a bit of software and you need software-plus-governance to provide persistence):

The case for People Australia Ids is that theyre _persistent_, _public_, and enable discovery of information about people and the resources they create across collections and domains.  Importantly we plan to pull in the VIAF names later in the year and this means that we can link to researchers internationally as well as in Australia.  Research doesnt stop at the coast line !

He notes that Machine to Machine interfaces are another key advantage of these systems which I take to mean that repositories can talk to each other to build a distributed identity system.

3 Summary what does this mean for repository managers?

For most repositories I think that it is a case of waiting to see what happens. At the moment, the People Australia services is not creating IDs for stuff that comes in via the ARROW Discovery Service, and the NicNames project has not yet released any code, or got it working in any of the partner institutions. When NicNames is released the CAIRSS team will have a look and see what would be involved in getting it into production across the various platforms in use in Australia, and we’ll keep talking to the NLA about how NicNames IDs will flow through to People Australia.


* It’s impressive that when I copied the tabs from the top of the SBDS page they pasted into my document as bullet points. That’s clean design.

2009/06/03

Open Repositories 2009 – Peter Sefton's thoughts

Filed under: Uncategorized — ptsefton @ 4:20 pm

Please note – The CAIRSS blog has relocated to http://cairss.caul.edu.au/blog

I posted a general summary of my trip to Open Repositories 2009 over on my blog (this is Peter Sefton writing). Katy Watson asked me to make some comments for CAIRSS. As with Tim McCallum’s trip I was funded by USQ.

The big theme mentioned in my post, and this is something that I used to go on about in the RUBRIC project, is that a repository is not a bit of software, it’s a way of life. Put more formally, repositories are more about governance of services than they are about software applications. There was a fair bit of this ‘set of services’ approach evident a OR09 I take this as a positive sign that we are moving beyond the idea that the bit of software that you call The Repository is all there is. For a local example, look at the way some sites are going to deal with evidence for the ERA by putting files up on a simple web server. Provided this is accompanied by procedures and governance to ensure the materials persist for an appropriate length of time, I think it’s just part of the repository service offered by the library to the institution.

I didn’t see much at the conference about institutional repositories specifically so not much to report to the CAIRSS community about that, and as I’m primarily in technical role I spent a fair bit of time with the technical crowd.

One thing I think is striking is how well ePrints is doing; it seems that their model of single-institution support is a good way to provide vibrant software they are producing new releases at least as fast as DSpace and Fedora. I get the sense that the administrative overhead of establishing the DuraSpace organization and managing highly distributed developer teams is making progress hard for those platforms at the moment. When we did the RUBRIC project I think there was a feeling that ePrints was old technology and ‘better’ Fedora based solutions were going to be the way forward, but at least one Australian ePrints site has stayed with the software and not gone ahead with a planned move to a Fedora based system. Note my prediction in my blog post that there will be a Fedora back-end option for ePrints by the time Open Repositories 2010 comes around. At this stage I think ePrints is a really good solution for document-based repositories. Me, I would not be managing other kinds of collection with it, but at Southampton they do and I may be eating those words soon.

I pointed this out in my other post but I will do so here as well. USQ now has ePrints hooked up to our ICE content management system, meaning that we have papers, presentations and posters going in not just as PDF but as web pages. This is going to allow us to do much more with linking documents to data and providing rich interactive experiences. My last few items all have HTML as the primary object, with PDF available if you click through there are a few glitches to sort out but we’re getting there.

VTLS, of VITAL fame, had a small presence, pitching a bit of open source software for OAI-PMH. Nice to see them contributing in this way.

Remember, it’s not a software package, it’s a state of mind.

Create a free website or blog at WordPress.com.