CAIRSS Blog

2009/06/10

Open Repositories Conference 09 Part 2

Filed under: Open Repositories 09,The Fascinator — tpmccallum @ 3:08 pm

Please note – The CAIRSS blog has relocated to http://cairss.caul.edu.au/blog

Performance

I learned of a multi-disciplinary search engine for academically relevant web resources called BASE during some recreation/mingling time. I spoke to a couple of developers about two important issues relating to repository systems/solutions. The first issue was performance. We discussed how BASE uses a search technology called FAST. I believe that Microsoft acquired FAST Search & Transfer in 2008, the product is now known as FAST ESP from Microsoft. Base currently holds over 20 Million records from 1265 sources around the globe and is contributing to the Digital Repository Infrastructure Vision for European Research (DRIVER).

The average search that I did on generic topics produced about 150,000 hits out of 20,084,184 items, all in fractions of a second. Very impressive performance. I have never tested using item numbers this large and certainly have not seen results like this with even a fraction of the content. It appears that BASE holds meta data, full text and precise bibliographic data and uses OAI-PMH for harvesting. I searched for quite a while to get a data stream such as a PDF served via the BASE url but was redirected every time. I am therefore assuming that there are no data streams stored locally (meta data only). Guys please correct me if I am wrong about this.

The Fascinator

I do not wish to make any performance comparisons at all with BASE as The Fascinator has only been tested with a minute amount of records compared to BASE. The interesting part that I would like to raise is that The Fascinator is not only able to harvest and provide meta data only, but can harvest and store data stream content locally as well. In this case it is possible to configure The Fascinator in two ways. The first way is to enable it to engage directly with Fedora and harvest meta data as well as data streams using Fedora’s API’s. The second way is to configure The Fascinator to harvest using OAI-ORE, if there are references to data streams in the resource maps they will be downloaded and stored locally along with the meta data it was configured to harvest at the time. The University of Southern Queensland in conjunction with the CAIRSS project is getting ready to carry out a nation wide harvest called the Australian University Repository Census(AURC), this harvest will be carried out using The Fascinator software.

Normalization

As I mentioned above there was another important issue that was brought up in our casual conversation, Normalization. It appears that this is a problem for everyone in the repository space and harvesting projects. I was throwing a developer challenge idea around in my head before the conference about creating an application, well more of a web service really that would harvest a repositories metadata and then display it in a web browser. pointing out obvious mistakes first, followed by suggestions for normalization (all the while linking back to the item, so that the user could organize the editing of that item). I talked to Oliver Lucido briefly (could not discuss it with Peter as he was a Judge for the challenge). We came to the conclusion that this is pretty much what we are doing with AURC using The Fascinator. This being my first conference, I was unsure about how much conference content I would miss out on by trying to code something up for 2 out of the 4 days… so that idea kind of died.

Now that I am back I am revisiting that idea and wondering if it is possible to put together some pieces that exists already and combine that with some software (plagiarism detection style) in the hope of creating a web service that is capable of pointing out problems with normalization on a Institution by Institution basis, giving suggestions regarding conforming with other institutions and/or repairing internal normalization issues. I think ultimately the best solution would be for each individual institution to be able to see and repair normalization issues in house.

Advertisements

2009/06/03

Open Repositories 2009 – Peter Sefton's thoughts

Filed under: Uncategorized — ptsefton @ 4:20 pm

Please note – The CAIRSS blog has relocated to http://cairss.caul.edu.au/blog

I posted a general summary of my trip to Open Repositories 2009 over on my blog (this is Peter Sefton writing). Katy Watson asked me to make some comments for CAIRSS. As with Tim McCallum’s trip I was funded by USQ.

The big theme mentioned in my post, and this is something that I used to go on about in the RUBRIC project, is that a repository is not a bit of software, it’s a way of life. Put more formally, repositories are more about governance of services than they are about software applications. There was a fair bit of this ‘set of services’ approach evident a OR09 I take this as a positive sign that we are moving beyond the idea that the bit of software that you call The Repository is all there is. For a local example, look at the way some sites are going to deal with evidence for the ERA by putting files up on a simple web server. Provided this is accompanied by procedures and governance to ensure the materials persist for an appropriate length of time, I think it’s just part of the repository service offered by the library to the institution.

I didn’t see much at the conference about institutional repositories specifically so not much to report to the CAIRSS community about that, and as I’m primarily in technical role I spent a fair bit of time with the technical crowd.

One thing I think is striking is how well ePrints is doing; it seems that their model of single-institution support is a good way to provide vibrant software they are producing new releases at least as fast as DSpace and Fedora. I get the sense that the administrative overhead of establishing the DuraSpace organization and managing highly distributed developer teams is making progress hard for those platforms at the moment. When we did the RUBRIC project I think there was a feeling that ePrints was old technology and ‘better’ Fedora based solutions were going to be the way forward, but at least one Australian ePrints site has stayed with the software and not gone ahead with a planned move to a Fedora based system. Note my prediction in my blog post that there will be a Fedora back-end option for ePrints by the time Open Repositories 2010 comes around. At this stage I think ePrints is a really good solution for document-based repositories. Me, I would not be managing other kinds of collection with it, but at Southampton they do and I may be eating those words soon.

I pointed this out in my other post but I will do so here as well. USQ now has ePrints hooked up to our ICE content management system, meaning that we have papers, presentations and posters going in not just as PDF but as web pages. This is going to allow us to do much more with linking documents to data and providing rich interactive experiences. My last few items all have HTML as the primary object, with PDF available if you click through there are a few glitches to sort out but we’re getting there.

VTLS, of VITAL fame, had a small presence, pitching a bit of open source software for OAI-PMH. Nice to see them contributing in this way.

Remember, it’s not a software package, it’s a state of mind.

Create a free website or blog at WordPress.com.