CAIRSS Blog

2009/06/10

Open Repositories Conference 09 Part 2

Filed under: Open Repositories 09,The Fascinator — tpmccallum @ 3:08 pm

Please note – The CAIRSS blog has relocated to http://cairss.caul.edu.au/blog

Performance

I learned of a multi-disciplinary search engine for academically relevant web resources called BASE during some recreation/mingling time. I spoke to a couple of developers about two important issues relating to repository systems/solutions. The first issue was performance. We discussed how BASE uses a search technology called FAST. I believe that Microsoft acquired FAST Search & Transfer in 2008, the product is now known as FAST ESP from Microsoft. Base currently holds over 20 Million records from 1265 sources around the globe and is contributing to the Digital Repository Infrastructure Vision for European Research (DRIVER).

The average search that I did on generic topics produced about 150,000 hits out of 20,084,184 items, all in fractions of a second. Very impressive performance. I have never tested using item numbers this large and certainly have not seen results like this with even a fraction of the content. It appears that BASE holds meta data, full text and precise bibliographic data and uses OAI-PMH for harvesting. I searched for quite a while to get a data stream such as a PDF served via the BASE url but was redirected every time. I am therefore assuming that there are no data streams stored locally (meta data only). Guys please correct me if I am wrong about this.

The Fascinator

I do not wish to make any performance comparisons at all with BASE as The Fascinator has only been tested with a minute amount of records compared to BASE. The interesting part that I would like to raise is that The Fascinator is not only able to harvest and provide meta data only, but can harvest and store data stream content locally as well. In this case it is possible to configure The Fascinator in two ways. The first way is to enable it to engage directly with Fedora and harvest meta data as well as data streams using Fedora’s API’s. The second way is to configure The Fascinator to harvest using OAI-ORE, if there are references to data streams in the resource maps they will be downloaded and stored locally along with the meta data it was configured to harvest at the time. The University of Southern Queensland in conjunction with the CAIRSS project is getting ready to carry out a nation wide harvest called the Australian University Repository Census(AURC), this harvest will be carried out using The Fascinator software.

Normalization

As I mentioned above there was another important issue that was brought up in our casual conversation, Normalization. It appears that this is a problem for everyone in the repository space and harvesting projects. I was throwing a developer challenge idea around in my head before the conference about creating an application, well more of a web service really that would harvest a repositories metadata and then display it in a web browser. pointing out obvious mistakes first, followed by suggestions for normalization (all the while linking back to the item, so that the user could organize the editing of that item). I talked to Oliver Lucido briefly (could not discuss it with Peter as he was a Judge for the challenge). We came to the conclusion that this is pretty much what we are doing with AURC using The Fascinator. This being my first conference, I was unsure about how much conference content I would miss out on by trying to code something up for 2 out of the 4 days… so that idea kind of died.

Now that I am back I am revisiting that idea and wondering if it is possible to put together some pieces that exists already and combine that with some software (plagiarism detection style) in the hope of creating a web service that is capable of pointing out problems with normalization on a Institution by Institution basis, giving suggestions regarding conforming with other institutions and/or repairing internal normalization issues. I think ultimately the best solution would be for each individual institution to be able to see and repair normalization issues in house.

Advertisements

2 Comments »

  1. One of the problems with normalization is that things need to be normalized for participation in a federation but locally institutions want to see their own terminology. So while a service like the one you propose might be useful I think we need to supply an easy way for people to adapt their OAI-PMH feeds, which is a possible new project as an outcome of the repository census – if we get the go ahead from the CAIRSS steering committee to work on it.

    Comment by Peter Sefton — 2009/06/10 @ 4:44 pm

  2. >“developer challenge” … creating an application, well more of a web service really that would harvest a repositories metadata and then display it in a web browser

    Simply embed some of the basic metadata in the URL thus…

    http://www.myuni.edu/repository/public/full-text/history/history-of-technology/thesis/2009-adams-peter-on-watermills.pdf

    …and if there’s a strict standard set among universities for how to structure and name repository directories, then IT techies can’t break the URL by juggling the directories.

    Comment by Borrowind — 2009/06/16 @ 9:37 pm


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Blog at WordPress.com.

%d bloggers like this: