CAIRSS Blog

2009/06/10

Open Repositories Conference 09 Part 2

Filed under: Open Repositories 09,The Fascinator — tpmccallum @ 3:08 pm

Please note – The CAIRSS blog has relocated to http://cairss.caul.edu.au/blog

Performance

I learned of a multi-disciplinary search engine for academically relevant web resources called BASE during some recreation/mingling time. I spoke to a couple of developers about two important issues relating to repository systems/solutions. The first issue was performance. We discussed how BASE uses a search technology called FAST. I believe that Microsoft acquired FAST Search & Transfer in 2008, the product is now known as FAST ESP from Microsoft. Base currently holds over 20 Million records from 1265 sources around the globe and is contributing to the Digital Repository Infrastructure Vision for European Research (DRIVER).

The average search that I did on generic topics produced about 150,000 hits out of 20,084,184 items, all in fractions of a second. Very impressive performance. I have never tested using item numbers this large and certainly have not seen results like this with even a fraction of the content. It appears that BASE holds meta data, full text and precise bibliographic data and uses OAI-PMH for harvesting. I searched for quite a while to get a data stream such as a PDF served via the BASE url but was redirected every time. I am therefore assuming that there are no data streams stored locally (meta data only). Guys please correct me if I am wrong about this.

The Fascinator

I do not wish to make any performance comparisons at all with BASE as The Fascinator has only been tested with a minute amount of records compared to BASE. The interesting part that I would like to raise is that The Fascinator is not only able to harvest and provide meta data only, but can harvest and store data stream content locally as well. In this case it is possible to configure The Fascinator in two ways. The first way is to enable it to engage directly with Fedora and harvest meta data as well as data streams using Fedora’s API’s. The second way is to configure The Fascinator to harvest using OAI-ORE, if there are references to data streams in the resource maps they will be downloaded and stored locally along with the meta data it was configured to harvest at the time. The University of Southern Queensland in conjunction with the CAIRSS project is getting ready to carry out a nation wide harvest called the Australian University Repository Census(AURC), this harvest will be carried out using The Fascinator software.

Normalization

As I mentioned above there was another important issue that was brought up in our casual conversation, Normalization. It appears that this is a problem for everyone in the repository space and harvesting projects. I was throwing a developer challenge idea around in my head before the conference about creating an application, well more of a web service really that would harvest a repositories metadata and then display it in a web browser. pointing out obvious mistakes first, followed by suggestions for normalization (all the while linking back to the item, so that the user could organize the editing of that item). I talked to Oliver Lucido briefly (could not discuss it with Peter as he was a Judge for the challenge). We came to the conclusion that this is pretty much what we are doing with AURC using The Fascinator. This being my first conference, I was unsure about how much conference content I would miss out on by trying to code something up for 2 out of the 4 days… so that idea kind of died.

Now that I am back I am revisiting that idea and wondering if it is possible to put together some pieces that exists already and combine that with some software (plagiarism detection style) in the hope of creating a web service that is capable of pointing out problems with normalization on a Institution by Institution basis, giving suggestions regarding conforming with other institutions and/or repairing internal normalization issues. I think ultimately the best solution would be for each individual institution to be able to see and repair normalization issues in house.

Advertisements

2009/05/28

Open Repositories Conference 09

Please note – The CAIRSS blog has relocated to http://cairss.caul.edu.au/blog

General Overview

This years Open Repositories Conference was held in Atlanta Georgia USA. This year marked the 4th year for this International Annual Conference. The Conference was held at the Georgia Institute of Technology Hotel and Conference Center.

Participants and Sponsors

The Conference had representatives from organizations including Dspace, Eprints, Fedora, VTLS, JISC, Microsoft Research, Sun Microsystems, @MIRE and NSF (National Science Foundation).

Microsoft Research

It was made quite clear to me throughout the conference that Microsoft Research were looking at carrying out research and development and not concerned with directly profiting from their involvement. There were several open discussions during workshops about how they would best create plug in functionality for their products that would enable their users to interact with Repositories. There was allot of constructive conversation hovering around how SWORD would be integrated with new Microsoft Research products/plug-ins. There were good discussions about whether the processing and converting of documents and meta data should be done on the Client, as a Web Service or handled directly by the Repository Software. The main challenges that I could see with doing this is deciding how much freedom to give to the user. Do they simply click a button upon completion of their work, or does the software allow them to interact at quite a low level with regards to meta data and file types, allowing them to review their work in the different formats before the final submission. I am assuming that if a researcher has spent several years writing and researching they would have a substantial amount of time to put the final touches on the master document to make sure that it rendered correctly in HTML and PDF. It would be amazing if we could write software that would handle everything behind the scenes, perhaps eventually we will arrive at this point.

DSpace

Where is Dspace heading? 2.0 can be expected early 2010

In the mean time 1.6 will be released as a stepping stone to 2.0 and will include bug fixes (due October 2009)

I ran into Kim Shepherd from the Library Consortium of New Zealand on my way out of Atlanta. Kim is a DSpace committer, we had a good conversation about DSpace 2.0 amongst other things. I will be sure to keep in touch and keep an eye on future development.

Dura Space Organization

Dura Space is an organization. The first technology to emerge from Dura Space will be a product called Dura Cloud. Dura Cloud consists of a complete hosting service using Dura Space partners (commercial cloud providers). While Dura Space is offering a cloud computing solution as a service, it is possible to download the code and create a cloud computing solution inside your own institution.

Components used by Dura Space are Akubra (A pluggable file storage interface), Mulgura (Semantic store), and Dura Cloud.

Dura Space expect more components will be considered for use as they are discovered.

@MIRE

I took a bit of time to talk to Bram Luyten from @MIRE. From what I understand @MIRE is a commercial company that works very closely with the developers of DSpace, as I understand it their staff include DSpace committers. @MIRE provide services including preparing and implementing repository solutions, technical assistance, bug fixes, customizations and a support service for the DSpace product.

As I understand it DSpace ships with a BSD license and is therefore very open to this sort of interaction and collaboration with a commercial company. To me this seems to be a fairly good approach to a Repository solution as it allows the flexibility of using an open source product with the option to request immediate assistance and support at a price should you need it.

Fedora

Fedora 3.2 wants to shift to using Akubra to replace the old Fedora storage interface. The Akubra API is not turned on by default in Fedora 3.2, it is hoped that developers will take interest in it over time. This will allow the new technology to be tested and implemented gradually.

An interesting feature of Fedora 3.2 that was mentioned is that you are now able to run multiple Fedoras instances with one Tomcat instance. This has been a topic that I have heard raised a few times over the last couple of years.

Poster Presentations

Squire

The poster sessions included Squire. Squire as you probably already know is the Java version ov the VTLS product VALET. It was developed with ARROW funding. It appears that VTLS has recently taken an interest in this product and it is possible that they will further develop it. Whether it remains open source or not remains to be seen.

graphics1

The Fascinator

This poster was presented by Peter Sefton. The Fascinator is an Apache Solr front end to the Fedora commons repository, I am again guessing that most of you probably already know that. You can find out more about The Fascinator here.

graphics2

You can find a full list of the Open Repositories Poster Sessions here

Photographs

The Open Repository organisers have provided a Flickr slide show of the entire conference. You will see Peter and Myself in the Minute Madness Poster Presentations as well as us discussing the finer points of our posters in the ball room.

Wrap up

I found that I got just as much information out of talking to people casually than I did during the formal presentations. I met so many people that I have a big job of going through my notes and contacting them all.

In my opinion there was a definite trend towards having distributed systems rather than a single repository. There were even discussions about Repository performance and how running only the database components on separate servers had marked increase in said Repositories performance. I was surprised at how many people are using open source products and building their own applications over the top. Very few used full proprietary solutions. One of the many examples of this would be Ruby on Rails application that incorporated Fedora using Jruby and of course one of the most impressive, our very own The Fascinator complete with multi-portal creation, harvesting framework, Solr indexing, security model as well as installers for Linux, MacOS and Windows. Oliver Lucido has also recently created a screen cast of the new desktop feature. Peters presentation went down really well he got quite a few laughs with some witty humor. Over all had a great time and cant wait until next time.

Create a free website or blog at WordPress.com.