Please note – The CAIRSS blog has relocated to http://cairss.caul.edu.au/blog
I learned of a multi-disciplinary search engine for academically relevant web resources called BASE during some recreation/mingling time. I spoke to a couple of developers about two important issues relating to repository systems/solutions. The first issue was performance. We discussed how BASE uses a search technology called FAST. I believe that Microsoft acquired “FAST Search & Transfer” in 2008, the product is now known as “FAST ESP from Microsoft”. Base currently holds over 20 Million records from 1265 sources around the globe and is contributing to the “Digital Repository Infrastructure Vision for European Research” (DRIVER).
The average search that I did on generic topics produced about 150,000 hits out of 20,084,184 items, all in fractions of a second. Very impressive performance. I have never tested using item numbers this large and certainly have not seen results like this with even a fraction of the content. It appears that BASE holds meta data, full text and precise bibliographic data and uses OAI-PMH for harvesting. I searched for quite a while to get a data stream such as a PDF served via the BASE url but was redirected every time. I am therefore assuming that there are no data streams stored locally (meta data only). Guys please correct me if I am wrong about this.
I do not wish to make any performance comparisons at all with BASE as The Fascinator has only been tested with a minute amount of records compared to BASE. The interesting part that I would like to raise is that The Fascinator is not only able to harvest and provide meta data only, but can harvest and store data stream content locally as well. In this case it is possible to configure The Fascinator in two ways. The first way is to enable it to engage directly with Fedora and harvest meta data as well as data streams using Fedora’s API’s. The second way is to configure The Fascinator to harvest using OAI-ORE, if there are references to data streams in the resource maps they will be downloaded and stored locally along with the meta data it was configured to harvest at the time. The University of Southern Queensland in conjunction with the CAIRSS project is getting ready to carry out a nation wide harvest called the “Australian University Repository Census”(AURC), this harvest will be carried out using The Fascinator software.
As I mentioned above there was another important issue that was brought up in our casual conversation, Normalization. It appears that this is a problem for everyone in the repository space and harvesting projects. I was throwing a “developer challenge” idea around in my head before the conference about creating an application, well more of a web service really that would harvest a repositories metadata and then display it in a web browser. pointing out obvious mistakes first, followed by suggestions for normalization (all the while linking back to the item, so that the user could organize the editing of that item). I talked to Oliver Lucido briefly (could not discuss it with Peter as he was a Judge for the challenge). We came to the conclusion that this is pretty much what we are doing with AURC using The Fascinator. This being my first conference, I was unsure about how much conference content I would miss out on by trying to code something up for 2 out of the 4 days… so that idea kind of died.
Now that I am back I am revisiting that idea and wondering if it is possible to put together some pieces that exists already and combine that with some software (plagiarism detection style) in the hope of creating a web service that is capable of pointing out problems with normalization on a Institution by Institution basis, giving suggestions regarding conforming with other institutions and/or repairing internal normalization issues. I think ultimately the best solution would be for each individual institution to be able to see and repair normalization issues in house.