CAIRSS Blog

2010/01/20

How to export contents from an Institutional Repository to a Spreadsheet

Filed under: Digital Commons,DigiTool,DSpace,EPrints,Equella,ERA,Fedora,Fez,Java,OAI-PMH,SOLR — tpmccallum @ 5:21 pm

Please note – The CAIRSS blog has relocated to http://cairss.caul.edu.au/blog

The idea

A short time ago CAIRSS was approached by a Repository Manager from within the CAIRSS community to assist with exporting the contents of their repository to a spreadsheet. It was made apparent that accomplishing this task would greatly assist with Institutional Repository management tasks and most importantly ERA related work.

The tools

There are many ways that data can be extracted, moved and converted. The wisest choice is to use tools that are interoperable. An example of this would be choosing OAI-PMH to extract data rather than attempting to communicate with an individual Repositories data storage device or database etc.

The solution

Our CAIRSS Technical Officer Tim McCallum has completed a solution to address this task in the form of a Java Web Application. FoREveR – Flexible Repository Export Reporter.

Extracting the data

The data extraction is carried out using an OAI-PMH harvester. In this instance The Fascinator was used to accomplish this task. With regards to recent trends in Institutional Repository development and the use of SOLR the next step was an easy choice; simply extract the data from The Fascinator using a SOLR query. As an added bonus SOLR is able to supply the data in JSON (JavaScript Object Notation) format.

Converting the data

Overview

After testing different methods of converting the data including XSLT and Python some research was done revealing some excellent JSON libraries written in Java. The final choice was Java given the fact that the JSON libraries could meet the requirements for this application and that OAI-PMH, The Fascinator and SOLR were all already written in Java.

Technical

The JSON data returned is the result of an HTTP request (can be set to fetch all by default). This data is converted to Java Maps and Java ArrayLists for further processing. The application loops through every record that has been returned and creates a Java Set (unique list/master list). This Set is then displayed in the users browser. This is a last minute chance to select or deselect metadata before the final report is written. It is sometimes the case that a metadata field containing a large amount of content is best left out, as this can make the spreadsheet unmanageable from an end users perspective.

Reporting the data

Once approved the application creates an HTML file with all data saved to a table. The table includes table headings, table rows, table data cells and unordered lists for repeating information. This file can be opened in Microsoft Excel and Open Office spreadsheet applications or viewed in a browser.

Screen Shots

Optional SOLR Query

graphics1Note: It is not necessary to know SOLR query syntax, the application can be set to get everything by default. This may be an area to address with the community and feedback is welcome.

Feedback

graphics2

Small sample of spreadsheet output

graphics3

Using the Flexible Repository Export Reporter (FoREveR)

As this software is in the very early stages of its life cycle reports can be created by CAIRSS and emailed out to you. Please contact CAIRSS Central if you think that your institution can benefit from the use of this tool.

The source code is available at http://cairss.caul.edu.au/trac/browser/code/FoREveR for your interest, however it has not been extensively tested. All feedback is welcome. CAIRSS will endeavor to improve and enhance the software to meet your needs.

2009/10/13

Australian repository software in use

Filed under: Digital Commons,DigiTool,DSpace,EPrints,Equella,Fedora,Fez,Software,VITAL — caulcairss @ 3:28 pm

Please note – The CAIRSS blog has relocated to http://cairss.caul.edu.au/blog

CAIRSS now has a current list of Australian university research repositories (see: http://cairss.caul.edu.au/www/repository_software/repository_software.htm).

As outlined on this CAIRSS webpage, all 39 Australian universities have a research repository, with seven various repository software options currently in use.

CAIRSS will be working in the future to list which version of the software each installation uses.

2009/05/28

Open Repositories Conference 09

Please note – The CAIRSS blog has relocated to http://cairss.caul.edu.au/blog

General Overview

This years Open Repositories Conference was held in Atlanta Georgia USA. This year marked the 4th year for this International Annual Conference. The Conference was held at the Georgia Institute of Technology Hotel and Conference Center.

Participants and Sponsors

The Conference had representatives from organizations including Dspace, Eprints, Fedora, VTLS, JISC, Microsoft Research, Sun Microsystems, @MIRE and NSF (National Science Foundation).

Microsoft Research

It was made quite clear to me throughout the conference that Microsoft Research were looking at carrying out research and development and not concerned with directly profiting from their involvement. There were several open discussions during workshops about how they would best create plug in functionality for their products that would enable their users to interact with Repositories. There was allot of constructive conversation hovering around how SWORD would be integrated with new Microsoft Research products/plug-ins. There were good discussions about whether the processing and converting of documents and meta data should be done on the Client, as a Web Service or handled directly by the Repository Software. The main challenges that I could see with doing this is deciding how much freedom to give to the user. Do they simply click a button upon completion of their work, or does the software allow them to interact at quite a low level with regards to meta data and file types, allowing them to review their work in the different formats before the final submission. I am assuming that if a researcher has spent several years writing and researching they would have a substantial amount of time to put the final touches on the master document to make sure that it rendered correctly in HTML and PDF. It would be amazing if we could write software that would handle everything behind the scenes, perhaps eventually we will arrive at this point.

DSpace

Where is Dspace heading? 2.0 can be expected early 2010

In the mean time 1.6 will be released as a stepping stone to 2.0 and will include bug fixes (due October 2009)

I ran into Kim Shepherd from the Library Consortium of New Zealand on my way out of Atlanta. Kim is a DSpace committer, we had a good conversation about DSpace 2.0 amongst other things. I will be sure to keep in touch and keep an eye on future development.

Dura Space Organization

Dura Space is an organization. The first technology to emerge from Dura Space will be a product called Dura Cloud. Dura Cloud consists of a complete hosting service using Dura Space partners (commercial cloud providers). While Dura Space is offering a cloud computing solution as a service, it is possible to download the code and create a cloud computing solution inside your own institution.

Components used by Dura Space are Akubra (A pluggable file storage interface), Mulgura (Semantic store), and Dura Cloud.

Dura Space expect more components will be considered for use as they are discovered.

@MIRE

I took a bit of time to talk to Bram Luyten from @MIRE. From what I understand @MIRE is a commercial company that works very closely with the developers of DSpace, as I understand it their staff include DSpace committers. @MIRE provide services including preparing and implementing repository solutions, technical assistance, bug fixes, customizations and a support service for the DSpace product.

As I understand it DSpace ships with a BSD license and is therefore very open to this sort of interaction and collaboration with a commercial company. To me this seems to be a fairly good approach to a Repository solution as it allows the flexibility of using an open source product with the option to request immediate assistance and support at a price should you need it.

Fedora

Fedora 3.2 wants to shift to using Akubra to replace the old Fedora storage interface. The Akubra API is not turned on by default in Fedora 3.2, it is hoped that developers will take interest in it over time. This will allow the new technology to be tested and implemented gradually.

An interesting feature of Fedora 3.2 that was mentioned is that you are now able to run multiple Fedoras instances with one Tomcat instance. This has been a topic that I have heard raised a few times over the last couple of years.

Poster Presentations

Squire

The poster sessions included Squire. Squire as you probably already know is the Java version ov the VTLS product VALET. It was developed with ARROW funding. It appears that VTLS has recently taken an interest in this product and it is possible that they will further develop it. Whether it remains open source or not remains to be seen.

graphics1

The Fascinator

This poster was presented by Peter Sefton. The Fascinator is an Apache Solr front end to the Fedora commons repository, I am again guessing that most of you probably already know that. You can find out more about The Fascinator here.

graphics2

You can find a full list of the Open Repositories Poster Sessions here

Photographs

The Open Repository organisers have provided a Flickr slide show of the entire conference. You will see Peter and Myself in the Minute Madness Poster Presentations as well as us discussing the finer points of our posters in the ball room.

Wrap up

I found that I got just as much information out of talking to people casually than I did during the formal presentations. I met so many people that I have a big job of going through my notes and contacting them all.

In my opinion there was a definite trend towards having distributed systems rather than a single repository. There were even discussions about Repository performance and how running only the database components on separate servers had marked increase in said Repositories performance. I was surprised at how many people are using open source products and building their own applications over the top. Very few used full proprietary solutions. One of the many examples of this would be Ruby on Rails application that incorporated Fedora using Jruby and of course one of the most impressive, our very own The Fascinator complete with multi-portal creation, harvesting framework, Solr indexing, security model as well as installers for Linux, MacOS and Windows. Oliver Lucido has also recently created a screen cast of the new desktop feature. Peters presentation went down really well he got quite a few laughs with some witty humor. Over all had a great time and cant wait until next time.

2009/05/13

What do you get when you combine DSpace and Fedora? … DuraSpace

Filed under: DSpace,Fedora — caulcairss @ 5:38 pm

Please note – The CAIRSS blog has relocated to http://cairss.caul.edu.au/blog

Fedora Commons and DSpace Foundation join together to create DuraSpace organization

Further details available at: http://www.duraspace.org/pressrelease.html

Create a free website or blog at WordPress.com.