Digital Library Consulting Logo Making digital libraries easy

Archive for the ‘Articles’ Category

[Article] What’s METS/ALTO and should you care?

Monday, August 17th, 2009

As technologists we always get excited by digital library standards like METS/ALTO. What is it and does it matter to library professionals?

Over the past ten years Digital Library Consulting has built lots of digital collections, and those collections have been built from a wide range of digital objects. For example, we’ve built collections from PDF files, Microsoft Word documents, digital images, text files, HTML files, from digital images with associated text/HTML/PDF files, and from objects using various standards like TEI. In addition, many of the collections we’ve built have had existing metadata in Excel spreadsheets, XML and binary MARC formats, Microsoft Word files, and a whole range of different database formats.

It’s obviously much more complex and time-consuming to build a digital library from objects in formats we’ve not used before than it is to do so from some kind of “standard” format that we already have software to support. In many cases this can’t be helped – the digital files already exist, and the digital library software used to present them simply must be adapted to suit.

In the case of “new” digitization projects, where physical documents are being turned into new digital objects, any number of different digital formats can be selected. We regularly work with those planning new digitization projects to help them select the most appropriate format, and for textual documents we most often recommend METS/ALTO.

The Metadata Encoding and Transmission Standard (METS) has been around for some time, and is a standard with which many library professionals will be familiar. It’s a standard for encoding descriptive, administrative, and structural metadata regarding objects within a digital library, using XML. The METS standard is maintained by the Library of Congress, and is developed as an initiative of the Digital Library Federation.

While METS is great at describing the structure of a digital object, it’s missing the ability to describe the content and layout of each piece of the digital object. For that we need an extension to METS called ALTO (Analyzed Layout and Text Object). This combination of METS and ALTO was originally developed by the METAe project, and was later adopted by the Library of Congress for their large-scale National Digital Newspaper Project (NDNP). Since then METS/ALTO has been used for many large national newspaper digitization projects, as well as a number of projects digitizing books and journals.

METS/ALTO provides extremely rich digital objects, which allows for extremely rich digital library interfaces to be built. For example, a typical METS/ALTO object encodes not only the complete logical and physical structure of a document (i.e. chapters, sections, articles, pages, etc., and their associated metadata), but also the full-text content of each section of the document and even the physical coordinates of every word in the document! The impact of this on the user’s search experience can be quite significant. Additionally, it doesn’t typically cost any more to digitize materials to METS/ALTO than to formats like HTML, which contain much less information.

Digital Library Consulting has completed several projects using the METS/ALTO standard. The National Library of New Zealand’s Papers Past project for example, which contains approximately 1.2 million newspaper pages, or around 7 million individually searchable and viewable newspaper articles. We’ve also completed a project based on the standard for Cornell University Library, and are working on a major project with the National Library of Singapore.

If you would like to discuss the implications of using METS/ALTO in your digital collection projects please contact us at contact@dlconsulting.com.


Making legacy databases available online with Greenstone

Sunday, September 21st, 2008

Recently we were approached by a client seeking to move online information from their legacy Paradox 4.0 database. Greenstone would enable the information to be web-accessible but we first needed to convert the database into a form that Greenstone can handle.

Digging on the internet revealed conversion software called Paradox dBase Reader, which is able to read DBF/DB files in Paradox format (but also supports other formats, such as dBase, FoxBase, Foxpro, Visual Foxpro and Visual DBase). This software allows the conversion of Paradox databases into HTML, Text (CSV), Excel, or XML formats.  We converted the database into XML, our preferred machine-readable format. Paradox dBase Reader was also able to extract images, stored internally as binary data, from the client’s database.

A plugin was then created to import the extracted XML and images into a Greenstone collection.  Greenstone’s modular nature made this change straightforward. The finishing touches will be to customize and brand the collection to the client’s specifications and then make it available on the client’s intranet.

This is an example of a Paradox database being converted into Greenstone but many other database formats can have new life breathed into them by having Greenstone serve them up on the web.


Open-Source report available on site

Friday, June 1st, 2007

I finally remembered to add my paper on The Adoption of Open Source technologies in New Zealand to the main DLC site.  This was the final report for the first module of the Post Graduate Certificate in Professional Development (Electronics and ICT) that I’m currently undertaking.

While the report could probably do with some more polish, it does provide an interesting background to open source issues (such as intellectual property and copyright/left issues) and provides statistics and references which could be used as evidence in future contracts. Earned me an A too!


Return to top


Powered by Wordpress