Digital Library Consulting Logo Making digital libraries easy

[Article] What’s METS/ALTO and should you care?

As technologists we always get excited by digital library standards like METS/ALTO. What is it and does it matter to library professionals?

Over the past ten years Digital Library Consulting has built lots of digital collections, and those collections have been built from a wide range of digital objects. For example, we’ve built collections from PDF files, Microsoft Word documents, digital images, text files, HTML files, from digital images with associated text/HTML/PDF files, and from objects using various standards like TEI. In addition, many of the collections we’ve built have had existing metadata in Excel spreadsheets, XML and binary MARC formats, Microsoft Word files, and a whole range of different database formats.

It’s obviously much more complex and time-consuming to build a digital library from objects in formats we’ve not used before than it is to do so from some kind of “standard” format that we already have software to support. In many cases this can’t be helped – the digital files already exist, and the digital library software used to present them simply must be adapted to suit.

In the case of “new” digitization projects, where physical documents are being turned into new digital objects, any number of different digital formats can be selected. We regularly work with those planning new digitization projects to help them select the most appropriate format, and for textual documents we most often recommend METS/ALTO.

The Metadata Encoding and Transmission Standard (METS) has been around for some time, and is a standard with which many library professionals will be familiar. It’s a standard for encoding descriptive, administrative, and structural metadata regarding objects within a digital library, using XML. The METS standard is maintained by the Library of Congress, and is developed as an initiative of the Digital Library Federation.

While METS is great at describing the structure of a digital object, it’s missing the ability to describe the content and layout of each piece of the digital object. For that we need an extension to METS called ALTO (Analyzed Layout and Text Object). This combination of METS and ALTO was originally developed by the METAe project, and was later adopted by the Library of Congress for their large-scale National Digital Newspaper Project (NDNP). Since then METS/ALTO has been used for many large national newspaper digitization projects, as well as a number of projects digitizing books and journals.

METS/ALTO provides extremely rich digital objects, which allows for extremely rich digital library interfaces to be built. For example, a typical METS/ALTO object encodes not only the complete logical and physical structure of a document (i.e. chapters, sections, articles, pages, etc., and their associated metadata), but also the full-text content of each section of the document and even the physical coordinates of every word in the document! The impact of this on the user’s search experience can be quite significant. Additionally, it doesn’t typically cost any more to digitize materials to METS/ALTO than to formats like HTML, which contain much less information.

Digital Library Consulting has completed several projects using the METS/ALTO standard. The National Library of New Zealand’s Papers Past project for example, which contains approximately 1.2 million newspaper pages, or around 7 million individually searchable and viewable newspaper articles. We’ve also completed a project based on the standard for Cornell University Library, and are working on a major project with the National Library of Singapore.

If you would like to discuss the implications of using METS/ALTO in your digital collection projects please contact us at contact@dlconsulting.com.

Leave a Reply

* required


Return to top


Powered by Wordpress