Continued improvements to METS/ALTO support for newspaper collections

May 1st, 2008 by stefan

Here at DL Consulting we’re continuing to make improvements to Greenstone’s support for importing and displaying METS/ALTO data. METS/ALTO is an XML schema published by the Library of Congress, and being used by the US National Digital Newspaper Project (NDNP), as well as many other newspaper digitization projects (as well as some collections of books, journals, and other textual resources). In addition to extracting machine-readable text from the page a process resulting in METS/ALTO also records information about individual articles within a page. This allows a user interface to be built where newspaper articles can be displayed on their own, as well as within the pages on which they were printed.The Papers Past site we built last year with the National Library of New Zealand (and which uses METS/ALTO) continues to grow. There are now over 600,000 searchable pages (that’s about 6.5 million newspaper articles!) in the system. We’re happy with how well the system is scaling, but continue to work on further improvements, with the eventual goal being infinite scalability with large collections distributed across multiple computers. We’re making good progress towards that goal thanks to a research grant from the Foundation for Research, Science, and Technology.In addition to the Papers Past collection we’ve built two further METS/ALTO based newspaper collections over recent months. Neither of these sites are accessible to the public yet unfortunately, but we’ll post links on www.dlconsulting.com once they are.

  • Cornell University - The Cornell Daily Sun Digitization project. This project has been using a basic Greenstone system for some time (and which is still online now) but we implemented a major upgrade so the system can now import METS/ALTO data (which Cornell have switched to for the digitization of all remaining newspaper issues) as well as the older (proprietary) data format that was used in the earlier digitization work. METS/ALTO is more flexible than the older format but the system was implemented so that all the data (both old and new formats) are displayed very similarly. The Cornell Daily Sun project also switched to generating web-accessible images on demand with image server software, similar to the way Papers Past does.
  • National Library Board of Singapore. We’ve also been working for many months on a large newspaper collection for the National Library of Singapore, building upon the software written for Papers Past. The Singapore collection will be released later this year, initially with around 600,000 pages of digitized content. That will grow to around 2 million pages over time. The Singapore project has some added complexity, including integration with a digital rights management system (because some of the digitized newspapers are still in copyright) and integration with automated concept (i.e. subject heading) extraction software. In addition, the Singapore project uses large grayscale JPEG2000 source images, as opposed to the black-and-white TIFF images used by Papers Past. We had to redevelop our image server software quite significantly to get good performance when processing these JPEG2000 images.

We’ve been asked several times if the code written to import and display METS/ALTO data is open source, and if it has been committed back to Greenstone. The answer is yes, of course it’s open source, but no it hasn’t yet been committed back to the Greenstone code base. The reasons for not committing it back are as follows.

  • It’s a lot of highly specialized code, and is only useful to those with METS/ALTO data. My personal belief is that at times we have too much highly specialized functionality added into Greenstone, and that Greenstone2 isn’t currently modular enough to make it easy to add these sorts of major changes.
  • We’ve worked with a number of METS/ALTO based projects and the data itself is always subtly different. That is, the code always needs to be modified to suit the METS/ALTO schema used, so is only useful as a starting point.

Having said all of the above, we are of course happy to make the code available to those who want it. Please contact us at contact@dlconsulting.com if you’re planning on building a METS/ALTO based Greenstone collection.

“Papers Past” newspaper digitization site released

September 18th, 2007 by stefan

DL Consulting are pleased to announce that the redesigned Papers Past has now been officially released. Papers Past is a collection of 19th and early 20th century newspapers belonging to the National Library of New Zealand. The previous version of the site made images of each of the 1,000,000+ newspaper pages in the collection available, but did not allow the contents of the newspapers to be searched. The new site is a complete redesign, and is based on a very heavily customised Greenstone installation. DL Consulting built the Greenstone-based delivery system, and have been working on it with the National Library of New Zealand since mid-2006.

The functionality of the updated Papers Past site includes the following.

  • Newspaper pages underwent an OCR process to produce METS/ALTO XML data. A new Greenstone plugin was written for importing this data.
  • Papers Past handles a mixture of searchable and unsearchable newspapers. At present about 250,000 of the 1.1 million total pages in the system are searchable, with more pages being converted to searchable format over time.
  • The use of METS/ALTO data allowed us to build a system where individual newspaper articles and advertisements can be extracted from pages. That is, the user may choose to view either full newspaper pages or to view larger versions of individual articles.
  • The collection features search term highlighting directly within digital images.
  • An image-server application was developed by DL Consulting, to allow images to be processed, cropped, and scaled at display time. That is, only the original archival TIFF images of each page are stored. When the Greenstone delivery system requires an article-level image, a web-friendly page-level image, or any other type of image, it requests it from the image server. The image server then generates the required image on-the-fly from the stored archival TIFF images.
  • At present the Papers Past collection uses the Lucene search engine. We chose Lucene for its proven ability to scale to very large indexes, and because of its “fuzzy search” capability. Fuzzy search allows the search engine to return hits for documents containing words similar but not identical to the search terms entered by the user. This is a useful feature in a delivery system for a newspaper digitization project, as newspapers are extremely difficult for OCR software to deal with. This invariably means that the searchable text produced by the OCR process is not perfect.
  • This collection is already very large, and will grow much larger over time. At the time of writing there are 254,000 searchable pages and around 3.1 million searchable articles. While 254,000 pages doesn’t sound like a lot, these pages each contain a huge amount of text. There’s more than 9Gb of raw searchable text, and 27Gb of metadata (we store coordinates for each article and word on every page, hence the enormous amount of metadata). We went to considerable effort to ensure that Greenstone scales sufficiently to support a collection of over one million pages, and we’re continuing this work with funding from a government R&D Grant. We’re currently working on another newspaper digitization project which will eventually scale to more than two million pages.

DLC involved in Greenstone pilot project in Southern Africa

September 18th, 2007 by stefan

DL Consulting recently became involved in a pilot project to introduce Greenstone to Southern Africa. Our commitment to this project includes the donation of time and expertise in helping to provide training in Greenstone (and more general digital library and digitization subjects) to institutions in Southern Africa. As part of the first stage of this pilot project I recently visited the National University of Science and Technology in Zimbabwe, the Lesotho College of Education, the University of Lesotho, and Bunda College of Agriculture in Malawi. My time in Africa was extremely positive, with lots of interest in and excitement about potential applications for digital library technologies like Greenstone. Each of these institutions, as well as the University of Namibia (who are the sub-regional coordinating centre) are now working on Greenstone-based digital collections.The next phase of the pilot project is a training course to be held at the University of Namibia in early October. This course will cover more advanced topics, following on from the basic training work we did when I visited each institution. Professor Ian Witten, head of the Greenstone project at the University of Waikato will be the trainer.For more information on the Southern African Greenstone project see the official project website.

Greenstone scalability research grant

July 3rd, 2007 by stefan

DLConsulting recently received a grant from the Foundation for Research Science & Technology to further improve Greenstone’s performance when scaled up to very large collections. Of particular interest are scalability issues caused by large collections of uncorrected OCRed text (e.g. digitized newspaper collections). As part of this research we are testing and benchmarking the performance of a number of different search engine and metadata database options, as well as improving Greenstone’s ability to distribute a collection across multiple servers. The research grant runs until April 2008.

DL Consulting website updated

June 26th, 2007 by stefan

We’ve been gradually updating our website over the past few weeks, in an attempt to make it much clearer how DL Consulting relates to Greenstone and open source. Most of the content of the site has been changed, and a new page has been added to further clarify what our policies are with regard to open source.

XMP support in Greenstone

June 13th, 2007 by john

After an email request, we’ve developed a new plugin, MetadataXMPPlug, to extract eXtensible Metadata Platform information from PDFs. The plugin is very basic - but should read most XMP information as long as it adheres to the RDF standard.

The neatest part of this project it that it makes use of Greenstone relatively new multipass import functionality. The MetadataXMPPlug is used during the metadata_read pass to extract metadata from PDF files, while the files themselves are handled by PDFPlug during the later import pass.

Open-Source report available on site

June 1st, 2007 by john


I finally remembered to add my paper on “The Adoption of Open Source technologies in New Zealand” to the main DLC site.  This was the final report for the first module of the Post Graduate Certificate in Professional Development (Electronics and ICT) that I’m currently undertaking.

While the report could probably do with some more polish, it does provide an interesting background to open source issues (such as intellectual property and copyright/left issues) and provides statistics and references which could be used as evidence in future contracts. Earned me an A too!

Hello (World!)

May 30th, 2007 by johnr

We’re starting a blog with the hopes to work more closely with the Greenstone community. In the past DL Consulting has contributed on the mailing lists and code to the Greenstone CVS tree. Along those lines, we see Planet Greenstone and this blog as a great way to get more involved and engaged with the community around Greenstone.

We work with Greenstone because it provides the most robust and scalable open source digital library system. I put Greenstone in the same category as Linux and Gnome - successful open source projects with a large number of everyday users. Like all successful software projects, Greenstone solves real world problems for many users out there - I know our clients are very happy with it.

However, like a lot of open source projects, Greenstone doesn’t get a lot of user contributions when compared to the number of users. Obviously most users can’t contribute code to the project, but contributions such as documentation, help and showing people what Greenstone can do for them are incredibly useful for any open source project.

The good news is that the University of Waikato is trying to make it easier for users to contribute code changes back. We hope this type of change will trickle down and encourage users to give back to Greenstone to make it better for everyone.