Digital Library Consulting Logo Making digital libraries easy

Archive for September, 2007

“Papers Past” newspaper digitization site released

Tuesday, September 18th, 2007

DL Consulting are pleased to announce that the redesigned Papers Past has now been officially released. Papers Past is a collection of 19th and early 20th century newspapers belonging to the National Library of New Zealand. The previous version of the site made images of each of the 1,000,000+ newspaper pages in the collection available, but did not allow the contents of the newspapers to be searched. The new site is a complete redesign, and is based on a very heavily customised Greenstone installation. DL Consulting built the Greenstone-based delivery system, and have been working on it with the National Library of New Zealand since mid-2006.

The functionality of the updated Papers Past site includes the following.

  • Newspaper pages underwent an OCR process to produce METS/ALTO XML data. A new Greenstone plugin was written for importing this data.
  • Papers Past handles a mixture of searchable and unsearchable newspapers. At present about 250,000 of the 1.1 million total pages in the system are searchable, with more pages being converted to searchable format over time.
  • The use of METS/ALTO data allowed us to build a system where individual newspaper articles and advertisements can be extracted from pages. That is, the user may choose to view either full newspaper pages or to view larger versions of individual articles.
  • The collection features search term highlighting directly within digital images.
  • An image-server application was developed by DL Consulting, to allow images to be processed, cropped, and scaled at display time. That is, only the original archival TIFF images of each page are stored. When the Greenstone delivery system requires an article-level image, a web-friendly page-level image, or any other type of image, it requests it from the image server. The image server then generates the required image on-the-fly from the stored archival TIFF images.
  • At present the Papers Past collection uses the Lucene search engine. We chose Lucene for its proven ability to scale to very large indexes, and because of its “fuzzy search” capability. Fuzzy search allows the search engine to return hits for documents containing words similar but not identical to the search terms entered by the user. This is a useful feature in a delivery system for a newspaper digitization project, as newspapers are extremely difficult for OCR software to deal with. This invariably means that the searchable text produced by the OCR process is not perfect.
  • This collection is already very large, and will grow much larger over time. At the time of writing there are 254,000 searchable pages and around 3.1 million searchable articles. While 254,000 pages doesn’t sound like a lot, these pages each contain a huge amount of text. There’s more than 9Gb of raw searchable text, and 27Gb of metadata (we store coordinates for each article and word on every page, hence the enormous amount of metadata). We went to considerable effort to ensure that Greenstone scales sufficiently to support a collection of over one million pages, and we’re continuing this work with funding from a government R&D Grant. We’re currently working on another newspaper digitization project which will eventually scale to more than two million pages.

DLC involved in Greenstone pilot project in Southern Africa

Tuesday, September 18th, 2007

DL Consulting recently became involved in a pilot project to introduce Greenstone to Southern Africa. Our commitment to this project includes the donation of time and expertise in helping to provide training in Greenstone (and more general digital library and digitization subjects) to institutions in Southern Africa. As part of the first stage of this pilot project I recently visited the National University of Science and Technology in Zimbabwe, the Lesotho College of Education, the University of Lesotho, and Bunda College of Agriculture in Malawi. My time in Africa was extremely positive, with lots of interest in and excitement about potential applications for digital library technologies like Greenstone. Each of these institutions, as well as the University of Namibia (who are the sub-regional coordinating centre) are now working on Greenstone-based digital collections.The next phase of the pilot project is a training course to be held at the University of Namibia in early October. This course will cover more advanced topics, following on from the basic training work we did when I visited each institution. Professor Ian Witten, head of the Greenstone project at the University of Waikato will be the trainer.For more information on the Southern African Greenstone project see the official project website.


Powered by Wordpress