“Papers Past” newspaper digitization site released
Tuesday, September 18th, 2007DL Consulting are pleased to announce that the redesigned Papers Past has now been officially released. Papers Past is a collection of 19th and early 20th century newspapers belonging to the National Library of New Zealand. The previous version of the site made images of each of the 1,000,000+ newspaper pages in the collection available, but did not allow the contents of the newspapers to be searched. The new site is a complete redesign, and is based on a very heavily customised Greenstone installation. DL Consulting built the Greenstone-based delivery system, and have been working on it with the National Library of New Zealand since mid-2006.
The functionality of the updated Papers Past site includes the following.
- Newspaper pages underwent an OCR process to produce METS/ALTO XML data. A new Greenstone plugin was written for importing this data.
- Papers Past handles a mixture of searchable and unsearchable newspapers. At present about 250,000 of the 1.1 million total pages in the system are searchable, with more pages being converted to searchable format over time.
- The use of METS/ALTO data allowed us to build a system where individual newspaper articles and advertisements can be extracted from pages. That is, the user may choose to view either full newspaper pages or to view larger versions of individual articles.
- The collection features search term highlighting directly within digital images.
- An image-server application was developed by DL Consulting, to allow images to be processed, cropped, and scaled at display time. That is, only the original archival TIFF images of each page are stored. When the Greenstone delivery system requires an article-level image, a web-friendly page-level image, or any other type of image, it requests it from the image server. The image server then generates the required image on-the-fly from the stored archival TIFF images.
- At present the Papers Past collection uses the Lucene search engine. We chose Lucene for its proven ability to scale to very large indexes, and because of its “fuzzy search” capability. Fuzzy search allows the search engine to return hits for documents containing words similar but not identical to the search terms entered by the user. This is a useful feature in a delivery system for a newspaper digitization project, as newspapers are extremely difficult for OCR software to deal with. This invariably means that the searchable text produced by the OCR process is not perfect.
- This collection is already very large, and will grow much larger over time. At the time of writing there are 254,000 searchable pages and around 3.1 million searchable articles. While 254,000 pages doesn’t sound like a lot, these pages each contain a huge amount of text. There’s more than 9Gb of raw searchable text, and 27Gb of metadata (we store coordinates for each article and word on every page, hence the enormous amount of metadata). We went to considerable effort to ensure that Greenstone scales sufficiently to support a collection of over one million pages, and we’re continuing this work with funding from a government R&D Grant. We’re currently working on another newspaper digitization project which will eventually scale to more than two million pages.