Large-scale Greenstone collections using DB2
Monday, August 25th, 2008Greenstone has developed—rather unfairly, we feel—a reputation as a ‘toy’ document system not capable of handling large-scale, enterprise level collections. While our latest ‘million page’ newspaper collections should help change this preconception, there are indeed some scalability issues encountered in large collections. Similar problems have been encountered in large-scale databases and have been answered by the use of distributed computing, where the processing and storage workload is shared and balanced between several computers instead of just one. However, Greenstone didn’t provide this functionality… until now.
While still in its early development stage, Greenstone has been integrated with the recently released IBM DB2 Express-C database. This database meets Greenstone’s requirements for metadata storage and—using the Net Search Extensions add-on—full text search, while its license allows users to download and install for free. Most importantly, Greenstone is then able to leverage the power of ‘Federation’, DB2’s implementation of distributed computing. The ‘front-end’ DB2 server transparently manages interaction with an arbitrary number of ‘back-end’ DB2 data servers. This provides the potential to dramatically increase Greenstone’s large-scale performance just by adding further ‘back-end’ servers without having to drastically change Greenstone itself.