Greenstone tutorial exercise

Back to wiki
Back to index
Prerequisite: Scanned image collection
Sample files: niupepa.zip
Devised for Greenstone version: 3.12 (with gv extension)
Modified for Greenstone version: 3.12

OCRing scanned image content with Google Vision

In this exercise we build upon the collection created in the Scanned image collection exercise. We utilize the Google Vision API when building the collection to apply Optical Character Recognition (OCR) the the scanned pages.

Activating Google Vision to apply its OCR capability to scanned pages

In undertaking the Scanned image collection exercise, the two items from "Te Waka o Te Iwi") that were added to the collection have scanned images, but no text files. Through this exercise we learn how to make use of the Google Vision extension to Greenstone so automatically generate the text files by using its OCR capability.

In the Librarian Interface, open up the Paged Image collection that was created in exercise Scanned image collection if it is not already open (File → Open...).

In the Gather panel, refresh you memory of the scanned file content and associated .item files that are present. In particular, review the 09 folder, which has in it the two newspaper editions that do not have any accompying text files, meaning full-text searching of their content in the digital library is not possible. The only way they can be accessed is through the metadta provided in the their repsective .item files.
Inside the 09 folder you can see that there are 2 item files, 8 image files and 0 text files.

Change to the Design panel, click on the Select plugin to add drop-down menu. Select GoogleVisionPagedImagePlugin from the list displayed, and click Add Plugin... to add it to the pipeline of plugins this collection uses.
Next, click on GoogleVisionPagedImagePlugin in the Assigned Plugins pane and use the Move Up button to move the position of this plugin so it above PageImagePlugin and below GreenstoneXMLPlugin.

If GLI is not already in Expert mode, use the Preferences... item on the File menu to change to this mode. This will result in more output generated, when the collection is built, showing how Google Vision is being used to OCR the scanned image content.

Switch to the Create panel, and press the buildbutton, to build the collection. Upon completion, preview the result

Using the query box for the digital library collection, search for Kihikihi, a small town located in the North Island of New Zealand.
Previously (for the version of the collection built in the Scanned image collection) this only found 2 matching newspapers. Now, with Google Vision used to OCR the scanned pages where there was no text file provide, the search returns 3 matching newspapers. The additional match is in Volume 1, Number 2 or Te Waka o Te Iwi.

Copyright © 2005-2024 by the New Zealand Digital Library Project at the University of Waikato, New Zealand
Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts. A copy of the license is included in the section entitled “GNU Free Documentation License.”