Learning lessons from the CITScribe Hackathon participation
Two members of BHL's Technical Advisory Group (TAG), BHL Technical Director William Ulate from Missouri Botanical Garden and Joe deVeer, Head of Technical Services of the Ernst Mayr Library at the Museum of Comparative Zoology, Harvard, "virtually" attended the CITSCribe Hackathon in Gainesville, Florida from Dec. 16 to 20, 2013.
The hackathon was co-organized by iDigBio (www.idigbio.org) and Zooniverse's Notes from Nature Project (www.notesfromnature.org). Led by the co-organizers Rob Guralnick (University of Colorado, Boulder) and Austin Mast (Florida State University), the objective of the hackathon was to further enable public participation in the online transcription of biodiversity specimen labels and to produce new functionality and interoperability for Zooniverse's Notes from Nature and similar transcription tools. The event started by focusing on four areas of development, progressively addressed throughout the week: (1) interoperability between public participation tools and biodiversity data systems, (2) transcription quality assessment/quality control (QA/QC) and the reconciliation of replicate transcriptions, (3) integration of optical character recognition (OCR) into the transcription workflow, and (4) user engagement.
The LI LI Hackathon subgroup working on the integration of OCR. |
For a while now, the BHL Technical Advisory Group (TAG) has been participating in the events of iDigBio, particularly on the Augmenting OCR Working Group, under Deb Paul and Bryan Heidorn's guidance. One of our BHL colleagues and TAG members, John Mignault from New York Botanical Garden, also participated last year at iDigBio's previous hackathon which primarily focused on parsing OCR output to standard darwin core.
This time, we had a presentation by Cody Meche, an Agile trainer who gave some useful directions and tips on how to approach our hackathon to use our time as efficiently as possible while following an agile development process. Alex Thompson, from iDigBio, presented on the digital resources that could allow programmers to run a copy of the web interface of Notes for Nature on their local machines using Vagrant and Virtualbox. Paul Kimberly also presented the Smithsonian transcription project (transcription.si.edu) and Yonggang Liu talked about iDigBio’s image ingestion tool, useful for getting images into the iDigBio cloud. Laura Whyte, Director of Citizen Science at Adler Planetarium joined us virtually at the Hackathon to talk about Zooniverse and citizen science.
I also had a chance to present shortly about our approach to OCR and Transcription within our Purposeful Gaming project funded by the Institute of Museum and Library Services (IMLS). Basically, we intended to leverage iDigBio's knowledge while tackling the challenge of improving the existing BHL OCR through Purposeful Gaming. Our interest was mainly two-fold: on one side we wanted to learn from the amazing cumulative experience and skills that iDigBio comprises and the tools that are being adapted, built and used to generate the OCR and transcribe the text of their specimen labels; a challenge very similar to our own issues with OCR of books and journals in BHL. On the other hand, we wanted to learn from their workflow definition to improve the OCR generation process, the incorporation of their transcription outputs and the reconciliation of the existing versions into a single one (for example, read about using sequence aligning methods on this recent blogpost by Rob Guralnick), while at the same time taking advantage of the citizen scientists' help in an efficient manner that copes with the scale of the challenge.
I also had a chance to present shortly about our approach to OCR and Transcription within our Purposeful Gaming project funded by the Institute of Museum and Library Services (IMLS). Basically, we intended to leverage iDigBio's knowledge while tackling the challenge of improving the existing BHL OCR through Purposeful Gaming. Our interest was mainly two-fold: on one side we wanted to learn from the amazing cumulative experience and skills that iDigBio comprises and the tools that are being adapted, built and used to generate the OCR and transcribe the text of their specimen labels; a challenge very similar to our own issues with OCR of books and journals in BHL. On the other hand, we wanted to learn from their workflow definition to improve the OCR generation process, the incorporation of their transcription outputs and the reconciliation of the existing versions into a single one (for example, read about using sequence aligning methods on this recent blogpost by Rob Guralnick), while at the same time taking advantage of the citizen scientists' help in an efficient manner that copes with the scale of the challenge.
High level workflow proposed for the Purposeful Gaming project |
Specifically, our short term goals include incorporating recommendations and adapting existing tools developed by many of the participants in the past two hackathons: Jason Best's DarwinScore, Ben W. Brumfield's work on QC for Collaborative (Crowdsourced) Manuscript Transcription, the QA/QC track developments to take outputs from the citizen science transcription products and assure the highest quality end result, are just some among many other outputs and suggestions from these activities that can be incorporated to our process of improving the OCR texts in BHL. And if you have more ideas, I would definitely like to hear about them...
For more information on the recent CITScribe Event read this iDigBio's blogpost, Austin Mast's detailed blog post or consult directly the Hackathon wikipage. You can access all information and results about iDigBio in their home pageand follow our own BHL developments from the webpage of the Purposeful Gaming project, made possible in part by the Institute for Museum and Library Services [Grant number LG-05-13-0352-13].
William Ulate R.
BHL Global Coordinator and BHL-US/UK Technical Director
Missouri Botanical Garden