The team focused primarily on how to bring the algorithm work to a close. The IMA developed four algorithms for identifying which pages in the BHL corpus contain images. Those algorithms were run across a gold standard set of 40k pages to determine their accuracy and performance. Two of the four algorithms were deemed to be useful (accuracy ratings were above 80%). Currently, an environment is being established to enable efficient processing of those two algorithms across the entire corpus of 40+ million pages. There was also discussion on how to refine the functionality of the Macaw tool which will be used to broadly classify the page images into groups such as illustrations, photographs, diagrams/tables, and maps. Further work on the schema included efforts to create an application profile based on VRA Core and Darwin Core element sets. Finally, the team wrapped up the meeting by investigating bulk uploads of large numbers of images to Flickr and Wikimedia Commons as well as how to then extract the resulting crowdsourced metadata back into the BHL portal.
Notes and action items from the meeting can be found on the Art of Life site http://biodivlib.wikispaces.com/Art+of+Life.
Pictured left to right: Kyle Jaebker, Joel Richard, Mike Lichtenberg, Mike Blomberg, William Ulate, Doug Holland, Guarav Vaidya, Chris Freeland. Not pictured: Chuck Miller, Rob Guralnick |
Questions about the project’s goals, progress, and technology can be directed to Principal Investigator, Trish Rose-Sandler at trish.rose-sandler [at] mobot.org. The Art of Life project runs from May 1, 2012 through April 30, 2014 and is generously funded by the National Endowment for the Humanities.
Any views, findings, conclusions, or recommendations expressed in this blog post do not necessarily represent those of the National Endowment for the Humanities.