Meet our international partners to extract data from BHL
Mining Biodiversity (MiBio project) is one of the projects that won during the third round of the transatlantic Digging Into Data Challenge, a competition aiming to promote the development of innovative computational techniques that can be applied to big data in the humanities and social sciences. The project is an international collaboration between the National Centre for Text Mining (UK), Missouri Botanical Garden (US) and Dalhousie University’s Big Data Analytics Institute (Canada) and Social Media Lab (Canada), along with colleagues from the Encyclopedia of Life and the Smithsonian Institution.
We will integrate novel text mining methods, visualization, crowdsourcing and social media into the BHL to provide a semantic search system that allows users to explore search results according to multiple information dimensions or facets. The goal is to transform BHL into a next-generation social digital library resource that facilitates the study and discussion (via social media integration) of legacy science documents on biodiversity by a worldwide community.
The project has five major components, covered in 9 Work Packages (WP):
Image may be NSFW. Clik here to view. ![]() |
Relations between the Work Packages of the project |
The project has five major components, covered in 9 Work Packages (WP):
- Automatic correction of errors in OCR using Google n-grams by our colleagues of the Big Data Analytics Institute (WP2).
- Crowdsourcing the annotation of semantic metadata (concepts and events) in legacy texts (WP5).
- Extract metadata (terms, concepts and significant events) automatically and track their change over time (WP3 & WP4) to facilitate semantic search (WP6) implemented with NaCTeM.
- Use interactive visualization techniques to manage the search results, in collaboration with Dalhousie (WP7)
- Design a social media layer as an environment for interaction and collaboration on science, education, awareness and outreach, lead by our colleagues of the SocialMediaLab (WP8).
Image may be NSFW. Clik here to view. ![]() |
Manchester Town Hall at Albert Square, UK |
The National Centre for Text Mining (NACTEM)
NaCTeM has developed text mining services based on a number of generic natural language processing tools like Argo, their Web-based workflow construction platform for text mining, implemented on top of the OASIS Unstructured Information Management Architecture (UIMA) standard for interoperability among information processing components.
Image may be NSFW. Clik here to view. ![]() |
Example of an Argo workflow that automatically extracts species and anatomical features using entity tagging components that NaCTeM has developed for this purpose. |
Another type of components of the workflows, in addition to named entity recognizers, are the linkers, which facilitate the automatic linking of names or concepts found in text to entries in external vocabularies via unique identifiers and using a string similarity method.
Argo's functionality allows workflows to be deployed as a Web service so they can be invoked by external applications, just like BHL currently invokes Name-finding web services to find the taxa within the text.
Image may be NSFW. Clik here to view. |
At NaCTeM commenting on the tools for the project. L to R: Mr. John McNaught, Dr. Anatoliy Gruzd, Dr. Sophia Ananiadou, Ms. Riza Batista-Navarro, Mr. Georgios Kontonatsios and Mr. Paul Thompson. Missing from the photo: Dr. Rafal Rak, Claudiu Mihăilă, and Dr. Ioannis Korkontzelos. |
NaCTeM has also done substantial work on event extraction, i.e., the extraction of associations or interactions between concepts or entities. This experience will help us identify and extract the type of events that scientists, historians and other scholars have long wanted to extract from the BHL corpus (like behavior, habitat, trophic relations, geographic range and others ) for our own named-entities: species, people, places throughout time. Finally, NaCTeM's vast experience developing customized semantic search engines like KLEIO, ISHER and Europe PubMed Central EvidenceFinder will facilitate providing an enhanced semantic search functionality over the BHL corpus text, to allow users to explore results according to multiple information dimensions or facets.
Additional information on the tools and services can be found at:
- Rak, R., Rowley, A., Black, W.J. and Ananiadou, S. (2012). Argo: an integrative, interactive, text mining-based workbench supporting curation. Database: The Journal of Biological Databases and Curation.
- Kolluru, B., Nakjang, S., Hirt, R. P, Wipat, A. and Ananiadou, S.. (2011). Automatic extraction of microorganisms and their habitats from free text using text mining workflows. In: Journal of Integrative Bioinformatics, 8(2), 184
Or read more details about these or some of the other service systems and tools that NaCTeM has developed.
The Social Media Lab
On Monday February 17th, 2014, as part of the third Social Media Workshop, which covered the outreach and impact aspects of the International Centre for Social Media Research at Manchester University, our group was invited to attend a talk by our colleague in the project, Dr. Anatoliy Gruzd, where he presented the research done at the Dalhousie University Social Media Lab, how to make sense of the huge quantity of data and the new methods to collect information when studying online social networks through analysis and visualization.Image may be NSFW. Clik here to view. ![]() |
Social Media Lab at Dalhousie University, Canada © All Rights Reserved |
For more information, take a look at:
- Gruzd, A. & Haythornthwaite, C. (2013). Enabling Community through Social Media. Journal of Medical Internet Research 15(10):e248. [DOI:10.2196/jmir.2796]
I hope this gives an idea of the work ahead and a better sense of what the project attempts to do and how it aims to do it. We will keep you informed as results become available, but in the meantime, let us know how do you envision yourself using BHL as a social digital library? What information you'd like to track and how you'd like to access it? Tell us the vocabularies you'd need to see included and what types of named entities and associations you'd want to be tagged in the BHL corpus?
William Ulate
BHL Technical Director
Missouri Botanical Garden
This project is made possible in part by a grant from the Institute for Museum and Library Services [Grant number LG-00-14-0032-14].