Data mining

David Morse, David King & Alistair Willis(OU)

Key resources
Willis, Alistair, Dave Roberts, David King, David Morse, Anton Dil, and Chris Lyal. "From XML to XML: The Why and How of Making the Biodiversity Literature Accessible to Researchers." In Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC'10), edited by Nicoletta(Conference Chair) Calzolari, Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odjik, Stelios Piperidis and Mike Rosne. Valletta, Malta: European Language Resources Association (ELRA), 2010. http://www.lrec-conf.org/proceedings/lrec2010/index.html

Data mining refers to the extraction of useful information from published resources, fundamentally making published data re-usable.

To be effective this required the isolation of concepts, rather than words, and understanding the context in which they are used. Within ViBRANT we restrict the areas in which we work to descriptions of organisms and immediately related information. Hence, when working on documents related to the Cichlids of East Africa, we need to ensure that we do not use a conventional gazeteer that identifies 'Lake George' as a town in north America (see the Wikipedia entry for Lake George as an example), but that we extract the context from the docuument to guide the gazeteer to the correct recognition of 'Lake George' as a lake in contemporary Uganda.

Central to this process are the use of standardised vocabularies that list terms, their definitions and their relationship to other terms. These vocabularies are being built in ViBRANT and others, such as agroVOC. Ultimately these processes can be used to form linked data and realise the concept of the semantic web.

Page from Biologia Centrali-Americana Plate from Biologia Centrali-Americana
Page taken from 'Biologia Centrali-Americana' Colour plate of beetles taken from 'Biologia Centrali-Americana'

To progress our work in this area, we are collaborating with the INOTAXA project to build a large corpus of biodiversity literature to facilitate the development and testing of data mining tools. This corpus will be a major benefit to all who work in this area as currently only individual documents are available.

An example that demonstrates the application of semantic mark up to a document is shown in the screenshot below. In this case, the annotations of taxon rank have been applied to a page in one of the Biologia Centrali-Americana volumes on birds. The page as part of the testing of our toolsets for creating and using the corpus. While not in itself immediately useful within a taxonomist's workflow, the screenshot shows that we are achieving accurate and automatic identification of semantic content in historic literature and thereby render that content searchable and re-usable.

Marked up page from Biologia Centrali-Americana
Screenshot of marked up page from 'Biologia Centrali-Americana'
Right click on the image and select 'View Image' to see the accuracy of the mark up clearly.