WP7: Biodiversity literature access and data mining

Workpackage 7 will be led by the UK’s Open University, who are experts in data mining biodiversity literature, and will engage partners at the Karlsruhe Institute for Technology, Pensoft Publishing (covering data formats in contemporary literature) and the Natural History Museum London. Key aspects of this work include developments on i) the infrastructure to support the creation and ongoing maintenance of community constructed digital bibliographies within the Scratchpad virtual research environment; ii) robust, federated search mechanism and context-sensitive ranking of search results for biodiversity literature; iii) web services to recover certain content elements, such as taxonomic names, author names and locality, from within text blocks; iv) the means to identify structural elements (text blocks) of different types within published documents; and v) the infrastructure to support annotation and correction of documents by citizen scientists and others. These research activities build on the EU funded INOTAXA project at the Natural History Museum, London and the Plazi project run from the Karlsruhe Institute for Technology.

WP7 is the key link with the agINFRA project.  Slides from teh agINFRA kick-off meeting are attached below.

M7.27 – Publish ViBRANT NLP corpus

To develop, refine and assess our data mining work, and the similar work of others aiming to mine biodiversity texts, a substantial gold standard corpus is required. None currently exists. This milestone addresses the community need.

M7.26 – Workpackage software packaged

For sustainability and to formalise our contractual agreement with the EC, we need to package the software in industry standard formats, following accepted coding conventions and using version control. Software to be deposited in the Scratchpads git repository.

M7.25 Enhance reference parser to parse references in bulk uploads

Opening RefBank to allow people to upload references requires that the reference parser is enhanced so that it can cope with the variety of formats that will be presented. This is not simply to cope with well-structured references that adhere to reference format standards, but also common mistakes and errors in the formats. M7.25 builds on the functionality that will be developed in M7.21, M7.22, M7.23 and M7.24.

Actually completed earlier, but only got around to writing the report today, Tuesday 30 October 2012. Dauvit

M7.24 Upload service for complete bibliographies

In order to allow individuals to upload bibliographies that they have compiled, a bulk upload service is required. This milestone will address that requirement. M7.24 requires the functionality that will be developed in M7.21, M7.22 and M7.23.

This milestone was defined in M7.15.

----

Milestone original delivery date was Friday 29 June 2012.
Brought forward to Friday 1 June in line with changing emphasis within WP7 to favour progress on bibliographic reference handling at the expense of mark up processing. See milestones M7.16 and M7.23.

M7.23 Extend RefBank import routines to support other widely used bibliographic formats, eg BibTex, RIS, etc

As the milestone states, RefBank needs extending so that import from other widely used bibliographic formats such as BibTex and RIS are supported. This will facilitate populating the database by bulk upload of personal bibliographies. Note that this milestone relies on there being a working attribution (origin) mechanism - see Milestone M7.21 - if people who upload their bibliographies are to be credited. Bulk upload will be implemented by Milestone M7.24.

This milestone was defined in M7.15.

----

Milestone original delivery date was Friday 29 June 2012.

M7.22 Import bibliographies from Pensoft to RefBank

Develop the infrastructure to import bibliographic information automatically from Pensoft to RefBank. Typically, this will be the metadata about each publication and the bibliography from the end of the paper. Note that this milestone relies on there being a working attribution (origin) mechanism - see Milestone M7.21.

This milestone was defined in M7.15.

M7.21 Add metadata to cover origin of bibliographies

Metadata and processing code added to RefBank to cover origin of bibliographies.

Use cases:
1) Bibliographies taken from a publication, in which case the origin is the bibliographic details of the original publication or (possibly) RefBank ID of original publication, which will be accurate but not human-user friendly.
2) Bibliographies contributed by a particular author, in which case, attribution of the bibliography is appropriate.

M7.19 - Review of pilot of reference de-duplication software

Working with community contributed references to RefBank means that our repository will have a large number of duplicate references arising from:

D7.3 - Literature search

This, the third and final deliverable of workpackage seven, was originally conceived of as an "Enhanced search facility to locate concepts based on linguistics and proximity rules." It was superseded during the project by the need to provide a breadth of coverage to enable a bibliography of life.

M7.20 - Workpackage software documentation produced

To encourage uptake and use, and for future enhancement and maintenance after completion of ViBRANT, all workpackage produced software must be fully documented to consistent quality, informed by the appropriate standards.

M7.16 - Mark-up modules delivering outline mark-up

E.g. for article boundaries, treatment boundaries, headings and authors

----

Milestone original delivery date was Thursday 31 May 2012.

Deferred to Friday 29 June 2012 in line with changing emphasis within WP7 to favour progress on bibliographic reference handling at the expense of mark up processing. See milestones M7.23 and M7.24.

Deferred again - see Rescheduling below.

M7.17 - Review of pilot mark up processes within the Scratchpad infrastructure

Originally planned for 31 July 2012. However, following confirmation at ManComm7 to pull forward bibliographic work in preference to data mining work this milestone was deferred. Delivery date further affected by re-plan to include more enhancements to RefBank during year two than originally envisaged.

Eventually re-scheduled in line with revised date for M7.16 (http://vbrant.eu/content/m716-mark-modules-delivering-outline-mark). M7.16 was completed 28 September 2012.

D7.2 - Mark-up modules

This deliverable involved extending and integrating the GoldenGATE interactive mark-up tool (http://plazi.org/?q=GoldenGATE) within the Scratchpad infrastructure. GoldenGATE is our tool of choice because it has the mechanisms for handling the stylised structures common in taxonomic literature. Should integration of the complete tool prove difficult, GoldenGATE’s modular structure will permit it to be decomposed so that individual modules can be integrated into the Scratchpad infrastructure or deployed as web services.

M7.18 - First integration phase complete

Notes

This milestone represents implementation of sustainable links between the bibliography service and mark-up services developed by this work package, and Scratchpads.

The bibliography service integrations is achieved through a Scratchpads-to-RefBank harvester program.

The mark-up modules integration is achieved through the standard OBOE interface.

D7.1 - Community contributed bibliography

A functional community-contributed bibliography with unique identifiers at publication unit level(s) and links to publicly available digital copies where possible.

The report contains a description of the community constructed bibliography, covering:

  • architectural and implementation issues,
  • a brief description of the functionality of the bibliography,
  • the way forward, including architectural and functional developments

The prototype is hosted at http://plazi2.cs.umb.edu:8080/RefBank/search

M7.15 - Define further milestones in the light of usage and feedback

Additional milestones defined to monitor and break down work programme for year 2. Milestones added to the list of milestones and deliverables on the ViBRANT website.

D6.3 - Data publication workflow

The present deliverable describes several workflows and tools developed or upgraded by Pensoft in the course of the ViBRANT project.

AttachmentSize
agINFRA_ViBRANT_Roberts.pdf1.79 MB
WP4_OCR_Morse.pdf1.4 MB
WP4_agStor_Morse.pdf612.6 KB
WP4_Scratchpads_Roberts.pdf4.32 MB
WP5_BHL_Morse_opt.pdf1.07 MB
WP5_Policies_Morse.pdf1.32 MB
Report_BibliographyDataFormatsAndServices.odt125.35 KB
Syndicate content