M7.19 - Review of pilot of reference de-duplication software

Working with community contributed references to RefBank means that our repository will have a large number of duplicate references arising from:

  • letting users load textual, as opposed to marked-up, references which means that a reference can be entered using any style such as Harvard, Chicago, etc laid out in the published source.
  • near identical references varying only by a comma or space caused by individual stylistic quirks of the contributors.
  • near identical references varying only typographical errors, whether in the original source or induced later through re-keying the reference.

We consider it important to RefBank’s success that there are as few blocks as possible to user contributions: users should simply upload references as they are without having to specially reformat them to suit RefBank. This design decision leads to the problem of multiple references; however, we consider it preferable that the duplicates are resolved within RefBank rather than prevent the loading of these references at all, so hindering the workflow of our potential contributors.

The problem of de-duplication is still unresolved within bibliographic reference management. We will need to develop a tool to automatically identify canonical forms of a reference from the many references loaded into RefBank. Our approach is based on graph theory, with each reference forming a node in a graph and the emergent centroid being considered the canonical form. Various algorithms will be used to calculate the centroid, decomposing the reference so that the most appropriate algorithm can be used, for example Jaro-Winkler for author names. This canonical form of a reference will be returned in future searches, however, the other references will not be deleted but simply marked as unavailable to general searches. Manual curation will be enabled so that a user can override RefBank’s canonical form if necessary.