# Notes on Future Citations # ## Executive Summary ## A useful couple of days in which we explored: 1. nature of citations 1. comparison across datasets to highlight anomalies - for example, CrossRef's content is skewed 2010 onwards because it depends on publishers contributions and from then on publishers *do* contribute, and make sure that all this new material references their own publications correctly; however, this work skews the number of references per year, the years 2010 & 2011 being noticeably higher than previous years. 1. investigated erroneous citations - for example, the correct reference "Kimura,M. (1980) A simple method for estimating evolutionary rate of base substitutions through comparative studies of nucleotide sequences. J. Mol. Evol., 16, 111–120" has a variant form with *model* replacing *method* that persists. In Google Scholar *method* has 11,614 citations, *model* has 1. In MS Academic Research *method* has 5,725 citations, *model* has 165. 1. talked with people from other disciplines and familiar with merging the citations for documents with citations for data; DOIs and [DataCite](http://www.datacite.org/) are the preferred solution for unique identifiers. 1. good networking opportunities 1. have potential access to more biodiversity citation references. 1. discussed relevant toolsets being developed elsewhere, ie at least two people are working on PDF metadata extraction. 1. linked up with Paul Stokes, JISC programme manager on citations; he is starting to look at bringing linked open data into the programme. ## Background information ## Thursday 27 and Friday 28 September 2012 Aston Business Centre, University of Aston [Event page](http://devcsi.ukoln.ac.uk/upcoming-events/future-citations-hack-days/) [Ideas page](http://devcsi.ukoln.ac.uk/upcoming-events/future-citations-hack-days/future-citations-hack-days-ideas-page/) ## Day 1 ## ### People ### *Mahendra - UKOLN* event organiser, heads up [devcsi](http://devcsi.ukoln.ac.uk/) activities *Max - MH consulting* independent consultant on JISC project looking at citations and whose work inspired this hackathon *me - OU* *Petr - OU* working on [CORE](http://core-project.kmi.open.ac.uk/); developing tool to extract citation data, create a citation network to assist metrics; looking for use cases for mining citation networks (first day only) *Catherine - Science Research Council* interested linking publications to data, looking at [DataCite](http://www.datacite.org/), which is led by BL, giving DOIs to data; just interested to see what other people do (first day only) *Karl - CrossRef* mention CrossRef looking at linking DOIs to ORCHID ids, when ORCID up and running; has open source tools, eg extract references from PDFs, tho' not parse the refs after *Sheng - Uni Birm* PhD student; researching weibo and censorship; colleagues looking at how people cite; happy to get feedback on actual use *Edward - Faculty of 1000* - is a business, product = database of medical evaluations; looking to relate research and that helped by scraping references; he is new to job, four months, so want to get a better idea of what work is being done in the area *Geoff* observer is on same JISC project as Max (first day only) *Paul -JISC* programme manager for citations programme, also looking at linked data *Tanya - Uni Ox* working on open citations dataset; is downloadable; looking at timeline visualisation (dev something in D3?) or semantic browsing of dataset *Tim - Uni Soton* lead dev on eprint s'ware; has tools from his PhD, again reference extraction. but never developed ### Brainstorm ### 1. big picture - what are citations and how do citations fit in context 2. work with data, comparing datasets 3. visualise data ### Working on ideas ### 1. With Karl and Petr on comparing their datasets - can we generalise about citation data, what are its properties 1. With Jeff, Paul and Catherine discussing *What is a citation?* - discussion extended to blur the lines between documents and data, are documents data (some disciplines, eg archaeology, think they are and even supersede original artefacts as citable evidence), where does metadata fit in - do we need to describe both the data and the metadata as distinct entities because both are usable in their own right? ## Day 2 ## ### day one round up ### *Karl/Emma/me* - comparing datasets, similar patterns of citation across datasets - future work for Petr, add DOIs to CORE *Max* - overview map *Paul* - floating *Edward/Tim/Tanya/Sheng* - visualising, learning node.js, sparql queries *me* - I cut in to expand on discussion: documents + data; data + metadata; DOIs for all, vs LSIDs vs URLs ### working on ideas ### Karl continued to prepare citations by year data (coding in Ruby), while Emma looked at applying the earth-mover algorithm (think Levenshtein distance for graphs) using an R package, Sheng looked at drawing timelines using an R package and I looked at Google charts API. Then Karl and I looked at matching DOIs to InCHIs but agreed that was a step too far for the time remaining at this workshop, so looked at D3 for visualisation instead. Found that CrossRef data affected by its creation from publishers' submitted material so vast increase in references 2010 onwards that distorts the data. Selecting an earlier subset of the data to avoid this spike in the number of references. Find that the reference rate by year not necessarily follow the same shape as other reference data sources, ie MS Academic Research, Google Scholar and CORE; but equally they can still share a common shape. Suggests that: 1. must first understand data to confirm no anomalies 1. once anomalies allowed for, datasets are comparable ### good example problem ### Should be: Kimura,M. (1980) A simple method for estimating evolutionary rate of base substitutions through comparative studies of nucleotide sequences. J. Mol. Evol., 16, 111–120. But is also a variant form with *model* replacing *method* that persists. In Google Scholar *method* has 11,614 citations, *model* has 1. In MS Academic Research *method* has 5,725 citations, *model* has 165. ### YouTube from James Jardine ### **Exploring Citations and Themes with Qiqqa** Couldn't be here so YouTubed instead. He developed Qiqqa during the last three years as part of his PhD. Slightly disappointing, was a sales pitch, but a thorough introduction. Qiqqa examines first page to extract title. Then searches Google Scholar using title and lets you import BibTex ref; on automatic it matches retrieved author in BibTex against PDF. If no match then halts execution. Then can build cross references using Google Scholar. Is *heavily* dependent on Google Scholar. Already familiar with product after investigation earlier this year. If want to follow up, Mahendra has contact details, though so do we I thought.