We have found some of the information in the cited-by data is inaccurate. The problem arises when crossref asserts a DOI, because sometimes these are just wrong. The problem centers on the fact that articles sometimes have multiple versions on the web, and these versions get cited differently depending on the context of the citation.
One reason why there might exist multiple versions is because sometimes a publisher places a page limit on what they accept, so authors will post a longer, more complete version on a preprint server or their personal home page. It would be wrong to associate the DOI of the shorter version to the longer version if the longer version is what was cited. Similarly, these articles are produced at different times, and history can change in the intervening period. Sometimes the preprint version will correct errors in the earlier official version, and that is what causes it to get cited.
As an example, consider this item where there is a DOI that has the attribute “doi-asserted-by”: “crossref”. Unfortunately it’s pointing to this paper which has a different title and somewhat different content. They are really different, and the citation should not have been made to the later version in order to preserve historical accuracy.
I understand that this is a good-faith effort by crossref to match articles to their DOIs, and that the fact that Springer reports their bibliographic references in unstructured format doesn’t help. Still, the machine learning algorithm is going to occasionally make errors, and in my opinion the assertion of who associated the DOI to the citation should be propagated to the cited-by information. At present there is no field in the schema but that seems like a desirable thing to include. We don’t wish to include inaccurate information for citations to an article - we only want to show these if the article is actually cited. It might be ok if we could include source information for the reference, but that is not propagated to the cited-by information. The natural place to put it would be as an attribute on the <forward_link>
element.
Thanks for raising this, we have discussed issues with matching methods in a recent series of blog posts - see this one in particular. Unfortunately there will always be errors. Members can fix these by redepositing the metadata with the correct DOI.
We are also looking at improving how we perform matching, to have a better idea of the rate of errors we expect and to improve the matching methods. We don’t use a machine learning method at present and in tests we’ve performed so far, more traditional methods have proved as good and with better performance at scale.
You are correct that we don’t provide the source of the DOI in the XML API, however in the REST API output this information is included. See the doi-asserted-by
field which is present in all references with a DOI
field (such as in the example you provided above).
I saw it in the REST api, but I was suggesting that you propagate the source of the doi-asserted-by
into the cited-by service. Otherwise I have to fetch the citations and then fetch the information for each citation to see where the DOI came from. Instead of 1 fetch I might have to do a hundred, so it’s not a good use of crossref resources to do this.
Another thing that wasn’t clear was whether this information would ever be updated. Once the references are deposited, they are unlikely to change unless there is a redeposit by the original publisher. You might however change your matching method to infer a DOI - would you ever update this information that is returned by the REST API?
I see, sorry for the misunderstanding. The Cited-by service was designed for members to get new matches for their works and this isn’t a question that’s come up before from members, as far as I’m aware, but I could see how it could be useful. I’ll discuss it with colleagues and see whether we could make the change.
If we changed the matching method we are unlikely to go back and change historically matched items. If a different match was made, for example if the DOI was changed by the publisher, that would show up as a new forward link in Cited-by. One thing we don’t communicate well is when matches are removed, so at present you’d see a new link but wouldn’t see the old one removed. In the REST API the reference DOI would also change. The updated-date of the cited article would change, although you wouldn’t be able to access this history to see what the previous match was.
It’s possible that few people have examined the citations for their journal in any detail. I’m retired so I have a surplus of time to spend on such things.
People want these citations for two reasons:
- bibliometrics, to decide what receives the most citations
- helping readers navigate the space of literature.
I’m less concerned about bibliometrics than I am about helping readers navigate the literature. They are sometimes in conflict.
The proliferation of paywalls has caused a proliferation of preprint servers and a lot of near-duplicate articles to be made available. Authors are sometimes confused about which one to cite, because the devil is in the details of the article they actually got to read. This greatly complicates bibliometrics. I know that Google Scholar goes to a great deal of trouble to cluster different versions together and collect citations to those into a single cluster that has different versions. The only thing that allows them to do this is access to the actual content beyond title and authors, but it’s a computational nightmare. Sometimes there are multiple DOIs for the same cluster of versions (e.g., multiple arxiv DOIs plus a conference publisher DOI for a shortened version plus a journal DOI plus a repository DOI). Scholar aggregates all of these citations together and associates them to a cluster of similar papers. It results in a bit of a mess and I don’t think it is feasible for crossref to attempt this, but the matching you are doing now is helpful.
One of the problems is that authors often omit the DOIs from their references. In my discipline (CS) this is common because of page limits imposed by publishers. Authors have been crafty in cramming as much as possible into their allotted pages, and inclusion of DOIs may cause an article to spill over an extra page so people have often omitted them. Once habits like this become common it’s very difficult to dislodge from the culture. As a publisher we are now demanding that authors supply a DOI for their references, but sometimes they have a reason to cite something without a DOI. That’s something that can only be decided during the peer review process. We usually ask for a URL if a DOI is not appropriate but apparently that gets discarded by crossref.
In an ideal world authors would include DOIs in their citations, and articles would have only one DOI issued for them. I think it’s reasonable for crossref to try and guess the DOIs and cluster citations together, but in the interest of transparency I think you should also supply the source of the information. It’s much like when ORCIDs are supplied by a publisher, but the author(s) may not have authenticated themselves to really own that DOI. The 5.3.1 schema for registering DOIs has an attribute to indicate whether it was authenticated by the owner or not. In reality, it’s not always feasible for authors to authenticate their ORCID (and some authors are refusing to use ORCIDs). There are also places for other kinds of assertions (like affiliations, grant support, etc). crossref doesn’t need to solve all of the problems associated with these identifiers, but it’s helpful when the provenance of an assertion can be propagated through your systems. Your blog said as much in section 2.