Public data file in the cloud

I’m currently partway through the torrent download of the 2022 public data dump. This will be a great start to establishing a baseline for work we are doing with scientific publication linked data. However, it would be really great if these data were stored in one or another public data cloud. Peer-to-peer tech like torrent is challenging in that it is slow and some of us face restrictions on using torrent via government or corporate networks. Cloud availability would be a far more efficient way to work with the bulk data. You could set this up as a “requestor pays” type of situation in something like AWS, Azure, Google Cloud, or all of the above, so egress cost would be borne by the user. More importantly, though, it would be nice to simply run processing directly on the metadata in the cloud to get whatever we need done, including transformations off to other cloud assets.

This isn’t something we currently support, or have specific plans to. But it’s certainly a very interesting suggestion, and something we might consider supporting in future.

I’m interested in understanding your use case. Could I ask which metadata formats you’re interested in, and what kind of transformation you’re doing?

Thanks for the quick reply. I’d love to see cloud availability happen at some point. If that was current metadata content, we’d end up using that as opposed to hitting the API.

I’m with USGS where we publish thousands of works a year with ID spaces in both CrossRef and DataCite. We have what’s essentially a corporate intelligence system based on graphable linked data that we interrogate to better understand the state of scientific knowledge in any given area within our domains and analyze for future capacity. Basic CrossRef metadata gives us useful details about pubs we already know about that we don’t have in our corporate catalogs (e.g., publisher keywords, references, etc.). We’re experimenting with what all we can get from event data on where our publications are being used but haven’t gotten too far there yet.

We had a previous system that periodically used content negotiation on DOIs found in other parts of our system to retrieve and cache basic application/json formatted metadata on everything. From that we parse out the useful details into an RDF-based structure inspired by Wikidata/Wikibase (entities and claims) that we use to drive search and discovery apps as well as graph-based analyses (clustering, etc.). That ran afoul of API throttling limits at one point. We’re just reinvigorating that part of the work, so we’re pulling a fresh baseline.

I know USGS is a CrossRef member. I may follow up with our Library folks and figure out member access to your API if needed.

Thanks for providing that detail, it’s definitely a use case we’d like to talk about in a bit more depth.

And if anyone else reading has a similar use case, we’d love to hear from you.