Cursor Based Pagination

Hi @ifarley ,

I am just starting using the Crossref REST API.

I read your post on cursor based paging: https:// community.crossref .org/t/ticket-of-the-month-march-2022-getting-started-with-rest-api-queries/2587/5#deep-paging-with-cursors-2

I want to harvest all the works that have at least one author / contributor identified with an ORCID.

Initial request: https:// api .crossref .org/works?filter=has-orcid:1&rows=1000&cursor=*

Then I obtain the cursor from .message.next-cursor in the JSON response (DnF1ZXJ5VGhlbkZldGNoBgAAAAAVkxo0FkhUQjJYNFlSVFVPT1k5bVZmVDQzQXcAAAAAEugiXhY0amo4YndCWVRadUx4QV9WQlVKWHdnAAAAAAZWz6oWMUZXWVdYT3hUZTZtQzNpVGM3NzZoUQAAAAAWv4Z_Fk5qWk1fWm1iUVV1V3Nwd3MxN3FqQ3cAAAAAFjFcdhZDNTBUdXhPeVM0bUg3UnZzcl9lR2N3AAAAABPuERgWSTFVWlpBeGRTWi1nRllxOU9nQUYydw==) to make the subsequent request(s), e.g., https:// api .crossref .org/works?filter=has-orcid:1&rows=1000&cursor=DnF1ZXJ5VGhlbkZldGNoBgAAAAAVkxo0FkhUQjJYNFlSVFVPT1k5bVZmVDQzQXcAAAAAEugiXhY0amo4YndCWVRadUx4QV9WQlVKWHdnAAAAAAZWz6oWMUZXWVdYT3hUZTZtQzNpVGM3NzZoUQAAAAAWv4Z_Fk5qWk1fWm1iUVV1V3Nwd3MxN3FqQ3cAAAAAFjFcdhZDNTBUdXhPeVM0bUg3UnZzcl9lR2N3AAAAABPuERgWSTFVWlpBeGRTWi1nRllxOU9nQUYydw%3D%3D

I URL encode the cursor for this purpose (“==” at the end are encoded.)

The thing that I struggle to understand is the way cursors are handled. Normally, a cursor is a base64 encoded string that can be decoded, resulting in some kind of pointer such as an id etc. With each request, the pointer changes and so does the next cursor.

Here, however, a cursor remains the same once obtained (this is at least what I experienced). Does this mean that the server holds some sort of state for a given cursor that changes each time a request ist made?

So in other words, aside from the first request, all subsequent requests are the same URL but for each requests, different results are returned?

I found this:

The problem is this- if you are doing a long sequence of cursor requests, and the API (or your script) becomes unstable in the middle of the sequence, and you get an error, you will have to start from scratch with a new cursor.

https:// www. crossref. org/documentation/retrieve-metadata/rest-api/tips-for-using-the-crossref-rest-api/

What I also noticed is that "query":{"start-index":0} increments with offset based pagination but not when using a cursor.

Thanks for clarification and kind regards,

Tobias

Hi Tobias,

Thanks for writing, and for reading our docs.

It seems like you’ve sorted out how cursors work in the REST API. The cursor remains the same once obtained, and the server treats it as a stateful artifact. For each request wth the same cursor, different results should be returned. Our cursor configuration is what we get out of the box with Elasticsearch.

When we migrated from Solr to Elasticsearch, some other users had to adjust to this cursor behavior. You might find the comment thread in this old issue informative: As a Metadata Plus user, I'd like the cursor timeout to be increased (a 5-minute expiration is too short) (#649) · Issues · crossref / DEPRECATED User stories · GitLab

I hope this clarifies, but let us know if you have any further questions.

Thanks,
Patrick

1 Like

Hi Patrick,

Thanks a lot for your quick answer. Yes, it’s now clear to me how it works.

As for my use case, there are a lot of results (9’998’490 items and I can fetch 1’000 per request). As you write on the tips page, it is quite likely that a request could go wrong. This would require me to start over. Is the only other option to download the public data file and apply some filtering myself?

I’ve recently worked on a prototypical GraphQL interface backed by Elasticsearch. I’ve implemented cursor-based pagination using Elasticsearch’s search_after (Paginate search results | Elasticsearch Guide [8.6] | Elastic). It required deterministic sorting by id (sort) and the cursor was just the base64 encoded id of the last result of the page fetched. So with each request, the client would get the next page of results and obtain a new cursor.

Could this be an option, too? I am aware that given your large amounts of data this is just an untested idea but let me know if I should provide more details etc.

Kind regards,

Tobias

1 Like

Hi again,

Yes, your two best options for fetching a large volume of records are to iterate over a result set with cursors, or by starting with the public data file.

We’re aware that there is room for improvement in our cursor implementation and will consider the feature you’ve suggested for future enhancements. However we’re not likely to work on developing this feature in the near term due to limited development resources.

Thanks for your suggestion, and for offering to provide additional details. The Elasticsearch documentation should be sufficient for us, but we’ll be sure to reach out if we need to better understand your use case.

Take care,
Patrick

2 Likes