Date range search of index changes seems to retrieve too many records

Some months ago I retrieved the covid-19 dataset. Now I want to retrieve any records that have been added or changed since then.

I run this command to retrieve the the first page of the set of records starting April 1:

https://0-api-crossref-org.library.alliant.edu/works?filter=from-index-date:2020-04-01,until-index-date:2020-11-11

I get total-records = 88403122.

88 million records seems like a lot of records for an incremental update.

So, out of curiosity, I run this command to see how many records have had an index update yesterday.

https://0-api-crossref-org.library.alliant.edu/works?filter=from-index-date:2020-11-11,until-index-date:2020-11-11

I get total-records = 385706.

Thatā€™s a lot of records to be updated in one day!

What am I missing?

Hello @slnm. Thanks for your message. Welcome to the community forum!

One alternative here is that you could use the from-update-date filter instead of the from-index-date filter. The major difference between the two is that the from-index-date is going to include updated citation counts (and the changes that are also included in the from-update-date filter). If that information isnā€™t of concern for you, then you can use the from-update-date filter which will result in much fewer results. Index includes changes that we also make to the record - so very occasionally that will include work weā€™ve done on bugs and those citation count updates I mentioned. The from-update-date filter will include all metadata changes made by our members to their records.

Youā€™re right, 385,706 records is a lot to update in one day, but weā€™re always updating those citation counts by matching references with existing, cited DOIs, so that from-index-date filter is going to seem high.

My best,
Isaac

Thanks, Isaac, for your help and engagement.

There are over 112,000 updates for Nov 11 which is better than nearly 386,000 records to fetch but still a large number.

https://0-api-crossref-org.library.alliant.edu/works?filter=from-update-date:2020-11-11,until-update-date:2020-11-11

And, nearly 20,000,000 records to fetch to update the covid-19 dataset to be current.

https://0-api-crossref-org.library.alliant.edu/works?filter=from-update-date:2020-05-01,until-update-date:2020-11-11

My aim is to maintain a relatively current dataset. Should I create a daily job to fetch new records, using your deep cursor? My concern with that approach is that a query takes roughly 30 seconds to return. At the rate of 2 queries per minute, 100,000 updates per day, and 1,000 results per page, it will take 50 minutes per day to fetch the incremental changes. I have tried using the mailto parameter (and https) to get into the preferred query pool but that doesnā€™t seem to speed up queries.

Thanks.

2 Likes

Hello again @slnm,

I think youā€™ll find that the 112,000 updates per day number is a little higher than the average, which should help with the overall time estimate for fetching these incremental changes. And, thereā€™s no reason you canā€™t send us more than two queries per minute. You should be able to perform up to 50 per second and still be below our rate limits, as discussed here: https://github.com/CrossRef/rest-api-doc#rate-limits

Iā€™d suggest using the Polite pool to the Public pool, as the Polite pool is the more performant of the two over the longer-term.

If you need a higher rate limit or a more performant pool, our Plus pool, with its SLAs, is an option as well. You can learn more here: https://0-www-crossref-org.library.alliant.edu/services/metadata-retrieval/metadata-plus/. If youā€™re interested in learning more about the Plus service, Iā€™d be happy to answer your questions or connect you with Jennifer Kemp, our Head of Partnerships.

Kind regards,
Isaac

1 Like

Thanks again, @ifarley, for your help. Iā€™m still not clear.

According to the REST API doc I should use cursor if Iā€™m fetching a large number of rows and offset canā€™t be used with cursors. So, I do an initial query with parameter cursor=* to get the first cursor and then I get next-cursor from the first set of results and use that cursor for the next query and so on. Given that the cursor changes for every subsequent query, I canā€™t parallelize those queries but need to get the next cursor before doing the next query. So, to get 112,000 updates with a max of 1,000 rows per query Iā€™ll need to do 112 queries and I donā€™t see how I can do anything but wait for one query to complete before doing the next one.

Back to the original question of how to efficiently retrieve all updates since April 1.

https://0-api-crossref-org.library.alliant.edu/works?filter=from-update-date:2020-04-01,until-update-date:2020-11-16 shows that there are nearly 24M records to fetch to get my covid-19 set current. Thatā€™s 24,000 queries which, unless Iā€™m missing something, I canā€™t parallelize.

Letā€™s say I did want to parallelize them by fetching records for April through July in one set of queries and from August on in another set of queries.

https://0-api-crossref-org.library.alliant.edu/works?filter=from-update-date:2020-04-01,until-update-date:2020-07-31 shows 13,286,127 records.

https://0-api-crossref-org.library.alliant.edu/works?filter=from-update-date:2020-08-01,until-update-date:2020-11-16 shows 10,660,131 records.

So, I can do those two date range query sets in parallel and cut the time roughly in half for fetching the records since April. And, I can do more granular date range searches and parallelize them but Iā€™ll hit duplicate records (i.e. records that were updated in more than one date range.)

But, I think Iā€™m still missing something because you say that I can do up to 50 queries per second to fetch those 112,000 updates for one day.

Thanks, again.

Hi @slnm,

Youā€™re right, my suggestion wasnā€™t well thought out. Sorry about that. You do need to wait for the cursor for each of your queries.

Iā€™m not sure a way around your dilemma, outside of becoming a Plus subscriber and being able to regularly pull the monthly Snapshots. That said, Iā€™ve asked our technical team for any suggestions they may have. Iā€™ll follow up as soon as I know more.

My best,
Isaac

My colleagues on the technical team have some suggestions:

You could divide the set you need to download by the date of creation, and download various creation date ranges in parallel. Creation date should safe because it does not change, and every DOI has only one creation date. So a DOI should belong to exactly one creation date range, assuming all possible ranges are downloaded. The full range to cover is from 2002-07-25 (inclusive, this the older creation date in our data) to the current date.

For example, I can download DOIs updated since April and created in 2020, in parallel download DOIs updated since April and created in 2019, ā€¦ , and in parallel download DOIs updated from April and created in 2002, using parallel requests https://0-api-crossref-org.library.alliant.edu/works?filter=from-update-date:2020-04-01,until-update-date:2020-11-16,from-created-date:2020,until-created-date:2020&cursor=ā€¦ and https://0-api-crossref-org.library.alliant.edu/works?filter=from-update-date:2020-04-01,until-update-date:2020-11-16,from-created-date:2019,until-created-date:2019&cursor=ā€¦ and so on.

Or, I could use smaller ranges and download separately DOIs updated from April and created in 2020-11, DOIs updated from April and created in 2020-10, and so one down to 2002-07. Or use just a few days as the range. The smallest range is 1 day long, as this is the creation date filter ā€œresolutionā€.

Those subsets may not be well balanced in terms of the numbers of DOIs, but it should allow you to speed the whole thing up a bit.

Does that make sense?

@ifarley Yes, this all makes sense. Thank you! Iā€™ll do some queries to get some counts to estimate the volume of searches needed and the time needed then parallelize the whole process. Again, I appreciate your willingness to dig into this issue.

3 Likes

Iā€™m always happy to help, @slnm. Thanks for posting this message here for all to benefit from the exchange.

2 Likes

A perfect and very useful question for many of us. Thanks for the answers and suggestions!

4 Likes

@ifarley Thanks for this great article! It looks like something goes wrong with the cursor. With &cursor=* the api returns a good next-cursor, but with &cursor=the_second_cursor_encoded the api returns the same next-curso as the &cursor= value (while iā€™m not at the end of the list).

For example:
for me your example (right now) /works?filter=from-update-date:2020-04-01,until-update-date:2020-11-15,from-created-date:2020,until-created-date:2020&cursor=DnF1ZXJ5VGhlbkZldGNoBgAAAAADT3Z4FkhUQjJYNFlSVFVPT1k5bVZmVDQzQXcAAAAAAj_MxBY0amo4YndCWVRadUx4QV9WQlVKWHdnAAAAAALaP_IWR25aQ25BM09Rb0ctLThhTmdORDh1ZwAAAAACFDYDFkM1MFR1eE95UzRtSDdSdnNyX2VHY3cAAAAAAq9T1BZOalpNX1ptYlFVdVdzcHdzMTdxakN3AAAAAANPdncWSFRCMlg0WVJUVU9PWTltVmZUNDNBdw%3D%3D

returns:

{
status: "ok",
message-type: "work-list",
message-version: "1.0.0",
message: {
facets: { },
next-cursor: "DnF1ZXJ5VGhlbkZldGNoBgAAAAADT3Z4FkhUQjJYNFlSVFVPT1k5bVZmVDQzQXcAAAAAAj_MxBY0amo4YndCWVRadUx4QV9WQlVKWHdnAAAAAALaP_IWR25aQ25BM09Rb0ctLThhTmdORDh1ZwAAAAACFDYDFkM1MFR1eE95UzRtSDdSdnNyX2VHY3cAAAAAAq9T1BZOalpNX1ptYlFVdVdzcHdzMTdxakN3AAAAAANPdncWSFRCMlg0WVJUVU9PWTltVmZUNDNBdw==",
total-results: 4300774,
items: [

Or am i doing something wrong here?

Hi @ps80. Thanks for your message and welcome the community forum.

We migrated our backend to elasticsearch since I wrote the information above. Cursors in elasticsearch are a little different from cursors in Solr, our previous backend, and thus some of the information in my post :point_up_2: is out of date. Sorry about that.

In elasticsearch, which we moved to in August 2021, you get a cursor and the server remembers your position in the dataset you are iterating over. The cursor stays the same, but you should be getting different result pages on subsequent calls.

Is that what you are finding?

-Isaac

Hi @ifarley Thanks for your fast response! Iā€™m sorry, iā€™m used to solr so I expected a new cursor, but everything works as youā€™re describing. Works perfect!

Peter

1 Like

Thatā€™s great, Peter. Thanks for confirming.

-Isaac