Implement new iteration strategy #353

hadley · 2023-10-23T15:51:15Z

A first pass. Fixes #341. Fixes #298.

To do (possibly in future PRs)

What will req_chunk() (Add req_chunk() #307) look like?
Mechanism for signalling total number of pages (if known)
Do we need a way to return multiple pages (as in original)? — I don't think so? I think you could use req_perform_parallel() if you wanted.
Should we return a new data structure containing both reqs and resps? If so, we should also do that in req_perform_parallel() (but not multi_req_perform())

@mgirlich it's probably worth a glance if you have time, but it's likely to evolve quite a bit more as I work on it, so no need for a thorough review yet. It's a bit PR so I recommend reading iterate.R first, then iterate-helpers.R then iterate-responses.R

A first pass. Fixes #341. Fixes #298.

R/iterate-helpers.R

R/iterate.R

R/utils.R

mgirlich · 2023-10-24T04:18:51Z

I quickly skimmed over it and it looks quite nice. Let me know when this is ready for a proper review.

hadley · 2023-10-24T12:58:43Z

I think it's probably good for a proper review now. The part I'm least certain about is the return value — my gut feeling is that it should return a list of resps and a list of reqs, so that if there is an error you can track it down more easily. But I'm struggling with what to call that object, and it should go in a separate PR anyway.

R/iterate.R

mgirlich · 2023-10-25T05:28:45Z

I used the new interface in an internal package (for an absolutely terrible API with a very inconsistent interface. So, the perfect use case actually). Some things that might be difficult:

the token is sometimes in a nested field. This forced me to used purrr::pluck()
the token must be send in the body 🙄 which is quite ugly. I'm not sure this is often the case but it might be nice to add a helper to modify the body
I copied the part how to signal the number of pages. We probably want to add a helper for this.

api_paginate_token <- function(req,
                               field,
                               spec,
                               page_size,
                               n_results_field,
                               ...,
                               req_token_field = "nextToken",
                               resp_token_field = "nextToken",
                               pb_name = "Download",
                               n = Inf) {
  check_dots_empty()

  parse_resp <- function(resp) {
    parsed <- httr2::resp_body_json(resp, check_type = FALSE, bigint_as_char = TRUE)
    tibblify(parsed[[field]], spec)
  }

  known_total <- FALSE
  next_req <- function(resp, req) {
    parsed <- httr2::resp_body_json(resp, check_type = FALSE, bigint_as_char = TRUE)
    next_token <- purrr::pluck(parsed, !!!resp_token_field)

    if (is.null(next_token)) {
      return(NULL)
    }

    if (!known_total && !is.null(n_results_field)) {
      n_results <- purrr::pluck(parsed, !!!n_results_field)
      check_number_whole(n_results)
      known_total <<- TRUE
      signal("", class = "httr2_total_pages", n = ceiling(n_results / page_size))
    }

    data <- req$body$data
    data <- purrr::assign_in(data, req_token_field, value = u(next_token))
    req$body$data <- data
    req
  }

  responses <- httr2::req_perform_iteratively(
    req = req,
    next_req = next_req,
    max_reqs = ceiling(n / page_size),
    progress = pb_name
  )

  httr2::resps_combine(responses, parse_resp)
}

mgirlich · 2023-10-25T06:08:38Z

Oh and of course I couldn't use the nice helpers because of the pagination patterns used by the API...

mgirlich · 2023-10-25T06:11:59Z

R/iterate-helpers.R

+#' * `iterate_with_offset()` increments a query parameter, e.g. `?page=1`,
+#'   `?page=2`, or `?offset=1`, `offset=21`.
+#' * `iterate_with_cursor()` updates a query parameter with the value of a
+#'    cursor found somewhere in the response.


Restricting iterate_with_offset() and iterate_with_cursor() to a query parameter makes the implementation much simpler. I hope this covers most APIs but it definitely doesn't cover the main one I have to use...

I think it's a reasonable trade-off. My plan is to eventually have vignette that describes how to handle more complex situations.

R/iterate-helpers.R

mgirlich · 2023-10-25T06:19:03Z

R/iterate-helpers.R

+#' @param resp_param_value A callback function that takes a response (`resp`)
+#'   and returns the next cursor value. Return `NULL` if there are no further
+#'   pages.
+iterate_with_cursor <- function(param_name, resp_param_value) {


Is there a reason to not include a resp_pages argument?

I didn't include a resp_pages() for either of these two helpers because I think it's relatively rare that the total number of pages is available with these interfaces. But I guess there's no harm in adding it.

R/iterate-helpers.R

R/iterate-responses.R

mgirlich · 2023-10-25T06:31:48Z

R/iterate-responses.R

+#' @param resp_data A function with one argument `resp` that parses the
+#'   response and returns a list with the field `data` and other fields needed
+#'   to create the request for the next page.
+resps_combine <- function(resps, resp_data) {


CRAN might be more happy if you explicitly document the return value with @return

R/iterate.R

mgirlich · 2023-10-25T06:36:40Z

R/iterate.R

+  check_request(req)
+  check_function2(next_req, args = c("resp", "req"))
+  check_number_whole(max_reqs, allow_infinite = TRUE, min = 1)
+  check_string(path, allow_empty = FALSE, allow_null = TRUE)


Maybe check that the path actually includes {i}?

Thats a bit fiddly for me for now, particularly since I'm not sure if I'm super committed to this API. We can work on it in the future.

R/test.R

mgirlich · 2023-10-25T06:45:19Z

R/test.R


    res$set_status(200L)$send_json(
-      object = list(data = data, count = n),
+      object = list(data = data, count = n, pages = n / page_size),


Suggested change

object = list(data = data, count = n, pages = n / page_size),

object = list(data = data, count = n, pages = ceiling(n / page_size)),

mgirlich · 2023-10-25T06:50:51Z

Overall this looks good to me. Switching to the new interface was relatively quick (as written above there are some annoying things but I guess this is mainly due to the terrible API I have to use).

mgirlich · 2023-10-25T06:51:57Z

The part I'm least certain about is the return value — my gut feeling is that it should return a list of resps and a list of reqs, so that if there is an error you can track it down more easily. But I'm struggling with what to call that object, and it should go in a separate PR anyway.

Maybe add the request to each response as an attribute. Or add a helper last_iterative_requests().

Co-authored-by: Maximilian Girlich <[email protected]>

Part of #353

hadley added 6 commits October 23, 2023 10:47

Implement new iteration strategy

324920a

A first pass. Fixes #341. Fixes #298.

Eliminate old doc fragment

8385882

Add path argument

2dd5507

Test req_perform_iteratively() checks its argumetns

203d077

Add way to report total number of pages

eaed0e8

Don't use base pipe in tests

54f5c02

hadley commented Oct 23, 2023

View reviewed changes

R/iterate-helpers.R Show resolved Hide resolved

Tweaking formatting; add comment

efa2bec

hadley mentioned this pull request Oct 23, 2023

Add helper to perform a request in chunks #303

Closed

mgirlich reviewed Oct 24, 2023

View reviewed changes

R/iterate.R Outdated Show resolved Hide resolved

mgirlich reviewed Oct 24, 2023

View reviewed changes

R/utils.R Show resolved Hide resolved

This was referenced Oct 24, 2023

Add paginate_req_perform_continue() #306

Closed

Add req_chunk() #307

Closed

Don't increment when changing max pages

d2f597e