Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement new iteration strategy #353

Merged
merged 9 commits into from
Oct 25, 2023
Merged

Implement new iteration strategy #353

merged 9 commits into from
Oct 25, 2023

Conversation

hadley
Copy link
Member

@hadley hadley commented Oct 23, 2023

A first pass. Fixes #341. Fixes #298.

To do (possibly in future PRs)

  • What will req_chunk() (Add req_chunk() #307) look like?
  • Mechanism for signalling total number of pages (if known)
  • Do we need a way to return multiple pages (as in original)? — I don't think so? I think you could use req_perform_parallel() if you wanted.
  • Should we return a new data structure containing both reqs and resps? If so, we should also do that in req_perform_parallel() (but not multi_req_perform())

@mgirlich it's probably worth a glance if you have time, but it's likely to evolve quite a bit more as I work on it, so no need for a thorough review yet. It's a bit PR so I recommend reading iterate.R first, then iterate-helpers.R then iterate-responses.R

Sorry, something went wrong.

@mgirlich
Copy link
Collaborator

I quickly skimmed over it and it looks quite nice. Let me know when this is ready for a proper review.

This was referenced Oct 24, 2023
@hadley
Copy link
Member Author

hadley commented Oct 24, 2023

I think it's probably good for a proper review now. The part I'm least certain about is the return value — my gut feeling is that it should return a list of resps and a list of reqs, so that if there is an error you can track it down more easily. But I'm struggling with what to call that object, and it should go in a separate PR anyway.

@mgirlich
Copy link
Collaborator

I used the new interface in an internal package (for an absolutely terrible API with a very inconsistent interface. So, the perfect use case actually). Some things that might be difficult:

  • the token is sometimes in a nested field. This forced me to used purrr::pluck()
  • the token must be send in the body 🙄 which is quite ugly. I'm not sure this is often the case but it might be nice to add a helper to modify the body
  • I copied the part how to signal the number of pages. We probably want to add a helper for this.
api_paginate_token <- function(req,
                               field,
                               spec,
                               page_size,
                               n_results_field,
                               ...,
                               req_token_field = "nextToken",
                               resp_token_field = "nextToken",
                               pb_name = "Download",
                               n = Inf) {
  check_dots_empty()

  parse_resp <- function(resp) {
    parsed <- httr2::resp_body_json(resp, check_type = FALSE, bigint_as_char = TRUE)
    tibblify(parsed[[field]], spec)
  }

  known_total <- FALSE
  next_req <- function(resp, req) {
    parsed <- httr2::resp_body_json(resp, check_type = FALSE, bigint_as_char = TRUE)
    next_token <- purrr::pluck(parsed, !!!resp_token_field)

    if (is.null(next_token)) {
      return(NULL)
    }

    if (!known_total && !is.null(n_results_field)) {
      n_results <- purrr::pluck(parsed, !!!n_results_field)
      check_number_whole(n_results)
      known_total <<- TRUE
      signal("", class = "httr2_total_pages", n = ceiling(n_results / page_size))
    }

    data <- req$body$data
    data <- purrr::assign_in(data, req_token_field, value = u(next_token))
    req$body$data <- data
    req
  }

  responses <- httr2::req_perform_iteratively(
    req = req,
    next_req = next_req,
    max_reqs = ceiling(n / page_size),
    progress = pb_name
  )

  httr2::resps_combine(responses, parse_resp)
}

@mgirlich
Copy link
Collaborator

Oh and of course I couldn't use the nice helpers because of the pagination patterns used by the API...

#' * `iterate_with_offset()` increments a query parameter, e.g. `?page=1`,
#' `?page=2`, or `?offset=1`, `offset=21`.
#' * `iterate_with_cursor()` updates a query parameter with the value of a
#' cursor found somewhere in the response.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Restricting iterate_with_offset() and iterate_with_cursor() to a query parameter makes the implementation much simpler. I hope this covers most APIs but it definitely doesn't cover the main one I have to use...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's a reasonable trade-off. My plan is to eventually have vignette that describes how to handle more complex situations.

#' @param resp_param_value A callback function that takes a response (`resp`)
#' and returns the next cursor value. Return `NULL` if there are no further
#' pages.
iterate_with_cursor <- function(param_name, resp_param_value) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to not include a resp_pages argument?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't include a resp_pages() for either of these two helpers because I think it's relatively rare that the total number of pages is available with these interfaces. But I guess there's no harm in adding it.

#' @param resp_data A function with one argument `resp` that parses the
#' response and returns a list with the field `data` and other fields needed
#' to create the request for the next page.
resps_combine <- function(resps, resp_data) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CRAN might be more happy if you explicitly document the return value with @return

check_request(req)
check_function2(next_req, args = c("resp", "req"))
check_number_whole(max_reqs, allow_infinite = TRUE, min = 1)
check_string(path, allow_empty = FALSE, allow_null = TRUE)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe check that the path actually includes {i}?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thats a bit fiddly for me for now, particularly since I'm not sure if I'm super committed to this API. We can work on it in the future.

R/test.R Outdated

res$set_status(200L)$send_json(
object = list(data = data, count = n),
object = list(data = data, count = n, pages = n / page_size),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
object = list(data = data, count = n, pages = n / page_size),
object = list(data = data, count = n, pages = ceiling(n / page_size)),

@mgirlich
Copy link
Collaborator

Overall this looks good to me. Switching to the new interface was relatively quick (as written above there are some annoying things but I guess this is mainly due to the terrible API I have to use).

@mgirlich
Copy link
Collaborator

The part I'm least certain about is the return value — my gut feeling is that it should return a list of resps and a list of reqs, so that if there is an error you can track it down more easily. But I'm struggling with what to call that object, and it should go in a separate PR anyway.

Maybe add the request to each response as an attribute. Or add a helper last_iterative_requests().

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature. The key has expired.
Co-authored-by: Maximilian Girlich <[email protected]>
@hadley hadley merged commit 5524237 into main Oct 25, 2023
@hadley hadley deleted the iterate branch October 25, 2023 20:36
hadley added a commit that referenced this pull request Oct 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

iterate_next_request() -> resp_next_request() Store intermediate responses in req_paginate()
2 participants