Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New [[.tbl_df shouldn't use vec_slice() when doing [[i, j]] #681

Closed
DavisVaughan opened this issue Dec 2, 2019 · 11 comments
Closed

New [[.tbl_df shouldn't use vec_slice() when doing [[i, j]] #681

DavisVaughan opened this issue Dec 2, 2019 · 11 comments
Labels
vctrs ↗️ Requires vctrs package
Milestone

Comments

@DavisVaughan
Copy link
Member

CRAN tibble

library(tibble)

tbl <- tibble(x = structure(c(1, 2), extra = "hi"))

tbl$x
#> [1] 1 2
#> attr(,"extra")
#> [1] "hi"

tbl[[1, 1]]
#> [1] 1

Dev tibble

library(tibble)

tbl <- tibble(x = structure(c(1, 2), extra = "hi"))

tbl$x
#> [1] 1 2
#> attr(,"extra")
#> [1] "hi"

tbl[[1, 1]]
#> [1] 1
#> attr(,"extra")
#> [1] "hi"

I think this is definitely not right. [[ should always strip attributes from atomic vectors.

The issue comes from using vec_slice() in [[.tbl_df here:

vec_slice(x, i)

The issue is that vec_slice() works more like [ which won't strip the attributes from the column. [[ would. The vctrs equivalent for [[ doesn't exist yet, but would be vec_get()
r-lib/vctrs#626

Note that base:::[[.data.frame uses [[ for the row extraction after selecting the column, which is why we previously had the correct behavior:

> base:::`[[.data.frame`
function (x, ..., exact = TRUE) 
{
    na <- nargs() - !missing(exact)
    if (!all(names(sys.call()) %in% c("", "exact"))) 
        warning("named arguments other than 'exact' are discouraged")
    if (na < 3L) 
        (function(x, i, exact) if (is.matrix(i)) 
            as.matrix(x)[[i]]
        else .subset2(x, i, exact = exact))(x, ..., exact = exact)
    else {
        col <- .subset2(x, ..2, exact = exact)
        i <- if (is.character(..1)) 
            pmatch(..1, row.names(x), duplicates.ok = TRUE)
        else ..1
        col[[i, exact = exact]] # <- RIGHT HERE!
    }
}

vec_get() also behaves differently than vec_slice() with lists, so list column extraction is also different now.

CRAN tibble:

library(tibble)

tbl <- tibble(x = list(1, 2), y = 1:2)

tbl[[1, 1]]
#> [1] 1

Dev tibble:

library(tibble)

tbl <- tibble(x = list(1, 2), y = 1:2)

tbl[[1, 1]]
#> [[1]]
#> [1] 1
@krlmlr
Copy link
Member

krlmlr commented Dec 2, 2019

I agree vec_slice() isn't the right function to call here.

We define

x[[i, j]] is equal to x[i, ][[j]]

in the invariants vignette, because we want it to work with data frame and array columns. Is this wrong?

@DavisVaughan
Copy link
Member Author

I think that is wrong because it implies the attributes are kept.

library(tibble)

df <- tibble(x = 1:2, y = structure(1:2, extra = "stuff"))

# df[[1, 2]]
df[1,][[2]]
#> [1] 1
#> attr(,"extra")
#> [1] "stuff"

Created on 2019-12-02 by the reprex package (v0.3.0.9000)

I think it should be more like:

x[[i, j]] == vec_get(x[[j]], i)

Which is very close to x[[j]][[i]] except vec_get() would extract rows from data frames and arrays

@DavisVaughan
Copy link
Member Author

It also depends on what you want list column extraction to return

library(vctrs)
library(tibble)

tbl <- tibble(y = 1:2, x = list(1, 2))

# tbl[[1, 2]]

tbl[1,][[2]]
#> [[1]]
#> [1] 1

vctrs:::vec_get(tbl[[2]], 1)
#> [1] 1

Created on 2019-12-02 by the reprex package (v0.3.0.9000)

@DavisVaughan
Copy link
Member Author

A potential short term solution is to do x_col = x[[j]] and then use x_col[[i]] unless x_col is a data frame or array, in which case use vec_slice(x_col, i).

This isn't exactly right because it still would not remove extra attributes and row names from data frames and arrays, but would be close. Something like:

library(tibble)
library(vctrs)

vec_get_ish <- function(x, i) {
  can_use_bracket_bracket <- !is.data.frame(x) && !is.array(x)
  
  if (can_use_bracket_bracket) {
    return(x[[i]])
  }
  
  out <- vec_slice(x, i)
  
  row.names(out) <- NULL
  
  # Now strip all non-core attributes? Not sure how though
  # ...
  
  out
}

vec_get_ish(1:5, 1)
#> [1] 1

vec_get_ish(mtcars, 1)
#>   mpg cyl disp  hp drat   wt  qsec vs am gear carb
#> 1  21   6  160 110  3.9 2.62 16.46  0  1    4    4

vec_get_ish(list(1, 2), 1)
#> [1] 1

tbl_bracket_bracket_ij <- function(x, i, j) {
  vec_get_ish(x[[j]], i)
}

df <- tibble(x = tibble(a = 1:2))
df2 <- tibble(x = list(a = 1:2))

tbl_bracket_bracket_ij(df, 1, 1)
#> # A tibble: 1 x 1
#>       a
#>   <int>
#> 1     1

tbl_bracket_bracket_ij(df2, 1, 1)
#> [1] 1 2

Created on 2019-12-02 by the reprex package (v0.3.0.9000)

@krlmlr
Copy link
Member

krlmlr commented Dec 2, 2019

What should happen here?

tibble::tibble(hms::hms(1:3))[[1, 1]]
#> 00:00:01

Created on 2019-12-02 by the reprex package (v0.3.0)

I don't understand what's wrong with keeping attributes.

@DavisVaughan
Copy link
Member Author

DavisVaughan commented Dec 3, 2019

Ah, you're right. I forgot that the main difference between [ and [[ with attributes is that [ keeps names, but both drop everything else.

Subsetting (except by an empty index) will drop all attributes except names, dim and dimnames.

The usual form of indexing is [. [[ can be used to select a single element dropping names, whereas [ keeps them, e.g., in c(abc = 123)[1].

library(vctrs)

x <- structure(1:5, extra = "stuff", names = letters[1:5])

# vec_slice keeps attributes
vec_slice(x, 1)
#> a 
#> 1 
#> attr(,"extra")
#> [1] "stuff"

# `[` drops attributes, except names
x[1]
#> a 
#> 1

# `[[` drops attributes, including names
x[[1]]
#> [1] 1

# vec_get(), then, should keep attributes, except names?
strip_names <- function(x) {
  if (is.data.frame(x) || is.array(x)) {
    row.names(x) <- NULL
  } else {
    names(x) <- NULL
  }
  x
}

vec_get <- function(x, i) {
  stopifnot(length(i) == 1L)
  if (is.list(x) && !is.data.frame(x)) {
    x <- x[[i]]
  } else {
    x <- vec_slice(x, i)
    x <- strip_names(x)
  }
  x
}

vec_get(x, 1)
#> [1] 1
#> attr(,"extra")
#> [1] "stuff"

vec_get(mtcars, 1)
#>   mpg cyl disp  hp drat   wt  qsec vs am gear carb
#> 1  21   6  160 110  3.9 2.62 16.46  0  1    4    4

vec_get(list(x = c(a = 1)), 1)
#> a 
#> 1

vec_get(hms::hms(1:3), 1)
#> 00:00:01

Created on 2019-12-03 by the reprex package (v0.3.0.9000)

@krlmlr
Copy link
Member

krlmlr commented Mar 18, 2020

@DavisVaughan: Do we need to do anything here?

@DavisVaughan
Copy link
Member Author

DavisVaughan commented Mar 18, 2020

I think the main question is what you want this behavior with lists to be.

library(tibble)

y <- tibble(z = list(1, 2))

i <- 1
j <- 1

# Current behavior. But is this right? It doesn't match CRAN tibble
y[[i, j]]
#> [[1]]
#> [1] 1

# I expect the invariant to be more like this, which
# is more accurately defined (for data frames) as `vec_get(y[[j]], i)`
y[[j]][[i]]
#> [1] 1

If you can definitively answer that, the answer to that also implies if names should be dropped here, as you would have to use vec_get(), which would drop names.

library(tibble)

y <- tibble(z = structure(1:2, stuff = "foo", names = letters[1:2]))

i <- 1
j <- 1

# Current behavior. But names should be dropped too if we index into lists with
# `[[` behavior. (Other attributes would be kept).
y[[i, j]]
#> a 
#> 1 
#> attr(,"stuff")
#> [1] "foo"

We still don't have a vec_get() in vctrs, but here is one you could use

library(tibble)
library(vctrs)

strip_vec_names <- function(x) {
  maybe_row_names <- is.data.frame(x) || is.array(x)
  
  if (maybe_row_names) {
    row.names(x) <- NULL
  } else {
    names(x) <- NULL
  }
  
  x
}

vec_get <- function(x, i) {
  # Assuming you already do this
  # i <- vec_as_location2(i, vec_size(x))
  
  if (vec_is_list(x)) {
    out <- x[[i]]
    return(out)
  }
  
  out <- vec_slice(x, i)
  out <- strip_vec_names(out)
  
  out
}

x <- tibble(z = list(1, 2))
y <- tibble(z = structure(1:2, stuff = "foo", names = letters[1:2]))

vec_get(x[[1]], 1)
#> [1] 1
vec_get(y[[1]], 1)
#> [1] 1
#> attr(,"stuff")
#> [1] "foo"

@krlmlr krlmlr added the vctrs ↗️ Requires vctrs package label Mar 20, 2020
@krlmlr
Copy link
Member

krlmlr commented Mar 20, 2020

It's in the invariants, https://tibble.tidyverse.org/dev/articles/invariants.html#definition-of-xi-j-1:

i must be a numeric vector of length 1. x[[i, j]] is equal to x[i, ][[j]].

It doesn't talk about names yet. IIUC, we need to return an unnamed scalar in the last example:

x <- tibble::tibble(a = c(b = 1))
x$a
#> b 
#> 1
x[["a"]]
#> b 
#> 1
x[["a"]][[1]]
#> [1] 1
x[[1, "a"]]
#> b 
#> 1

Created on 2020-03-20 by the reprex package (v0.3.0)

@krlmlr krlmlr closed this as completed in 5e811c9 Mar 20, 2020
@krlmlr
Copy link
Member

krlmlr commented Mar 20, 2020

Thanks!

@krlmlr krlmlr added this to the 3.0.0 milestone Mar 21, 2020
@github-actions
Copy link
Contributor

This old thread has been automatically locked. If you think you have found something related to this, please open a new issue and link to this old issue if necessary.

@github-actions github-actions bot locked and limited conversation to collaborators Mar 22, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
vctrs ↗️ Requires vctrs package
Projects
None yet
Development

No branches or pull requests

2 participants