-
Couldn't load subscription status.
- Fork 34
WIP use Tables.Columns instead of columntable
#247
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
For which situations do we need |
| cols = termvars(formula) | ||
| materialize = Tables.materializer(data) | ||
| data = materialize(TableOperations.select(cols...)(data)) | ||
| drop = TableOperations.narrowtypes() ∘ TableOperations.dropmissing() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAICT TableOperations.dropmissing operates row-wise (it calls filter). I'm afraid this is going to kill performance for data frames.
Maybe an optimized method for column tables could be added? (EDIT: That's probably doable, as we can use a faster approach than filter since we know that the condition can be computed separately for each row.) Another solution would be to define dropmissing in DataAPI, say that dropmissing(::Any) is owned by TableOperations, but have dropmissing(::DataFrame) be defined in DataFrames.
Also, narrowtypes is a much more costly operation that just doing nonmissingtype(eltype(col)) as it requires going over all entries. DataFrames's dropmissing does that by default, maybe TableOperations could take a similar argument.
| function schema(ts::AbstractVector{<:AbstractTerm}, | ||
| data, | ||
| hints::Dict{Symbol}=Dict{Symbol,Any}()) | ||
| data = Tables.Columns(Tables.columns(data)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the advantage of wrapping the result of Tables.columns in a Tables.Columns object?
| # if the "hint" is already an AbstractTerm, use that | ||
| # need this specified to avoid ambiguity | ||
| concrete_term(t::Term, d::ColumnTable, hint::AbstractTerm) = hint | ||
| concrete_term(t::Term, d::Tables.Columns, hint::AbstractTerm) = hint |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just
| concrete_term(t::Term, d::Tables.Columns, hint::AbstractTerm) = hint | |
| concrete_term(t::Term, d, hint::AbstractTerm) = hint |
This uses
Tables.Columnsas a (potentially) lightweight wrapper around input tables that does not convert them to the strongly typedNamedTupleofVectors representation. This might make some things easier on the compiler (e.g. #220 ).Requires Tables 1.6.0 since that's when
Columnsstopped being a lie ;)There's some design issues to work out here still, since a generic
NamedTuplecould be EITHER a column table (if it contains vectors) or a single row, and there are a handful of methods that specialize on that to provide special handling (most notablymodelcols(::InteractionTerm, ...)). What we PROBABLY will need to do is to add parallel methods forRowin a similar fashion, but I'm not sure about that. In the mean time, merging this would be breaking since you lose first class support for named tuples of singletons, which is part of the current public API. There may be a way around that but I haven't dug into it yet...