Skip to content

Include TabPFNv2.5 as a df-analyze model #43

@DM-Berger

Description

@DM-Berger

The Hutter lab has finally released TabPFN v2.5 (https://github.com/PriorLabs/tabpfn: [docs]). Their research now seems to suggest it is finally regularly beating LightGBM, and it appears to actually be an installable library now as well. This should most definitely be incorporated into df-analyze.

Caveats: multiple classification beyond "roughly" 10 classes requires using the ManyClassClassifier. How this limit can be "rough" is unclear, and may matter when implementing a df-analyze TabPFNEstimator class. Alternately, we might just always use the ManyClassClassifier and make it be a tunable hyperparam to not use it when classes are less than 10.

We will also need to test CPU-only compute times and memory costs, but, likely, these will be manageable.

Finally, and perhaps most trickily, we might need to alter the pre-processing pipeline if TabPFN doesn't like one-hot encoded DataFrames. In particular, it appears that TabPFN expects smoothly clipped values, in the method of https://papers.nips.cc/paper_files/paper/2024/file/2ee1c87245956e3eaa71aaba5f5753eb-Paper-Conference.pdf (Section 3):

In the first step of RealMLP, we apply one-hot encoding to categorical columns with at most eight distinct values (not counting missing values). Binary categories are encoded to a single feature with values ${-1, 1}$. Missing values in categorical columns are encoded to zero. After that, all numerical columns, including the one-hot encoded ones, are preprocessed independently as follows: Let $x_1, ..., x_n \in \mathbb{R}$ be the values in column $i$, and let $q_p$ be the $p$-quantile of $(x_1, ..., x_n)$ for $p \in [0, 1]$. Then,

$x_{j,\mathrm{processed}} = f(s_j \cdot (x_j - q_{1/2}))$
where
$f(x) = \frac{x}{\sqrt{1 + (\frac{x}{3})^2}}$

and

$s_j = $

$\frac{1}{q_{3/4} - q_{1/4}}, \text{ if } q_{3/4} \neq q_{1/4} $
$\frac{2}{q_1 - q_0}, \text{ if } q_{3/4} = q_{1/4}\text{ and }q_1 \neq q_0 $
$0, \text{ otherwise}$
In scikit-learn, this corresponds to applying a RobustScaler (first case) or MinMaxScaler (second case), and then the function $f$, which smoothly clips its input to the range $(-3, 3)$

We already do (hard) clipping and robust scaling in df-analyze, so this step is unnecessary, however, it probably still needs to be run on the df-analyze pre-processed dataframes to ensure the data matches the form expected by TabPFN. Hopefully this is just done automatically.

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions