Include TabPFNv2.5 as a df-analyze model

The Hutter lab has finally released [TabPFN v2.5](https://github.com/PriorLabs/tabpfn) (https://github.com/PriorLabs/tabpfn: [[docs](https://docs.priorlabs.ai/extensions/hpo)]). Their research [now seems to suggest](https://arxiv.org/pdf/2511.08667) it is finally regularly beating LightGBM, and it appears to actually be an installable library now as well. This should most definitely be incorporated into df-analyze. 

Caveats: multiple classification beyond "roughly" 10 classes requires using the [ManyClassClassifier](https://github.com/PriorLabs/tabpfn-extensions/tree/main/src/tabpfn_extensions/many_class#manyclassclassifier). How this limit can be "rough" is unclear, and may matter when implementing a df-analyze TabPFNEstimator class. Alternately, we might just _always_ use the ManyClassClassifier and make it be a tunable hyperparam to not use it when classes are less than 10. 

We will also need to test CPU-only compute times and memory costs, but, likely, these will be manageable.

Finally, and perhaps most trickily, we might need to alter the pre-processing pipeline if TabPFN doesn't like one-hot encoded DataFrames. In particular, it appears that TabPFN expects smoothly clipped values, in the method of https://papers.nips.cc/paper_files/paper/2024/file/2ee1c87245956e3eaa71aaba5f5753eb-Paper-Conference.pdf (Section 3):

> In the first step of RealMLP, we apply one-hot encoding to categorical columns with at most eight distinct values (not counting missing values). Binary categories are encoded to a single feature with values $\{-1, 1\}$. Missing values in categorical columns are encoded to zero. After that, all numerical columns, including the one-hot encoded ones, are preprocessed independently as follows: Let $x_1, ..., x_n \in \mathbb{R}$ be the values in column $i$, and let $q_p$ be the $p$-quantile of $(x_1, ..., x_n)$ for $p \in [0, 1]$. Then,
>
> 
> $x_{j,\mathrm{processed}} =  f(s_j \cdot (x_j - q_{1/2}))$
> where
> $f(x) = \frac{x}{\sqrt{1 + (\frac{x}{3})^2}}$
> 
> and
> 
> $s_j  = $
>
> > $\frac{1}{q_{3/4} - q_{1/4}}, \text{ if } q_{3/4} \neq q_{1/4} $
> > $\frac{2}{q_1 - q_0}, \text{ if } q_{3/4} = q_{1/4}\text{ and }q_1 \neq q_0 $
> > $0, \text{ otherwise}$
> In scikit-learn, this corresponds to applying a `RobustScaler` (first case) or `MinMaxScaler` (second case), and then the function $f$, which smoothly clips its input to the range $(-3, 3)$

We already do (hard) clipping and robust scaling in df-analyze, so this step is unnecessary, however, it probably still needs to be run on the df-analyze pre-processed dataframes to ensure the data matches the form expected by TabPFN. Hopefully this is just done automatically.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Include TabPFNv2.5 as a df-analyze model #43

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Include TabPFNv2.5 as a df-analyze model #43

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions