Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cubist #28

Open
gtalckmin opened this issue Nov 10, 2020 · 0 comments
Open

Cubist #28

gtalckmin opened this issue Nov 10, 2020 · 0 comments

Comments

@gtalckmin
Copy link

Hi @topepo,

I am working with raster datasets and employing different rule-based algorithms (namely, CART, Cubist, bagged trees, boosted trees and random forests "RF") for a regression problem (biomass per area).

My initial reasoning was that Cubist would have an optimal prediction performance and require low processing power/time for predictions. The reasons for such should be the low complexity fit between predictors and explained variable.

Result wise, Cubist has performed as well as RF (as per the results of Dunn's Test, using the results of a k-fold repeated cross-validation). M5, on the other hand, is lightning-fast (3 seconds), but not as accurate as RF.

However, and quite surprisingly, Cubist took around one minute, whereas Random Forest needed 19 seconds, to predict the same raster. The same results were reported in this paper: https://doi.org/10.1016/j.neunet.2018.12.010

I would be happy to provide a reprex, if provided a mock-up raster (in which I could perform regression and not classification, although computing time should not be determined by the task). I've seen one of your talks, where you mentioned that Cubist should be faster than Random Forests (provided that is coded in C and is far smaller and optimized, rather than Random Forest).

The size of a Cubist model is around 100kb whereas RF, 5Mb. However, this (in the context where I am working) is not a limiting factor.

Is there something I am doing wrong? I would argue that Cubist should be the work-horse (for tasks such as mine) rather than Random Forest; however, as is, Cubist will be limited by the processing-time

Cheers, Gustavo
PS: I also post this question in StackOverflow, but I reckon it would be useful to have it here, as I am using your package as the basis for these statements.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant