-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Recommendation for minimum sample size for classification #8
Comments
Hi @mestaki, Thanks for the kind words, glad to hear you found Codacore useful! Great question. I agree it can depend on the nature of the data, but as a rule of thumb I would make sure to use strong regularization ( Admittedly, we have only experimented with datasets greater than 100 samples, so I can't speak with too much confidence. But ultimately, I'd always make sure to keep at least a couple dozen samples in a held-out test set (so a 50/50 split when there's only 40-50 samples), and to make sure that the predictions of the trained model are better than simple baselines/established benchmarks. If you do observe a significant improvement, that may be enough to identify relevant biomarkers or at least to inform follow-up research. Thanks! |
Thank you very much @egr95! This is very useful. Just a follow up on this issue of smaller sample sizes. In your CoDaCoRe tutorial, you use the same 80% of both groups for training, even though the CD group is about twice as large as the control group. In my experience this would be problematic with regular RF, if the model is trained with a lot more of one group. Is this not an issue with CoDaCoRe? Or is there a general threshold to look out for? Again with the low sample size in mind, say the data is 50:200 control to case. |
You're welcome! You are also right to be concerned about class imbalance. It's really up to the user to decide what's acceptable and whether class rebalancing is preferred. In the case of the Crohn's disease data given in the Guide, the sample size is fairly large (~1000) and the class imbalance (1:2) is not too severe, so I decided to keep things simple and avoid overcomplicating the tutorial with a discussion on class imbalance. After all, the main goal of the Guide is to showcase the functionality of the Codacore package itself. However, the example you mention with a low sample size and a more severe imbalance (1:4) does sound more risky, so my advice would be to consider the usual techniques for imbalanced classification, e.g., precision-recall curves, resampling, etc. |
Hi folks,
Been playing with CoDaCore for a few different projects now and wanted to give a big shout out for this awesome tool, especially its speed is just outrageous.
I have a few questions and maybe comments which I'll post as separate issues (so apologies if you get a few notifications in a row).
My first question is do you folks have a sense as to what would be a minimum sample size for codacore to perform reliably. I'm sure it depends on a lot on the nature of the data, but generally speaking. Is this similar to regular supervised classification? For example with random forest I usually wouldn't even bother without having at least 40-50 samples. Perhaps more importantly, any recommendations for minimum size of the training set?
The text was updated successfully, but these errors were encountered: