-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add RandomUnderSampling functionality #89
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contribution!
I can't access the branch security settings for this repo, but I'm guessing that 100% codecov is a required test. Do you have test cases for each option? These would also be a great help for me, since I'm less familiar with this repo.
Also, I would change the version "0.9.0-dev".
Project.toml
Outdated
@@ -1,23 +1,26 @@ | |||
name = "ClassImbalance" | |||
uuid = "04a18a73-7590-580c-b363-eeca0919eb2a" | |||
authors = ["Paul Stey <[email protected]>", "Dilum Aluthge <[email protected]>", "Brown Center for Biomedical Informatics <[email protected]>"] | |||
version = "0.8.7" | |||
version = "0.8.8" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be a minor release ("0.9.0"
or "0.9.0-dev"
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
alright, I'll do that
replacement::Bool = false | ||
) where T <: AbstractVector where S <: Integer where A <: Any | ||
# check if X implements getobs | ||
@assert Tables.istable(X) "$X is not implementing the MLUtils.jl getobs interface" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have any example inputs? Here is what I tried to get around this @assert
:
pkg> activate --temp
pkg> add https://github.com/vmpyr/ClassImbalance.jl#vmpyr/random_undersampler
# Example from README:
julia> using ClassImbalance
julia> y = vcat(ones(20), zeros(180)); # 0 = majority, 1 = minority
julia> X = hcat(rand(200, 10), y)
julia> X2, y2 = smote(X, y, k = 5, pct_under = 100, pct_over = 200)
([0.8133030487718941 0.6399427612098323 … 0.18570608874157368 1.0; 0.7749414282075959 0.46482119778838893 … 0.17369768251844653 1.0; … ; 0.8348062387369848 0.04923338419004153 … 0.8706114262388865 1.0; 0.8582028352406533 0.39648688626915957 … 0.46337293622094944 1.0], [1.0, 1.0, 0.0, 1.0, 0.0, 0.0, 1.0, 0.0, 1.0, 1.0 … 1.0, 0.0, 0.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0])
julia> using Tables
julia> random_undersampler(Tables.table(X),y)
(Tables.DictColumnTable with 40 rows, 11 columns, and schema:
:Column1 Float64
:Column2 Float64
:Column3 Float64
:Column4 Float64
:Column5 Float64
:Column6 Float64
:Column7 Float64
:Column8 Float64
:Column9 Float64
:Column10 Float64
:Column11 Float64, [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0 … 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0])
julia> random_undersampler(Tables.table(X2),y2)
(Tables.DictColumnTable with 80 rows, 11 columns, and schema:
:Column1 Float64
:Column2 Float64
:Column3 Float64
:Column4 Float64
:Column5 Float64
:Column6 Float64
:Column7 Float64
:Column8 Float64
:Column9 Float64
:Column10 Float64
:Column11 Float64, [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0 … 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes I do have one I tried. I had .csv files for it too btw, you can find them here.
# import the required packages
import CSV
using ClassImbalance, DataFrames, DelimitedFiles
# load the data as defined in the task guidelines
# load X as a DataFrame
X = CSV.read("./data/pima-indians-diabetes-X.csv", DataFrame; header=["x1","x2","x3","x4","x5","x6","x7","x8"])
# load y as a CategoricalVector
y = CategoricalVector(vec(readdlm("./data/pima-indians-diabetes-y.csv", Int)))
# call the undersampler
Xover, yover = random_undersampler(X, y)
# print the results
show(Xover)
println()
show(IOContext(stdout, :limit => true), yover)
println()
# print the counts of the classes
println("initial count of class 0: ", count(y .== 0))
println("initial count of class 1: ", count(y .== 1))
println("count of class 0 after undersampling: ", count(yover .== 0))
println("count of class 1 after undersampling: ", count(yover .== 1))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I've written the function such that X
contains only the features and not the labels. I'm writing a overloaded function for a data matrix containing both features and labels.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In your example, the function does work correctly anyway, you can access the columns and check the data or the count of it.
julia> Xover, yover = random_undersampler(Tables.table(X),y)
(Tables.DictColumnTable with 40 rows, 10 columns, and schema:
:Column1 Float64
:Column2 Float64
:Column3 Float64
:Column4 Float64
:Column5 Float64
:Column6 Float64
:Column7 Float64
:Column8 Float64
:Column9 Float64
:Column10 Float64, [1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0 … 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0])
julia> count(y .== 1)
20
julia> count(y .== 0)
180
julia> count(yover .== 0)
20
julia> count(yover .== 1)
20
julia> Xover
Tables.DictColumnTable with 40 rows, 10 columns, and schema:
:Column1 Float64
:Column2 Float64
:Column3 Float64
:Column4 Float64
:Column5 Float64
:Column6 Float64
:Column7 Float64
:Column8 Float64
:Column9 Float64
:Column10 Float64
julia> Xover.Column1
40-element Vector{Float64}:
0.3952844545871751
0.8704752637842621
0.6724007567263786
0.35193156297873285
0.7208683859930942
0.4178980161173723
0.33217406300791696
0.07224589352246602
0.09320295190634142
0.10316414877241253
0.6166656578236519
0.20575631236600955
0.5488182158068842
0.3156879765220615
0.8174789837160467
⋮
0.10887263478147902
0.6100936867437683
0.9340285563941725
0.11318763378056362
0.24546432607219804
0.2788921342943955
0.6103736533699433
0.05058497458051314
0.9006314393478084
0.3487415101299778
0.693372173075407
0.4980414320194523
0.9952081568741031
0.8699331124771522
0.11711882057394163
Just a quick comment. This package hasn't been updated in a while, so the CI in this repo is Travis CI. Of course, Travis CI no longer offers free CI for open source projects. So we need to migrate the CI to GitHub Actions CI. @vmpyr @AshlinHarris Would you two be willing to work together to migrate CI to GitHub Actions? You can use https://discourse.julialang.org/t/easy-workflow-file-for-setting-up-github-actions-ci-for-your-julia-package/49765 as a starting point. |
@DilumAluthge Could you please add me as an admin on this repo? I can't access the branch protection, but I'll probably need to remove the travis-ci requirement. |
Done. |
If possible, I'd recommend doing the Travis-to-GitHub-Actions migration in a separate PR. I tend to prefer multiple smaller PRs (instead of one large PR), because I find that the smaller PRs are each easier to review. |
Not as of now, I have written the functionality and will start writing the tests today most probably. |
Yeah, we should keep the requirement of having 100% code coverage. |
Yes, I'll happily work on that. Will I be needed to have admin privileges as well? If so, please add me as admin too. The repo needs a huge overhaul since it is very old now. |
It'll be sufficient for @AshlinHarris to have admin permissions; there are relatively few settings that need to be modified. |
Work underway. Have added functions to utils for future use as well