Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] method = 'rake' return AttributeError #73

Open
geniusjenny opened this issue Mar 5, 2024 · 18 comments
Open

[BUG] method = 'rake' return AttributeError #73

geniusjenny opened this issue Mar 5, 2024 · 18 comments
Labels
bug Something isn't working

Comments

@geniusjenny
Copy link

geniusjenny commented Mar 5, 2024

Describe the bug

The same code has no error when running method ='ipw', and method = 'cbps', but return below error when using raking.
The below code return error

sample_with_target.adjust(method = "rake",variables = variables) 
table_current.loc[feature, weight_col]                      
AttributeError: 'numpy.int64' object has no attribute 'loc'

###Update on 2023/03/08###
This bug is returned because some of the bin that appears in the sample has never appeared in the target.
Once I add the sample to the target to make sure all bins appear in the target, the bug disappear.

Session information

Please run paste here the output of running the following in your notebook/terminal:

# Sessions info
import session_info
session_info.show(html=False, dependencies=True)

balance 0.9.1
balance_functions NA
boto3 1.28.28
dateutil 2.8.2
matplotlib 3.7.2
numpy 1.24.4
pandas 1.4.3
psutil 5.9.5
seaborn 0.12.2
session_info 1.0.0
tqdm 4.65.0

OpenSSL 23.2.0
PIL 10.0.0
anyio NA
arrow 1.2.3
asttokens NA
attr 23.1.0
attrs 23.1.0
babel 2.12.1
backcall 0.2.0
beta_ufunc NA
binom_ufunc NA
botocore 1.31.28
brotli NA
certifi 2023.05.07
cffi 1.15.1
charset_normalizer 3.2.0
cloudpickle 2.2.1
colorama 0.4.4
comm 0.1.3
coxnet NA
cryptography 41.0.2
cvcompute NA
cvelnet NA
cvfishnet NA
cvglmnet NA
cvglmnetCoef NA
cvglmnetPredict NA
cvlognet NA
cvmrelnet NA
cvmultnet NA
cycler 0.10.0
cython_runtime NA
debugpy 1.6.7
decorator 5.1.1
defusedxml 0.7.1
elnet NA
executing 1.2.0
fastjsonschema NA
fishnet NA
fqdn NA
fsspec 2023.6.0
glmnet NA
glmnetCoef NA
glmnetControl NA
glmnetPredict NA
glmnetSet NA
glmnet_python NA
google NA
hypergeom_ufunc NA
idna 3.4
ipfn NA
ipykernel 6.24.0
ipython_genutils 0.2.0
ipywidgets 8.0.7
isoduration NA
jedi 0.18.2
jinja2 3.1.2
jmespath 1.0.1
joblib 1.3.1
json5 NA
jsonpointer 2.4
jsonschema 4.18.4
jsonschema_specifications NA
jupyter_events 0.6.3
jupyter_server 2.7.0
jupyterlab_server 2.23.0
kiwisolver 1.4.4
loadGlmLib NA
lognet NA
markupsafe 2.1.3
matplotlib_inline 0.1.6
mpl_toolkits NA
mrelnet NA
nbformat 5.9.1
nbinom_ufunc NA
ncf_ufunc NA
overrides NA
packaging 21.3
parso 0.8.3
patsy 0.5.3
pexpect 4.8.0
pickleshare 0.7.5
pkg_resources NA
platformdirs 3.9.1
plotly 5.15.0
prometheus_client NA
prompt_toolkit 3.0.39
ptyprocess 0.7.0
pure_eval 0.2.2
pyarrow 12.0.1
pydev_ipython NA
pydevconsole NA
pydevd 2.9.5
pydevd_file_utils NA
pydevd_plugins NA
pydevd_tracing NA
pygments 2.15.1
pyparsing 3.0.9
pythonjsonlogger NA
pytz 2023.3
referencing NA
requests 2.31.0
rfc3339_validator 0.1.4
rfc3986_validator 0.1.1
rpds NA
s3fs 0.4.2
scipy 1.9.1
send2trash NA
six 1.16.0
sklearn 1.3.0
sniffio 1.3.0
socks 1.7.1
stack_data 0.6.2
statsmodels 0.14.0
tenacity NA
threadpoolctl 3.2.0
tornado 6.3.2
traitlets 5.9.0
typing_extensions NA
uri_template NA
urllib3 1.26.14
wcwidth 0.2.6
webcolors 1.13
websocket 1.6.1
wtmean NA
yaml 6.0
zmq 25.1.0

IPython 8.14.0
jupyter_client 8.3.0
jupyter_core 5.3.1
jupyterlab 4.0.3
notebook 6.5.4

Python 3.10.12 | packaged by conda-forge | (main, Jun 23 2023, 22:40:32) [GCC 12.3.0]
Linux-5.10.209-198.812.amzn2.x86_64-x86_64-with-glibc2.26

Session information updated at 2024-03-05 04:21

Screenshots

If applicable, add screenshots to help explain your problem.
image (4)
image (5)

Reproducible example

Please provide us with (any that apply):

  1. Code: code we can run to reproduce the issue (in terminal or python notebook)
    sample = Sample.from_frame(sample_df2[:50]) target = Sample.from_frame(target_df2[:500]) sample_with_target = sample.set_target(target) adjusted_ads_weight = sample_with_target.adjust(method = "rake",variables = variables_subset2)
    sample_df2 and target_df2 are dataframes with two numerical columns.
    image (6)

  2. Reference: If the issue is in a tutorial, please provide the link to it, and the exact place in which the code fails.

Additional context

Add any other context about the problem here that might help us solve it.

@geniusjenny geniusjenny added the bug Something isn't working label Mar 5, 2024
@geniusjenny geniusjenny changed the title [BUG] [BUG] method = 'rake' return AttributeError Mar 5, 2024
@talgalili
Copy link
Contributor

Hey @geniusjenny

Thanks for the bug report!

Could you please try to run the code from the rake tutorial:
https://import-balance.org/docs/tutorials/quickstart_rake/
And see if you can reproduce the code from it?

What would help me is a fully self-contained reproducible example that I could run in my env to reproduce the error - that would allow me to more easily iterate to get a solution.

Thanks upfront!

@geniusjenny
Copy link
Author

Thanks for the replies!
For the sample code it runs smoothly with no error.
image (7)

@talgalili
Copy link
Contributor

Thanks for checking @geniusjenny
Any way you could play around and try to find a way to reproduce the issue?
I suggest you look at the
sample.df.info()
And look at the data types, and maybe the hint could be there.

Once you could find a way to reproduce the issue, I'd be able to work on it.
WDYT?

@geniusjenny
Copy link
Author

geniusjenny commented Mar 5, 2024

Hi talgalili, I tried to reproduce the issue but couldn't. I tried using two numerical features ['income', 'happiness'] similar with what I have for my dataset, and the code runs smoothly.
I attached the sample data here for you to reproduce the issue. Sorry that I couldn't be more helpful.

Thank you so much.
sample_test2.csv
target_test2.csv
code:

s2= pd.read_csv('sample_test2.csv',index_col=0)
t2= pd.read_csv('target_test2.csv',index_col=0)
sample = Sample.from_frame(s2)
target = Sample.from_frame(t2)
sample_with_target = sample.set_target(target)
adjusted_ads_weight1 = sample_with_target.adjust(method = "rake") 

@talgalili
Copy link
Contributor

Thanks @geniusjenny

Just to double check, could you please paste the full output of you running the above code?
And please also include the output of:
sample.df.info()
target.df.info()

Thanks!

@geniusjenny
Copy link
Author

Sure!
Full output:
image (8)
image (4)
image (5)

df.info:
image (9)

@talgalili
Copy link
Contributor

talgalili commented Mar 6, 2024 via email

@geniusjenny
Copy link
Author

geniusjenny commented Mar 6, 2024

Hi talgalili,
I just tried binning the numerical variables to categorical variables, but still the code returns the same error. While method='cbps' and method = 'ipw' run smoothly.

Here are the code and df.info:
image (10)
ERROR:
image (11)

@talgalili
Copy link
Contributor

Thanks @geniusjenny
Interesting!
Could you please change the object type of the bucketed variables from 'categorical' to 'object'? And let me know if this resolve the error you get?

@geniusjenny
Copy link
Author

I also tried that. Still getting the same error.

image

@geniusjenny
Copy link
Author

I think I may find the issue.
Some of the bin that appears in the sample has never appeared in the target, causing this error.
Once I add the sample to the target, the bug disappear.
I suggest the code take this edge case in consideration as well!

t2=pd.concat([s2,t2])
t2.reset_index(inplace=True)
t2['id']=t2.index.astype('str')
image

@talgalili
Copy link
Contributor

Great catch - thanks a bunch @geniusjenny !

O.k., I'll leave this issue open - and we'll get to add a proper exception in the future.

Thanks again.

@geniusjenny
Copy link
Author

Thank you!

@EmanueleCeglia
Copy link

EmanueleCeglia commented May 10, 2024

I jump in the issue because I have the same problem.
In my case I have no missing data in the target. I am trying to use the marginal distribution with rake.
If there is no weight column in the "sample" dataframe and target_df_from_marginals'' then is automatically created with values equal to 1.
Then, I tried to create the column "weight" for both: "target_df_from_marginals'' and the dataframe used to create "sample" but instead of use 1 used 1.0 so dtype - float and this time the error message is:
AttributeError: 'numpy.float64' object has no attribute 'loc'
Do you have any suggestions? @talgalili

@talgalili
Copy link
Contributor

Hey @EmanueleCeglia ,
Do you want to share the code you used?
My guess is that you need to add the weight column to the DataFrame of your data before using Sample.from_frame so it will inherit from pandas the relevant methods.

@EmanueleCeglia
Copy link

image
df_sorted is a dataframe with two columns: ctrysize and ctrysect (they are sorted in alphabetical order) this is my df in which I have to calibrate weights.
ctrysize is the combination of 12 EU countries and for each country the dimension of the firm size (from 1 to 4)
ctrysect is the combination of 12 EU countries and for each country the sector of the firm (from A to D).

For each of these combinations I have the real totals in EU and I want to use these data as margins for the calibration.
In the picture below you can see how I used the totals to create the dictionaire "a_dict_with_marginal_distributions"
image
image
then
image
image
image
Error
image
Hope it's clear enough.
In any case I can provide additional details.
Thanks @talgalili

@talgalili
Copy link
Contributor

Hi @EmanueleCeglia

  1. could you please open a new issue for this discussion? (this seems like a separate issue)
  2. If you run this tutorial, does it work? https://import-balance.org/docs/tutorials/quickstart_rake/
  3. Notice that you have a huge amount of tiny buckets, regardless of this bug, are you sure you have values for each of them in your sample?

(please let's continue this discussion in the new bug you'll open - thanks)

@EmanueleCeglia
Copy link

Hi @talgalili yes the tutorial works perfectly
image
I am going to open a new issue so we can continue there

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants