Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

integer overflow #6

Open
crahal opened this issue Aug 24, 2024 · 5 comments
Open

integer overflow #6

crahal opened this issue Aug 24, 2024 · 5 comments

Comments

@crahal
Copy link

crahal commented Aug 24, 2024

Whenever my 'x' is greather than ~2750 organisations, I get this error (on all different models):

Error in if (machine == "localhost") "localhost" else getClusterOption("master",  : 
  missing value where TRUE/FALSE needed
In addition: Warning message:
In nrow(x) * nrow(y) : NAs produced by integer overflow

Again in windows, R 3.4.

@cjerzak
Copy link
Owner

cjerzak commented Aug 24, 2024

What's the dimensionality of 'y' in this case?

@crahal
Copy link
Author

crahal commented Aug 25, 2024

~700k or so

@cjerzak
Copy link
Owner

cjerzak commented Aug 25, 2024

There's an expand.grid of 1:2750 against 1:700k, and this is likely causing the overflow. I'll ponder a workaround and run some tests on this case. (So far, we've only tested merges of dimensionality ~100k.) More soon.

@crahal
Copy link
Author

crahal commented Aug 25, 2024

How detrimental to linkage performance would it be to iterate through chunks of 1k 'x' at a time? Is any of the training holistic, or are all of the linkages one-shot?

@cjerzak
Copy link
Owner

cjerzak commented Aug 26, 2024

Linkages are one-shot, so iterating through chunks in the way described should give the same results (with one qualification being that the choice of acceptable match threshold might be dynamically set given input data; to disable that, one can set AveMatchNumberPerAlias = NULL and set MaxDist = c for some floating point constant c.

In general, it's hard to know what that c should be but looking at a histogram of distances between matches/non-matched points if available can help.

You might also want to check out ZoomerJoin for a big matching task like this (it's specifically designed for very large merge tasks and computes matches (approximately) using locality sensitive hashing). Ben (of ZoomerJoin) and I are in the process of adding ZoomerJoin capabilities to LinkOrgs, but in the meantime it wouldn't be too hard to output the machine learned representations of the organizational aliases that could then be fed into, e.g., ZoomerJoin.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants