Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The input checks seem too strict #10

Open
krassowski opened this issue Jul 24, 2019 · 4 comments
Open

The input checks seem too strict #10

krassowski opened this issue Jul 24, 2019 · 4 comments

Comments

@krassowski
Copy link

I've got a couple of cases where I wished to run o2m but could not as the input checks failed: data with NaN is not accepted, it is impossible to perform O2PLS-DA (strict "less than" check of the number of components vs the number of columns in data; granted it is less common thing to do than OPLS-DA); in cross-validation checks the sum of requested components is checked against the number of columns, which of course will work for omics but not for many other datasets etc.

I understand that some limitations may arise from the implementation details (e.g. use of SVD for PCA) but, I wonder if it would be possible to relax some of the checks. Do you plan to support the cases I mentioned above in this package?
Or maybe would it be reasonable to provide a "force" argument to ignore the checks and let the user take the risk of failing miserably (when the algorithm does not indeed support specific case)?

@krassowski krassowski changed the title The input checks seems are too strict The input checks seem too strict Jul 25, 2019
@selbouhaddani
Copy link
Owner

Thanks for the suggestions. Here is a point by point response.

data with NaN is not accepted

Since dealing with missing data is a whole field of reseach on its own, I did not want to include an arbitrary method to do this. Methods as mice and missForest can be used to impute missing values prior to analysis.

it is impossible to perform O2PLS-DA (strict "less than" check of the number of components vs the number of columns in data; granted it is less common thing to do than OPLS-DA)

O2PLS-DA can be understood in two ways (at least). In the first case, the Y matrix consist of few outcomes that are to be predicted with X. Here, specific variation is assumed to be only in X, hence O2PLS-DA is actually OPLS-DA. In the second case, X and Y are both general data matrices, however the inner relation U = TB + H is relaxed to non-square B. I did not consider this case, as there are probably some identifiability issues that arise with non-diagonal B. In CFA, this is possible (I think), but there you have to specify which/how u_i are associated to which t_j.

in cross-validation checks the sum of requested components is checked against the number of columns

The number of components cannot be larger than the number of columns, because you can maximally "fill up" a space of dimension number of columns (i.e. the ncol(X)+1th component does not exist).

X <- matrix(rnorm(10*2), nrow = 10); Y <- jitter(X)
svd(crossprod(Y,X)) # at most 2 eigenvectors possible

An additional restriction is that the number of components cannot exceed the number of rows. This has more to do with full rank score matrices, this can be relaxed somehow by using ginv instead of solve

X <- matrix(rnorm(10*2), nrow = 2); Y <- jitter(X)
str(o2m_stripped(X, Y, n = 3, 0, 0)) # 

Note however that the functions without input_checker, such as o2m_stripped do not check the number of components. So custom functions can be built around it.

Do you plan to support the cases I mentioned above in this package?

I don't have direct plans to extend O2PLS. However, what I do find useful is that n + nx > ncol(Y) as long as it is < ncol(X). This is possible in Probabilistic O2PLS (a package on my github)! Here the idea is that the dimension of specific variation in X is not restricted by the dimensions of Y.

I'll leave this issue open, perhaps others can contribute to this discussion

@krassowski
Copy link
Author

Thank you for the detailed responses.

For missing values, I did not mean imputing them - as far as my understanding goes, O(2)PLS is known to tolerate moderate amounts of missing values and this is one of the huge benefits of this method in omics data integration. This is, of course, assuming that the underlying implementation uses NIPALS - and as it is implemented in the package I was hoping to get this benefit. Or did you mean that the way NIPALS ignores NaNs is an arbitrary way of handling NaNs and could be substituted by imputation step?

However, what I do find useful is that n + nx > ncol(Y) as long as it is < ncol(X).

This is what I was trying to highlight, apologies for vague description.

@krassowski
Copy link
Author

krassowski commented Jul 25, 2019

I will have to think about it more, I am not quite convinced about these checks yet; I fully understand that the orthogonal and joint components should both be < ncol(X) or ncol(Y) for X and Y respectively, but it eludes me why their sum has to be lower than the number of columns.

Edit, here is the error I get:

Error in (function (X, Y, n, nx, ny, stripped = FALSE, p_thresh = 3000,  : 
  n + max(nx, ny) =2 exceed # columns in X or Y

Here is what could be used instead:

if (max(ncol(X), ncol(Y)) < n)
    stop("n =", n, " exceed # columns in X or Y")
if (ncol(X) < nx || ncol(Y) < ny) 
    stop("nx = ", nx, " or ny = ", ny, " exceed # columns in X or Y, respectively")

@selbouhaddani
Copy link
Owner

selbouhaddani commented Apr 28, 2021

Firstly

Thanks again for the comments. I finally managed to deal with some limitations. See the new blogpost on selbouhaddani.eu and the Dev branch

  1. The missing data problem
  2. Cross-validation over a partially feasible grid

For 1, I added a function impute_matrix that does matrix completion based on SVD. For 2, I allowed for a more flexible grid, so that one can specify a grid of a,ax,ay that doesn't have to entirely satisfy the input checks.

Secondly

I didn't change the n+max(nx,ny) requirement, since the algorithm depends on performing SVD with n+max(nx,ny) components, see n2 <- n + max(nx, ny) and svd(t(Y) %*% X, nu = n2, nv = n2) in the code. Changing the requirement means changing the algorithm.

One can do the initial SVD with just n components. And the second step with nx should go fine.

I haven't looked at the statistical implications of that. If someone tries that out (as a MSc project or...), I'm happy to support him/her! Maybe we can add support for supervised (O)PLS-DA and prediction where Y is treated as outcome, which addresses the O2PLS limitation...

So the code in o2m will be something like

...... etc

if (nx + ny > 0) {
      # larger principal subspace
      n2 <- n #### Change this to n
      
      cdw <- svd(t(Y) %*% X, nu = n2, nv = n2) #### W,C,Tt will have n components
      C <- cdw$u
      W <- cdw$v
      
      Tt <- X %*% W
      
      if (nx > 0) {
        # Orthogonal components in Y
        E_XY <- X - Tt %*% t(W) #### rank p minus rank n subspace
        
        udv <- svd(t(E_XY) %*% Tt, nu = nx, nv = 0) #### Does this work? Tt doesn't contain all specific variation. 

....... etc

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants