The input checks seem too strict #10

krassowski · 2019-07-24T21:51:10Z

I've got a couple of cases where I wished to run o2m but could not as the input checks failed: data with NaN is not accepted, it is impossible to perform O2PLS-DA (strict "less than" check of the number of components vs the number of columns in data; granted it is less common thing to do than OPLS-DA); in cross-validation checks the sum of requested components is checked against the number of columns, which of course will work for omics but not for many other datasets etc.

I understand that some limitations may arise from the implementation details (e.g. use of SVD for PCA) but, I wonder if it would be possible to relax some of the checks. Do you plan to support the cases I mentioned above in this package?
Or maybe would it be reasonable to provide a "force" argument to ignore the checks and let the user take the risk of failing miserably (when the algorithm does not indeed support specific case)?

selbouhaddani · 2019-07-25T09:06:41Z

Thanks for the suggestions. Here is a point by point response.

data with NaN is not accepted

Since dealing with missing data is a whole field of reseach on its own, I did not want to include an arbitrary method to do this. Methods as mice and missForest can be used to impute missing values prior to analysis.

it is impossible to perform O2PLS-DA (strict "less than" check of the number of components vs the number of columns in data; granted it is less common thing to do than OPLS-DA)

O2PLS-DA can be understood in two ways (at least). In the first case, the Y matrix consist of few outcomes that are to be predicted with X. Here, specific variation is assumed to be only in X, hence O2PLS-DA is actually OPLS-DA. In the second case, X and Y are both general data matrices, however the inner relation U = TB + H is relaxed to non-square B. I did not consider this case, as there are probably some identifiability issues that arise with non-diagonal B. In CFA, this is possible (I think), but there you have to specify which/how u_i are associated to which t_j.

in cross-validation checks the sum of requested components is checked against the number of columns

The number of components cannot be larger than the number of columns, because you can maximally "fill up" a space of dimension number of columns (i.e. the ncol(X)+1th component does not exist).

X <- matrix(rnorm(10*2), nrow = 10); Y <- jitter(X)
svd(crossprod(Y,X)) # at most 2 eigenvectors possible

An additional restriction is that the number of components cannot exceed the number of rows. This has more to do with full rank score matrices, this can be relaxed somehow by using ginv instead of solve

X <- matrix(rnorm(10*2), nrow = 2); Y <- jitter(X)
str(o2m_stripped(X, Y, n = 3, 0, 0)) #

Note however that the functions without input_checker, such as o2m_stripped do not check the number of components. So custom functions can be built around it.

Do you plan to support the cases I mentioned above in this package?

I don't have direct plans to extend O2PLS. However, what I do find useful is that n + nx > ncol(Y) as long as it is < ncol(X). This is possible in Probabilistic O2PLS (a package on my github)! Here the idea is that the dimension of specific variation in X is not restricted by the dimensions of Y.

I'll leave this issue open, perhaps others can contribute to this discussion

krassowski · 2019-07-25T09:22:14Z

Thank you for the detailed responses.

For missing values, I did not mean imputing them - as far as my understanding goes, O(2)PLS is known to tolerate moderate amounts of missing values and this is one of the huge benefits of this method in omics data integration. This is, of course, assuming that the underlying implementation uses NIPALS - and as it is implemented in the package I was hoping to get this benefit. Or did you mean that the way NIPALS ignores NaNs is an arbitrary way of handling NaNs and could be substituted by imputation step?

However, what I do find useful is that n + nx > ncol(Y) as long as it is < ncol(X).

This is what I was trying to highlight, apologies for vague description.

krassowski · 2019-07-25T09:27:01Z

I will have to think about it more, I am not quite convinced about these checks yet; I fully understand that the orthogonal and joint components should both be < ncol(X) or ncol(Y) for X and Y respectively, but it eludes me why their sum has to be lower than the number of columns.

Edit, here is the error I get:

Error in (function (X, Y, n, nx, ny, stripped = FALSE, p_thresh = 3000,  : 
  n + max(nx, ny) =2 exceed # columns in X or Y

Here is what could be used instead:

if (max(ncol(X), ncol(Y)) < n)
    stop("n =", n, " exceed # columns in X or Y")
if (ncol(X) < nx || ncol(Y) < ny) 
    stop("nx = ", nx, " or ny = ", ny, " exceed # columns in X or Y, respectively")

selbouhaddani · 2021-04-28T11:20:47Z

Firstly

Thanks again for the comments. I finally managed to deal with some limitations. See the new blogpost on selbouhaddani.eu and the Dev branch

The missing data problem
Cross-validation over a partially feasible grid

For 1, I added a function impute_matrix that does matrix completion based on SVD. For 2, I allowed for a more flexible grid, so that one can specify a grid of a,ax,ay that doesn't have to entirely satisfy the input checks.

Secondly

I didn't change the n+max(nx,ny) requirement, since the algorithm depends on performing SVD with n+max(nx,ny) components, see n2 <- n + max(nx, ny) and svd(t(Y) %*% X, nu = n2, nv = n2) in the code. Changing the requirement means changing the algorithm.

One can do the initial SVD with just n components. And the second step with nx should go fine.

I haven't looked at the statistical implications of that. If someone tries that out (as a MSc project or...), I'm happy to support him/her! Maybe we can add support for supervised (O)PLS-DA and prediction where Y is treated as outcome, which addresses the O2PLS limitation...

So the code in o2m will be something like

...... etc

if (nx + ny > 0) {
      # larger principal subspace
      n2 <- n #### Change this to n
      
      cdw <- svd(t(Y) %*% X, nu = n2, nv = n2) #### W,C,Tt will have n components
      C <- cdw$u
      W <- cdw$v
      
      Tt <- X %*% W
      
      if (nx > 0) {
        # Orthogonal components in Y
        E_XY <- X - Tt %*% t(W) #### rank p minus rank n subspace
        
        udv <- svd(t(E_XY) %*% Tt, nu = nx, nv = 0) #### Does this work? Tt doesn't contain all specific variation. 

....... etc

krassowski changed the title ~~The input checks seems are too strict~~ The input checks seem too strict Jul 25, 2019

selbouhaddani mentioned this issue Apr 28, 2021

Any special reason for n + max(nx, ny) in o2m2? #15

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The input checks seem too strict #10

The input checks seem too strict #10

krassowski commented Jul 24, 2019

selbouhaddani commented Jul 25, 2019

krassowski commented Jul 25, 2019

krassowski commented Jul 25, 2019 •

edited

Loading

selbouhaddani commented Apr 28, 2021 •

edited

Loading

The input checks seem too strict #10

The input checks seem too strict #10

Comments

krassowski commented Jul 24, 2019

selbouhaddani commented Jul 25, 2019

krassowski commented Jul 25, 2019

krassowski commented Jul 25, 2019 • edited Loading

selbouhaddani commented Apr 28, 2021 • edited Loading

Firstly

Secondly

krassowski commented Jul 25, 2019 •

edited

Loading

selbouhaddani commented Apr 28, 2021 •

edited

Loading