Skip to content
This repository was archived by the owner on Feb 2, 2024. It is now read-only.

Scale Corr #623

Merged
merged 7 commits into from
Feb 21, 2020
Merged

Scale Corr #623

merged 7 commits into from
Feb 21, 2020

Conversation

1e-to
Copy link
Contributor

@1e-to 1e-to commented Feb 19, 2020

image

elena.totmenina added 2 commits February 18, 2020 15:46
@pep8speaks
Copy link

pep8speaks commented Feb 19, 2020

Hello @1e-to! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-02-20 11:14:30 UTC

@@ -73,6 +74,10 @@ def nansum(self):
pass


def corr(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate function definition. Other definition in line 507.

@AlexanderKalistratov
Copy link
Collaborator

corr(x,y) = cov(x, y)/sqrt(var(x)*var(y))

Please note, that cov here is unbiased, while Series.cov calculates biased one

Biased cov calculation is here:
covariance

Suggestion for var implementation is here:
variance

Considering all of it, corr calculation would be something like this:

def hpat_pandas_series_corr_impl(self, other, min_periods=None):

    if min_periods is None or min_periods < 1:
        min_periods = 1

    min_len = min(len(self._data), len(other._data))

    if min_len == 0:
        return numpy.nan

    sum_y = 0.
    sum_x = 0.
    sum_xy = 0.
    sum_xx = 0.
    sum_yy = 0.
    total_count = 0
    for i in prange(min_len):
        x = self._data[i]
        y = other._data[i]
        if not (numpy.isnan(x) or numpy.isnan(y)):
            sum_x += x
            sum_y += y
            sum_xy += x*y
            sum_xx += x*x
            sum_yy += y*y
            total_count += 1

    if total_count < min_periods:
        return numpy.nan

    cov_xy = (sum_xy - sum_x*sum_y/total_count)
    var_x = (sum_xx - sum_x*sum_x/total_count)
    var_y = (sum_yy - sum_y*sum_y/total_count)

    corr_xy = cov_xy/sqrt(var_x*var_y)

    return corr_xy

Haven't verified it. So, please, double check it and fix any mistakes in calculation.

Also, if we had loops fusion, we'd be able to simply call three functions (cov(x,y), var(x) and var(y)).
But I don't think that at the moment it is possible

@1e-to
Copy link
Contributor Author

1e-to commented Feb 20, 2020

Also, if we had loops fusion, we'd be able to simply call three functions (cov(x,y), var(x) and var(y)).
But I don't think that at the moment it is possible

I tried to use our algorithms for var and cov to test fusing with Todd’s fix, but the unit tests stopped passing, the algorithm does not work correctly. Results have not needed accurancy. Does this mean that we need to rewrite the var and cov?
And also, what is the difference between biased from unbiased?

@AlexanderKalistratov
Copy link
Collaborator

Also, if we had loops fusion, we'd be able to simply call three functions (cov(x,y), var(x) and var(y)).
But I don't think that at the moment it is possible

I tried to use our algorithms for var and cov to test fusing with Todd’s fix, but the unit tests stopped passing, the algorithm does not work correctly. Results have not needed accurancy. Does this mean that we need to rewrite the var and cov?
And also, what is the difference between biased from unbiased?

It doesn't match because Series.conv calculates biased cov. And you need unbiased one.
Also, there could be a problem with nans. When calculating var we need to skip values if either of two is nan. Series.var skips only nans in it.

Have you tried the provided solution?

@AlexanderKalistratov AlexanderKalistratov merged commit 2a26041 into IntelPython:master Feb 21, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants