Review of Correlation

Review of Correlation#

Recall from the previous section that the correlation, $r$, between two time series, $x(t)$ and $y(t)$ gives a measure of how well the two time series vary linearly together (or don’t).

We derived the correlation from the coefficient of determination between $x(t)$ and $y(t)$ (where $x$ is the predictor and $y$ is the predictand) by quantifying how well the variance explained by the linear relationship between $x(t)$ and $y(t)$ accounts for the total the variance in $y$:

\[\begin{split} \begin{align} r^2 & = \text{coefficient of determination}\\ & = \frac{\text{explained variance}}{\text{total variance}}\\ & = \frac{S_{\hat{y}\hat{y}}}{S_{yy}}\\ & \\ & \text{Plugging in the expression for $\hat{y}(t)$ and rearranging things, we get ...}\\ & \\ & = \frac{(\overline{x^{\prime}y^{\prime}})^2}{\overline{{x^\prime}^2}\overline{{y^\prime}^2}} \end{align} \end{split}\]

and the correlation coefficient is simply the square-root of $r^2$,

\[\begin{split} \begin{align} r & = \pm\sqrt{r^2}\\ & = \frac{\overline{x^{\prime}y^{\prime}}}{\sigma_x\sigma_y} \end{align} \end{split}\]

The key characteristics of these parameters are:

$r^2$, the coefficient of determination (aka “r-squared”):

is the fraction of variance explained by the linear least-squares fit between the two variables
always lies between 0 and 1

r, the correlation coefficient:

varies between -1 and 1
indicates the sign of the relationship between the two variables

Now, we will discuss techniques for testing the statistical significance of correlations. We will denote the sample correlation as $r$ and the theoretical true value as $\rho$.

Sampling Distributions of $r$#

Recall that when we discussed the Central Limit Theorem, the key result of that theorem was that the sampling distribution of sample means was normally distributed. This allowed us to then use the power of the $z$-statistic to test statistical hypotheses associated with the sample mean.

For the case when $\rho$ = 0 and $N$ is large, the sampling distribution of $r$ is normally distributed and we can make use of a test statistic that is analogous to the $z$-statistic. But when $\rho \neq$ 0, then the sampling distribution becomes skewed and we have to use a transformation to establish statistical significance (more on that later).

First, let’s just take a look at the sampling distributions of $r$ for different values of $\rho$. To do this, we will generate some synthetic data.

When $\rho$ = 0:#

For the case when $\rho$ = 0, we will generate two random time series. These should be uncorrelated with one another for large $N$.

# Generate two time series of random data (50,000 data points each)
import numpy as np
import scipy.stats as st

x = np.random.random(50000)
y = np.random.random(50000)

rho = st.pearsonr(x,y)
print(rho[0])

-0.009561526823633124

We see that we get a near-zero value of the correlation coeffiecient ($\rho$) between our random time series $x$ and $y$. Now, let’s take samples of size $N$ = 50 each of $x$ and $y$ and compute the sample correlation coefficient, $r$. We will do this 50,000 times. You can click on the “Live Code” button to change this number and see how the PDF changes.

# Calculate sample correlations
N = 100 # sample size
r = []
Ns = 50000 # sampling rate

# calculate correlation 50,000 times for samples of size N
for i in range(Ns):
    r.append(st.pearsonr(np.random.choice(x,N), np.random.choice(y,N))[0])

Now, let’s plot the distribution of these sample correlations.

from matplotlib import pyplot as plt

# calculate histogram
bins = np.linspace(-1,1,51)
hcorr0 = np.histogram(r,bins)

# xbins for plotting
xbins = np.linspace(-1,1,50)

# plot distribution
plt.figure(figsize=(10,5))
plt.plot(xbins,hcorr0[0]/Ns) # convert counts to frequency
plt.xlabel("Sample Correlation Coefficient ($r$)")
plt.ylabel("Frequency")
plt.title(r"Sampling Distribution of Correlation Coefficients ($\rho$ = 0)")
plt.show()

../../_images/01bef4174fffe59721c8a0d01d1a0d755e0a5c3273854d25c73aaef1546ceaf3.png

So, we see that we get a distribution of sampling correlation coefficients that looks quite normal.

When $\rho \neq$ 0:#

Now, let’s look at an example when we know the $\rho \neq$ 0. We are going to pick a very idealized case, but there may often be situations where a particular theoretical relationship indicates a non-zero population correlation.

We are going to construct a set of time series to illustrate how the distribution of sample correlation coefficients changes as $\rho$ changes. We will generate time series such that we will obtain population correlations of 0.1, 0.4 0.7 and 0.9. I am hiding the code used to generate these time series, but we will come back to this in an upcoming section on red-noise processes.

In order to generate distributions of the sampling correlation coefficients, we will add a bit of noise to the time series to represent the variability that we would likely see in nature.

Next, we will compute the correlation between samples of these noisy time series 50,000 times to obtain distributions of the sampling correlation coefficient for each value of the population correlation (0.1,0.4,0.7 and 0.9). I have hidden the code again - the key message is in the histograms below.

---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
Cell In[5], line 12
     10 for i in range(Ns):
     11     start = np.random.randint(0,9000)
---> 12     r[j,i] = st.pearsonr(x[start:start+N,j]+noise, x[start+1:start+N+1,j]+noise)[0]

File /opt/anaconda3/envs/ccia/lib/python3.10/site-packages/scipy/stats/_stats_py.py:4782, in pearsonr(x, y, alternative, method)
   4778     result = PearsonRResult(statistic=r, pvalue=1.0, n=n,
   4779                             alternative=alternative, x=x, y=y)
   4780     return result
-> 4782 xmean = x.mean(dtype=dtype)
   4783 ymean = y.mean(dtype=dtype)
   4785 # By using `astype(dtype)`, we ensure that the intermediate calculations
   4786 # use at least 64 bit floating point.

File /opt/anaconda3/envs/ccia/lib/python3.10/site-packages/numpy/core/_methods.py:168, in _mean(a, axis, dtype, out, keepdims, where)
    164 arr = asanyarray(a)
    166 is_float16_result = False
--> 168 rcount = _count_reduce_items(arr, axis, keepdims=keepdims, where=where)
    169 if rcount == 0 if where is True else umr_any(rcount == 0, axis=None):
    170     warnings.warn("Mean of empty slice.", RuntimeWarning, stacklevel=2)

File /opt/anaconda3/envs/ccia/lib/python3.10/site-packages/numpy/core/_methods.py:71, in _count_reduce_items(arr, axis, keepdims, where)
     68 if where is True:
     69     # no boolean mask given, calculate items according to axis
     70     if axis is None:
---> 71         axis = tuple(range(arr.ndim))
     72     elif not isinstance(axis, tuple):
     73         axis = (axis,)

KeyboardInterrupt: 

../../_images/8e1e1c7f9d0d462c64e2458f26d1b3c440b359a2724c1632bad7423d72cdcb8b.png

We see that when we have a population correlation coefficient that is non-zero, we obtain sampling distributions that become more and more skewed as $\rho$ increases. Thus, when $\rho \neq$ 0, we cannot use a test statistic that is analogous to the $z$-statistic.

Now, we will take a look at testing statistical significance for the above two cases: $\rho$ = 0 and $\rho \neq$ 0.

Review of Correlation

Contents

Review of Correlation#

Sampling Distributions of \(r\)#

When \(\rho\) = 0:#

When \(\rho \neq\) 0:#