got into data science, there was a phrase that we’d all heard; everyone knows it, young and old:
“Correlation doesn’t imply causation.”
It is a catchy phrase, and you’ve definitely said it once or twice, and might even have nodded confidently when someone else said it. Especially for datasets that don’t relate to each other, but where it’s funny and intriguing to imply causation!
Here are two very interesting facts:
- Countries that eat more pizza tend to have higher math scores.
- The more sunglasses sold, the more shark attacks occur.
Now, if that were all the information you have… what should you conclude?
Does eating pizza make you better at math? Will buying a new pair of sunglasses cause a shark attack?
Though it is funny to think about, the answer to those questions is “probably not”.
And yet, these are examples of something very real: Correlation.
The question worth asking now is: if correlation doesn’t equal causation, then what does it mean?
That’s where things get fuzzy.
Because we tend to treat correlation like a vague idea, we think of it as if it means “They’re kind of related”, or “They move together somehow”. But correlation isn’t just a feeling; it’s a precise mathematical measurement of how two variables move together.
Instead of just repeating the warning, let’s actually understand the concept. Once you do, those weird examples stop being surprising and start making sense.
So, let’s get into it!
What is correlation?
When people say two things are “correlated,” they usually mean one of three things:
- “Those two things seem related.”
- “Those two things move together.”
- “There’s some connection between those two things.”
On a surface level, all three of these are not wrong, but they are missing some nuances.
Correlation is not a vibe. It’s a measurement! And like any measurement, it answers a very specific question.
Taking a step back, imagine you collect the data on how many hours students studied and their exam scores.
You plot it, and you see something like this:

Each point represents one student. The x-axis is how long they studied, and the y-axis is their score.
When you look at this plot, you notice that the points tend to move upward. So you conclude, “As study time increases, scores tend to increase too”, which is what we call a positive correlation.
But, is that just a trend or is the data telling you something more?
In this example, the relationship you just plotted is: when one variable is above its average, the other tends to be above its average too.
That’s the key idea most people miss: correlation isn’t about raw values, it’s about how variables move relative to their averages.
So, the question correlation answers is:
Do two variables move together in a consistent way?
That question has one of three answers:
- Up + up → positive correlation
- Up + down → negative correlation
- No consistent pattern → no correlation
The Math Behind Correlation
Let’s try to make thinking about correlation more intuitive. We will do that using the Pearson correlation coefficient, which we can define as:
r=cov(X,Y)σX.σYr = \frac{cov(X, Y)}{ \sigma_{X}.\sigma_{Y}}
Okay, I know that equation isn’t what anyone thinks of when I say “intuitive”… But stick with me and let’s unpack it without turning it into a lecture.
Step 1: Covariance (AKA Do They Move Together?)
Covariance looks at how two variables move relative to their averages. For example, if both variables are above their averages, we get positive covariance; if one is above and the other below, we get negative covariance.
Basically, covariance answers: “Are these variables aligned in how they deviate from their averages?”
Step 2: Normalize It
Covariance alone is hard to interpret because it depends on scale. To overcome that, we divide by the standard deviations: σX\sigma_{X} and σY\sigma_{Y}. This rescales everything into a clean range: -1 to 1. That gives us common ground for comparing variable values.
After these two steps, we can now calculate the Pearson coefficient! If we get:
- +1 → perfect positive relationship.
- 0 → no linear relationship.
- -1 → perfect negative relationship.
This code simply measures how consistently these two variables move together—not how big they are, but how well aligned they are.
What Different Correlations Look Like

- Left: strong positive correlation → clear upward pattern
- Middle: no correlation → random scatter
- Right: strong negative correlation → downward pattern
Correlation measures consistency of movement, not just whether two variables are related.
What Correlation Actually Tells You
Correlation tells you: these variables move together in a structured way. It tells us that there is a pattern here to pay attention to.
But, it does NOT tell you why or how they do, or whether one causes the other.
The classic example of correlation is that ice cream sales and drowning incidents are correlated.
In fact, we can plot the number of ice cream sales and drowning incidents to get:

We can see a clear upward relationship between these two variables… more ice cream sales lead to more drownings?…
But that’s misleading. Because the real driver is temperature: hot weather means more ice cream sales, more people going to the beach, and more swimming.
So, though we can clearly see that correlation is real, the explanation is hidden.
Correlation and Nonlinearity
Now consider this relationship:
y = x²

This is clearly a strong relationship, as x increases or decreases, y increases! But if you compute correlation:
np.corrcoef(x, y)[0,1]
You’ll get something close to 0.
That is because correlation only measures: How well a straight line fits the relationship. This is a crucial limitation. If the relationship is curved, correlation may fail, even when a strong relationship exists.
So, instead of thinking: “Correlation = relationship”, it’s better to think: “Correlation = how well a straight line explains the relationship.”
The Misunderstanding
The vagueness of the concept of correlation, and the way we are taught it, leads to some misunderstandings. Three very common ones are:
- Assuming causation: Just because two variables move together doesn’t mean one causes the other.
- Ignoring hidden variables: There may be a third factor driving both.
- Missing nonlinear relationships: Correlation only sees straight-line patterns.
You be wondering now, if correlation is a very simple term that doesn’t tell us much, why is it important still?
Because it’s incredibly useful as a first signal. It tells you:
“Something interesting might be happening here.”
From there, you investigate further. Correlation measures alignment; further investigation provides an explanation.
Final Takeaway
“Correlation doesn’t imply causation.” That is true. But here’s the problem: people hear this and think: “Correlation is meaningless.” That is not true!
Correlation measures how variables move together; it ranges from -1 to 1, captures linear relationships, but it does NOT imply causation.
Correlation isn’t misleading. We just expect too much from it when it is not trying to explain the world. It is just a signal indicating:
“Hey… this looks interesting.”
Now, the real work starts, as we investigate why this is really interesting.