3 min read

Causal Inference with Confounding Factors

Recently I read some articles about causal interpretations. I was told that the correlation doesn’t imply causation the first time I learned it. Also the Simpson’s paradox show that some confounding factors may destroy the conclusion based on the whole. Maybe when someone wanted to use the cor relationship to do the causal inference, the first problem is the evaluation of the effect of confounding factors.

Here, we must know the confounding factors are not the interaction between the variables. If our goal is to find the causal effect between X and Y, the confounding factor is a jerk U who could affect both the X and Y. So most of time, a third man, Z could be invited to solve this problem. Z will have a strong relationship with X while no relationship with Y. OK, the next thing is change Z to observe X and Y. In case the effect of Z must work under X to give something to Y, we could access the changes trend of Y when the Z changes. I mean, U is hidden but Z is in the light, so if Z could change Y via X, the relationship between X and Y will still be obviously. Meanwhile, if the X is a confounding factor to Y, the relationship between X and Y will be much weaker than no change of Z. Most of time the role of Z is taken by randomization. For example, the smoking and cancer has a strong relationship, but the brands of cigarette may not have a relationship with cancer. Also we know that the favor of brand is different among smokers. Here we control one brands within a group with different tastes of brands. Then the result will show the relationship between the smoking and cancer. After comparing with a random survey, a clear relationship may be shown. But as you know those result has a limit ion of brand.

I think this problem may be solved by another method: Z must have an effect on X while no effect on Y. What if affect both X and Y? well, you may find the Z might be another U. So why not just use random numbers? I mean if strong relationship or causal relationship between X and Y, random Z will not change it. So we could simulate a lot of relationship variables to find the changes of r-squared between X and Y. On the other hand, we could use a lot of suspicious U to find their effect on X and Y. I doubt that there is a distribution of confounding factors' effect and there also exist a impossible space of the r-squared. Using computer, we might simulate those distribution.

Back to the Simpson’s paradox, we might find the survey is strong effected by U, we might improve the result by a random sampling or design an experiment. Anyway, the data must be randomized to draw the conclusion. An experiment with the control variable must be designed with other variables have average 0 effect. But when we got a census data, some data analysis processes might be applied to make you focuse on the relationship you care.

The causal inference could be done both by well- experiment and math analysis, I believe.