Statistical ramblings of a Moonbat

I will try to use this medium to talk about my statistical musings

Archive for December 28th, 2012

Granger Causality

leave a comment »

Granger causality is one of several tools that are extensively used to figure out cause and effect (causal) relationship from data. However there are some strong assumptions on data that limits the applicability of the Granger causality which will be listed later. Let us now formally introduce this Causality test for a simple linear model.

Let X be a variable that is of predictive interest to you. Assume, you have access to historical values of X starting from some past time p till the current time t . Additionally, let our system model currently have N control variables represented by the matrix W whose historical values, represented as W_{p:t} , are available from time p to t . These control variables W_{p:t} and X_{p:t} are assumed to form a linear predictive model for future values of X represented by X_{t+1} as

X_{t+1}=\sum_{i=p}^{t}{\left\{\sum_{j=1}^{N}{\left[\alpha_{ij}*W_{ij}\right]} + \beta_i*X_i\right\}} .

In the above equation, \alpha_{ij};i\in\left\{p,p+1,\cdots,t\right\}, j\in\left\{1,2,\cdots,N\right\} and \beta_i;i\in\left\{p,p+1,\cdots,t\right\} are coefficients of this linear predictive model that are estimated using linear regression.  Now if we would like to verify if a new control variable Y with its historical values Y_{p:t} can improve the prediction of  X_{t+1} then, the best way to test this hypothesis is to expand the above linear model as

\hat{X}_{t+1}=\sum_{i=p}^{t}{\left\{\sum_{j=1}^{N}{\left[\alpha_{ij}*W_{ij}\right]} + \beta_i*X_i + + \gamma_i*Y_i \right\}}

where \gamma_i:i\in\left\{p,p+t,\cdots,t\right\} are the coefficients of the control variable. If this new predicted value represented by \hat{X}_{t+1} has a statistically significant quality metric such as variance lower than X_{t+1} then Y has some magical cause and effect relationship with X . The best way to check lower variance would be to verify the hypothesis \gamma_i=0;i\in\left\{p,p+1,\cdots,t\right\} simultaneously. If the residuals from both the above linear models are normally distributed then an F-test would reveal if the addition of coefficients \gamma_i;\left\{p:t\right\} are has resulted in reducing the variance.

However, some of the caveats of the above method are:

  • The linear system model assumption for the output variable X
  • The normality of the residuals from linear models. F-test is particularly sensitive to normality requirement. Non-normal residuals can skew the results from an F-test.
  • Finally, the minimum number of data samples that are required to verify this causality could be significantly large depending on the number of control variables N and the number past values (lag) terms \left(t-p\right) of those control variables .

Written by ranabasheer

December 28, 2012 at 1:17 pm

Posted in Statistics, tutorial

Pitfalls of Data Driven Models

leave a comment »

Defining a system from data is poor mans version of research. If you don’t understand a complicated system, then throw in enough data and see if you could build a predictive model for your desired output variables from a series of control variables. More often what goes into the control variable pool is subjective but occasionally there is some reasoning behind it. Usually the reasoned part arises from the desire to reduce the  volatility in predicting future values of the output variables. The belief here is that if by the addition of a new control variable to this cocktail pool of variables called the system model resulted in lowering the volatility in predicting future values of your output variables, then magically there is a cause and effect relationship between this new variable and your output variable. More often the researcher doesn’t elaborate on the basis of this causal relationship or at the best provides a perfunctory explanation that is often an excuse for his belief in this  magical causality.  This kind of data driven research leads to interesting set of conclusions some of which are listed in Nate Silver’s book Signals and Noise. However, the best example that I could find on the interwebs for failing to understand the dictum “correlation doesn’t imply causation” goes to the research by Tatu Westling of Universtiy of Helsinki who found an inverted-U relationship between the economic growth rates and reported average penis length. You may read about this interesting research here

Westling, T. (2011). Male organ and economic growth: Does size matter? University of Helsinki Discussion Paper (335).
 

In the above article the author states the following “The existence and channel of causality remains obscure at this point but the correlations are robust” is the sort of reasoning that finally ends up in correlation masquerading as causality. However, this type of system modelling seems to be in vogue now. The 2003 Nobel prize from economics was won by C.W.J Granger for his contribution in analyzing time series data with common trends or Granger Causality and in 2011 the Nobel prize in economics was won by Christopher Sims for his contribution to the development of Vector Auto Regressive (VAR) for analyzing the cause of effect between statistical variables.

Written by ranabasheer

December 28, 2012 at 1:13 pm

Posted in Statistics