Sunday, August 16, 2015

Tsay Ch 11 - State-space models and kalman filter

Local trend model

For a univariate time series $y_t=\mu_t+\epsilon_t$ and $\mu_{t+1}=\mu_t+\eta_t$, both the error terms are assumed to be normally distributed to distinct variance $\sigma_e^2$ and $\sigma_{\eta}^2$ respectively. Notice the first equation is the observed version of the second trend model with added noise. This model can be used to analyze realized volatility of an asset price if $\mu_t$ is assumed to be the log volatility (which is not directly observable) and $y_t$ is the logarithm or realized volatility (which is observable constructed from high-frequency transaction data with microstructure noise).

If there is no measurement error term in the first equation ($\sigma_e=0$) this becomes a ARIMA(0,1,0) model. With the error term it is a ARIMA(0,1,1) model, which is also the simple exponential smoothing model. The form is $(1-B)y_t=(1-\theta B)a_t$, $\theta$ and $\sigma_{a}^2$ are related to $\sigma^2_e$ and $\sigma^2_{\eta}$ as follows: $(1+\theta^2)\sigma^2_a=2\sigma^2_e+\sigma^2_{\eta}$ and $\theta \sigma^2_a=\sigma^2_e.$ The quadratic equation for $\theta$ will give two solutions with $|\theta|<1$ chosen. The reverse is also possible for positive $\theta$. Both representations have pros and cons and the objective of data analysis, substantive issues and experience decide which to use.

Statistical Inference

Three types (using reading handwritten note example)
  1. Filtering - recover state variable $\mu_t$ given $F_t$ to remove the measurement errors from the data. (figuring out the word you are reading based on knowledge accumulated from the beginning to the note).
  2. Prediction - forecast $\mu_{t+h}$ or $y_{t+h}$ for $h>0$ given $F_t$, where $t$ is the forecast origin. (guess the next word).
  3. Smoothing - estimate $\mu_t$ given $F_T$, where $T>t$. (deciphering a particular word once you have read through the note).

The Kalman Filter

Let $\mu_{t|j}=E(\mu_t|F_j)$ and $\Sigma_{t|j}=Var(\mu_t|F_j)$ be, respectively, the conditional mean and variance of $\mu_t$ given information $F_j$. Similarly $y_{t|j}$ denotes the conditional mean of $y_t$ given $F_j$. Furthermore let $v_t=y_t-y_{t|j}$ and $V_t=Var(v_t|F_{t-1})$ be 1-step ahead forecast error and its variance of $y_t$ given $F_{t-1}$. Note that $Var(v_t|F_{t-1})=Var(v_t)$, since the forecast error $v_t$ is independent of $F_{t-1}$. Further, $y_{t|t-1}=\mu_{t|t-1}$ giving $v_t=y_t-\mu_{t|t-1}$ and $V_t=\Sigma_{t|t-1}+\sigma^2_e$. Also, $E(v_t)=0$ and $Cov(v_t,y_t)=0$ for $j<t$. The information $F_t \equiv \{F_{t-1},y_t\} \equiv \{F_{t-1},v_t\}$, hence $\mu_{t|t}=E(\mu_t|F_{t-1},v_t)$ and $\Sigma_{t|t}=Var(\mu_t|F_{t-1},v_t)$.

One can show that $Cov(\mu_t,v_t|F_{t-1})=\Sigma_{t|t-1}$ giving, $$\begin{bmatrix} \mu_t \\ v_t \end{bmatrix}_{F_{t-1}} \sim N\left( \begin{bmatrix} \mu_{t|t-1} \\ 0 \end{bmatrix}, \begin{bmatrix} \Sigma_{t|t-1} & \Sigma_{t|t-1} \\ \Sigma_{t|t-1} & V_t\end{bmatrix} \right).$$ Applying the multivariate normal theorem we get $$\mu_t|t = \mu_{t|t-1}+(V_t^{-1}\Sigma_{t|t-1})v_t=\mu_{t|t-1}+K_tv_t,$$ $$\Sigma_{t|t}=\Sigma_{t|t-1}-\Sigma_{t|t-1}V_t^{-1}\Sigma_{t|t-1} = \Sigma_{t|t-1}(1-K_t),$$ where $K_t=V_t^{-1}\Sigma_{t|t-1}$ is referred to as the Kalman gain, which is the regression coefficient of $\mu_t$ on $v_t$, governing the contribution of th enew shock $v_t$ to the state variable $\mu_t$. To predict $\mu_{t+1}$ given $F_t$ we have $$\mu_{t+1|t} \sim N(\mu_{t|t}, \Sigma_{t|t}+\sigma^2_{\eta}).$$ once the new data $y_{t+1}$ is observed, the above procedure can be repeated (obviously once $\sigma_e$ and $\sigma_{\eta}$ are estimated, generally using maximum likelihood method). This is the famous Kalman filter algorithm (1960). The choice of priors $\mu_{1|0}$ and $\Sigma_{1|0}$ requires some attention.

Properties of forecast error -
State error recursion -
State smoothing -
Missing Values -
Effect of Initialization -
Estimation - 

Regression assumptions

Everybody in finance knows that the 90% of quant work is 'REGRESSION' and mostly LINEAR. The results of a linear regression are as good as we understand their assumptions. For a univariate case we write $y_t = \alpha + \beta x_t + \epsilon_t$, where the estimation is straightforward. The interesting case is multivariate regression, where we write $$Y_t = \pmb{\beta}X_t +\pmb{\epsilon}_t.$$ To estimate the parameters we use the normal equation to get $$\pmb{\beta} = (X^TX)^{-1}X^TY$$ Now, how good is this an estimate? We want these estimates to be:
unbiased - The expected value of the estimate is the true value.
consistent - With more observations the distribution of the estimate becomes more concentrated near true value.
efficient - lessor observations are required to establish true value for given confidence.
asymptotically normal - With a lot of observations the distribution of the estimate is a normal distribution.

OLS is consistent when the regressors are exogenous and there is no perfect multicollinearity, and optimal in the class of linear unbiased estimators when the errors are homoscedastic and serially uncorrelated. Under these conditions, OLS provides min-variance and mean-unbiased estimates, when the errors have finite variances. Aussuming errors have normal distribution, OLS is same as MLE. The expanded version of OLS is multi-fractional order estimator (like Kalman filter).

The 'random design' paradigm treats the regressors $x_i$ as random and sampled together with $y_i$ from some population. The 'fixed design' paradigm treats $X$ as known constants and $y$ is sampled conditionally on the values of $X$ as in an experiment. Practically, the distinction is unimportant and results in the same formula for estimation. 

Assumptions

  1. OLS minimizes error in dependent variable $y$ only and hence assumes there is no error in $x$.
  2. The functional dependence being modeled is valid.
  3. Strict exogeneity - The errors in regression have conditional mean zero: $E[\epsilon|X]=0$, which implies that error have mean zero: $E[\epsilon]=0$, and that the regressors are uncorrelated with the errors: $E[X^T\epsilon]=0$. If not true the OLS estimates are invalid. In that case use method of instrumental variables.
  4.  No linear dependence - The regressors in X must be linearly independent, i.e. X must be full rank almost surely. Sometimes we also assume that the regressors has finite moments up to second order, in such a case the matrix $X^TX$ will be finite and positive semi-definite. If violated the regressors are called perfectly multicollinear, $\beta$ can't be estimated, though prediction of $y$ is still possible.
  5. Spherical errors - It is assumed that $Var[\epsilon|X]=\sigma^2\pmb{I}_n$. IF violated OLS estimates are still valid, but no longer efficient. If error terms are don't have same variance, i.e. they are not homoscedastic Weighted least square is used. If there autocorrealation between error terms Generalized least squares is used.
  6. Normality - It is sometimes additionally assumed that errors have normal distribution. This is not required. Under this assumption OLS is equivalent to MLE and is asymptotically efficient in the class of all regular estimators.
Certain degree of correlation between the observations is very common, under which OLS and WLS are inefficient. GLS is the right thing to do: $$Y = X\beta + \epsilon \qquad E[\epsilon|X]=0, Var[\epsilon|X]=\Omega.$$ GLS estimates $\beta$ by minimizing the squared Mahalanobis length of the residual vector to give $$\hat{\beta}=(X^T\Omega^{-1}X)^{-1}X^T\Omega^{-1}Y.$$ The GLS estimator is unbiased, consistent, efficient and asymptotically normal. It is equivalent to applying OLS to linearly transformed version of data, which standardize and de-correlates the regressors. WLS is a special case of GLS.

To estimate GLS we use Feasible Generalized Least squares (FGLS) in two steps:
1) Model is estimated using OLS (consistent but inefficient) estimator, and the residuals are used to build a consistent estimator of the error covariance matrix;
2) Using these we estimate GLS.

FGLS is preferred only for large sample size. For small sample size it is better to stick to OLS. FGLS is not always consistent for small sample.

Saturday, August 15, 2015

Tsay Ch9 - Principal Component Analysis and Factor Models

Dimension reduction is essential to search for the underlying structure of the assets - called factors.

Three types of factor models -
1) Macroeconomic factor models - GDP growth, interest rates, inflation, unemployment - observable factors using regression
2) Fundamental factor models - firm size, book and market values, industrial classification.
3) Statistical factor models - non-observable  or latent variables e.g. PCA

General Factor Model

For $m$ factors, $k$ assets, and $T$ time periods let $r_{it}$ be the return of asset i at time period t. The factor model is
$$\pmb{r}_{t}=\pmb{\alpha}+\pmb{\beta}\pmb{f}_t+\pmb{\epsilon}_{t}, \qquad t = 1,...,T$$
where $\pmb{\beta}$ is a $k\times m$ factor loading matrix and $\pmb{\epsilon}_t$ is the error vector with $Cov{\pmb{\epsilon}_t}=\pmb{D}=diag[\sigma^2_1,...,\sigma^2_k]$, a $k\times k$ diagonal matrix. The covariance matrix of the returns $\pmb{r}_t$ is then given by:
$$Cov(\pmb{r}_t)=\pmb{\beta}\pmb{\Sigma}_f\pmb{\beta}^T+\pmb{D}$$

Macroeconomic factor models

Macroeconomic factors are observable. We can convert the general factor model into Multiple Linear regression setup and estimate the factors. This estimation does not impose the constraint of $\epsilon_{it}$ being uncorrelated, so may not be efficient in general. The best known single factor model is the market model (Sharpe 1970). The $R^2$ can reach up to 50%, showing the significance of common market factor. One simple trick to compare factor based covariance matrix with sample covariance matrix is to use the global minimum variance portfolio (GMVP). For a given covariance matrix $\Sigma$, the GMVP $\omega$ solves $min\sigma^2_{p,\omega}=\omega^T\Sigma\omega$, such that $\omega^T\pmb{1}=1$ given by
$$\omega=\frac{\Sigma^{-1}\pmb{1}}{\pmb{1}^T\Sigma^{-1}\pmb{1}}.$$
It is also important to verify that the residual covariance matrices do not have large off-diagonal elements, to fit the factor model criteria.

Ross (1986) considers multi-factor model consisting of unexpected changes or surprises (e.g. residuals after fitting VAR(3) model to seasonally adjusted CPI and unemployment growth numbers). The explanatory power is low.

Fundamental factor models

BARRA factor method treats the observed asset specific fundamentals as the factor betas $\beta_i$, and estimates the factor $f_t$ at each time index $t$ via regression. Fama and French construct their factors based on hedge portfolio which depend on the fundamentals. For BARRA factor model $$\widetilde{\pmb{r}}_t = \pmb{\beta} \pmb{f}_t+\pmb{\epsilon}_t,$$ where $\widetilde{\pmb{r}_t}$ is the mean-corrected returns. We need WLS setup since the regression is not homogeneous, the estimate would be $$\hat{\pmb{f}_t}=(\beta D^{-1}\beta^T)^{-1}(\beta D^{-1}\beta^T\widetilde{r_t}).$$ We estimate the diagonal covariance matrix of errors from OLS first and then use it to estimate the factors using WLS equation. Cross-correlations in errors are ignored. The diagonal covariance matrix of final errors $\hat{D_g}$ and the covariance matrix of estimated factor realizations $\hat{\Sigma}$ can be used to derive the covariance matrix of the original returns as $$Cov(r_t)=\beta\hat{\Sigma}_f\beta^T+\hat{D_g}.$$ In practice, the sample mean or returns are not different from zero, so one may not need to remove the sample mean before fitting the BARRA factor model.

Fama-French approach used a two-step procedure. First, they sorted the assets based on the value of three fundamentals (market excess returns, small vs big cap, value vs growth stocks). They formed the hedge portfolio which is long top quintile and short the bottom quintile. The observed return on this hedge portfolio is the factor realization for the given asset. Finally, given the factor realizations calculate betas using regression.

Principal component analysis

We look to find linear combinations which explain the most variance and are orthogonal to each other, with weights summing to one. This is done on covariance or correlation matrix which are non-negative definite and hence have spectral decomposition. For covariance matrix the variance explained is $\lambda_i/\sum \lambda$, which becomes $\lambda_i/k$ for a correlation matrix, since $Tr(\rho_r)=k$.

Statistical factor analysis

The aim is to identify a few factors that can account for most of the variations in the covariance or correlation matrix of the data. The assumption of no serial correlations is all right for low frequency data but not accurate for higher frequencies. Serial correlations should first be removed parametrically. We then construct orthogonal factor model. Since both the loadings and the factors are unobservable it is different from other factor models. For the Statistical factor model $r_t - \mu = \beta f_t + \epsilon_t$, we have the assumptions $E[f_t]=0$, $Cov[f_t]=\pmb{I}_m$, $E[\epsilon_t]=0$, $Cov[\epsilon_t]=D=diag(\sigma^2_1,...,\sigma^2_k)$ and $E[f_t\epsilon^T]=0$. These are not uniquely determined. This can be estimated either using Principal Component Method or Maximum Likelihood estimation, with specified $k$. Factor rotation can be used for interpretation using varimax criteria.

Left out sections: 9.6