Friday, September 4, 2015

Tsay Ch10 - Multivariate Volatility Models and their applicaitons

Generalize the univariate volatility models of chapter 3 to the multivariate case as well as simplifying the dynamic relationship between volatility process of multiple asset returns, to address the curse of dimensionality and time-varying correlations. Consider a multivariate return series $\pmb{r}_t$ given by $$\pmb{r}_t=\pmb{\mu}_t+\pmb{a}_t,$$ where $\pmb{\mu}_t=E(\pmb{r}_t|F_{t-1})$ is the conditional expectation of $\pmb{r}_t$ given the past information and $\pmb{a}_t$ is the shock at time t. The mean equation of $\pmb{r}_t$ can be modeled as a multivariate time series process (ch 8) , e.g. a simple VARMA process $$\pmb{\mu}_t=\pmb{\Gamma}\pmb{x}_t+\sum_{i=1}^{p}\pmb{\Phi}_i\pmb{r}_{t-i}-\sum_{i=1}^{q}\pmb{\Theta}_i\pmb{a}_{t-i},$$ where $\pmb{x}_t$ denotes the $m$-dimensional vector of exogenous (explanatory) variables with $x_{1t}=1$, $\pmb{\Gamma}$ is a $k\times m$ matrix, and $p$ and $q$ are non-negative integers.

The conditional covariance matrix of $\pmb{a}_t$ given $F_{t-1}$ is a $k\times k$ positive definite matrix $\pmb{\Sigma}_t$ defined by $Cov(\pmb{a}_t|F_{t-1})$. Multivariate volatility modeling is concerned with the time evolution of $\pmb{\Sigma}_t$. This is referred to as the volatility model equation of $\pmb{r}_t$.

Exponentially weighted estimate

An equally weighted estimate of unconditional covariance matrix of the innovations can be estimated by $$\hat{\Sigma}=\frac{1}{t-1}\sum_{j=1}^{t-1}a_j a_j^T.$$ To allow for a time-varying covariance matrix with emphasis on recent information one can use exponential smoothing as $$\hat{\Sigma}_t=\frac{1-\lambda}{1-\lambda^{t-1}}\sum_{j=1}^{t-1}\lambda^{j-1}a_{t-j}a_{t-j}^T,$$ where $0<\lambda<1.$ For a sufficiently large t such that $\lambda^{t-1}\approx 0,$ the equation becomes $$\hat{\Sigma}_t=(1-\lambda)a_{t-1}a_{t-1}^T+\lambda \hat{\Sigma}_{t-1}.$$ This is called the EWMA estimate of covariance matrix. The parameters along with $\lambda$ can be jointly estimated using log-likelihood, which can be evaluated recursively. $\lambda$ of 0.94 (30 days) comes out commonly as optimal. 

Some multivariate GARCH models

  1. Diagonal Vectorization model (VEC): generalization of exponentially weighted moving-average approach. Each element is a GARCH(1,1) type mode. May not produce a positive definite covariance matrix and does not model the dynamic dependence between volatility series. 
  2. BEKK model: Baba-Engle-Kraft-Kroner model (1995) to guarantee the positive-definite constraint. Too many parameters but models dynamic dependence between the volatility series. 

Reparameterization

$\Sigma_{t}$ is reparameterized by making used of the symmetric property.
  1. Use of correlations - Covariance matrix can be represented as variances and lower triangle correlations and can be jointly modeled. Specifically, we write $\Sigma_t$ as $D_t\rho_tD_t$, where $\rho_t$ is the conditional correlation matrix of $a_t$, and $D_t$ is a $k \times k$ diagonal matrix consisting of conditional standard deviations of elements of $a_t$. To model the volatility of $a_t$, it suffices to consider the conditional variances and correlation coefficient of $a_{it}$. The $k(k+1)/2$ dimensional vector $\Xi_t= (\sigma_{11,t},...,\sigma_{kk,t}, \varrho_t^T)^T$, where $\varrho_t$ is a $k(k-1)/2$ dimensional vector obtained by stacking columns of the correlation matrix $\rho_t$, but using only the elements below the main diagonal, i.e. $\varrho_t=(\rho_{21,t},...,\rho_{k1,t}|\rho_{32,t},...,\rho_{k2,t}|...|\rho_{k,k-1,t})^T$. To illustrate, for $k=2$, we have $\varrho_t=\rho_{21,t}$ and $\Xi_t=(\sigma_{11,t},\sigma_{22,t},\rho_{21,t})^T$, which is a 3-dimensional vector. The approach has weaknesses because the likelihood function becomes complicated when the dimension is greater than 2. And the approach requires a constrained maximization to ensure positive definiteness. 
  2. Cholesky decomposition - This requires no constrained maximization. This is orthogonal transformation so the resulting likelihood is extremely simple. Because $\Sigma_t$ is positive definite, there exist a lower triangular matrix $L_t$ with unit diagonal elements and a diagonal matrix $G_t$ with positive diagonal elements such that $\Sigma_t=L_tG_tL_t^T.$ A feature of the decomposition is that the lower off-diagonal elements of $L_t$ and the diagonal elements of $G_t$ have close connections with linear regression. Using Cholesky decomposition amounts to doing an orthogonal transformation from $a_t$ to $b_t$, where $b_{1t}=a_{1t}$, and $b_{it}$, for $1<i \le k$, is defined recursively by the least-square regression $a_{it}=q_{i1,t}b_{1t}+q_{i2,t}b_{2t}+...+q_{i(i-1),t}b_{(i-1)t}+b_{it}$, where $q_{ij,t}$ is the $(i,j)$th element of the lower triangular matrix $L_t$ for $1\le j <i$. We can write this transformation as $a_t=L_tb_t$, where $L_t$ is the lower triangular matrix with unit diagonal elements. The covariance matrix of $b_t$ is $G_t$. The parameter vector relevant to volatility modeling under such a transformation becomes $\Xi_t=(g_{11,t},...,g_{kk,t},q_{21,t},q_{31,t},q_{32,t},...,q_{k1,t},...,q_{k(k-1),t})^T$, which is also a $k(k+1)/2$ dimensional vector. The likelihood function also simplifies drastically. There are several advantages of this transformation. First, $\Sigma_t$ can be kept positive definite simply by modeling $ln(g_{ii,t})$. Second, element of $\Xi_t$ are simply the coefficients and residual variances of multiple linear regressions that orthogonalize the shocks to the returns. Third, the correlation coefficient between $a_{1t}$ and $a_{2t}$, which is simply $q_{21,t}\sqrt{\sigma_{11,t}}/\sqrt{\sigma_{22,y}}$, is time-varying. Finally, we get $\sigma_{ij,t}=\sum_{c=1}^{j}q_{iv,t}q_{jv,t}g_{vv,t}.$

GARCH models for bivariate returns

Thursday, September 3, 2015

Multivariate Normal distribution

This was forthcoming, especially, if you want to understand Kalman filter.

A $k$-dimensional random vector $\pmb{x}=(x_1,...,x_k)^T$ follows a multivariate normal distribution with mean $\pmb{\mu}=(\mu_1,...,\mu_k)^T$ and positive-definite covariance matrix $\pmb{\Sigma}=[\sigma_{ij}]$ if its probability density function is $$f(x|\mu, \Sigma)=\frac{1}{(2\pi)^{k/2}|\Sigma|^{1/2}}e^{-\frac{1}{2}(x-\mu)^T\Sigma^{-1}(x-\mu)}.$$ This is denoted by $x\sim N_k(\mu,\Sigma).$ A square matrix $A (m\times m)$ is a positive-definite matrix if $A$ is symmetric, and all eigenvalues of $A$ are positive. Alternatively, $A$ is a positive-definite matrix if for any nonzero $m$-dimensional vector $b$, we have $b^TAb>0.$ For a positive-definite matrix $A$ all eigenvalues are positive and matrix can be decomposed as $A=P\Lambda P^T,$ where $\lambda$ is a diagonal matrix consisting of all eigenvalues of $A$ and $P$ is an $m\times m$ matrix consisting of the $m$ right eigenvectors of $A$, making $P$ an orthogonal matrix, if eigenvalues are distinct.

For a symmetric matrix $A$, there exists a lower triangular matrix $L$ with diagonal elements being 1 and a diagonal matrix $G$ such that $A=LGL^T$. If $A$ is positive definite, then the diagonal elements of G are positive. In this case we can write $A=(L\sqrt{G})(L\sqrt{G})^T$, where $L\sqrt{G}$ again is a lower triangle matrix. Such a decomposition is called Cholesky decomposition of $A$. This shows that a positive-definite matrix $A$ can be diagonalized as $L^{-1}A(L^T)^{-1}=L^{-1}A(L^{-1})^T=G.$

Let $c=[c_1,...,c_k]^T$ be a nonzero vector partitioned as $x=[x_1^T,x_2^T]^T$, with the first of size $p$ and the second of size $k-p$ such that, $$\begin{bmatrix} x_1 \\ x_2 \end{bmatrix} \sim N\left( \begin{bmatrix} \mu_1 \\ \mu_2 \end{bmatrix}, \begin{bmatrix} \Sigma_{11} & \Sigma_{12} \\ \Sigma_{21} & \Sigma_{22}\end{bmatrix} \right).$$ Some properties of $x$ are:

  1. $c^Tx \sim N\left( c^T\mu, c^T\Sigma c \right)$, any nonzero linear combination of $x$ is univariate normal and vice-versa.
  2. The marginal distribution of $x_i$ is normal, $x_i \sim N_k \left( \mu_i, \Sigma_{ii}\right)$.
  3. $\Sigma_{12}=0$ if an only if $x_1$ and $x_2$ are independent.
  4. The variable $(x-\mu)^T\Sigma^{-1}(x-\mu)$ follows a chi-squared distribution with $m$ degrees of freedom.
  5. The conditional distribution of $x_1$ given $x_2=b$ is also normally distributed as $$(x_1|x_2=b)\sim N \left( \mu_1+\Sigma_{12}\Sigma_{22}^{-1}(b-\mu_2), \Sigma_{11}-\Sigma_{12} \Sigma_{22}^{-1} \Sigma_{21} \right).$$
Suppose that $x$, $y$, and $z$ are three random vectors such that their joint distribution is multivariate normal. In addition, assume that the diagonal block covariance matrix $\Sigma_{ww}$ is nonsingular for $w=x,y,z$, and $\Sigma_{yz}=0$. Then,

  1. $(x|y) \sim N \left( \mu_x+\Sigma_{xy}\Sigma_{yy}^{-1}(y-\mu_y), \Sigma_{xx}-\Sigma_{xx}\Sigma_{yy}^{-1}\Sigma_{yx}\right)$
  2. $(x|y,z) \sim N\left( E(x|y)+\Sigma_{xz}\Sigma_{zz}^{-1}(z-\mu_z), Var(x|y)-\Sigma_{xz}\Sigma_{zz}^{-1}\Sigma_{zx}\right)$


Sunday, August 16, 2015

Tsay Ch 11 - State-space models and kalman filter

Local trend model

For a univariate time series $y_t=\mu_t+\epsilon_t$ and $\mu_{t+1}=\mu_t+\eta_t$, both the error terms are assumed to be normally distributed to distinct variance $\sigma_e^2$ and $\sigma_{\eta}^2$ respectively. Notice the first equation is the observed version of the second trend model with added noise. This model can be used to analyze realized volatility of an asset price if $\mu_t$ is assumed to be the log volatility (which is not directly observable) and $y_t$ is the logarithm or realized volatility (which is observable constructed from high-frequency transaction data with microstructure noise).

If there is no measurement error term in the first equation ($\sigma_e=0$) this becomes a ARIMA(0,1,0) model. With the error term it is a ARIMA(0,1,1) model, which is also the simple exponential smoothing model. The form is $(1-B)y_t=(1-\theta B)a_t$, $\theta$ and $\sigma_{a}^2$ are related to $\sigma^2_e$ and $\sigma^2_{\eta}$ as follows: $(1+\theta^2)\sigma^2_a=2\sigma^2_e+\sigma^2_{\eta}$ and $\theta \sigma^2_a=\sigma^2_e.$ The quadratic equation for $\theta$ will give two solutions with $|\theta|<1$ chosen. The reverse is also possible for positive $\theta$. Both representations have pros and cons and the objective of data analysis, substantive issues and experience decide which to use.

Statistical Inference

Three types (using reading handwritten note example)
  1. Filtering - recover state variable $\mu_t$ given $F_t$ to remove the measurement errors from the data. (figuring out the word you are reading based on knowledge accumulated from the beginning to the note).
  2. Prediction - forecast $\mu_{t+h}$ or $y_{t+h}$ for $h>0$ given $F_t$, where $t$ is the forecast origin. (guess the next word).
  3. Smoothing - estimate $\mu_t$ given $F_T$, where $T>t$. (deciphering a particular word once you have read through the note).

The Kalman Filter

Let $\mu_{t|j}=E(\mu_t|F_j)$ and $\Sigma_{t|j}=Var(\mu_t|F_j)$ be, respectively, the conditional mean and variance of $\mu_t$ given information $F_j$. Similarly $y_{t|j}$ denotes the conditional mean of $y_t$ given $F_j$. Furthermore let $v_t=y_t-y_{t|j}$ and $V_t=Var(v_t|F_{t-1})$ be 1-step ahead forecast error and its variance of $y_t$ given $F_{t-1}$. Note that $Var(v_t|F_{t-1})=Var(v_t)$, since the forecast error $v_t$ is independent of $F_{t-1}$. Further, $y_{t|t-1}=\mu_{t|t-1}$ giving $v_t=y_t-\mu_{t|t-1}$ and $V_t=\Sigma_{t|t-1}+\sigma^2_e$. Also, $E(v_t)=0$ and $Cov(v_t,y_t)=0$ for $j<t$. The information $F_t \equiv \{F_{t-1},y_t\} \equiv \{F_{t-1},v_t\}$, hence $\mu_{t|t}=E(\mu_t|F_{t-1},v_t)$ and $\Sigma_{t|t}=Var(\mu_t|F_{t-1},v_t)$.

One can show that $Cov(\mu_t,v_t|F_{t-1})=\Sigma_{t|t-1}$ giving, $$\begin{bmatrix} \mu_t \\ v_t \end{bmatrix}_{F_{t-1}} \sim N\left( \begin{bmatrix} \mu_{t|t-1} \\ 0 \end{bmatrix}, \begin{bmatrix} \Sigma_{t|t-1} & \Sigma_{t|t-1} \\ \Sigma_{t|t-1} & V_t\end{bmatrix} \right).$$ Applying the multivariate normal theorem we get $$\mu_t|t = \mu_{t|t-1}+(V_t^{-1}\Sigma_{t|t-1})v_t=\mu_{t|t-1}+K_tv_t,$$ $$\Sigma_{t|t}=\Sigma_{t|t-1}-\Sigma_{t|t-1}V_t^{-1}\Sigma_{t|t-1} = \Sigma_{t|t-1}(1-K_t),$$ where $K_t=V_t^{-1}\Sigma_{t|t-1}$ is referred to as the Kalman gain, which is the regression coefficient of $\mu_t$ on $v_t$, governing the contribution of th enew shock $v_t$ to the state variable $\mu_t$. To predict $\mu_{t+1}$ given $F_t$ we have $$\mu_{t+1|t} \sim N(\mu_{t|t}, \Sigma_{t|t}+\sigma^2_{\eta}).$$ once the new data $y_{t+1}$ is observed, the above procedure can be repeated (obviously once $\sigma_e$ and $\sigma_{\eta}$ are estimated, generally using maximum likelihood method). This is the famous Kalman filter algorithm (1960). The choice of priors $\mu_{1|0}$ and $\Sigma_{1|0}$ requires some attention.

Properties of forecast error -
State error recursion -
State smoothing -
Missing Values -
Effect of Initialization -
Estimation - 

Regression assumptions

Everybody in finance knows that the 90% of quant work is 'REGRESSION' and mostly LINEAR. The results of a linear regression are as good as we understand their assumptions. For a univariate case we write $y_t = \alpha + \beta x_t + \epsilon_t$, where the estimation is straightforward. The interesting case is multivariate regression, where we write $$Y_t = \pmb{\beta}X_t +\pmb{\epsilon}_t.$$ To estimate the parameters we use the normal equation to get $$\pmb{\beta} = (X^TX)^{-1}X^TY$$ Now, how good is this an estimate? We want these estimates to be:
unbiased - The expected value of the estimate is the true value.
consistent - With more observations the distribution of the estimate becomes more concentrated near true value.
efficient - lessor observations are required to establish true value for given confidence.
asymptotically normal - With a lot of observations the distribution of the estimate is a normal distribution.

OLS is consistent when the regressors are exogenous and there is no perfect multicollinearity, and optimal in the class of linear unbiased estimators when the errors are homoscedastic and serially uncorrelated. Under these conditions, OLS provides min-variance and mean-unbiased estimates, when the errors have finite variances. Aussuming errors have normal distribution, OLS is same as MLE. The expanded version of OLS is multi-fractional order estimator (like Kalman filter).

The 'random design' paradigm treats the regressors $x_i$ as random and sampled together with $y_i$ from some population. The 'fixed design' paradigm treats $X$ as known constants and $y$ is sampled conditionally on the values of $X$ as in an experiment. Practically, the distinction is unimportant and results in the same formula for estimation. 

Assumptions

  1. OLS minimizes error in dependent variable $y$ only and hence assumes there is no error in $x$.
  2. The functional dependence being modeled is valid.
  3. Strict exogeneity - The errors in regression have conditional mean zero: $E[\epsilon|X]=0$, which implies that error have mean zero: $E[\epsilon]=0$, and that the regressors are uncorrelated with the errors: $E[X^T\epsilon]=0$. If not true the OLS estimates are invalid. In that case use method of instrumental variables.
  4.  No linear dependence - The regressors in X must be linearly independent, i.e. X must be full rank almost surely. Sometimes we also assume that the regressors has finite moments up to second order, in such a case the matrix $X^TX$ will be finite and positive semi-definite. If violated the regressors are called perfectly multicollinear, $\beta$ can't be estimated, though prediction of $y$ is still possible.
  5. Spherical errors - It is assumed that $Var[\epsilon|X]=\sigma^2\pmb{I}_n$. IF violated OLS estimates are still valid, but no longer efficient. If error terms are don't have same variance, i.e. they are not homoscedastic Weighted least square is used. If there autocorrealation between error terms Generalized least squares is used.
  6. Normality - It is sometimes additionally assumed that errors have normal distribution. This is not required. Under this assumption OLS is equivalent to MLE and is asymptotically efficient in the class of all regular estimators.
Certain degree of correlation between the observations is very common, under which OLS and WLS are inefficient. GLS is the right thing to do: $$Y = X\beta + \epsilon \qquad E[\epsilon|X]=0, Var[\epsilon|X]=\Omega.$$ GLS estimates $\beta$ by minimizing the squared Mahalanobis length of the residual vector to give $$\hat{\beta}=(X^T\Omega^{-1}X)^{-1}X^T\Omega^{-1}Y.$$ The GLS estimator is unbiased, consistent, efficient and asymptotically normal. It is equivalent to applying OLS to linearly transformed version of data, which standardize and de-correlates the regressors. WLS is a special case of GLS.

To estimate GLS we use Feasible Generalized Least squares (FGLS) in two steps:
1) Model is estimated using OLS (consistent but inefficient) estimator, and the residuals are used to build a consistent estimator of the error covariance matrix;
2) Using these we estimate GLS.

FGLS is preferred only for large sample size. For small sample size it is better to stick to OLS. FGLS is not always consistent for small sample.

Saturday, August 15, 2015

Tsay Ch9 - Principal Component Analysis and Factor Models

Dimension reduction is essential to search for the underlying structure of the assets - called factors.

Three types of factor models -
1) Macroeconomic factor models - GDP growth, interest rates, inflation, unemployment - observable factors using regression
2) Fundamental factor models - firm size, book and market values, industrial classification.
3) Statistical factor models - non-observable  or latent variables e.g. PCA

General Factor Model

For $m$ factors, $k$ assets, and $T$ time periods let $r_{it}$ be the return of asset i at time period t. The factor model is
$$\pmb{r}_{t}=\pmb{\alpha}+\pmb{\beta}\pmb{f}_t+\pmb{\epsilon}_{t}, \qquad t = 1,...,T$$
where $\pmb{\beta}$ is a $k\times m$ factor loading matrix and $\pmb{\epsilon}_t$ is the error vector with $Cov{\pmb{\epsilon}_t}=\pmb{D}=diag[\sigma^2_1,...,\sigma^2_k]$, a $k\times k$ diagonal matrix. The covariance matrix of the returns $\pmb{r}_t$ is then given by:
$$Cov(\pmb{r}_t)=\pmb{\beta}\pmb{\Sigma}_f\pmb{\beta}^T+\pmb{D}$$

Macroeconomic factor models

Macroeconomic factors are observable. We can convert the general factor model into Multiple Linear regression setup and estimate the factors. This estimation does not impose the constraint of $\epsilon_{it}$ being uncorrelated, so may not be efficient in general. The best known single factor model is the market model (Sharpe 1970). The $R^2$ can reach up to 50%, showing the significance of common market factor. One simple trick to compare factor based covariance matrix with sample covariance matrix is to use the global minimum variance portfolio (GMVP). For a given covariance matrix $\Sigma$, the GMVP $\omega$ solves $min\sigma^2_{p,\omega}=\omega^T\Sigma\omega$, such that $\omega^T\pmb{1}=1$ given by
$$\omega=\frac{\Sigma^{-1}\pmb{1}}{\pmb{1}^T\Sigma^{-1}\pmb{1}}.$$
It is also important to verify that the residual covariance matrices do not have large off-diagonal elements, to fit the factor model criteria.

Ross (1986) considers multi-factor model consisting of unexpected changes or surprises (e.g. residuals after fitting VAR(3) model to seasonally adjusted CPI and unemployment growth numbers). The explanatory power is low.

Fundamental factor models

BARRA factor method treats the observed asset specific fundamentals as the factor betas $\beta_i$, and estimates the factor $f_t$ at each time index $t$ via regression. Fama and French construct their factors based on hedge portfolio which depend on the fundamentals. For BARRA factor model $$\widetilde{\pmb{r}}_t = \pmb{\beta} \pmb{f}_t+\pmb{\epsilon}_t,$$ where $\widetilde{\pmb{r}_t}$ is the mean-corrected returns. We need WLS setup since the regression is not homogeneous, the estimate would be $$\hat{\pmb{f}_t}=(\beta D^{-1}\beta^T)^{-1}(\beta D^{-1}\beta^T\widetilde{r_t}).$$ We estimate the diagonal covariance matrix of errors from OLS first and then use it to estimate the factors using WLS equation. Cross-correlations in errors are ignored. The diagonal covariance matrix of final errors $\hat{D_g}$ and the covariance matrix of estimated factor realizations $\hat{\Sigma}$ can be used to derive the covariance matrix of the original returns as $$Cov(r_t)=\beta\hat{\Sigma}_f\beta^T+\hat{D_g}.$$ In practice, the sample mean or returns are not different from zero, so one may not need to remove the sample mean before fitting the BARRA factor model.

Fama-French approach used a two-step procedure. First, they sorted the assets based on the value of three fundamentals (market excess returns, small vs big cap, value vs growth stocks). They formed the hedge portfolio which is long top quintile and short the bottom quintile. The observed return on this hedge portfolio is the factor realization for the given asset. Finally, given the factor realizations calculate betas using regression.

Principal component analysis

We look to find linear combinations which explain the most variance and are orthogonal to each other, with weights summing to one. This is done on covariance or correlation matrix which are non-negative definite and hence have spectral decomposition. For covariance matrix the variance explained is $\lambda_i/\sum \lambda$, which becomes $\lambda_i/k$ for a correlation matrix, since $Tr(\rho_r)=k$.

Statistical factor analysis

The aim is to identify a few factors that can account for most of the variations in the covariance or correlation matrix of the data. The assumption of no serial correlations is all right for low frequency data but not accurate for higher frequencies. Serial correlations should first be removed parametrically. We then construct orthogonal factor model. Since both the loadings and the factors are unobservable it is different from other factor models. For the Statistical factor model $r_t - \mu = \beta f_t + \epsilon_t$, we have the assumptions $E[f_t]=0$, $Cov[f_t]=\pmb{I}_m$, $E[\epsilon_t]=0$, $Cov[\epsilon_t]=D=diag(\sigma^2_1,...,\sigma^2_k)$ and $E[f_t\epsilon^T]=0$. These are not uniquely determined. This can be estimated either using Principal Component Method or Maximum Likelihood estimation, with specified $k$. Factor rotation can be used for interpretation using varimax criteria.

Left out sections: 9.6

Tuesday, July 7, 2015

Tsay Ch4 - Nonlinear Models and their applications

I was going to leave out this chapter but may be not. I need to see what regime switching is all about.

Remember a purely stochastic time series $x_t$ is liner if it can be decomposed into a Wold form. Anything else is nonlinear. A general nonlinear functions is inapplicable due to too many parameters. We can restrict the model to the form $$x_t=g(F_{t-1})+\sqrt{h(F_{t-1})}\epsilon_t,$$ where $\epsilon_t=a_t/\sigma_t$ is the standardized innovation. ARMA is a linear model for mean while ARCH/GARCH models are nonlinear models of variance (because they model the square of volatility and not volatility). For stationary volatility series, shocks are uncorrelated but dependent. The models in this chapter extend to nonlinear model of mean.

The following four nonlinear models let the conditional mean $\mu_t$ evolve over time according to some simple parametric nonlinear function - Bilinear, Threshold autoregressive, State dependent, Markov switching. More recent nonlinear models are more data driven - nonlinear state-space, functional coefficient autoregressive, nonlinear additive autoregressive, multivariate adaptive regression spline. Finally, nonparametric methods like kernel regression and artificial neural networks have also been used. Test statistics - parametric (based on Lagrange multiplier or likelihood) and non-parametric (based on higher order spectra or dimension correlation) are also discussed.

Nonlinear models: Some basic ones for financial time series.

1) Bilinear Model: The linear model is simply the firs-order Taylor series expansion of $f(.)$ function. As such, a natural extension to non-linearity is to employ the second order terms in the expansion to improve the approximation. Hence the model is
$$x_t=c+\sum_{i=1}^p\phi_ix_{t-i}-\sum_{j=1}^q\theta_j a_{t-j}+\sum_{i=1}^m\sum_{j=1}^s\beta_{ij}x_{t-i}a_{t-j}+a_t.$$
This is generally analyzed by putting the model in state-space form. ARCH models generally seem to fit the data better, when quadratic terms of innovations are involved.

2) Threshold Autoregressive Model (TAR): Motivated by asymmetry in declining and rising patterns and hence is piece-wise linear (not in time space but threshold space). This can be used to model regimes. e.g. here is two regime model
\[
x_t=
\begin{cases}
\phi_1^{(1)}x_{t-1}+a_t, & \text{if } x_{t-l}<\delta\\
\phi_1^{(2)}x_{t-1}+a_t, & \text{if } x_{t-l}\ge \delta
\end{cases}
\]
Here $l$ is the delay parameter and $\delta$ is the threshold, which bifurcates the two regimes. There are several interesting characteristics of TAR models:

a) The process $x_t$ is geometrically ergodic and stationary,under the conditions $\phi_1^{(1)}<1$, $\phi_1^{(2)}<1$, and $\phi_1^{(1)\phi_1^{(2)}}<1$. A process is ergodic if its statistical properties (such as mean and variance) can be deduced from a single, sufficiently long sample (realization) of the process. Ergodic theorem: sample mean $\bar{x}=(\sum_t^Tx_t)/T$ of $x_t$ converges to the mean of $x_t$. This is regarded as the counterpart of the central limit theory for the iid case. 

b) The series exhibits an asymmetric increasing and decreasing pattern due to different coefficients in the two regimes, hence different regimes will have different number of observations. The series is hence not time-reversible. 

c) The model, in the example, has no constant terms, but $E(x_t)$ is not zero. $E(x_t)$ is a weighted average of the conditional means of the two regimes, which are nonzero. the weight of each regime is simply the probability that $x_t$ is in that regime under its stationary distribution. Hence for TAR model to have zero mean, nonzero constant terms are needed in some regimes, unlike in stationary linear model where nonzero constant implies nonzero mean for $x_t$.

Properties of TAR models are hard to obtain, but estimation is not difficult. US unemployment can be modeled using TAR models where the regime changes after a sudden increase in unemployment causing monetary intervention. It is also used to model asymmetric responses in volatility between positive and negative returns, study arbitrage tradings in index futures and cash prices. 

3) Smooth transition AR model (STAR): The mean is discontinuous for TAR, so using logistic, exponential or cumulative functions it can be made continuous and differentiable. These are hard to estimate with large standard errors. 

4) Markov switching autoregressive Model (MSA): Using probability of switching, emphasizing aperiodic transition between states, Hamilton (1989). 
\[
x_t=
\begin{cases}
c_1+\sum_{i=1}^p\phi_{1,i}x_{t-i}+a_{1t}, & \text{if } s_{t}=1\\
c_2+\sum_{i=1}^p\phi_{2,i}x_{t-i}+a_{2t}  & \text{if } s_{t}=2
\end{cases}
\]
where $s_t$ are the two states and is a first order Markov-chain with transition probability matrix
\[
\begin{bmatrix}
1-w_1 & w_1 \\
w_2 & 1-w_2
\end{bmatrix}
\]
The expected duration of the process to stay in state $i$ is $1/w_i$. TAR model uses a deterministic scheme of transition while MSA uses a stochastic scheme. Hence, in MSA one is never certain about which state $x_t$ belongs to. This has important implication. MSA forecast are always some linear combination of the two states, while TAR model picks a particular state. These models are estimated using EM or MCMC algorithms. The model can be further generalized by modeling transition probabilities to be logistic, probit or some other function of explanatory variables available at time $t-1$.  Markov switching can hence be used to choose among many models. Taking the example of US quarterly GDP and fitting a MSA model we see the following observations:
a) The states correspond to growth and contraction.
b) Large std of state 2 implies that that there are relatively fewer observations in the contraction state.
c) It is more likely for US GNP to get out of contraction period than to jump into one.
d) Expected duration of expansion is 11 quarters while contraction is 4 quarters.

5) Non-parametric methods: The essence is smoothing, as they are highly data dependent and may otherwise overfit. If given many realizations of $y_t$ for a given $x_t$, an asymptotically accurate value of the functional dependence, $m(x)$, of $Y$ on $X$ at $X=x$ is given by average of $y_t$. In a time series only one observation of $y_t$ is available for a given $x_t$. But if the functional form is sufficiently smooth, then the values of $Y$ for which $X_t\approx x$ continues to provide accurate approximation of $m(x)$. We can use weighted average to estimate $m(x)$. Hence,
$$\hat{m}(x)=\frac{1}{T}\sum_{t=1}^Tw_t(x)y_t$$

Kernel Regression: Kernel $K(x)$ is the weighting function, satisfying $K(x)\ge0$ and $\int K(x)dz=1$. For rescaling of distance measure, one can use bandwidth $h>0$ making the kernel $K_h(x)=K(x/h)/h$ and $\int K_h(z)dz=1$. The weight function simply becomes:
$$w_t(x)=\frac{K_h(x-x_t)}{\sum_{t=1}^T K_h(x-x_t)}$$, which gives the Nadarya-Watson kernel estimator (1964). Kernels chosen are generally, Gaussian, Epanechnikov. If $h$ is very small, then the weights focus on a few observations that are int he neighborhood around each $x_t$. If $h$ is very large, then the weights will spread over a larger neighborhood of x_t. Bandwidth selection is done by either minimizing the mean integrated square error (using some preliminary smoothing estimations) or leave-one-out cross validation. Fan and Yao (2003) give the bandwidth for Gaussian as $1.06\sigma_xT^{-0.2}$.

Local Linear Regression Method:- Takes the weighted kernel function but minimizes the least square error to estimate the parameters of the equation.
Functional coefficient AR model:- Treat each AR equation coefficient as function and estimate, using kernels for functions. Shown to have better 1-step ahead forecasting.
Nonlinear additive AR model:- To avoid curse of dimensionality use GAM style AR model.
Nonlinear State-Space model:- instead of linear matrix evolution, use nonlinear evolution of states, using MCMC.

Left of sections: 4.1.9,4.2-4.5

Tuesday, June 30, 2015

Tsay Ch8 - Multivariate Time Series Analysis and Its Applications

How markets are interrelated is important to understand the lead-lag relationship and under what circumstances they reverse or do not work. Various previous methods can be applied directly to the vector case, but some need attention.

Weak stationarity and cross-correlation matrices

k-dimensional time series $\pmb{r_t}=[r_{1t},...,r_{kt}]^T$ is weakly stationary if its first and second moments are time-invariant, $\pmb{\mu}=E(\pmb{r}_t)$ and $\pmb{\Gamma}_0=E[(\pmb{r}_t-\pmb{\mu})(\pmb{r}_t-\pmb{\mu})^T]$. Let $\pmb{D}$ be a $k\times k$ diagonal matrix consisting of the standard deviations of $r_{it}$. The lag-zero, cross-correlation matrix is defined as $\pmb{\rho}_0=\pmb{D}^{-1}\pmb{\Gamma}_0\pmb{D}^{-1}$, which is the correlation matrix. The lag-$l$ cross-covariance matrix $\pmb{r}_t$ is defined as $\pmb{\Gamma}_l=E[(\pmb{r}_t-\pmb{\mu})(\pmb{r}_{t-l}-\pmb{\mu})^T]$. For a weakly stationary series, the cross-covariance matrix $\pmb{\Gamma}_l$ is a function of $l$, not the time index $t$. The lag-$l$ cross-correlation matrix (CCM) is defined as $\pmb{\rho}_l=\pmb{D}^{-1}\pmb{\Gamma}_l\pmb{D}^{-1}$. To understand this better, notice
$$\rho_{ij}(l)=\frac{\Gamma_{ij}(l)}{\sqrt{\Gamma_{ii}(0)\Gamma_{jj}(0)}}=\frac{Cov(r_{it},r_{j,t-l})}{std(r_{it})std(r_{jt})}$$
which is the correlation between $r_{it}$ and $r_{j,t-l}$. If $\rho_{ij}(l)\ne 0$ and $l>0$, we say that the series $r_{jt}$ leads the series $r_{it}$ at lag $l$. Similarly, if $\rho_{ji}(l)\ne 0$ and $l>0$ we say that the series $r_{it}$ leads the series $r_{jt}$ at lag $l$. The diagonal element $\rho_{ii}(l)$ is simply the lag-$l$ autocorrelation coefficient of $r_{it}$.

Some remarks are ($l>0$):
1) In general $\rho_{ij}(l) \ne \rho_{ji}(l)$ for $i\ne j$, because they are measuring two different lag relationships, implying $\pmb{\Gamma}_l$ and $\pmb{\rho}_l$ are in general not symmetric.
2) It is easy to see that $\pmb{\Gamma}_l=\pmb{\Gamma}_{-l}^T$ and $\pmb{\rho}_l=\pmb{\rho}_{-l}^T$. Hence it suffices in practice to consider the cross-correlation matrices $\pmb{\rho}_l$ for $l \ge 0$.

For the cross-correlation matrices {$\pmb{\rho}_l|l=0,1,...$}, the diagonal elements {$\rho_{ii}(l)|l=0,1,...$} are the autocorrelation function of $r_{it}$, the off-diagonal element $\rho_{ij}(0)$ measures the concurrent linear relationship between $r_{it}$ and $r_{jt}$, and for $l>0$ the off-diagonal element $\rho_{ij}(l)$ measures the linear dependence of $r_{it}$ on past value of $r_{j,t-l}$. Depending on the value in these matrices one and identify -
1) no linear relationship ($\rho_{ij}(l)=\rho_{ji}(l)=0, \forall l\ge0$),
2) concurrent correlation ($\rho_{ij}(0)\ne0$),
3) no lead-lag relationship ($\rho_{ij}(l)=\rho_{ji}(l)=0, \forall l>0$),
4) unidirectional relationship ($\rho_{ij}(l)=0,\forall l>0,\rho_{ji}(v)$ for some $v\ge0$), or
5) feedback relationship ($\rho_{ij}(l)\ne0$, for some $l>0,\rho_{ji}(v)\ne0$ for some $v\ge0$) .

Sample Cross-correlation matrices can be estimate using $\hat{\pmb{\rho}}_l=\hat{\pmb{D}}^{-1}\hat{\pmb{\Gamma}}_l\hat{\pmb{D}}^{-1}$, where
$$\hat{\Gamma}_l=\frac{1}{T}\sum_{t=l+1}^{T}(\pmb{r}_t-\bar{\pmb{r}})(\pmb{r}_{t-l}-\bar{\pmb{r}})^T, \qquad l\ge 0.$$ Bootstrapping can be used to get confidence intervals on finite samples.

Multivariate  Portmanteau tests: or Multivariate Ljung-Box test with statistic $Q(m)$ have the null hypothesis $H_0:\pmb{\rho}_1=...=\pmb{\rho}_m=\pmb{0}$, and $H_a: \pmb{\rho}_i\ne0$ for some $i \epsilon {1,...,m}$. The test statistic assumes the form
$$Q_k(m)=T^2\sum_{l=1}^m\frac{1}{T-l} tr(\hat{\Gamma}^T_l\hat{\pmb{\Gamma}}^{-1}_0\hat{\pmb{\Gamma}}_l
\hat{\pmb{\Gamma}}^{-1}_0)$$ and under some regularity conditions follows a chi-squared distribution with $k^2m$ degrees of freedom, asymptotically.

Vector autoregressive models (VAR)

A multivariate time series ${\pmb{r}_t}$ is a VAR process of order 1, or VAR(1) for short, if it follows the model $$\pmb{r}_t=\pmb{\phi}_0+\pmb{\Phi}\pmb{r}_{t-1}+\pmb{a}_t,$$ where $\pmb{\phi}_0$ is a k-dimensional vector, $\pmb{\Phi}$ is a $k\times k$ matrix, and ${\pmb{a}_t}$ is a sequence of uncorrelated random vectors with mean zero and covariance matrix $\pmb{\Sigma}$, which is positive definite (generally assumed to be multivariate normal).

Positive definite matrix is a symmetric matrix with all eigenvalues positive. Also for any vector $\pmb{b}$, we have $\pmb{b}^T\pmb{Ab}>0$. These types of matrices can be decomposed as $\pmb{A}=\pmb{P\Lambda P}^T$, where $\pmb{\Lambda}$ is a diagonal matrix consisting of eigenvalues of $\pmb{A}$ and $\pmb{P}$ is a square matrix consisting of eigenvectors of $\pmb{A}$. These eigenvectors are orthogonal to each other. Matrix $\pmb{P}$ is orthogonal and this decomposition is referred to as spectral decomposition. For a symmetric matrix $\pmb{A}$, there exists a lower triangular matrix $\pmb{L}$ with diagonal elements being 1 and a diagonal matrix $\pmb{G}$ such that $\pmb{A}=\pmb{LGL}^T$. If $\pmb{A}$ is positive definite, then the diagonal elements of $\pmb{G}$ are positive. In this case, we have $\pmb{A=L\sqrt{G}\sqrt{G}L}^T=\pmb{MM}^T$, where $\pmb{M=L\sqrt{G}}$ is a lower triangular matrix. This is called Cholesky decomposition. Notice that this implies $\pmb{L}^{-1}\pmb{A}(\pmb{L}^{-1})^{T}=\pmb{G}$.

Reduced and Structural form: In general the off diagonal elements of matrix $\pmb{\Sigma}$ show the concurrent relationship between $r_{1t}$ and $r_{2t}$, while the matrix $\pmb{\Phi}$ measures the dynamic dependence or $\pmb{r}_t$. This is called reduced-form model because it does not show explicitly the concurrent dependence between the component series. The explicit-form expression of concurrent relationship (for the last series and hence any series by rearrangement) can be deduced by a simple linear transformation. Using Cholesky decomposition (possible because $\pmb{\Sigma}$ is positive definite symmetric matrix) we can find a lower triangle matrix $\pmb{L}$ with unit diagonal elements such that $\pmb{\Sigma=LGL}^T$, where $\pmb{G}$ is a diagonal matrix. If we define $\pmb{b}_t = \pmb{L}^{-1}\pmb{a}_t$, then $E(\pmb{b}_t)=\pmb{L}^{-1}E(\pmb{a_t})=\pmb{0}$ and $Cov(\pmb{b}_t)=E(\pmb{b}_t\pmb{b}_t^T)=\pmb{L}^{-1}\pmb{\Sigma}(\pmb{L}^T)^{-1}=\pmb{G}$. Since $\pmb{G}$ is a diagonal matrix the components of $\pmb{b}_t$ are uncorrelated.

Pre-multiplying the reduced-form with $\pmb{L}^{-1}$, to uncouple the equations, we get
$$\pmb{L}^{-1}\pmb{r}_t=\pmb{L}^{-1}\pmb{\phi}_0+\pmb{L}^{-1}\pmb{\Phi}\pmb{r}_{t-1}+\pmb{L}^{-1}\pmb{a}_t=\pmb{\phi}_0^*+\pmb{\Phi}^*\pmb{r}_{t-1}+\pmb{b}_t.$$
The last row of $\pmb{L}^{-1}$ has 1 as the last element, let it be $(w_{k1},w_{k2},..,w_{k,k-1},1)$, and hence the structural equation for the last ($k^{th}$) time series becomes:
$$r_{kt}+\sum_{i=1}^{k-1}w_{ki}r_{it}=\phi_{k,0}^*+\sum_{i=1}^k\Phi_{ki}^* r_{i,t-1}+b_{kt}.$$
This is possible because $\pmb{b}_t$ is a diagonal matrix and uncoupled. Reduced-form is commonly used for two reasons - ease in estimation, and concurrent correlations cannot be used in forecasting.

Stationarity condition and moments of a VAR(1) model: All eigenvalues of $\pmb{\Phi}$ should be less than 1 in modulus for weak stationarity for $\pmb{r}_t$, provided the covariance matrix of $\pmb{a}_t$ exists. Further we have $\pmb{\Gamma}_l=\pmb{\Phi \Gamma}_{l-1}$, for $l>0$, where $\pmb{\Gamma}_j$ is the lag-j cross-covariance matrix of $\pmb{r}_t$. By repeated substitutions we get $\pmb{\Gamma}_l=\pmb{\Phi}^l \pmb{\Gamma}_0$. Further for $\pmb{\Upsilon}=\pmb{D}^{-1/2}\pmb{\Phi}\pmb{D}^{1/2}$, we get $\pmb{\rho}_l=\pmb{\Upsilon}^l\pmb{\rho}_0$. A VAR(p) model is generally converted to a VAR(1) model using companion matrix and then analyzed like a VAR(1) model.

To find the order of a VAR model one can generally use the multi-variant equivalent of PACF with hypothesis tests on the successive residuals. The $i^{th}$ equation in the PACF is given by $\pmb{r}_t=\pmb{\phi}_0+\pmb{\Phi}_1\pmb{r}_{t-1}+...+\pmb{\Phi}_i\pmb{r}_{t-i}+\pmb{a}_t$. Parameters of these equations can be estimated by OLS method. For the $i^{th}$ equation let the OLS estimates of coefficients be $\hat{\pmb{\Phi}}_j^{(i)}$ and $\hat{\pmb{\phi}}_0^{(i)}$, where the superscript $(i)$ is used to denote the VAR(i) model. Then the residual is $\hat{\pmb{a}}_t^{(i)}=\pmb{r}_t-\hat{\pmb{\phi}}_0^{(i)}-\hat{\pmb{\Phi}}_1^{(i)}\pmb{r}_{t-1}-...-\hat{\pmb{\Phi}}_i^{(i)}\pmb{r}_{t-i}$. We then test the hypothesis sequentially to identify the order of VAR model. For $i^{th}$ and $(i-1)^{th}$ equations we test $H_0:\pmb{\Phi}_i=0$ versus $H_a:\pmb{\Phi}_i\ne0$, the test statistic is
$$M(i)=-\Bigg(T-k-i-\frac{3}{2}\Bigg)ln\Bigg(\frac{|\hat{\Sigma}_i|}{|\hat{\Sigma}_{i-1}|}\Bigg).$$ Asymptotically, $M(i)$ is distributed as a chi-squared distribution with $k^2$ degrees of freedom, where $k$ is the dimensionality of the asset universe.

Equivalent AIC and BIC methods can also be employed.  OLS or ML is generally used to estimate the parameters, the two methods being asymptotically equivalent. Once a model is fit, residuals should be tested for inadequacy using $Q_k(m)$ statistic (with $k^2m-g$ degrees of freedom). Forecasting is similar to the uni-variate case. Impulse response function is the MA version and can be derived to look at the decay rate. The MA equation is premultiplied by $\pmb{L}^{-1}$ to get the impulse response function of $\pmb{r}_t$ with the orthogonal innovations $\pmb{b}_t$. But different ordering may lead to different response functions, which is a drawback.

Vector moving-average models (VMA) :A VMA(1) model is given by $$\pmb{r}_t=\pmb{\theta}_0+\pmb{a}_t-\pmb{\Theta}\pmb{a}_{t-1},$$ where $\pmb{\theta}_0$ is k-dimensional vector, $\pmb{\Theta}$ is $k\times k$ matrix. Like the uni-variate case the cross-correlations cuts at order 1 and can be used to identify the order. Estimation of VMA model is lot more involved. The conditional or exact MLE approaches can be used.
Vector ARMA models (VARMA): In generalizing from uni-variate to vectors ARMA encounters the issue of identifiability - it may not be uniquely defined. Some constraints need to be imposed - structural specification. These are hardly used.
Marginal models of components: Given the vector models the component models are called the marginal models. For a k-dimensional ARMA(p,q)  model, the marginal models are ARMA(kp,(k-1)p+q) models.

Unit root nonstationarity and cointegration: When modeling several unit-root nonstationary time series jointly, one may encounter the case of cointegration. This may be due to the common trend or unit root of one of the components. In other words one can find a linear combination which is stationary. Let $h$ be the number of unit roots (or common trends) in the $k$-dimensional series $\pmb{x}_t$. Cointegration exists if $0<h<k$, and the quantity $k-h$ is called the number of cointegrating factors, these are the different linear combinations that are unit-root stationary. The linear combinations resulting in these unit-root stationary processes are called the cointegrating vectors. Two price series, if cointegrated, will have common underlying trend and we will lose this information if we take the first difference of the price series. This is because one difference per unit-root preserves useful information. Under cointegration we have more non-stationary series than uni-roots hence losing information. Overdifferencing can lead to the problem of unit roots in the MA matrix polynomial, invertability and estimation. Also, cointegration can also exist after adjusting for transaction costs and exchange-rate risk, which is artificial.

Error correction form: To overcome the difficulty of noninvertible VARMA models one can use this form. A VARMA(p,q) model is $$\pmb{x}_t=\sum_{i=1}^{p}\pmb{\Phi}_i\pmb{x}_{t-i}+\pmb{a}_t-\sum_{j=1}^q\pmb{\Theta}_j\pmb{a}_{t-j}.$$
For $\Delta \pmb{x}_t=\pmb{x}_t-\pmb{x}_{t-1}$, we can subtract $\pmb{x}_{t-1}$ from both side of VARMA equation to get the error correction form. \[\Delta\pmb{x}_t=\pmb{\alpha\beta}^T\pmb{x}_{t-1}+\sum_{i=1}^{p-1}\pmb{\Phi}_i^* \Delta\pmb{x}_{t-i}+\pmb{a}_t-\sum_{j=1}^q\pmb{\Theta}_j\pmb{a}_{t-j}\]
where $\pmb{\alpha}$ and $\pmb{\beta}$ are $k\times m$ full-rank matrix, k is the total asset dimension, m is the cointegrating factors ($m<k$).  The term $\pmb{\alpha\beta}^T\pmb{x}_{t-1}$ is called the error-correction term, as it compensates for the over-differentiating. Further, $\pmb{\beta}^T\pmb{x}_{t-1}$ is stationary. Also, $$\pmb{\Phi}_j^*=-\sum_{i=j+1}^{p}\pmb{\Phi}_i, \qquad j=1,...,p-1$$ $$\pmb{\alpha\beta}^T=\pmb{\Phi}_p+...+\pmb{\Phi}_1-\pmb{I}$$
The time series $\pmb{\beta}^T\pmb{x_t}$ is unit-root stationary, and the columns of $\pmb{\beta}$ are the cointegrating vectors of $\pmb{x}_t$.

Co-integrated VAR models: We focus on VAR models for their simplicity in estimation, to better understand cointegration here. For a k-dimensional VAR(p) model
\[\pmb{x}_t=\pmb{\mu}_t+\pmb{\Phi}_1\pmb{x}_{t-1}+...+\pmb{\Phi}_p\pmb{x}_{t-p}+\pmb{a}_t,\]
where $\pmb{\mu}_0+\pmb{\mu}_1t$. Or equivalently, using backshift operator $B$
\[(\pmb{I}-\pmb{\Phi}_1B-...-\pmb{\Phi}_pB^p)\pmb{x}_t=\pmb{\mu}_t+\pmb{a}_t.\]
The characteristic polynomial in the above is represented as $\pmb{\Phi}(B)$. For a unit-root nonstationary process, 1 is a root making $|\pmb{\Phi}(1)|=0$ An error-correction form for this can be obtained by subtracting $\pmb{x}_{t-1}$ from both sides of the equation giving
\[\Delta \pmb{x}_t=\pmb{\mu}_t+\pmb{\Pi}\pmb{x}_{t-1}+\pmb{\Phi}_1^*\Delta\pmb{x}_{t-1}+...+\pmb{\Phi}_{p-1}^*\Delta \pmb{x}_{t-p+1}+\pmb{a}_t \]
Where $\pmb{\Pi}=\pmb{\Phi}_1+...+\pmb{\Phi}_p-\pmb{I}=-\pmb{\Phi}(1)$ and $\pmb{\Phi}_j^*=-\sum_{i=j+1}^p\pmb{\Phi}_i$ for $j=1,...,p-1$. Further if Rank($\pmb{\Pi}$)=0 implies that $\pmb{x}_t$ is not cointegrated, Rank($\pmb{x}_t$)=$k$ implies that ECM is not informative and one studies $\pmb{x}_t$ directly. Finally, if $0<$Rank($\pmb{\Pi}$)$=m<k$ then one can write $\pmb{\Pi}=\pmb{\alpha\beta}^T$, where $\pmb{\alpha}$ and $\pmb{\beta}$ are both of rank $m$. We have the case of cointegration with $m$ linearly independent cointegrated vectors $\pmb{w}_t=\pmb{\beta}^T\pmb{x}_t$,a nd has $k-m$ unit roots or common trends.

To obtain the $k-m$ common trends vector of size $(k-m)\times 1$, $\pmb{y}_t=\pmb{\alpha}^T_{\bot}\pmb{x}_t$, we calculate the orthogonal matrix of size $k \times (k-m)$ as $\pmb{\alpha}^T_{\bot}\pmb{\alpha}=\pmb{0}$.To uniquely identify $\pmb{\alpha}$ and $\pmb{\beta}$ we require that $\pmb{\beta}^T=[\pmb{I}_m,\pmb{\beta}^T_1]$, where $\pmb{T}_m$ is a $m\times m$ identity matrix and $\pmb{\beta}_1$ is a $(k-m)\times m$ matrix. There are a few more constraints for the process $\pmb{w}_t=\pmb{\beta}^T\pmb{x}_t$ to be unit-root stationary.

The rank of $\pmb{\Pi}$ in the ECM is the number of cointegrating vectors. Thus, to test for cointegration, once can examine the rank of $\pmb{\Pi}$, the approach taken in Johansen test.

Deterministic function: Limiting distributions of cointegration tests depend on the deterministic function $\pmb{\mu}_t$.
1) $\pmb{\mu}_t=\pmb{0}$: All components series of $\pmb{x}_t$ are $I(1)$ without drift and the stationary series $\pmb{w}_t=\pmb{\beta}^T\pmb{x}_t$ has mean zero.
2) $\pmb{\mu}_t=\pmb{\alpha}\pmb{c}_0$: Components of $\pmb{x}_t$ are $I(1)$ without drift, but $\pmb{w}_t$ have a nonzero mean $-\pmb{c}_0$, called restricted constant.
3) $\pmb{\mu}_t=\pmb{\mu}_0$: Component series are $I(1)$ with drift $\pmb{\mu}_0$ and $\pmb{w}_t$ may have a nonzero mean.
4)  $\pmb{\mu}_t=\pmb{\mu}_0+\pmb{\alpha c}_1 t$: Components of $\pmb{x}_t$ are $I(1)$ with drift $\pmb{\mu}_0$ and $\pmb{w}_t$ has a linear time trend, called restricted trend.
5) $\pmb{\mu}_t=\pmb{\mu}_0+\pmb{\mu}_1t$: Both the constant and trend are unrestricted. The components of $\pmb{x_t}$ are $I(1)$ and have a quadratic time trend and $\pmb{w}_t$ have a linear trend.

Maximum likelihood estimation: Estimation of VAR(p) is quite involved. The deterministic term ($\pmb{x}_t$) and stationary terms ($\Delta\pmb{x}_t$) are first bifurcated and estimated using linear regression and have error terms are $\pmb{u}_t$ and $\pmb{v}_t$ respectively. A relevant eigenvalue problem when solved leads to a likelihood which when maximized gives the estimates of the coefficients.

Johansen test for cointegration: This is esentially testing the rank of the matrix $\pmb{\Pi}$, for a specified deterministic term $\pmb{\mu}_t$. The number of non-zero eigenvalues of $\pmb{\Pi}$ can be obtained if a consistent estimate of $\pmb{\Pi}$ is available. Looking at the ECM equation it is clear that $\pmb{\Pi}$ is related to the covariance matrix between $\pmb{x}_{t-1}$ and $\Delta\pmb{x}_t$ after adjusting for the effects of deterministic trend term and $\Delta\pmb{x}_{t-i}$ for $i=1,...,p-1$. Using canonical correlation analysis between the two adjusted equation, the squared correlation between them are calculated to be $\hat{\lambda}_i$. There are two versions of Johansen test:

1) Trace cointegration test:- $H_0$: Rank($\pmb{\Pi}$) = $m$ versus $H_a$: Rank($\pmb{\Pi}$)>$m$. The Likelihood ratio (LR) statistic is
$$LK_{tr}(m) = -(T-p)\sum_{i=m+1}^{k}ln(1-\hat{\lambda}_i)$$
Due to the presence of unit-roots, the asymptotic distribution of statistic is not chi-squared, but a function of standard Brownian motions. Thus, the critical values must be obtained via simulation.

2) Sequential test:- $H_0$: Rank($\pmb{\Pi}$) = $m$ versus $H_a$: Rank($\pmb{\Pi}$)=$m+1$. The $LK$ ratio test statistic, called the maximum eigenvalue statistic, is
$$LK_{max}(m)=-(T-p)ln(1-\hat{\lambda}_{m+1})$$
Again, the critical values of the test statistics are nonstandard and must be evaluated via simulation.

Left out sections: 8.7, 8.8