Processing math: 100%

Friday, September 4, 2015

Tsay Ch10 - Multivariate Volatility Models and their applicaitons

Generalize the univariate volatility models of chapter 3 to the multivariate case as well as simplifying the dynamic relationship between volatility process of multiple asset returns, to address the curse of dimensionality and time-varying correlations. Consider a multivariate return series rrt given by rrt=μμt+aat, where μμt=E(rrt|Ft1) is the conditional expectation of rrt given the past information and aat is the shock at time t. The mean equation of rrt can be modeled as a multivariate time series process (ch 8) , e.g. a simple VARMA process μμt=ΓΓxxt+pi=1ΦΦirrtiqi=1ΘΘiaati, where xxt denotes the m-dimensional vector of exogenous (explanatory) variables with x1t=1, ΓΓ is a k×m matrix, and p and q are non-negative integers.

The conditional covariance matrix of aat given Ft1 is a k×k positive definite matrix ΣΣt defined by Cov(aat|Ft1). Multivariate volatility modeling is concerned with the time evolution of ΣΣt. This is referred to as the volatility model equation of rrt.

Exponentially weighted estimate

An equally weighted estimate of unconditional covariance matrix of the innovations can be estimated by ˆΣ=1t1t1j=1ajaTj. To allow for a time-varying covariance matrix with emphasis on recent information one can use exponential smoothing as ˆΣt=1λ1λt1t1j=1λj1atjaTtj, where 0<λ<1. For a sufficiently large t such that λt10, the equation becomes ˆΣt=(1λ)at1aTt1+λˆΣt1. This is called the EWMA estimate of covariance matrix. The parameters along with λ can be jointly estimated using log-likelihood, which can be evaluated recursively. λ of 0.94 (30 days) comes out commonly as optimal. 

Some multivariate GARCH models

  1. Diagonal Vectorization model (VEC): generalization of exponentially weighted moving-average approach. Each element is a GARCH(1,1) type mode. May not produce a positive definite covariance matrix and does not model the dynamic dependence between volatility series. 
  2. BEKK model: Baba-Engle-Kraft-Kroner model (1995) to guarantee the positive-definite constraint. Too many parameters but models dynamic dependence between the volatility series. 

Reparameterization

Σt is reparameterized by making used of the symmetric property.
  1. Use of correlations - Covariance matrix can be represented as variances and lower triangle correlations and can be jointly modeled. Specifically, we write Σt as DtρtDt, where ρt is the conditional correlation matrix of at, and Dt is a k×k diagonal matrix consisting of conditional standard deviations of elements of at. To model the volatility of at, it suffices to consider the conditional variances and correlation coefficient of ait. The k(k+1)/2 dimensional vector Ξt=(σ11,t,...,σkk,t,ϱTt)T, where ϱt is a k(k1)/2 dimensional vector obtained by stacking columns of the correlation matrix ρt, but using only the elements below the main diagonal, i.e. ϱt=(ρ21,t,...,ρk1,t|ρ32,t,...,ρk2,t|...|ρk,k1,t)T. To illustrate, for k=2, we have ϱt=ρ21,t and Ξt=(σ11,t,σ22,t,ρ21,t)T, which is a 3-dimensional vector. The approach has weaknesses because the likelihood function becomes complicated when the dimension is greater than 2. And the approach requires a constrained maximization to ensure positive definiteness. 
  2. Cholesky decomposition - This requires no constrained maximization. This is orthogonal transformation so the resulting likelihood is extremely simple. Because Σt is positive definite, there exist a lower triangular matrix Lt with unit diagonal elements and a diagonal matrix Gt with positive diagonal elements such that Σt=LtGtLTt. A feature of the decomposition is that the lower off-diagonal elements of Lt and the diagonal elements of Gt have close connections with linear regression. Using Cholesky decomposition amounts to doing an orthogonal transformation from at to bt, where b1t=a1t, and bit, for 1<ik, is defined recursively by the least-square regression ait=qi1,tb1t+qi2,tb2t+...+qi(i1),tb(i1)t+bit, where qij,t is the (i,j)th element of the lower triangular matrix Lt for 1j<i. We can write this transformation as at=Ltbt, where Lt is the lower triangular matrix with unit diagonal elements. The covariance matrix of bt is Gt. The parameter vector relevant to volatility modeling under such a transformation becomes Ξt=(g11,t,...,gkk,t,q21,t,q31,t,q32,t,...,qk1,t,...,qk(k1),t)T, which is also a k(k+1)/2 dimensional vector. The likelihood function also simplifies drastically. There are several advantages of this transformation. First, Σt can be kept positive definite simply by modeling ln(gii,t). Second, element of Ξt are simply the coefficients and residual variances of multiple linear regressions that orthogonalize the shocks to the returns. Third, the correlation coefficient between a1t and a2t, which is simply q21,tσ11,t/σ22,y, is time-varying. Finally, we get σij,t=jc=1qiv,tqjv,tgvv,t.

GARCH models for bivariate returns

Thursday, September 3, 2015

Multivariate Normal distribution

This was forthcoming, especially, if you want to understand Kalman filter.

A k-dimensional random vector xx=(x1,...,xk)T follows a multivariate normal distribution with mean μμ=(μ1,...,μk)T and positive-definite covariance matrix ΣΣ=[σij] if its probability density function is f(x|μ,Σ)=1(2π)k/2|Σ|1/2e12(xμ)TΣ1(xμ). This is denoted by xNk(μ,Σ). A square matrix A(m×m) is a positive-definite matrix if A is symmetric, and all eigenvalues of A are positive. Alternatively, A is a positive-definite matrix if for any nonzero m-dimensional vector b, we have bTAb>0. For a positive-definite matrix A all eigenvalues are positive and matrix can be decomposed as A=PΛPT, where λ is a diagonal matrix consisting of all eigenvalues of A and P is an m×m matrix consisting of the m right eigenvectors of A, making P an orthogonal matrix, if eigenvalues are distinct.

For a symmetric matrix A, there exists a lower triangular matrix L with diagonal elements being 1 and a diagonal matrix G such that A=LGLT. If A is positive definite, then the diagonal elements of G are positive. In this case we can write A=(LG)(LG)T, where LG again is a lower triangle matrix. Such a decomposition is called Cholesky decomposition of A. This shows that a positive-definite matrix A can be diagonalized as L1A(LT)1=L1A(L1)T=G.

Let c=[c1,...,ck]T be a nonzero vector partitioned as x=[xT1,xT2]T, with the first of size p and the second of size kp such that, [x1x2]N([μ1μ2],[Σ11Σ12Σ21Σ22]). Some properties of x are:

  1. cTxN(cTμ,cTΣc), any nonzero linear combination of x is univariate normal and vice-versa.
  2. The marginal distribution of xi is normal, xiNk(μi,Σii).
  3. Σ12=0 if an only if x1 and x2 are independent.
  4. The variable (xμ)TΣ1(xμ) follows a chi-squared distribution with m degrees of freedom.
  5. The conditional distribution of x1 given x2=b is also normally distributed as (x1|x2=b)N(μ1+Σ12Σ122(bμ2),Σ11Σ12Σ122Σ21).
Suppose that x, y, and z are three random vectors such that their joint distribution is multivariate normal. In addition, assume that the diagonal block covariance matrix Σww is nonsingular for w=x,y,z, and Σyz=0. Then,

  1. (x|y)N(μx+ΣxyΣ1yy(yμy),ΣxxΣxxΣ1yyΣyx)
  2. (x|y,z)N(E(x|y)+ΣxzΣ1zz(zμz),Var(x|y)ΣxzΣ1zzΣzx)


Sunday, August 16, 2015

Tsay Ch 11 - State-space models and kalman filter

Local trend model

For a univariate time series yt=μt+ϵt and μt+1=μt+ηt, both the error terms are assumed to be normally distributed to distinct variance σ2e and σ2η respectively. Notice the first equation is the observed version of the second trend model with added noise. This model can be used to analyze realized volatility of an asset price if μt is assumed to be the log volatility (which is not directly observable) and yt is the logarithm or realized volatility (which is observable constructed from high-frequency transaction data with microstructure noise).

If there is no measurement error term in the first equation (σe=0) this becomes a ARIMA(0,1,0) model. With the error term it is a ARIMA(0,1,1) model, which is also the simple exponential smoothing model. The form is (1B)yt=(1θB)at, θ and σ2a are related to σ2e and σ2η as follows: (1+θ2)σ2a=2σ2e+σ2η and θσ2a=σ2e. The quadratic equation for θ will give two solutions with |θ|<1 chosen. The reverse is also possible for positive θ. Both representations have pros and cons and the objective of data analysis, substantive issues and experience decide which to use.

Statistical Inference

Three types (using reading handwritten note example)
  1. Filtering - recover state variable μt given Ft to remove the measurement errors from the data. (figuring out the word you are reading based on knowledge accumulated from the beginning to the note).
  2. Prediction - forecast μt+h or yt+h for h>0 given Ft, where t is the forecast origin. (guess the next word).
  3. Smoothing - estimate μt given FT, where T>t. (deciphering a particular word once you have read through the note).

The Kalman Filter

Let μt|j=E(μt|Fj) and Σt|j=Var(μt|Fj) be, respectively, the conditional mean and variance of μt given information Fj. Similarly yt|j denotes the conditional mean of yt given Fj. Furthermore let vt=ytyt|j and Vt=Var(vt|Ft1) be 1-step ahead forecast error and its variance of yt given Ft1. Note that Var(vt|Ft1)=Var(vt), since the forecast error vt is independent of Ft1. Further, yt|t1=μt|t1 giving vt=ytμt|t1 and Vt=Σt|t1+σ2e. Also, E(vt)=0 and Cov(vt,yt)=0 for j<t. The information Ft{Ft1,yt}{Ft1,vt}, hence μt|t=E(μt|Ft1,vt) and Σt|t=Var(μt|Ft1,vt).

One can show that Cov(μt,vt|Ft1)=Σt|t1 giving, [μtvt]Ft1N([μt|t10],[Σt|t1Σt|t1Σt|t1Vt]). Applying the multivariate normal theorem we get μt|t=μt|t1+(V1tΣt|t1)vt=μt|t1+Ktvt, Σt|t=Σt|t1Σt|t1V1tΣt|t1=Σt|t1(1Kt), where Kt=V1tΣt|t1 is referred to as the Kalman gain, which is the regression coefficient of μt on vt, governing the contribution of th enew shock vt to the state variable μt. To predict μt+1 given Ft we have μt+1|tN(μt|t,Σt|t+σ2η). once the new data yt+1 is observed, the above procedure can be repeated (obviously once σe and ση are estimated, generally using maximum likelihood method). This is the famous Kalman filter algorithm (1960). The choice of priors μ1|0 and Σ1|0 requires some attention.

Properties of forecast error -
State error recursion -
State smoothing -
Missing Values -
Effect of Initialization -
Estimation - 

Regression assumptions

Everybody in finance knows that the 90% of quant work is 'REGRESSION' and mostly LINEAR. The results of a linear regression are as good as we understand their assumptions. For a univariate case we write yt=α+βxt+ϵt, where the estimation is straightforward. The interesting case is multivariate regression, where we write Yt=ββXt+ϵϵt. To estimate the parameters we use the normal equation to get ββ=(XTX)1XTY Now, how good is this an estimate? We want these estimates to be:
unbiased - The expected value of the estimate is the true value.
consistent - With more observations the distribution of the estimate becomes more concentrated near true value.
efficient - lessor observations are required to establish true value for given confidence.
asymptotically normal - With a lot of observations the distribution of the estimate is a normal distribution.

OLS is consistent when the regressors are exogenous and there is no perfect multicollinearity, and optimal in the class of linear unbiased estimators when the errors are homoscedastic and serially uncorrelated. Under these conditions, OLS provides min-variance and mean-unbiased estimates, when the errors have finite variances. Aussuming errors have normal distribution, OLS is same as MLE. The expanded version of OLS is multi-fractional order estimator (like Kalman filter).

The 'random design' paradigm treats the regressors xi as random and sampled together with yi from some population. The 'fixed design' paradigm treats X as known constants and y is sampled conditionally on the values of X as in an experiment. Practically, the distinction is unimportant and results in the same formula for estimation. 

Assumptions

  1. OLS minimizes error in dependent variable y only and hence assumes there is no error in x.
  2. The functional dependence being modeled is valid.
  3. Strict exogeneity - The errors in regression have conditional mean zero: E[ϵ|X]=0, which implies that error have mean zero: E[ϵ]=0, and that the regressors are uncorrelated with the errors: E[XTϵ]=0. If not true the OLS estimates are invalid. In that case use method of instrumental variables.
  4.  No linear dependence - The regressors in X must be linearly independent, i.e. X must be full rank almost surely. Sometimes we also assume that the regressors has finite moments up to second order, in such a case the matrix XTX will be finite and positive semi-definite. If violated the regressors are called perfectly multicollinear, β can't be estimated, though prediction of y is still possible.
  5. Spherical errors - It is assumed that Var[ϵ|X]=σ2IIn. IF violated OLS estimates are still valid, but no longer efficient. If error terms are don't have same variance, i.e. they are not homoscedastic Weighted least square is used. If there autocorrealation between error terms Generalized least squares is used.
  6. Normality - It is sometimes additionally assumed that errors have normal distribution. This is not required. Under this assumption OLS is equivalent to MLE and is asymptotically efficient in the class of all regular estimators.
Certain degree of correlation between the observations is very common, under which OLS and WLS are inefficient. GLS is the right thing to do: Y=Xβ+ϵE[ϵ|X]=0,Var[ϵ|X]=Ω. GLS estimates β by minimizing the squared Mahalanobis length of the residual vector to give ˆβ=(XTΩ1X)1XTΩ1Y. The GLS estimator is unbiased, consistent, efficient and asymptotically normal. It is equivalent to applying OLS to linearly transformed version of data, which standardize and de-correlates the regressors. WLS is a special case of GLS.

To estimate GLS we use Feasible Generalized Least squares (FGLS) in two steps:
1) Model is estimated using OLS (consistent but inefficient) estimator, and the residuals are used to build a consistent estimator of the error covariance matrix;
2) Using these we estimate GLS.

FGLS is preferred only for large sample size. For small sample size it is better to stick to OLS. FGLS is not always consistent for small sample.

Saturday, August 15, 2015

Tsay Ch9 - Principal Component Analysis and Factor Models

Dimension reduction is essential to search for the underlying structure of the assets - called factors.

Three types of factor models -
1) Macroeconomic factor models - GDP growth, interest rates, inflation, unemployment - observable factors using regression
2) Fundamental factor models - firm size, book and market values, industrial classification.
3) Statistical factor models - non-observable  or latent variables e.g. PCA

General Factor Model

For m factors, k assets, and T time periods let rit be the return of asset i at time period t. The factor model is
rrt=αα+ββfft+ϵϵt,t=1,...,T
where ββ is a k×m factor loading matrix and ϵϵt is the error vector with Covϵϵt=DD=diag[σ21,...,σ2k], a k×k diagonal matrix. The covariance matrix of the returns rrt is then given by:
Cov(rrt)=ββΣΣfββT+DD

Macroeconomic factor models

Macroeconomic factors are observable. We can convert the general factor model into Multiple Linear regression setup and estimate the factors. This estimation does not impose the constraint of ϵit being uncorrelated, so may not be efficient in general. The best known single factor model is the market model (Sharpe 1970). The R2 can reach up to 50%, showing the significance of common market factor. One simple trick to compare factor based covariance matrix with sample covariance matrix is to use the global minimum variance portfolio (GMVP). For a given covariance matrix Σ, the GMVP ω solves minσ2p,ω=ωTΣω, such that ωT11=1 given by
ω=Σ11111TΣ111.
It is also important to verify that the residual covariance matrices do not have large off-diagonal elements, to fit the factor model criteria.

Ross (1986) considers multi-factor model consisting of unexpected changes or surprises (e.g. residuals after fitting VAR(3) model to seasonally adjusted CPI and unemployment growth numbers). The explanatory power is low.

Fundamental factor models

BARRA factor method treats the observed asset specific fundamentals as the factor betas βi, and estimates the factor ft at each time index t via regression. Fama and French construct their factors based on hedge portfolio which depend on the fundamentals. For BARRA factor model ~rrt=ββfft+ϵϵt, where ~rrt is the mean-corrected returns. We need WLS setup since the regression is not homogeneous, the estimate would be ^fft=(βD1βT)1(βD1βT~rt). We estimate the diagonal covariance matrix of errors from OLS first and then use it to estimate the factors using WLS equation. Cross-correlations in errors are ignored. The diagonal covariance matrix of final errors ^Dg and the covariance matrix of estimated factor realizations ˆΣ can be used to derive the covariance matrix of the original returns as Cov(rt)=βˆΣfβT+^Dg. In practice, the sample mean or returns are not different from zero, so one may not need to remove the sample mean before fitting the BARRA factor model.

Fama-French approach used a two-step procedure. First, they sorted the assets based on the value of three fundamentals (market excess returns, small vs big cap, value vs growth stocks). They formed the hedge portfolio which is long top quintile and short the bottom quintile. The observed return on this hedge portfolio is the factor realization for the given asset. Finally, given the factor realizations calculate betas using regression.

Principal component analysis

We look to find linear combinations which explain the most variance and are orthogonal to each other, with weights summing to one. This is done on covariance or correlation matrix which are non-negative definite and hence have spectral decomposition. For covariance matrix the variance explained is λi/λ, which becomes λi/k for a correlation matrix, since Tr(ρr)=k.

Statistical factor analysis

The aim is to identify a few factors that can account for most of the variations in the covariance or correlation matrix of the data. The assumption of no serial correlations is all right for low frequency data but not accurate for higher frequencies. Serial correlations should first be removed parametrically. We then construct orthogonal factor model. Since both the loadings and the factors are unobservable it is different from other factor models. For the Statistical factor model rtμ=βft+ϵt, we have the assumptions E[ft]=0, Cov[ft]=IIm, E[ϵt]=0, Cov[ϵt]=D=diag(σ21,...,σ2k) and E[ftϵT]=0. These are not uniquely determined. This can be estimated either using Principal Component Method or Maximum Likelihood estimation, with specified k. Factor rotation can be used for interpretation using varimax criteria.

Left out sections: 9.6

Tuesday, July 7, 2015

Tsay Ch4 - Nonlinear Models and their applications

I was going to leave out this chapter but may be not. I need to see what regime switching is all about.

Remember a purely stochastic time series xt is liner if it can be decomposed into a Wold form. Anything else is nonlinear. A general nonlinear functions is inapplicable due to too many parameters. We can restrict the model to the form xt=g(Ft1)+h(Ft1)ϵt, where ϵt=at/σt is the standardized innovation. ARMA is a linear model for mean while ARCH/GARCH models are nonlinear models of variance (because they model the square of volatility and not volatility). For stationary volatility series, shocks are uncorrelated but dependent. The models in this chapter extend to nonlinear model of mean.

The following four nonlinear models let the conditional mean μt evolve over time according to some simple parametric nonlinear function - Bilinear, Threshold autoregressive, State dependent, Markov switching. More recent nonlinear models are more data driven - nonlinear state-space, functional coefficient autoregressive, nonlinear additive autoregressive, multivariate adaptive regression spline. Finally, nonparametric methods like kernel regression and artificial neural networks have also been used. Test statistics - parametric (based on Lagrange multiplier or likelihood) and non-parametric (based on higher order spectra or dimension correlation) are also discussed.

Nonlinear models: Some basic ones for financial time series.

1) Bilinear Model: The linear model is simply the firs-order Taylor series expansion of f(.) function. As such, a natural extension to non-linearity is to employ the second order terms in the expansion to improve the approximation. Hence the model is
xt=c+pi=1ϕixtiqj=1θjatj+mi=1sj=1βijxtiatj+at.
This is generally analyzed by putting the model in state-space form. ARCH models generally seem to fit the data better, when quadratic terms of innovations are involved.

2) Threshold Autoregressive Model (TAR): Motivated by asymmetry in declining and rising patterns and hence is piece-wise linear (not in time space but threshold space). This can be used to model regimes. e.g. here is two regime model
xt={ϕ(1)1xt1+at,if xtl<δϕ(2)1xt1+at,if xtlδ
Here l is the delay parameter and δ is the threshold, which bifurcates the two regimes. There are several interesting characteristics of TAR models:

a) The process xt is geometrically ergodic and stationary,under the conditions ϕ(1)1<1, ϕ(2)1<1, and ϕ(1)ϕ(2)11<1. A process is ergodic if its statistical properties (such as mean and variance) can be deduced from a single, sufficiently long sample (realization) of the process. Ergodic theorem: sample mean ˉx=(Ttxt)/T of xt converges to the mean of xt. This is regarded as the counterpart of the central limit theory for the iid case. 

b) The series exhibits an asymmetric increasing and decreasing pattern due to different coefficients in the two regimes, hence different regimes will have different number of observations. The series is hence not time-reversible. 

c) The model, in the example, has no constant terms, but E(xt) is not zero. E(xt) is a weighted average of the conditional means of the two regimes, which are nonzero. the weight of each regime is simply the probability that xt is in that regime under its stationary distribution. Hence for TAR model to have zero mean, nonzero constant terms are needed in some regimes, unlike in stationary linear model where nonzero constant implies nonzero mean for xt.

Properties of TAR models are hard to obtain, but estimation is not difficult. US unemployment can be modeled using TAR models where the regime changes after a sudden increase in unemployment causing monetary intervention. It is also used to model asymmetric responses in volatility between positive and negative returns, study arbitrage tradings in index futures and cash prices. 

3) Smooth transition AR model (STAR): The mean is discontinuous for TAR, so using logistic, exponential or cumulative functions it can be made continuous and differentiable. These are hard to estimate with large standard errors. 

4) Markov switching autoregressive Model (MSA): Using probability of switching, emphasizing aperiodic transition between states, Hamilton (1989). 
xt={c1+pi=1ϕ1,ixti+a1t,if st=1c2+pi=1ϕ2,ixti+a2tif st=2
where st are the two states and is a first order Markov-chain with transition probability matrix
[1w1w1w21w2]
The expected duration of the process to stay in state i is 1/wi. TAR model uses a deterministic scheme of transition while MSA uses a stochastic scheme. Hence, in MSA one is never certain about which state xt belongs to. This has important implication. MSA forecast are always some linear combination of the two states, while TAR model picks a particular state. These models are estimated using EM or MCMC algorithms. The model can be further generalized by modeling transition probabilities to be logistic, probit or some other function of explanatory variables available at time t1.  Markov switching can hence be used to choose among many models. Taking the example of US quarterly GDP and fitting a MSA model we see the following observations:
a) The states correspond to growth and contraction.
b) Large std of state 2 implies that that there are relatively fewer observations in the contraction state.
c) It is more likely for US GNP to get out of contraction period than to jump into one.
d) Expected duration of expansion is 11 quarters while contraction is 4 quarters.

5) Non-parametric methods: The essence is smoothing, as they are highly data dependent and may otherwise overfit. If given many realizations of yt for a given xt, an asymptotically accurate value of the functional dependence, m(x), of Y on X at X=x is given by average of yt. In a time series only one observation of yt is available for a given xt. But if the functional form is sufficiently smooth, then the values of Y for which Xtx continues to provide accurate approximation of m(x). We can use weighted average to estimate m(x). Hence,
ˆm(x)=1TTt=1wt(x)yt

Kernel Regression: Kernel K(x) is the weighting function, satisfying K(x)0 and K(x)dz=1. For rescaling of distance measure, one can use bandwidth h>0 making the kernel Kh(x)=K(x/h)/h and Kh(z)dz=1. The weight function simply becomes:
wt(x)=Kh(xxt)Tt=1Kh(xxt), which gives the Nadarya-Watson kernel estimator (1964). Kernels chosen are generally, Gaussian, Epanechnikov. If h is very small, then the weights focus on a few observations that are int he neighborhood around each xt. If h is very large, then the weights will spread over a larger neighborhood of x_t. Bandwidth selection is done by either minimizing the mean integrated square error (using some preliminary smoothing estimations) or leave-one-out cross validation. Fan and Yao (2003) give the bandwidth for Gaussian as 1.06σxT0.2.

Local Linear Regression Method:- Takes the weighted kernel function but minimizes the least square error to estimate the parameters of the equation.
Functional coefficient AR model:- Treat each AR equation coefficient as function and estimate, using kernels for functions. Shown to have better 1-step ahead forecasting.
Nonlinear additive AR model:- To avoid curse of dimensionality use GAM style AR model.
Nonlinear State-Space model:- instead of linear matrix evolution, use nonlinear evolution of states, using MCMC.

Left of sections: 4.1.9,4.2-4.5

Tuesday, June 30, 2015

Tsay Ch8 - Multivariate Time Series Analysis and Its Applications

How markets are interrelated is important to understand the lead-lag relationship and under what circumstances they reverse or do not work. Various previous methods can be applied directly to the vector case, but some need attention.

Weak stationarity and cross-correlation matrices

k-dimensional time series rtrt=[r1t,...,rkt]T is weakly stationary if its first and second moments are time-invariant, μμ=E(rrt) and ΓΓ0=E[(rrtμμ)(rrtμμ)T]. Let DD be a k×k diagonal matrix consisting of the standard deviations of rit. The lag-zero, cross-correlation matrix is defined as ρρ0=DD1ΓΓ0DD1, which is the correlation matrix. The lag-l cross-covariance matrix rrt is defined as ΓΓl=E[(rrtμμ)(rrtlμμ)T]. For a weakly stationary series, the cross-covariance matrix ΓΓl is a function of l, not the time index t. The lag-l cross-correlation matrix (CCM) is defined as ρρl=DD1ΓΓlDD1. To understand this better, notice
ρij(l)=Γij(l)Γii(0)Γjj(0)=Cov(rit,rj,tl)std(rit)std(rjt)
which is the correlation between rit and rj,tl. If ρij(l)0 and l>0, we say that the series rjt leads the series rit at lag l. Similarly, if ρji(l)0 and l>0 we say that the series rit leads the series rjt at lag l. The diagonal element ρii(l) is simply the lag-l autocorrelation coefficient of rit.

Some remarks are (l>0):
1) In general ρij(l)ρji(l) for ij, because they are measuring two different lag relationships, implying ΓΓl and ρρl are in general not symmetric.
2) It is easy to see that ΓΓl=ΓΓTl and ρρl=ρρTl. Hence it suffices in practice to consider the cross-correlation matrices ρρl for l0.

For the cross-correlation matrices {ρρl|l=0,1,...}, the diagonal elements {ρii(l)|l=0,1,...} are the autocorrelation function of rit, the off-diagonal element ρij(0) measures the concurrent linear relationship between rit and rjt, and for l>0 the off-diagonal element ρij(l) measures the linear dependence of rit on past value of rj,tl. Depending on the value in these matrices one and identify -
1) no linear relationship (ρij(l)=ρji(l)=0,l0),
2) concurrent correlation (ρij(0)0),
3) no lead-lag relationship (ρij(l)=ρji(l)=0,l>0),
4) unidirectional relationship (ρij(l)=0,l>0,ρji(v) for some v0), or
5) feedback relationship (ρij(l)0, for some l>0,ρji(v)0 for some v0) .

Sample Cross-correlation matrices can be estimate using ^ρρl=^DD1^ΓΓl^DD1, where
ˆΓl=1TTt=l+1(rrt¯rr)(rrtl¯rr)T,l0. Bootstrapping can be used to get confidence intervals on finite samples.

Multivariate  Portmanteau tests: or Multivariate Ljung-Box test with statistic Q(m) have the null hypothesis H0:ρρ1=...=ρρm=00, and Ha:ρρi0 for some iϵ1,...,m. The test statistic assumes the form
Qk(m)=T2ml=11Tltr(ˆΓTl^ΓΓ10^ΓΓl^ΓΓ10) and under some regularity conditions follows a chi-squared distribution with k2m degrees of freedom, asymptotically.

Vector autoregressive models (VAR)

A multivariate time series rrt is a VAR process of order 1, or VAR(1) for short, if it follows the model rrt=ϕϕ0+ΦΦrrt1+aat, where ϕϕ0 is a k-dimensional vector, ΦΦ is a k×k matrix, and aat is a sequence of uncorrelated random vectors with mean zero and covariance matrix ΣΣ, which is positive definite (generally assumed to be multivariate normal).

Positive definite matrix is a symmetric matrix with all eigenvalues positive. Also for any vector bb, we have bbTAbAb>0. These types of matrices can be decomposed as AA=PΛPPΛPT, where ΛΛ is a diagonal matrix consisting of eigenvalues of AA and PP is a square matrix consisting of eigenvectors of AA. These eigenvectors are orthogonal to each other. Matrix PP is orthogonal and this decomposition is referred to as spectral decomposition. For a symmetric matrix AA, there exists a lower triangular matrix LL with diagonal elements being 1 and a diagonal matrix GG such that AA=LGLLGLT. If AA is positive definite, then the diagonal elements of GG are positive. In this case, we have A=LGGLA=LGGLT=MMMMT, where M=LGM=LG is a lower triangular matrix. This is called Cholesky decomposition. Notice that this implies LL1AA(LL1)T=GG.

Reduced and Structural form: In general the off diagonal elements of matrix ΣΣ show the concurrent relationship between r1t and r2t, while the matrix ΦΦ measures the dynamic dependence or rrt. This is called reduced-form model because it does not show explicitly the concurrent dependence between the component series. The explicit-form expression of concurrent relationship (for the last series and hence any series by rearrangement) can be deduced by a simple linear transformation. Using Cholesky decomposition (possible because ΣΣ is positive definite symmetric matrix) we can find a lower triangle matrix LL with unit diagonal elements such that Σ=LGLΣ=LGLT, where GG is a diagonal matrix. If we define bbt=LL1aat, then E(bbt)=LL1E(atat)=00 and Cov(bbt)=E(bbtbbTt)=LL1ΣΣ(LLT)1=GG. Since GG is a diagonal matrix the components of bbt are uncorrelated.

Pre-multiplying the reduced-form with LL1, to uncouple the equations, we get
LL1rrt=LL1ϕϕ0+LL1ΦΦrrt1+LL1aat=ϕϕ0+ΦΦrrt1+bbt.
The last row of LL1 has 1 as the last element, let it be (wk1,wk2,..,wk,k1,1), and hence the structural equation for the last (kth) time series becomes:
rkt+k1i=1wkirit=ϕk,0+ki=1Φkiri,t1+bkt.
This is possible because bbt is a diagonal matrix and uncoupled. Reduced-form is commonly used for two reasons - ease in estimation, and concurrent correlations cannot be used in forecasting.

Stationarity condition and moments of a VAR(1) model: All eigenvalues of ΦΦ should be less than 1 in modulus for weak stationarity for rrt, provided the covariance matrix of aat exists. Further we have ΓΓl=ΦΓΦΓl1, for l>0, where ΓΓj is the lag-j cross-covariance matrix of rrt. By repeated substitutions we get ΓΓl=ΦΦlΓΓ0. Further for ΥΥ=DD1/2ΦΦDD1/2, we get ρρl=ΥΥlρρ0. A VAR(p) model is generally converted to a VAR(1) model using companion matrix and then analyzed like a VAR(1) model.

To find the order of a VAR model one can generally use the multi-variant equivalent of PACF with hypothesis tests on the successive residuals. The ith equation in the PACF is given by rrt=ϕϕ0+ΦΦ1rrt1+...+ΦΦirrti+aat. Parameters of these equations can be estimated by OLS method. For the ith equation let the OLS estimates of coefficients be ^ΦΦ(i)j and ^ϕϕ(i)0, where the superscript (i) is used to denote the VAR(i) model. Then the residual is ^aa(i)t=rrt^ϕϕ(i)0^ΦΦ(i)1rrt1...^ΦΦ(i)irrti. We then test the hypothesis sequentially to identify the order of VAR model. For ith and (i1)th equations we test H0:ΦΦi=0 versus Ha:ΦΦi0, the test statistic is
M(i)=(Tki32)ln(|ˆΣi||ˆΣi1|). Asymptotically, M(i) is distributed as a chi-squared distribution with k2 degrees of freedom, where k is the dimensionality of the asset universe.

Equivalent AIC and BIC methods can also be employed.  OLS or ML is generally used to estimate the parameters, the two methods being asymptotically equivalent. Once a model is fit, residuals should be tested for inadequacy using Qk(m) statistic (with k2mg degrees of freedom). Forecasting is similar to the uni-variate case. Impulse response function is the MA version and can be derived to look at the decay rate. The MA equation is premultiplied by LL1 to get the impulse response function of rrt with the orthogonal innovations bbt. But different ordering may lead to different response functions, which is a drawback.

Vector moving-average models (VMA) :A VMA(1) model is given by rrt=θθ0+aatΘΘaat1, where θθ0 is k-dimensional vector, ΘΘ is k×k matrix. Like the uni-variate case the cross-correlations cuts at order 1 and can be used to identify the order. Estimation of VMA model is lot more involved. The conditional or exact MLE approaches can be used.
Vector ARMA models (VARMA): In generalizing from uni-variate to vectors ARMA encounters the issue of identifiability - it may not be uniquely defined. Some constraints need to be imposed - structural specification. These are hardly used.
Marginal models of components: Given the vector models the component models are called the marginal models. For a k-dimensional ARMA(p,q)  model, the marginal models are ARMA(kp,(k-1)p+q) models.

Unit root nonstationarity and cointegration: When modeling several unit-root nonstationary time series jointly, one may encounter the case of cointegration. This may be due to the common trend or unit root of one of the components. In other words one can find a linear combination which is stationary. Let h be the number of unit roots (or common trends) in the k-dimensional series xxt. Cointegration exists if 0<h<k, and the quantity kh is called the number of cointegrating factors, these are the different linear combinations that are unit-root stationary. The linear combinations resulting in these unit-root stationary processes are called the cointegrating vectors. Two price series, if cointegrated, will have common underlying trend and we will lose this information if we take the first difference of the price series. This is because one difference per unit-root preserves useful information. Under cointegration we have more non-stationary series than uni-roots hence losing information. Overdifferencing can lead to the problem of unit roots in the MA matrix polynomial, invertability and estimation. Also, cointegration can also exist after adjusting for transaction costs and exchange-rate risk, which is artificial.

Error correction form: To overcome the difficulty of noninvertible VARMA models one can use this form. A VARMA(p,q) model is xxt=pi=1ΦΦixxti+aatqj=1ΘΘjaatj.
For Δxxt=xxtxxt1, we can subtract xxt1 from both side of VARMA equation to get the error correction form. Δxxt=αβαβTxxt1+p1i=1ΦΦiΔxxti+aatqj=1ΘΘjaatj
where αα and ββ are k×m full-rank matrix, k is the total asset dimension, m is the cointegrating factors (m<k).  The term αβαβTxxt1 is called the error-correction term, as it compensates for the over-differentiating. Further, ββTxxt1 is stationary. Also, ΦΦj=pi=j+1ΦΦi,j=1,...,p1 αβαβT=ΦΦp+...+ΦΦ1II
The time series ββTxtxt is unit-root stationary, and the columns of ββ are the cointegrating vectors of xxt.

Co-integrated VAR models: We focus on VAR models for their simplicity in estimation, to better understand cointegration here. For a k-dimensional VAR(p) model
xxt=μμt+ΦΦ1xxt1+...+ΦΦpxxtp+aat,
where μμ0+μμ1t. Or equivalently, using backshift operator B
(IIΦΦ1B...ΦΦpBp)xxt=μμt+aat.
The characteristic polynomial in the above is represented as ΦΦ(B). For a unit-root nonstationary process, 1 is a root making |ΦΦ(1)|=0 An error-correction form for this can be obtained by subtracting xxt1 from both sides of the equation giving
Δxxt=μμt+ΠΠxxt1+ΦΦ1Δxxt1+...+ΦΦp1Δxxtp+1+aat
Where ΠΠ=ΦΦ1+...+ΦΦpII=ΦΦ(1) and ΦΦj=pi=j+1ΦΦi for j=1,...,p1. Further if Rank(ΠΠ)=0 implies that xxt is not cointegrated, Rank(xxt)=k implies that ECM is not informative and one studies xxt directly. Finally, if 0<Rank(ΠΠ)=m<k then one can write ΠΠ=αβαβT, where αα and ββ are both of rank m. We have the case of cointegration with m linearly independent cointegrated vectors wwt=ββTxxt,a nd has km unit roots or common trends.

To obtain the km common trends vector of size (km)×1, yyt=ααTxxt, we calculate the orthogonal matrix of size k×(km) as ααTαα=00.To uniquely identify αα and ββ we require that ββT=[IIm,ββT1], where TTm is a m×m identity matrix and ββ1 is a (km)×m matrix. There are a few more constraints for the process wwt=ββTxxt to be unit-root stationary.

The rank of ΠΠ in the ECM is the number of cointegrating vectors. Thus, to test for cointegration, once can examine the rank of ΠΠ, the approach taken in Johansen test.

Deterministic function: Limiting distributions of cointegration tests depend on the deterministic function μμt.
1) μμt=00: All components series of xxt are I(1) without drift and the stationary series wwt=ββTxxt has mean zero.
2) μμt=ααcc0: Components of xxt are I(1) without drift, but wwt have a nonzero mean cc0, called restricted constant.
3) μμt=μμ0: Component series are I(1) with drift μμ0 and wwt may have a nonzero mean.
4)  μμt=μμ0+αcαc1t: Components of xxt are I(1) with drift μμ0 and wwt has a linear time trend, called restricted trend.
5) μμt=μμ0+μμ1t: Both the constant and trend are unrestricted. The components of xtxt are I(1) and have a quadratic time trend and wwt have a linear trend.

Maximum likelihood estimation: Estimation of VAR(p) is quite involved. The deterministic term (xxt) and stationary terms (Δxxt) are first bifurcated and estimated using linear regression and have error terms are uut and vvt respectively. A relevant eigenvalue problem when solved leads to a likelihood which when maximized gives the estimates of the coefficients.

Johansen test for cointegration: This is esentially testing the rank of the matrix ΠΠ, for a specified deterministic term μμt. The number of non-zero eigenvalues of ΠΠ can be obtained if a consistent estimate of ΠΠ is available. Looking at the ECM equation it is clear that ΠΠ is related to the covariance matrix between xxt1 and Δxxt after adjusting for the effects of deterministic trend term and Δxxti for i=1,...,p1. Using canonical correlation analysis between the two adjusted equation, the squared correlation between them are calculated to be ˆλi. There are two versions of Johansen test:

1) Trace cointegration test:- H0: Rank(ΠΠ) = m versus Ha: Rank(ΠΠ)>m. The Likelihood ratio (LR) statistic is
LKtr(m)=(Tp)ki=m+1ln(1ˆλi)
Due to the presence of unit-roots, the asymptotic distribution of statistic is not chi-squared, but a function of standard Brownian motions. Thus, the critical values must be obtained via simulation.

2) Sequential test:- H0: Rank(ΠΠ) = m versus Ha: Rank(ΠΠ)=m+1. The LK ratio test statistic, called the maximum eigenvalue statistic, is
LKmax(m)=(Tp)ln(1ˆλm+1)
Again, the critical values of the test statistics are nonstandard and must be evaluated via simulation.

Left out sections: 8.7, 8.8

Tsay Ch5 - High Frequency Data Analysis and Market Microstructure

Non-synchronous trading

For daily stock returns, non-synchronous trading can introduce
1) cross correlations between stock returns
2) serial correlation in a portfolio return
3) sometimes negative serial correlations in the return series of a stock.

Bid-ask spread

Introduces lag-1 serial correlation in an asset return, called bid-ask bounce.

Empirical characteristics

Aggregation don't show some of the characteristics of transactions data
1) unequally spaced time intervals - duration between trades might contain useful information about market micro-structure e.g. trading intensity
2) discrete-valued prices - along with limits
3) Existence of a daily periodic pattern - thinner during lunch hour.
4) Multiple transactions within a single second

Overnight stocks returns differ substantially from intraday returns (Stoll and Whaley 1990). Intraday trading has exploded with multiple transactions within second.

Models for price change

The discreteness and concentration on 'no change' make it difficult to model the intraday price changes. There are two models - ordered probit model (Hauseman, Lo and MacKinlay 1992) and a decomposition model (McCulloch and Tsay 2000). These models find prediction challenging, but are more used for understanding purposes.
Ordered Probit model - For Pt being the fundamental value of the asset in a friction-less market and Pt being the observed price, we define yi=PtiPti1 and model yi as a continuous random variable given by yi=xiβ+ϵi. The observed value yi can be categorized in ordered set s1,...,sk. Generally a normal distribution is assumed. The model can be estimated by maximum likelihood or MCMC methods. Explanatory variables xi can be time duration, lagged prices, lagged SP500 price, bid-ask spread and direction, lagged volume. Volatility can be explained using duration and bid-ask spread as well.
A decomposition model (ADS) - indicator for price change, direction of price change, and the size of price change, yi=PtiPti1=AiDiSi, where ordering is important. Each of these terms are modeled as logistic regression using explanatory variables and estimated using log likelihood.

Duration models - ACD

Concerned with time intervals between trades. Longer durations indicate lack of trading activities, which means no new information. Before the duration can be modeled the diurnal pattern has to be removed from the time series. This is done by calculating adjusted time duration Δti=Δti/f(ti). f(ti)=exp(β0+71βjfj(ti)), where the fi are functions defined to take care of first 5 minutes, last 30 minutes, and mid period, depending on the asset and profile. We can then fit the autoregressive conditional duration model. f(ti) is commonly estimated using smoothing splines. One way is to use combination of quadratic functions and indicator variables to take care of deterministic components of daily trading activities.
f(ti)=ed(ti)d(ti)=β0+71βjfj(ti))
where, f1,f2,f3,f4 are quadratic functions fitted for specific data (Tsay pg 225).  f5 and f6 are indicator variables for the first and second 5 minutes of market open, and f7 is the indicator for the last 30 minutes of daily trading. The coefficients can be determined by least sqaures method
ln(δti)=β0+71βjfj(ti))+ϵi
The autoregressive conditional duration (ACD) model uses the idea of GARCH models to study the dynamic structure of the adjusted duration Δti. For xi=Δtt and ψi=E(xi|Fi1), the model is defined as xi=ψiϵi, where ϵi follows a standard exponential (EACD) or a standard Weibull (WACD) distribution. Further, similar to GARCH, we have ACD(r,s) model
ψi=ω+rj=1γjxij+sj=1ωjψij
with γj=0 for j>r and ωj=0 for j>s. For stationarity ω>0 and 1>j(γj+ωj).

EACD(1,1) model: ϵi as exponential distribution xi=ψiϵi and ψi=ω+γ1xi1+ω1ψi1. We have E(ϵi)=1 and Var(ϵi)=1, implying E(ϵ2i)=2. Assuming weak stationarity,
E(xi)=ω1γ1ω1.
Var(xi)=μ2x1ω212γ1ω11ω212γ212γ1ω1.
Hence, 1>2γ21+ω21+2γ1ω1 for stationarity.

Bivariate models for price change and duration - PCD

Jointly modeling the price change and associated duration process. Focus on transactions that result in a price change Pti=Pti1+DiSi, where Di is the direction change dummy and Si is the size change variable. It reduces the number of data point and there is no diurnal pattern in time durations between price changes. The PCD model decomposes the joint distribution of (Δti,Ni,Di,Si) given Fi1 as
f(Δti,Ni,Di,Si|Fi1)=f(Si|Di,Ni,Δti,Fi1)f(Di|Ni,Δti,Fi1)f(Ni|Δti,Fi1)f(Δti|Fi1)
where the ith transaction data consists of Δti duration, Ni number of trades in the period, Di direction of price change, Si size of price change in ticks. There are many ways to specify the conditional distributions depending on the asset under study. Using McCulloch and Tsay (2000) generalized linear models for discrete valued variables and time series model for continuous variable ln(Δti) we get
ln(Δti)=β0+β1ln(Δti1)+β2Si1+σϵi.
Log transformation is added to ensure positiveness. Due to concentration of Ni at 0, we partition the model for Ni in tow parts.
p(Ni=0|Δti,Fi1)=logit[α0+α1ln(Δti)]
where logit(x)=ex/(1+ex), whereas the second part of the model is
Ni|(Ni>0,Δti,Fi1)1+g(λi)λi=eγ0+γ1ln(Δti)1+eγ0+γ1ln(Δti)
where g(λ) denotes a geometric distribution with parameter λ, in interval (0,1). The model for direction Di is
Di|(Ni,Δti,Fi1)=sign(μi+σiϵ)
where ϵ is a N(0,1) random variable, and
μi=ω0+ω1Di1+ω2ln(Δti),
ln(σi)=β|Di1+Di2+Di3+Di4|
To allow for different dynamics between positive and negative price movements, we use different models for the size of a price change.
Si|(Di=1,Ni,Δti,Fi1)p(λd,i)+1ln(λd,i)=ηd,0+ηd,1Ni+ηd,2ln(δti)+ηd,3Si1
Si|(Di=1,Ni,Δti,Fi1)p(λu,i)+1ln(λu,i)=ηu,0+ηu,1Ni+ηu,2ln(δti)+ηu,3Si1
where p(λ) denotes a Poisson distribution with parameter λ, and 1 is added to the size because the minimum size is 1 tick when there is a price change. Estimation can be done either by maximum likelihood or MCMC methods.

Left out sections: 5.6

Tsay Ch3 - Conditional Heteroscedastic Models

Ch3 : Conditional Heteroscedastic Models

Concerned with modeling of volatility of an asset return. Volatility modeling, apart from being useful in options pricing, provides simple approach to calculate VaR in risk management. It also plays an important role in asset allocation under mean-variance framework. It can also improve the efficiency in parameter estimation and the accuracy in interval forecast. Implied volatility tend to be larger than that obtained by using a GARCH type of volatility model. Some characteristics are:
1) there exist volatility clusters
2) volatility jumps are rare
3) does not diverge to infinity
These implies that volatility is generally stationary.
4) leverage effect - volatility reacts differently to big price increases vs big price drops

Returns are generally serially independent but the square of them are not. The general mean equation is rt=μt+at where,
μt=ϕ0+ki=1βixit+pi=1ϕirtiqi=1θiati
where k, p and q are non-negative integers, and xit are explanatory variables (e.g. exogenous variables or dummy variables). The conditional variance (or volatility) is given by
σ2t=Var(rt|Ft1)=Var(at|Ft1)
The model of volatility (called volatility equation) can be of two types - one with exact function and the other with stochastic equation.

The first step is always to model the mean equation (e.g. ARMA) and check residuals for ARCH effects. We then fit volatility models and then jointly estimate the mean and volatility equations. Let the mean equation be rt=μt+at, with μ being the simple average of the series. The squared series a2t is then used to check for conditional heteroscedasticity or ARCH effects. Two tests:
1) Ljung-Box test on a2t series: H0: first m lags of ACF of a2t series are zero.
2) Lagrange multiplier test of Engle: Equivalent to F-statistic for testing αi=0 in the linear regression a2t=α0+α1a2t1+...+αma2tm+et, for t=m+1,...,T

ARCH model: Shock at is serially uncorrelated but dependent at the quadratic level as at=σtϵt, where
σ2t=α0+α1a2t1+...+αma2tm
where ϵt is a sequence of iid with mean zero and variance 1, α0>0 and αi0 for i>0. ϵt can be modeled using standard normal, student-t or generalized error distribution. Using heavy-tailed distribution will reduce the ARCH effect. This models volatility clustering.

Properties of ARCH(1): E[at]=E[E[at|Ft1]]=E[σtE[ϵt]]=0. Also, Var(at)=α0/(1α1). Making higher orders finite can impose further conditions, like the unconditional kurtosis is
E[a4t][Var(at)]2=31α2113α21>3
Thus, ARCH(1) model admits heavier tails than that of a normal distribution. For kurtosis to be finite α211/3. The model assumes that the positive and negative shocks have same effects. They are likely to over-predict the volatility because they respond slowly to large isolated shocks.

Order determination: using PACF of a2t one can select the ARCH order, unless the sample size is small. Estimation is done using log likelihood estimation. After fitting standardized residuals at/σt  should be checked for iid using Ljung-Box statistics, skewness, kurtosis and qqplot. Forecasting is similar to AR model.

GARCH model: To reduce the number of parameters we use GARCH model. A GARCH(m,s) is given by at=σtϵt and
σ2t=α0+mi=1αia2ti+sj=1βjσ2tj
 where again ϵt is an iid with mean 0 and variance 1, α0>0, αi0, βj0, and max(m,x)i=1(αi+βi)<1. αi and βj are referred to as ARCH and GARCH parameters, respectively. This shows similar properties as ARCH with ARMA like characteristics.

Difficult to specify the model. Only lower order models used. Log likelihood is used and the fitted model is checked using standardized residual at/σt and its squared process. a2h+1 is a consistent estimate of σ2h+1 but is not an accurate estimate of the prediction, because a single observation of a random variable with a known mean and value cannot provide an accurate estimate of its variance.

IGARCH model: unit-root GARCH model. αi+βi=1. The unconditional variance of at, hence that of rt, is not defined under the IGARCH model. Theoretically, this might be caused by occasional level shifts in volatility. For the case of α0=0, IGARCH(1,1) model become exponentially smoothing of volatility model.

Stochastic volatility model: To ensure positiveness of the conditional variance, SV models use ln(σ2t) instead of σ2t. It is defined as at=σtϵt
ln(σ2t)=α0+α1ln(σtt1)+νt
where ϵt are iid N(0,1), the νt are iid N(0,σ2ν), ϵt and νt are independent, α0 is a constant and all zeros of the characteristic polynomial are greater than 1 in modulus.

Quasi-likelihood methods via Kalman filtering or MCMC methods are used for estimation. Limited experience shows that SV models often provided improvements in model fitting, but their contributions to out-of-sample volatility forecasts received mixed results.

Alternative approaches: Two approaches
1) High-Frequency Data - Use of high-frequency data to calculate volatility of low-frequency returns. The model for daily returns is unknown which can complicate the monthly estimation. Assuming ri are log daily returns for the monthly returns rm, we have rm=nrt. For white noise we have Var(rm)=nVar(ri). For MA(1) process it is Var(rm)=nVar(ri)+2(n1)Cov(ri+ri+1). Further the sample size is only 21 days making the accuracy questionable. If the kurtosis and serial correlations are high the sample estimate is not even consistent.
This concept can be used to calculate daily volatility from intraday log returns. At this level the realized volatility (r2) approximately follows Gaussian ARIMA(0,1,q) model, which can be used to produce forecasts. Intuitively one would use the smallest possible interval, but anything less than 15 minutes bring in noise. Overnight returns also need to be taken into account for correct daily volatility estimation.
2) Daily OHLC Prices - Shown to improve volatility estimate. Let Ct, Ot, Ht, and Lt be the Close, Open, High and Low for day t. Also let f be the fraction of day that trading is closed. The conventional volatility estimate is σ2t=E[(CtCt1)2|Ft1]. Based on price following simple diffusion model without drift various estimates of volatility has been proposed, one being 0.5(HtLt)2[2ln(2)1](CtOt)2. Yang and Zhang (2000) also proposed an estimate on the lines of ˆσ2yz=ˆσ2o+kˆσ2c+(1k)ˆσ2rs,
where
ˆσ2o=1n1nt=1(otˉo)2ot=ln(Ot)ln(Ct1)
ˆσ2c=1n1nt=1(ctˉc)2ct=ln(Ct)ln(Ot)
ˆσ2rs=1nnt=1[ut(utct)+dt(dtct)]ut=ln(Ht/)t)dt=ln(Lt/Ot)
k=0.341.34+(n+1)/(n1)
quantity k is chosen to minimize the variance of the estimator ˆσ2yz.

Left out sections: 3.7-3.11, 3.13-3.14, 3.16

Tsay Ch2 - Linear Time Series Analysis and Its Applications


Ch2: Linear Time Series Analysis and its Applications

We work with log returns rt as a collection of random variables over time, as a time series rt.

Stationarity: The first step in time series analysis should always be checking stationarity.
1) Strictly stationary time series rt if the joint distribution of (rt1,...,rtk) is identical to that of (rt1+t,...,ttk+t) for all t and k, i.e. the joint distribution is invariant under time shift.
2) Weak stationarity: if both mean and covariance between rt and rtl are time-invariant. Weak stationarity implies partial predictability. In other words, the first two moments of rt are finite and constant. A strictly stationary series is weakly stationary but not vice-versa unless rt is normally distributed.

Autocovariance: γl=Cov(rt,rtl) with properties γ0=Var(rt) and γl=γl

Autocorrelation function (ACF): uncorrelated means ρx,y=0. It implies independence under normality. ρl=γl/γ0 giving ρ0=1 and ρl=ρl. Sample autocorrelation is given by
ˆρl=Tt=l+1(rtˉr)(rtlˉr)Tt=1(rtˉr)2
is asymptotically normal with mean zero and variance 1/T, if rt is an iid sequence with finite second moment (Brockwell and Davis 1991).

Bartlett's formula: For a weakly stationary time series of the form rt=μ+q1ψiati, where ψ0=1, q is a non-negative integer, and aj are Gaussian white noise series, then ˆρl is asymptotically normal with mean μ and variance (1+2q1ρ2i)/T for lq (Box, Jenkins, Reinsel 1994). Hence, to test the autocorrelation for a given integer l, we can construct the statistic.

Ljung and Box (1978) test increases the power for finite samples by constructing a chi-squared distribution statistic with m degrees of freedom:
Q(m)=T(T+2)ml=1ˆρ2lTl.
In practice, the selection of m may affect the performance of the Q(m) statistic. mln(T) provides better power performance.

Monthly returns of the value-weighted index seems to have stronger serial dependence than individual stock returns. CAPM suggests there is no autocorrelation in financial series. One has to be careful of the autocorrelation induced by the way index returns are determined or stock prices are determined - these are pseudo relationships.

White noise: time series rt is white noise if the sequence is iid with finite mean and variance. if the mean is 0 and variance σ2, it is called Gaussian white noise. For a white noise series all the ACFs are zero. e.g. monthly individual stocks returns seem to be white noise but not value-weighted index return series.

Wold decomposition: A time series rt is said to be linear if it can be written as Wold decomposition
rt=μ+i=0ψiati,
where μ is the mean of rt, ϕ0=1, and at is white noise series. If rt is weakly stationary, we can obtain the mean, variance and covariance easily. For weakly stationary series ρl converges to zero as l increases.

AR(p) model: Current conditional expectation of returns depend on the last p steps
rt=ϕo+ϕ1rt1+...+ϕprrp+at
The necessary and sufficient condition for weakly stationarity is |ϕ1|<1 for AR(1) process.  The plot of ACF decays exponentially (extends beyond lag 1!). For higher order AR processes the inverse of the solutions to characteristic equation are called characteristic roots. Complex roots correspond to damped oscillations related to business cycles. The stationarity condition is that the characteristic roots are less than one in modulus (or all solutions to characteristic equation greater than one in modulus).

Identification of AR model: order determination of AR models. Two approaches:
1) Partial Autocorrelation Function (PACF): nested AR models with increasing order. For a true order of p, the terms ˆϕl,l converge to zero for all l>p with variance 1/T. These can be plotted to determine the cutoff as a PACF plot.
2) Information criteria:  likelihood based criteria. e.g. the Akaike information criterion (AIC) 1973 is:
AIC=2Tln(likelihood)+2TP
For a Gaussian AR(l) model, AIC reduces to ln(ˆσ2l)+2l/T, where ^σ2l is the maximum likelihood estimate of σ2a, T is the sample size and P is the number of parameters. Another common criteria is Bayesian information criterion (BIC), which for Gaussian AR(l) reduces to ln(ˆσ2l)+ln(T)l/T. BIC tends to select a lower AR model when the sample size is moderate to large.

Parameter estimation: OLS method, or likelihood method. Likelihood estimation is very similar to OLS but gives slightly lower σ2a for finite samples.

Model checking: residual series should be white noise - check ACF of residual and Ljung-Box statistics with mg degrees of chi-squared distribution, where m is approximately ln(T) and g is number of AR coefficients used in the model.

Goodness of fit: For a stationary series R2 can be used as a goodness of fit. For unit-root non-stationary series R2 converges to one regardless of the underlying model. Adjusted R2 should be used to penalize the increased number of parameters.

Forecasting: Longer terms forecasts are more accurate and approach unconditional mean. The speed of mean reversion is measured by half life denoted by k=ln(0.5/|ϕ1|).

MA(q) model: Moving average models. These are simple extension of white noise series, an infinite-order AR model with some parameter constraints. Bid-ask bounce in stock trading may introduce an MA(1) structure in a return series. The general form is
rt=c0+atθ1at1...θqatq
It is a finite memory model with ACF zero beyond the first q. MA(q) is invertible if |θ|<1, otherwise noninvertible. An MA model is always stationary. The coefficient of MA models are also called impulse response function.

Identification of AR model: Identified using ACF, provides information on the nonzero MA lags of the model.

Estimation: Likelihood methods:
1) conditional likelihood method: assumes initial shock to be zero.
2) exact likelihood method: initial shocks are estimated jointly with other parameters.

Forecasting: q step ahead forecast for MA(q) model is the mean and stays there.

ARMA(p,q) model: to reduce higher order parameters in AR or MA model, one can use ARMA model. Generally not used for returns but volatility models. ARMA(1,1) is
rtϕ1rt1=ϕ0+atθ1at1
For the model to be meaningful we need ϕ1θ1. The ACF of an ARMA(1,1) behaves very much like that of an AR(1) model except that the exponential decay starts with lag 2 instead of lag 1. Consequently, the ACF of an ARMA(1,1,) does not cut off at any finite lag. The PACF of an ARMA(1,1) does not cut off at any finite lag either, like of MA(1), except that the exponential decay starts with lag 2. The stationarity condition for an ARMA(1,1) is same as that of an AR(1) model. A general ARMA(p,q) model is
rt=ϕ0+pi=1ϕirti+atqj=1θjatj
There should be no common factors between the AR and MA polynomial, otherwise the order of the model can be reduced. For stationarity all solutions to AR polynomial should be less than 1 in absolute value.

Identification: EACF (Tsay, Tiao 1984) can be used. The first zero corner in the EACF table identifies the order. Estimation can then be done using conditional or exact likelihood. Ljung-Box statistics of the residuals can also be checked for the adequacy of the model.

Unit root nonstationarity: Price series, interest rate, FX rates are generally non stationary. They are called unit root non-stationary  series in time series literature. The best example is random-walk.

Random walk: A time series pt is a random walk if it satisfies
pt=pt1+at
Looking it as and AR(1) process, it is a non-stationary as the coefficient is equal to 1. We call this a unit root nonstationary time series. This is a non-mean reverting model with prediction of price equal to the value at forecast origin, with variance of estimate lσ2a, where l is the number of look ahead steps, which diverge to infinity as l. The series has a strong memory, as the sample ACFs are all approaching 1 as the sample size increases.

Random walk with drift: The market index tends to have a small and positive mean. The model for log price becomes
pt=μ+pt1+at
Both the expected value tμ and variance tσ2a increase with time t.

Trend-Stationary time series: pt=β0+β1t+tt
The mean is β0+β1t and variance is finite and equal to Var(rt), unlike random walk non-stationary model with trend. This can be transformed into a stationary one by removing the time trend via a simple linear regression analysis.

ARIMA models: A conventional way to handle unit-root nonstationarity is to use differencing. A time series yt is said to be ARIMA(p,1,q) process if the change series cT=ytyt1 is stationary and invertible ARMA(p,q) process. For example, log price series is an ARIMA process. Series with multiple roots may need multiple differencing.

Dickey-Fuller unit root test: H0:ϕ1=1 versus Ha:ϕ1<1. The test statistic is the t-ratio of the least squares estimate of ϕ1 from the model pt=ϕ1pt1+et giving
ˆϕ1=T1pt1ptT1p2t1
ˆσ2e=T1(ptˆϕ1pt1)2T1
where p0=0 and T is the sample size. The t-ratio is
ˆϕ11std(ˆϕ1)=T1pt1etˆσeT1p2t1

Left out sections: 2.8 Seasonal models, 2.9 Regression models with time series errors, 2.10 consistent covariance matrix estimation, 2.11 long-memory models

Tsay Ch1 - Financial time series and their characteristics


Here are my notes from the book 'Analysis of Financial Time Series' by Tsay. It's a good introductory book, mostly empirical with decent maths. Someday I will read Hamilton or Green and write that up too. 

Ch1: Financial Time series and their characteristics

Concerned with asset valuation. The uncertainty in the definition and statistical inference of returns and volatility plays important role.

Asset returns - various definitions. Let Pt be the price of an asset at time index t. Assume no dividends.
1) One-period simple return: Rt=PtPt11
2) Multi-period simple return: 1+Rt[k]=Πk1j=0(1+Rtj)
3) Continuously compounded or log return: rt=ln(1+Rt)=ln(Pt/Pt1)=ptpt1, and rt[k]=k1i=0rti
4) Portfolio return: Rp,t=Ni=1wiRit and rp,tNi=1wirit
5) Dividend payment: Rt=Pt+DtPt11 and rt=ln(Pt+Dt)ln(Pt1)
6) Excess return: Zt=RtR0t and zt=rtr0t

Distributional properties - N assets and T time indices.
Joint distribution:
FX,Y(x,y;θ)=P(Xx,Yy;θ)=xyfx,y(w,z;θ)dzdw
Marginal distributions: marginal distribution of X is obtained by integrating out Y.
Conditional distributions:
FX|Yy(x;θ)=P(Xx,Yy;θ)P(Yy;θ)
fx,y(x,y;θ)=fx|y(x;θ)×fy(y;θ)
Independent variables imply fx,y(x,y;θ)=fx(x;θ)fy(y;θ)
Moments of a random variable:
ml=E[Xl]=xlf(x)dx
The lth central moment of X is defined as
ml=E[(Xμx)l]=(xμx)lf(x)dx
Sample central moments:
sample mean: ^μx=1TTt=1xt
sample variance: ^σ2x=1T1Tt=1(xt^mux)2
sample skewness: ^Sx=1(T1)ˆσ3xTt=1(xt^mux)3, distributed normally with mean 0 and variance 6/T (Snedecor and Cochran 1980).
sample kurtosis: ^Kx=1(T1)ˆσ4xTt=1(xt^mux)4, distributed normally with mean 3 and variance 24/T.

JB normality test: Both ˆS(r) and ˆK(r)3 are normally distributed and can have individual hypothesis tests for normality. Jarque and Bera (1987) combine the two tests and use the test statistic
JB=ˆS2(r)6/T+(ˆK(r)3)224/T
which is asymptotically distributed as a chi-squared random variable with 2 degrees of freedom.

Distribution of returns - The most general model for the log returns rit;i=1,...,N;t=1,...,T is its joint distribution funtion:
Fr(r11,...,rN1;r12,...,rN2;...;r1T,...,rNT;Y;θ),
where Y is a state vector summarizing the environment in which asset returns are determined and θ are parameters that uniquely determine the distribution function Fr(.). Many times Y is assumed given and the main concern is the conditional distribution of rit given Y. Some applications focus on the joint distribution of N returns at a single time index t, i.e. r1t,...,rNt. Other theories emphasize the dynamic structure of individual asset returns - the uni-variate case - ri1,...,riT for a given asset i. For the uni-variate case the joint distribution is given as
F(r1,...,rT;θ)=F(r1;θ)ΠTt=2F(rt|rt1,...,r1;θ).
The main concern then is to specify the conditional distribution F(rt|rt1,..,r1;θ). Different distributional specifications lead to different theories. For instance, if the conditional distribution is equal to the marginal distribution it is a random-walk hypothesis. It is customary to treat asset returns as continuous random variables, and use their probability density functions.
f(r1,...,rT;θ)=f(r1;θ)ΠTt=2f(rt|rt1,...,r1;θ).

Different distributions on marginals give different models:
Normal distribution: Rt iid and normal with fixed mean and variance - lower bound is wrong, multiperiod returns are not normal due to multiplication, empirical excess kurtosis.
Lognormal distribution: rt iid and normal with fixed mean μ and varianceσ2. The simple returns Rt are then lognormal with:
E[Rt]=e(μ+σ22)1Var[Rt]=e2μ+σ2[eσ21]
summation is also normally distributed, there is no lower bound, but is not consistent with excess kurtosis.
Stable distributions: distribution where a linear combination has the same distribution. Summation and lower bound satisfied but has infinite variance, e.g. Cauchy distribution f(x)=1/(π(1+x2)).
Scale mixture of normal distributions: finite mixtures of normal distributions rt(1X)N(μ,σ21)+XN(μ,σ22), where X is a Bernoulli random variable and σ22 is relatively large. Maintain tractability of normal under addition, capture excess kurtosis and finite higher moments, but hard to estimate the mixture parameters. For multivariate returns covariance becomes critical. Everything becomes vectors.

log likelihood function of returns with normal distribution:
lnf(r1,...,rT;θ)=lnf(r1;θ)12Tt=2(ln(2π)+ln(σ2t)+(rtμt)2σ2t)
This assumes iid of each observation.

Empirical properties of returns:
a) Daily returns have higher kurtosis than monthly returns. For monthly returns market indexes have higher kurtosis than individual stocks.
b) For daily returns market indexes have smaller standard deviation than individual stocks.
c) Empirical returns density is taller, but with fatter tails.

Processes considered: return series, conditional volatility (e.g. clustering), extreme behaviors (frequency, size, impact).