### Principle Component Analysis of Stock Market and Factor Model with R

Source Code : https://github.com/WenkaiCui/PCA-and-Factor-Model

### Introduction

Principle Component Analysis is often used as a tool of dimension reduction and helps visualize high dimension data set. It generates a set of factors that linearly explain a large portion of total variation of data.

In quantitative finance, PCA has interesting applications. For example, when using PCA to decompose return of future contract or fix income security with different term to maturity, the first principle component resembles the level of overall contract, the second resembles the slope and the third is very similar to the curvature.

In addition, PCA is closely related to factor analysis, whose idea is very similar to the factor model in finance. The return in factor model is driven by orthogonal factors, just like what PCA does.

The estimation of covariance matrix has always been one of the top topics in quantitative asset management. Factor model provides a way to restrict the covariance matrix of stocks to factors, thus reducing noise of estimation. The simplest way to test if a factor model works is to backtest if the Global Minimum Variance Portfolio estimated by factor model is smaller than before.

In this project I will first review some PCA theories and related linear algebras, then, its application in stock market and the construction of Orthogonal factor model. Later on, I will get things more serious and do a backtest to see if this factor model works.

Without further ado, let’s get started.

### PCA of Stock Market

I use the data downloaded from Ken French’s website.
This is the monthly return of 48 industry portfolios. you can download the data at http://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data_library.html

I start with most recent 10 year data, so the data frame has 120 * 48 elements.

library(xts)
library(dplyr)
library(ggplot2)
library(reshape2)
library(ggvis)
library(ggthemes)

df= read.csv('48_Industry_Portfolios.csv',skip=11,stringsAsFactors=F)
df= df[1:1098,]
df = data.frame(lapply(df,as.numeric))
mask = (df[,1]>200000)&(df[,1]< 201800)
df = df[mask,]
rownames(df) = df$X df = df[,-1]/100 n = dim(df)[1] k = dim(df)[2]  Suppose our data $X$ has $n$ observations and $k$ stocks, so the covariance matrix $\Omega$ of stocks should be a $k\times k$ matrix. We have: $X^TX=\Omega = P\Lambda P^{T}$ in which P is columns of orthogonal eigenvectors, a $k\times k$ matrix. $\Lambda$ is a diagonal matrix of eigenvalues. Suppose $X^* = XP$ Then: $(X^*)^T X^*= P^TX^T XP= P^T \Omega P = P^T P\Lambda P^T P=\Lambda$ In finance words, eigenvector $P$ contains $k$ columns of weights. For each weight $P_i$, we construct a new portfolio called eigen-portfolios $X^*_i$ from $X$, so now we have $k$ eigen-portfolios instead of $k$ stocks. Because $X^*$ has a diagonal covariance matrix, the eigen-portfolios $X^*$ are uncorrelated with each other and have variance $\Lambda_{i,i}$ The eigenvectors are constructed in a way to keep as more variance as possible. As a result, the first few eigen-portfolios contains a great amount of fraction of total variance. To start PCA, we need to normalize data by subtracting the mean and dividing by standard diviation. This will make our result not affacted by the scale of data. To make $P$ real weights, we need to normalize so that they sum to 1  pca = prcomp(df,scale.=T,center=T) omega = cov(df) # calculate covariance matrix of X P = pca$rotation #eigenvector
lambda = pca$sdev^2 # eigenvalue Pnorm = apply(P,2,function(x) x/sum(x)) #normalize weights df_price = apply(df+1,2,cumprod) #temp eigen.port = df_price %*% Pnorm #calculate eigen portfolios  Now let’s visulize the weights of first 10 eigen-portfolios.  par(mfrow=c(2,5),mar=c(3.5,2,2,2)) for (i in 1:10){ barplot(Pnorm[,i],main=colnames(P)[i],col='steelblue') #$text(plt, par("usr")[3], labels = rownames(P), srt = 45, adj = c(1.1,1.1), xpd = TRUE, cex=0.7)
}



Different PCs in the plot are actually trying to capture different kinds of variation. PC1 is capturing all variation as much as possible so it is not surprising that most time PC1 behave like a equal weight portfolio (at least compared with other portfolios).

The remaining PCs are trying to capture variantion that orthogonal to PC1, which means they are uncorrelated with the equal-weighted portfolio.

As a result, we can invest in P uncorrelated eigen-portfolios instead of P correlated stocks. In such way, it is much easier to make investment decision.

After transfering to return of portfolios to price, this is how them look like:

Grey lines are prices of actual 48 industries portfolios.

We can also check how many eigen-portfolios represent how much variance by $\frac{\sum_{i=1}^m\lambda_i}{\sum_{i=1}^k \lambda_i}$


frac =  rep(0,k)
for (i in 1:k){
frac[i] = sum(lambda[1:i])/sum(lambda)
}
options(repr.plot.width=5, repr.plot.height=3)
p = qplot(1:k,frac,geom='line',ylab='Fraction',xlab='Number of Eigen-Portfolios')+geom_point()+theme_bw()
ggplotly(p)



The first 4 principle components explain almost 70% of total variance! That is a lot, since all portfolios are from different industries that are not closely correlated. For example, if we use future contracts with different term to maturity here, the first 4 PCs can explain more than 99% of total variance.

### Orthogonal Factor Model

Below are some properties of orthogonal multi-factor model:

$R_i = \alpha_i + \sum_{j=1}^m \beta_{ij}F_j + \epsilon_i \\E(FF^T)=I\\ E(\epsilon \epsilon^T)=D\\E(\epsilon F^T) = 0\\E(F)=0\\E(\epsilon)=0$

Intuitively speaking, the returns are driven by a constant mean and multiple factors and idiosyncratic random term. The factors are uncorrelated to each other, which means the returns are driven by independent factors. The  idiosyncratic terms could be related to each other. For example, a market shock impact stocks in the same industries.

The factors are unobservable or unknown. They could be the SMB or HML factors from Fama-French model, or could be other macroeconomics factors as well, so we use PCA to estimate factors from data.

By definition:

$E(RR^T)=\Omega=\beta E(FF^T) \beta^T+ E(\epsilon \epsilon^T)\\\Omega=\beta \beta^T + D$

If we keep all eigen-vectors, which is to take $\hat\beta = \beta$. Then:

$\Omega = P\Lambda P^{T} = P\Lambda^{0.5} \Lambda^{0.5} (P)^{T} := \beta \beta^T + 0$

In this model, all the variance from data is explained and the noise is 0, the numbers of factor are equal to numbers of stock. Apparently this is useless because we highly over-fit the model.

However, if choose the first m Principle Components that explain most of variance to be the factor and the remaining factors become part of noise $\epsilon$, the model become legit.

Thus, $\hat D = diag(\hat \Omega - \hat \beta \hat \beta^T)$


m = 4
betahat = (P %*% sqrt(diag(lambda)))[,1:m]

Dhat = diag(diag(omega - betahat %*% t(betahat)))



We have $\hat \beta$ and $\hat D$ now. Next we need to estimate $F$. For each time $t$, regress $R_t$ on $\hat\beta$ with noise $\hat D_t$.

Assume $\epsilon \sim N(0,\hat D)$

Because $\epsilon$ is heteroskedatic, we need GLS to estimate $F_t$ for each t cross-sectionally.
$\hat F_t=(\hat \beta^T \hat D^{-1} \hat \beta)^{-1}(\hat \beta^T \hat D^{-1} R_t)$

Because we choose $\hat\beta$ on our own. The estimation of factor suffers Errors-in-variables.


DhatInv = solve(Dhat)
A = solve(t(betahat)%*%DhatInv%*%betahat)
fhat = matrix(0,nrow=dim(df)[1],ncol=m)
for (i in 1:dim(df)[1]){
fhat[i,] = t(A%*%(t(betahat)%*%DhatInv%*%t(as.matrix(df[i,]))))
}



$\hat F$contains the factor returns of orthogonal factor.

Visualization of the factor prices:

### Backtesting

In the previous section, the main purpose is to show what an orthogonal factor model looks like.

From now on, let’s make it more serious and see if the model can indeed helps us construct a more efficient portfolio.

The idea is simple. Remember the purpose of our factor model is to reduce the estimation error in covariance matrix. We can calculate the minimum variance portfolio using in sample data first with simple estimated covariance matrix then with factor constrained covariance matrix and compare their out of sample result. If factor constrained covariance matrix indeed outperform simply estimation, the model is proved to improve estimation accuracy.

For continued backtesting results, check this blog:

https://wenkaicui.com/2018/12/08/interactive-backtesting-report-by-plotly/