Principal Component Analysis

Principal Component Analysis (PCA) derives an orthogonal projection to convert a given set of observations to linearly uncorrelated variables, called principal components.

This package defines a PCA type to represent a PCA model, and provides a set of methods to access the properties.

Properties

Let M be an instance of PCA, d be the dimension of observations, and p be the output dimension (i.e the dimension of the principal subspace)

indim(M)

Get the input dimension d, i.e the dimension of the observation space.

outdim(M)

Get the output dimension p, i.e the dimension of the principal subspace.

mean(M)

Get the mean vector (of length d).

projection(M)

Get the projection matrix (of size (d, p)). Each column of the projection matrix corresponds to a principal component.

The principal components are arranged in descending order of the corresponding variances.

principalvars(M)

The variances of principal components.

tprincipalvar(M)

The total variance of principal components, which is equal to sum(principalvars(M)).

tresidualvar(M)

The total residual variance.

tvar(M)

The total observation variance, which is equal to tprincipalvar(M) + tresidualvar(M).

principalratio(M)

The ratio of variance preserved in the principal subspace, which is equal to tprincipalvar(M) / tvar(M).

Transformation and Construction

Given a PCA model M, one can use it to transform observations into principal components, as

\mathbf{y} = \mathbf{P}^T (\mathbf{x} - \boldsymbol{\mu})

or use it to reconstruct (approximately) the observations from principal components, as

\tilde{\mathbf{x}} = \mathbf{P} \mathbf{y} + \boldsymbol{\mu}

Here, \mathbf{P} is the projection matrix.

The package provides methods to do so:

transform(M, x)

Transform observations x into principal components.

Here, x can be either a vector of length d or a matrix where each column is an observation.

reconstruct(M, y)

Approximately reconstruct observations from the principal components given in y.

Here, y can be either a vector of length p or a matrix where each column gives the principal components for an observation.

Data Analysis

One can use the fit method to perform PCA over a given dataset.

fit(PCA, X; ...)

Perform PCA over the data given in a matrix X. Each column of X is an observation.

This method returns an instance of PCA.

Keyword arguments:

Let (d, n) = size(X) be respectively the input dimension and the number of observations:

name description default
method

The choice of methods:

  • :auto: use :cov when d < n or :svd otherwise
  • :cov: based on covariance matrix
  • :svd: based on SVD of the input data
:auto
maxoutdim Maximum output dimension. min(d, n)
pratio The ratio of variances preserved in the principal subspace. 0.99
mean

The mean vector, which can be either of:

  • 0: the input data has already been centralized
  • nothing: this function will compute the mean
  • a pre-computed mean vector
nothing

Notes:

  • The output dimension p depends on both maxoutdim and pratio, as follows. Suppose the first k principal components preserve at least pratio of the total variance, while the first k-1 preserves less than pratio, then the actual output dimension will be min(k, maxoutdim).
  • This function calls pcacov or pcasvd internally, depending on the choice of method.

Example:

using MultivariateStats

# suppose Xtr and Xte are training and testing data matrix,
# with each observation in a column

# train a PCA model
M = fit(PCA, Xtr; maxoutdim=100)

# apply PCA model to testing set
Yte = transform(M, Xte)

# reconstruct testing observations (approximately)
Xr = reconstruct(M, Yte)

Example with iris dataset and plotting:

using MultivariateStats, RDatasets, Plots
plotly() # using plotly for 3D-interacive graphing

# load iris dataset
iris = dataset("datasets", "iris")

# split half to training set
Xtr = convert(Array,DataArray(iris[1:2:end,1:4]))'
Xtr_labels = convert(Array,DataArray(iris[1:2:end,5]))

# split other half to testing set
Xte = convert(Array,DataArray(iris[2:2:end,1:4]))'
Xte_labels = convert(Array,DataArray(iris[2:2:end,5]))

# suppose Xtr and Xte are training and testing data matrix,
# with each observation in a column

# train a PCA model, allowing up to 3 dimensions
M = fit(PCA, Xtr; maxoutdim=3)

# apply PCA model to testing set
Yte = transform(M, Xte)

# reconstruct testing observations (approximately)
Xr = reconstruct(M, Yte)

# group results by testing set labels for color coding
setosa = Yte[:,Xte_labels.=="setosa"]
versicolor = Yte[:,Xte_labels.=="versicolor"]
virginica = Yte[:,Xte_labels.=="virginica"]

# visualize first 3 principal components in 3D interacive plot
p = scatter(setosa[1,:],setosa[2,:],setosa[3,:],marker=:circle,linewidth=0)
scatter!(versicolor[1,:],versicolor[2,:],versicolor[3,:],marker=:circle,linewidth=0)
scatter!(virginica[1,:],virginica[2,:],virginica[3,:],marker=:circle,linewidth=0)
plot!(p,xlabel="PC1",ylabel="PC2",zlabel="PC3")

Core Algorithms

Two algorithms are implemented in this package: pcacov and pcastd.

pcacov(C, mean; ...)

Compute PCA based on eigenvalue decomposition of a given covariance matrix C.

Parameters:
  • C – The covariance matrix.
  • mean – The mean vector of original samples, which can be a vector of length d, or an empty vector Float64[] indicating a zero mean.
Returns:

The resultant PCA model.

Note:

This function accepts two keyword arguments: maxoutdim and pratio.

pcasvd(Z, mean, tw; ...)

Compute PCA based on singular value decomposition of a centralized sample matrix Z.

Parameters:
  • Z – provides centralized samples.
  • mean – The mean vector of the original samples, which can be a vector of length d, or an empty vector Float64[] indicating a zero mean.
Returns:

The resultant PCA model.

Note:

This function accepts two keyword arguments: maxoutdim and pratio.