Probabilistic Principal Component Analysis¶

Probabilistic Principal Component Analysis (PPCA) represents a constrained form of the Gaussian distribution in which the number of free parameters can be restricted while still allowing the model to capture the dominant correlations in a data set. It is expressed as the maximum likelihood solution of a probabilistic latent variable model [BSHP06].

This package defines a PPCA type to represent a probabilistic PCA model, and provides a set of methods to access the properties.

Properties¶

Let M be an instance of PPCA, d be the dimension of observations, and p be the output dimension (i.e the dimension of the principal subspace)

indim(M)¶: Get the input dimension d, i.e the dimension of the observation space.

outdim(M)¶: Get the output dimension p, i.e the dimension of the principal subspace.

mean(M)¶: Get the mean vector (of length d).

projection(M)¶

Get the projection matrix (of size (d, p)). Each column of the projection matrix corresponds to a principal component.

The principal components are arranged in descending order of the corresponding variances.

loadings(M)¶: The factor loadings matrix (of size (d, p)).

var(M)¶: The total residual variance.

Transformation and Construction¶

Given a probabilistic PCA model M, one can use it to transform observations into latent variables, as

$\mathbf{z} = (\mathbf{W}^T \mathbf{W} + \sigma^2 \mathbf{I}) \mathbf{W}^T (\mathbf{x} - \boldsymbol{\mu})$

or use it to reconstruct (approximately) the observations from latent variables, as

$\tilde{\mathbf{x}} = \mathbf{W} \mathbb{E}[\mathbf{z}] + \boldsymbol{\mu}$

Here, $\mathbf{W}$ is the factor loadings or weight matrix.

The package provides methods to do so:

transform(M, x)¶

Transform observations x into latent variables.

Here, x can be either a vector of length d or a matrix where each column is an observation.

reconstruct(M, z)¶

Approximately reconstruct observations from the latent variable given in z.

Here, y can be either a vector of length p or a matrix where each column gives the latent variables for an observation.

Data Analysis¶

One can use the fit method to perform PCA over a given dataset.

fit(PPCA, X; ...)¶

Perform probabilistic PCA over the data given in a matrix X. Each column of X is an observation.

This method returns an instance of PCA.

Keyword arguments:

Let (d, n) = size(X) be respectively the input dimension and the number of observations:

name	description	default
method	The choice of methods: `:ml`: use maximum likelihood version of probabilistic PCA `:em`: use EM version of probabilistic PCA `:bayes`: use Bayesian PCA	`:ml`
maxoutdim	Maximum output dimension.	`d-1`
mean	The mean vector, which can be either of: `0`: the input data has already been centralized `nothing`: this function will compute the mean a pre-computed mean vector	`nothing`
tol	Convergence tolerance	`1.0e-6`
tot	Maximum number of iterations	`1000`

Notes:

This function calls ppcaml, ppcaem or bayespca internally, depending on the choice of method.

Example:

using MultivariateStats

# suppose Xtr and Xte are training and testing data matrix,
# with each observation in a column

# train a PCA model
M = fit(PPCA, Xtr; maxoutdim=100)

# apply PCA model to testing set
Yte = transform(M, Xte)

# reconstruct testing observations (approximately)
Xr = reconstruct(M, Yte)

Core Algorithms¶

Three algorithms are implemented in this package: ppcaml, ppcaem, and bayespca.

ppcaml(Z, mean, tw; ...)¶

Compute probabilistic PCA using on maximum likelihood formulation for a centralized sample matrix Z.

Parameters:	Z – provides centralized samples. mean – The mean vector of the original samples, which can be a vector of length `d`, or an empty vector `Float64[]` indicating a zero mean.
Returns:	The resultant PPCA model.
Note:	This function accepts two keyword arguments: `maxoutdim` and `tol`.

ppcaem(S, mean, n; ...)¶

Compute probabilistic PCA based on expectation-maximization algorithm for a given sample covariance matrix S.

Parameters:	S – The sample covariance matrix. mean – The mean vector of original samples, which can be a vector of length `d`, or an empty vector `Float64[]` indicating a zero mean. n – The number of observations.
Returns:	The resultant PPCA model.
Note:	This function accepts two keyword arguments: `maxoutdim`, `tol`, and `tot`.

bayespca(S, mean, n; ...)¶

Compute probabilistic PCA based on Bayesian algorithm for a given sample covariance matrix S.

Parameters:	S – The sample covariance matrix. mean – The mean vector of original samples, which can be a vector of length `d`, or an empty vector `Float64[]` indicating a zero mean. n – The number of observations.
Returns:	The resultant PPCA model.
Note:	This function accepts two keyword arguments: `maxoutdim`, `tol`, and `tot`.

Additional notes:

Function uses the maxoutdim parameter as an upper boundary when it automatically determines the latent space dimensionality.

References¶

[BSHP06]

Bishop, C. M. Pattern Recognition and Machine Learning, 2006.