Content from Clustering Introduction


Last updated on 2025-01-27 | Edit this page

Estimated time: 12 minutes

Download Chapter notebook (ipynb)

Download Chapter PDF

Mandatory Lesson Feedback Survey

Overview

Questions

  • How to search for multiple distributions in a dataset?
  • How to use Scikit-learn to perform clustering?
  • How is data labelled in unsupervised learning?
  • How can we score clustering predictions?

Objectives

  • Understanding Multiple Gaussian distributions in a dataset.
  • Demonstrating Scikit-learn functionality for Gaussian Mixture Models.
  • Learning automated labelling of dataset.
  • Obtaining a basic quality score using a ground truth.



Import functions

PYTHON

from numpy import arange, asarray, linspace, zeros, c_, mgrid, meshgrid, array, dot, percentile
from numpy import histogram, cumsum, around
from numpy import vstack, sqrt, logspace, amin, amax, equal, invert, count_nonzero
from numpy.random import uniform, seed, randint, randn, multivariate_normal

from matplotlib.pyplot import subplots, scatter, xlabel, ylabel, axis, figure, colorbar, title, show
from matplotlib.colors import LogNorm

from pandas import read_csv

Example


Import the patients data, scatter the data for Weight and Height and get a summary statistics for both.

PYTHON

df = read_csv("data/patients_data.csv")

# Weigth to kg and height to cm
pound_kg_conversion = 0.45
inch_cm_conversion  = 2.54

df['Weight'] = pound_kg_conversion*df['Weight']
df['Height'] = inch_cm_conversion *df['Height']


fig, ax = subplots()

ax.scatter(df['Weight'], df['Height'])

ax.set_xlabel('Weight (kg)', fontsize=16)
ax.set_ylabel('Height (cm)', fontsize=16)

df[['Weight', 'Height']].describe()

show()

OUTPUT

           Weight      Height
count  100.000000  100.000000
mean    69.300000  170.357800
std     11.957139    7.204631
min     49.950000  152.400000
25%     58.837500  165.100000
50%     64.125000  170.180000
75%     81.112500  175.895000
max     90.900000  182.880000

Looking at the data, one might expect that there are two distinct groups, visually identified as two clouds separated e.g. by a vertical line at Weight \(\approx\) 70 (kg). A consequence is that the mean value of 69.3 kg (which was calculated over all samples) should better be replaced by two mean values, one for each of the clouds. Visually, these can be estimated at around 60 and 80 kg.

We can make this even clearer by looking at the two individual distributions.

PYTHON

fig, ax = subplots(ncols=2)

ax[0].hist(df['Weight'], bins=20);
ax[0].set_xlabel('Weight', fontsize=16)
ax[0].set_ylabel('Count', fontsize=16)

ax[1].hist(df['Height'], bins=12);
ax[1].set_xlabel('Height', fontsize=16);

show()

The Weight histogram shows two distributions (at the chosen number of bins) which now also points to the two mean values as guessed above. The Height histogram is at least not compatible with the assumption of a normal distribution as would be expected for a typically noisy variable.

Thus, visual inspection suggests to analyse the data in terms of more than one underlying distribution. The automated assignment of data points to distinct groups is called clustering.

We now want to learn to use the Gaussian Mixture Model approach to find those groups. As we will not provide any labels for the training, this presents an example of unsupervised machine learning. Algorithms of this type of machine learning are designed to learn to optimally assign labels through training. As a result, we will be able to separate a dataset into groups and be able to predict the labels of new, unlabelled data.

Gaussian Mixture Models


A Gaussian Mixture Models (GMM) approach assumes that the data are composed of two or more normal distributions that may overlap. In a scatter plot that means that there is more than one centre in the density distribution of the data (see scatter plot above). The task is to find the centres and the spread of each distribution in the mixture. The GMM algorithm thus belongs to the category of (probability) Density Estimators. Another way of grouping is to find a curve that splits the plane into two areas.

The GMM assumes normally distributed data structure from at least two sources. Other than that it does not make assumptions about the data.

GMM is a parametric learning approach as it optimises the parameters of a normal distributions, i.e. the mean and the covariance matrix of each group. It is therefore an example of a model fitting method.

As its name suggests, it assumes that the distribution of each group is normal. If the groups are known to have a non-normal distribution, it may not be the optimal approach.

GMM is one example of clustering or cluster analysis. Whenever we suspect that a data set contains contributions of qualitatively different types, we can consider doing a cluster analysis to separate those types. However, this is a vague notion and clustering is therefore a complex field. We can only provide an introduction to its basic components. The main point to keep in mind is that the algorithms provided e.g. by Scikit-learn will always give some result but that it is not easy to assess the quality of the results. Scikit-learn has a good overview of clustering methods showing advantages and disadvantages of each. Here is a link to a readable introduction about the cautious application and the pitfalls of clustering.

Work Through Example


Creating test data

Let us create synthetic data for testing of the clustering algorithm. We do this according to the assumptions of GMM: we create two Gaussian data sets with different means and different standard distributions and add them together. For illustration we only use two features.

The example is adapted from a Scikit-learn example. It uses the concept of covariance matrix which is the extension of variance (or standard deviation) to multivariate datasets.

PYTHON

n_samples = 500
m_features = 2

# Seed the random number generator
RANDOM_NUMBER = 1
seed(RANDOM_NUMBER)

# Data set 1, centered at (20, 20)
mean_11 = 20
mean_12 = 20

gaussian_1 = randn(n_samples, m_features) + array([mean_11, mean_12])


# Data set 2, zero centered and stretched with covariance matrix C
C = array([[1, -0.7], [3.5, .7]])

gaussian_2 = dot(randn(n_samples, m_features), C)


# Concatenate the two Gaussians to obtain the training data set
X_train = vstack([gaussian_1, gaussian_2])

print(X_train.shape)

fig, ax = subplots()

ax.scatter(X_train[:, 0], X_train[:, 1]);

show()

OUTPUT

(1000, 2)

The scatter plot showes that this method allows the adjustment of the centres of the distributions as well as the elliptic shape of the distribution.

Now we fit a GMM. Note that the GMM needs to be told how many components one wants to fit. Modifications that estimate the optimal number of components exist but we will restrict the demonstration to the method that directly sets the number.

Analogous to the classifier in supervised learning, we instantiate the model from the imported class GaussianMixture. The instantiation takes the number of independent data sets (clusters) as an argument. By default, the classifier tries to fit the full covariance matrix of each group. The fitting is done using the method fit.

PYTHON

from sklearn.mixture import GaussianMixture

# Fit a Gaussian Mixture Model with two components

components = 2

clf = GaussianMixture(n_components=components)

clf.fit(X_train)
GaussianMixture(n_components=2)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

After the fitting of the model, we first create a meshgrid of the (two-dimensional) state space. For each point in this state space, we obtain the predicted scores using the method .score_samples. These are the weighted logarithmic probabilities which show the predicted distribution of points in the state space.

PYTHON

resolution = 100

vec_a = linspace(-60., 80., resolution)
vec_b = linspace(-40., 50., resolution)

grid_a, grid_b = meshgrid(vec_a, vec_b)

XY_statespace = c_[grid_a.ravel(), grid_b.ravel()]

Z_score = clf.score_samples(XY_statespace)

Z_s = Z_score.reshape(grid_a.shape)

print(Z_s.shape)

OUTPUT

(100, 100)

Now we can display the predicted scores as a contour plot. Typically, the negative log-likelihood or density estimation is used for this. In this case, the highest probabilities are shown as a landscape with two minima.

PYTHON

fig, ax = subplots(figsize=(8, 6))

cax = ax.contour(grid_a, grid_b, -Z_s,
           norm=LogNorm(vmin=1.0, vmax=1000.0),
           levels=logspace(0, 3, 10),
           cmap='magma'
          )

fig.colorbar(cax);

ax.scatter(X_train[:, 0], X_train[:, 1], .8)

title('Negative log-likelihood of Prediction', fontsize=16)
axis('tight');

show()

You can change the number of components to see the impact it has on the result. E.g. picking 3 components:

PYTHON

clf_3 = GaussianMixture(n_components=3)

clf_3.fit(X_train)

Z_score_3 = clf_3.score_samples(XY_statespace)

Z_s_3 = Z_score_3.reshape(grid_a.shape)

fig, ax = subplots(figsize=(8, 6))

cax = ax.contour(grid_a, grid_b, -Z_s_3,
           norm=LogNorm(vmin=1.0, vmax=1000.0),
           levels=logspace(0, 3, 10),
           cmap='magma'
          )

fig.colorbar(cax);

title('Negative log-likelihood (3 components)', fontsize=16)
axis('tight');

show()
GaussianMixture(n_components=3)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

For the choice of 3 components it does not lead to a probability distribution with 3 distinct maxima. This is because two of the maxima coincide or at least nearly coincide.

In our example, the choice of 2 components is very obvious because as done above, we could visualise the complete state space and there was a visually discernible structure in the data. In high-dimensional data the task is difficult and while methods exist to automatically find the optimal number of components for some clustering methods, the success of these depends very much on the problem.

Getting optimal model parameters

Now that the estimator is fitted, we can obtain the optimal parameters for the fitted components. They are stored in the model attributes. We can extract (i) the .weights_, the share of each of the components (Gaussians) in the mixture; (ii) the .means_, the coordinates of the mean values; and (iii) .covariances_, the covariance matrix of each component.

PYTHON

components = 2

clf = GaussianMixture(n_components=components);

clf.fit(X_train)

print('Model Weights: ')
print(clf.weights_)
print('')

print('Mean coordinates: ')
print(clf.means_)
print('')
print('Covariance Matrices: ')
print(clf.covariances_)
GaussianMixture(n_components=2)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

OUTPUT

Model Weights:
[0.5 0.5]

Mean coordinates:
[[ 2.00467649e+01  2.00308601e+01]
 [ 1.10681138e-01 -6.87868023e-03]]

Covariance Matrices:
[[[ 0.95442218 -0.06641459]
  [-0.06641459  0.97019156]]

 [[13.78557368  1.81677876]
  [ 1.81677876  1.04931994]]]

The fit returns a model where the two components have equal weight. The means and covariance matrices can be compared directly to the values chosen to create the data. They are not identical but good estimates are obtained from a fit to 500 data points in each group.

Create data from optimal model

The result of the fitting are the parameters for two Gaussian distributions with two features each. These parameters can be used to create further model data with the same characteristics. In our demonstration we know the original sources but if the parameters are obtained from experimental or clinical data, it is useful to visualise the predicted distributions using as many samples as necessary.

If we know the mean and the covariance matrix of a Gaussian, the function multivariate_normal can be used to create data from that Gaussian.

PYTHON

model1_mean, model2_mean = clf.means_[0], clf.means_[1]
model1_cov, model2_cov =  clf.covariances_[0], clf.covariances_[1]

samples = 100

model1_data = multivariate_normal(model1_mean, model1_cov, samples)
model2_data = multivariate_normal(model2_mean, model2_cov, samples)

fig, ax = subplots()

ax.scatter(model1_data[:, 0], model1_data[:, 1], c='b');
ax.scatter(model2_data[:, 0], model2_data[:, 1], c='r');

show()

Predicting Labels

Now we can apply what we have discussed in supervised machine learning and use the trained model to predict.

We can get the predictions of the group for new data. Here, for simplicity, we create test data from the same distribution as the train data. The label is obtained from the method .predict.

PYTHON

n_samples = 10
m_features = 2

# Seed the random number generator
RANDOM_NUMBER = 111
seed(RANDOM_NUMBER)

# Data set 1, centered at (20, 20)

mean_11 = 20
mean_12 = 20

gaussian_1 = randn(n_samples, m_features) + array([mean_11, mean_12])


# Data set 2, zero centered and stretched with covariance matrix C

C = array([[1, -0.7], [3.5, .7]])

gaussian_2 = dot(randn(n_samples, m_features), C)


# Concatenate the two Gaussians to obtain the training data set
X_test = vstack([gaussian_1, gaussian_2])


# Predict group
y_test = clf.predict(X_test)

print(y_test)

OUTPUT

[0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1]

For simplicity, fit and predict can be combined with the .fit_predict method to directly get the labels for each sample. Here is an example where we fit the model to the test data and directly extract their predicted labels.

PYTHON

components = 2

clf_2 = GaussianMixture(n_components=components, covariance_type='full')

labels = clf_2.fit_predict(X_test)
print(labels)

OUTPUT

[1 1 1 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0]

The probabilities of the predictions are obtained from the method .predict_proba. In this case, all probabilities are 0 and 1 respectively. The model is sure about their group signature.

PYTHON

y_proba = clf.predict_proba(X_test)

fig, ax = subplots()

ax.hist(y_proba, bins=10);

show()

The .sample_ method produces individual samples from the trained model. It takes the number of required samples as an input argument and yields the sample values as well as the group for each sample. Samples for each group are given with probability according to the group weights.

PYTHON

samples = clf.sample(5)

print(samples[0])
print('')
print(samples[1])

OUTPUT

[[20.06076245 20.69983729]
 [21.67317947 21.86459529]
 [ 4.90658144 -0.99981225]
 [ 4.74232451  0.2655483 ]
 [ 0.92148265  2.32782549]]

[0 0 1 1 1]

We can now redo the example with two distributions that lie closer together, i.e. making the clustering task harder.

PYTHON

n_samples = 500
m_features = 2

# Seed the random number generator
RANDOM_NUMBER = 1
seed(RANDOM_NUMBER)

# Data set 1, centered at (20, 20)

mean_11 = 2
mean_12 = 2

gaussian_1 = randn(n_samples, m_features) + array([mean_11, mean_12])


# Data set 2, zero centered and stretched with covariance matrix C

C = array([[1, -0.7], [3.5, .7]])

gaussian_2 = dot(randn(n_samples, m_features), C)


# Concatenate the two Gaussians to obtain the training data set
X_train = vstack([gaussian_1, gaussian_2])

print(X_train.shape)

fig, ax = subplots()

ax.scatter(X_train[:, 0], X_train[:, 1]);

show()

OUTPUT

(1000, 2)

PYTHON

components = 2

clf2 = GaussianMixture(n_components=components)

clf2.fit(X_train)

resolution = 100

vec_a = linspace(-40., 60., resolution)
vec_b = linspace(-20., 30., resolution)

grid_a, grid_b = meshgrid(vec_a, vec_b)

XY_statespace = c_[grid_a.ravel(), grid_b.ravel()]

Z_score = clf2.score_samples(XY_statespace)

Z_s = Z_score.reshape(grid_a.shape)

fig, ax = subplots(figsize=(8, 6))

cax = ax.contour(grid_a, grid_b, -Z_s,
           norm=LogNorm(vmin=1.0, vmax=1000.0),
           levels=logspace(0, 3, 10),
           cmap='magma'
          )

fig.colorbar(cax);

ax.scatter(X_train[:, 0], X_train[:, 1], .8)

title('Negative log-likelihood of Prediction', fontsize=16)
axis('tight');

show()
GaussianMixture(n_components=2)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

PYTHON

print('Model Weights: ')
print(clf2.weights_)
print('')

print('Mean coordinates: ')
print(clf2.means_)
print('')

print('Covariance Matrices: ')
print(clf2.covariances_)
print('')

y_predict = clf2.predict(X_train)

print('Predicted Labels')
print(y_predict)

OUTPUT

Model Weights:
[0.51626883 0.48373117]

Mean coordinates:
[[ 2.02822009  2.0218213 ]
 [ 0.06535904 -0.06576508]]

Covariance Matrices:
[[[ 0.95990893 -0.03850552]
  [-0.03850552  0.94445254]]

 [[14.15940548  1.77380246]
  [ 1.77380246  0.97555916]]]

Predicted Labels
[0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 1 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 1 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0
 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 1
 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1
 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0
 0 1 1 1 1 0 0 1 1 1 1 1 1 1 0 1 1 0 1 1 1 1 0 1 1 1 1 1 1 0 1 0 1 1 1 1 0
 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1
 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 1 0 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0
 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 0 1 0 0 0 1 1 1 1 1 1 1 1 1 1 0 1 1 1 0 1 1 1 1 0 0 1
 1 1 1 0 1 1 1 1 1 1 1 1 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 0 1 1 0 0 1 1 1 0 1
 1 0 1 1 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 0 1 1 1 1 1 1 1 1 0 1 1 1 0 1 1 1 0 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 0 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1
 1]

Scoring of Predictions

Knowing the origin of the data we can now compare the predicted labels with the true labels (the ground truth) and obtain a scoring. A function provided by Scikit-learn is the (adjusted or unadjusted) Rand index. It measures the similarity of the predicted and the true assignments. However, random assignment of labels will (by chance) lead to a number of correct predictions. To adjust for this fact and ensure that randomly assigned labels get a scoring close to zero, the function to use is adjusted_rand_score:

PYTHON

from sklearn.metrics.cluster import adjusted_rand_score

y_true = zeros(2*n_samples)
y_true[n_samples:] = 1

scoring = adjusted_rand_score(y_true, y_predict)

print(scoring)

OUTPUT

0.6174145036659798

The result shows that even though the two distributions are strongly overlapping, there is still a reasonable score based on the known ground truth.

It is important to remember that the ground truth is typically not known. There are therefore also measures to score the outcome based on within-data criteria. See internal evaluation of the wikipedia article for some techniques.

In general, the outcome of clustering is not easy to assess with confidence and specific measures need to be developed based on additional knowledge about the source of the data.

Application to Example Data


Let us now apply the GMM approach to the example at the beginning of the lesson.

PYTHON

from pandas import read_csv

df = read_csv("data/patients_data.csv")

df['Weight'] = 0.45*df['Weight']
df['Height'] = 2.54*df['Height']

X_train = df[['Weight', 'Height']]
X_train = X_train.to_numpy()

print(X_train.shape)

OUTPUT

(100, 2)

Now we can fit the GMM classifier using the suspected number of two components.

PYTHON

clf = GaussianMixture(n_components=2)

clf.fit(X_train)

resolution = 100

vec_a = linspace(0.8*min(X_train[:,0]), 1.2*max(X_train[:,0]), resolution)
vec_b = linspace(0.8*min(X_train[:,1]), 1.2*max(X_train[:,1]), resolution)

grid_a, grid_b = meshgrid(vec_a, vec_b)

XY_statespace = c_[grid_a.ravel(), grid_b.ravel()]

Z_score = clf.score_samples(XY_statespace)

Z_s = Z_score.reshape(grid_a.shape)


fig, ax = subplots(figsize=(8, 6))

cax = ax.contour(grid_a, grid_b, -Z_s,
           norm=LogNorm(vmin=1.0, vmax=1000.0),
           levels=logspace(0, 3, 10),
           cmap='magma'
          )

fig.colorbar(cax);

ax.scatter(X_train[:, 0], X_train[:, 1], .8)

title('Negative log-likelihood of Prediction', fontsize=16)
axis('tight');

show()
GaussianMixture(n_components=2)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

These predictions can now be compared with labels in the data, for example the Gender. To check the outcome of the fitted model versus the gender, we obtain the predicted labels from the model. We can compare this with the Gender labels in the data:

PYTHON

y_predict = clf.predict(X_train)

gender_boolean = df['Gender'] == 'Female'

y_gender = gender_boolean.to_numpy()

scoring = adjusted_rand_score(y_gender, y_predict)

print(scoring)

OUTPUT

1.0

In this case, the predictions from the GMM coincide 100 % with the gender label in the data. The outcome is therefore perfect in both cases.

We can also compare the predictions with the smoker labels:

PYTHON

y_smoker = df['Smoker']

scoring = adjusted_rand_score(y_smoker, y_predict)

print(scoring)

OUTPUT

0.039367492745118096

This result shows that the GMM labelling is arbitrary when compared the smoker labels in the data.

From the trained model we create the individual predicted distributions for each group.

PYTHON

group1_mean = clf.means_[0]
group1_cov  = clf.covariances_[0]

group2_mean = clf.means_[1]
group2_cov  = clf.covariances_[1]

samples = 1000


group1_data = multivariate_normal(group1_mean, group1_cov, samples)
group2_data = multivariate_normal(group2_mean, group2_cov, samples)

fig, ax = subplots(ncols=2, figsize=(12, 6))

ax[1].scatter(group1_data[:, 0], group1_data[:, 1], c='r');
ax[1].scatter(group2_data[:, 0], group2_data[:, 1], c='b');
ax[1].set_xlabel('Height', fontsize=16)

ax[0].scatter(df['Weight'], df['Height']);
ax[0].set_xlabel('Height', fontsize=16)
ax[0].set_ylabel('Weight', fontsize=16)

fig.suptitle('Scatter plot from Data (left) and Model (right)', fontsize=16);

show()

Exercises


End of chapter Exercises

Create the training and prediction workflow as above for a data set with two other features, namely: Diastole and Systole values from the ‘patients_data.csv’ file.

  1. Extract the Diastole and Systole columns.

  2. Use the data to fit a Gaussian model with 2 components and create a state space contour plot of the negative log likelihood with scattered data superimposed.

  3. Extract the model weights, the means of the two Gaussians and their corresponding covariance matrices.

  4. Calculate the adjusted random score for the labels ‘gender’ and ‘smoker’ in the data to estimate whether these have som overlap with the model fit.

  5. Compare the original scatter plot versus the model generated scatter plot. Use a total of 100 samples for the model generated data and distribute them according to the model weights.

  6. Repeat the plot multiple times to see how the degree of overlap in the model output changes with each choice of samples from the fitted distribution.

  7. Create corresponding histograms of the Diastolic and Systolic blood pressure values from data and model. Try to guess where the differences in appearance come from.

The data show systematic gaps in the histogram meaning that some values do not occur (integer values only). In contrast, the model data from the random number generator can take any value. Therefore the counts per bin are generally lower for the model.

Key Points

  • The automated assignment of data points to distinct groups is called clustering.
  • Gaussian Mixture Models (GMM) is one example of cluster analysis.
  • .fit, ..score_samples and .predict are some of the key methods in GMM clustering.
  • adjusted_rand_score method randomly assigns labels for prediction scoring.

Content from Clustering Images


Last updated on 2025-01-27 | Edit this page

Estimated time: 12 minutes

Download Chapter notebook (ipynb)

Download Chapter PDF

Mandatory Lesson Feedback Survey

Overview

Questions

  • What makes image data unique for machine learning?
  • How can MR images be clustered and segmented?
  • How can segmentation be improved?
  • How do we visualise clustered image data?

Objectives

  • Learning to work with a domain-specific Python package.
  • Clustering with Gaussian Mixture Models (GMM) to segment a medical image.
  • Combining information from different imaging modalities for improved segmentation.
  • Understanding strategies to visualise clustered output.



Prerequisite

Concept


Image Segmentation with Clustering

Medical imaging techniques are a valuable tool for probing diseases and disorders non-invasively. Processing medical images often consists of an expert, such as a radiologist, looking at the images and identifying, labelling and characterising potential lesions. This process can be time-consuming and reliant on a well-trained expert’s eye. To make medical imaging techniques feasible in circumstances where there may be insufficient time or resources for expert labelling, there is current research into using artificial intelligence to label images. There are many supervised learning techniques that utilise previously labelled data from experts to train a computer algorithm to recognise certain features of an image. However, this may require large amounts of data that were previously labelled, which again is not always available. An alternative approach is to use unsupervised learning strategies, such as clustering, to group images into different regions. The interpretation of these regions may be ambiguous, but with some previous knowledge, we may still use it to infer information from an image.

Medical Image Example

The example used in this lesson is part of the National Cancer Institute’s Clinical Proteomic Tumor Analysis Consortium Glioblastoma Multiforme (CPTAC-GBM) cohort and is available at The Cancer Imaging Archive: https://wiki.cancerimagingarchive.net/display/Public/CPTAC-GBM. For each subject in this study, several different brain MRI scans were performed, each of which gives different contrast in the brain. Each subject has been diagnosed with glioblastoma, and a tumour is visible in the MRI scans. To analyse the images and to, for example, estimate the size of the tumour, we may wish to segment the brain into healthy tissue and tumour tissue. The figure below shows four images in the different modalities.

Figure 1: Example image, different MRI modalities imaging glioblastoma
Figure 1: Example image, different MRI modalities imaging glioblastoma

Work Through Example


Code Preparation

We first import the modules needed for this lesson. We use Numpy to store and process images and we use nibabel to read the MRI images, which have a file type called ‘nifti’. Nibabel is freely available for download here: https://nipy.org/nibabel/

PYTHON

from numpy import zeros, sum, stack
import nibabel as nib
from matplotlib.pyplot import subplots, tight_layout, show

Note

Note how we import the nibabel package as ‘nib’. You can use any abbreviation to access the package’s functions from within your programme.

To familiarise yourself with the nibabel package, try the Getting started tutorial using an example image file.

Reading Images into Numpy Arrays

Next, we want to use the nibabel package to read the MRI images into Numpy arrays. In this example, we use four different images that were acquired with different MRI protocols.

PYTHON

img_3d = nib.load('fig/t1.nii')
img1 = img_3d.get_fdata()

img_3d = nib.load('fig/t1_contrast.nii')
img2 = img_3d.get_fdata()

img_3d = nib.load('fig/flair.nii')
img3 = img_3d.get_fdata()

img_3d = nib.load('fig/adc.nii')
img4 = img_3d.get_fdata()

Let’s have a look at the data shape:

PYTHON

print(img1.shape)
(256, 256, 32)

For plotting, we select a slice from the images. In this example we will view axial slices, i.e. slices from the last dimension. Thus, we choose a slice number between 0 and 31, here we go with slice 20 and plot it.

PYTHON

img_slice = 20

PYTHON

fig, ax = subplots(nrows=1, ncols=4, figsize=(20, 10))

ax[0].imshow(img1[:, :, img_slice], cmap='gray')
ax[0].set_title("T1", fontsize=16)
ax[1].imshow(img2[:, :, img_slice], cmap='gray')
ax[1].set_title("T1 with contrast agent", fontsize=16)
ax[2].imshow(img3[:, :, img_slice], cmap='gray')
ax[2].set_title("FLAIR", fontsize=16)
ax[3].imshow(img4[:, :, img_slice], cmap='gray')
ax[3].set_title("Apparent diffusion coefficient", fontsize=16);

show()

Data Pre-processing

To analyse the images, we need to do a bit of pre-processing. First of all, let us plot the histogram of the voxel (volume pixel) intensities.

PYTHON

fig, ax = subplots(nrows=1, ncols=4, figsize=(20, 4))

ax[0].hist(img1.flatten(), bins=50);
ax[0].set_title("T1", fontsize=16)
ax[0].set_xlabel("Intensity", fontsize=16)

ax[1].hist(img2.flatten(), bins=50);
ax[1].set_title("T1 with contrast agent", fontsize=16)
ax[1].set_xlabel("Intensity", fontsize=16)

ax[2].hist(img3.flatten(), bins=50);
ax[2].set_title("FLAIR", fontsize=16)
ax[2].set_xlabel("Intensity", fontsize=16)

ax[3].hist(img4.flatten(), bins=50);
ax[3].set_title("Apparent diffusion coefficient", fontsize=16)
ax[3].set_xlabel("Intensity", fontsize=16)

tight_layout()

print('Number of voxels with intensity equal to 0 is: %d'%sum(img1==0))
print('')

show()

OUTPUT

Number of voxels with intensity equal to 0 is: 1848804

As we can see from these histograms, a large number of the values are zero. This corresponds to the background voxels shown in black. We want to remove these voxels, as they are not useful for our analysis. For this, we create a binary mask and apply it to the images.

Note

Note the use of tight_layout in the cell above. It is a Matplotlib function to pad between the figure edge and the edges of subplots. This can be useful to avoid overlap of figures and labels. The keyword parameter pad is set to 1.08 by default.

PYTHON

mask = (img1>0) & (img2>0) & (img3>0) & (img4>0)

img1_nz = img1[mask]
img2_nz = img2[mask]
img3_nz = img3[mask]
img4_nz = img4[mask]

With the mask applied, let us plot the histograms of the non-zero voxels again:

PYTHON

fig, ax = subplots(1, 4, figsize=(20, 4))

ax[0].hist(img1_nz, bins=50);
ax[0].set_title("T1", fontsize=16)
ax[0].set_xlabel("Intensity", fontsize=16)

ax[1].hist(img2_nz, bins=50);
ax[1].set_title("T1 with contrast agent", fontsize=16)
ax[1].set_xlabel("Intensity", fontsize=16)

ax[2].hist(img3_nz, bins=50);
ax[2].set_title("FLAIR", fontsize=16)
ax[2].set_xlabel("Intensity", fontsize=16)

ax[3].hist(img4_nz, bins=50);
ax[3].set_title("Apparent diffusion coefficient", fontsize=16)
ax[3].set_xlabel("Intensity", fontsize=16)

tight_layout()

show()

We can see that the data is no longer confounded by the zero-valued background voxels. The distribution of relevant intensities now becomes apparent.

Image scaling

In many machine learning applications (both supervised and unsupervised) an additional step of data preparation consists in normalising or scaling, i.e. adjustment of the values under certain conditions. For example, the numbers in a data file are all positive and very large but the algorithms work best for numbers with mean zero and variance 1. In Scikit-learn this can be done by using fit_transform for an instance of the StandardScaler.

PYTHON

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

img1_scaled = scaler.fit_transform(img1_nz.reshape(-1, 1))
img2_scaled = scaler.fit_transform(img2_nz.reshape(-1, 1))
img3_scaled = scaler.fit_transform(img3_nz.reshape(-1, 1))
img4_scaled = scaler.fit_transform(img4_nz.reshape(-1, 1))

fig, ax = subplots(1, 4, figsize=(20, 4))

ax[0].hist(img1_scaled, bins=50);
ax[0].set_title("T1", fontsize=16)
ax[0].set_xlabel("Intensity", fontsize=16)

ax[1].hist(img2_scaled, bins=50);
ax[1].set_title("T1 with contrast agent", fontsize=16)
ax[1].set_xlabel("Intensity", fontsize=16)

ax[2].hist(img3_scaled, bins=50);
ax[2].set_title("FLAIR", fontsize=16)
ax[2].set_xlabel("Intensity", fontsize=16)

ax[3].hist(img4_scaled, bins=50);
ax[3].set_title("Apparent diffusion coefficient", fontsize=16)
ax[3].set_xlabel("Intensity", fontsize=16)

tight_layout()

show()

If you compare the histograms, you can see that the values in the data have changed (horizontal axis) but the shapes of the distributions are the same.

We are not pursuing this further here but you are encouraged to re-do the clustering below with the scaled images and check if there are any differences.

Image Segmentation with Clustering

After this data cleaning step, we can proceed with our analysis. We want to segment the images into brain tissue and tumour tissue. It is not obvious how to do this, as the intensity values in the above histogram are continuous with only one major peak in intensity. We will nonetheless attempt to cluster the images using a Gaussian Mixture Model (GMM).

First, we import the GMM class from Scikit-learn.

PYTHON

from sklearn.mixture import GaussianMixture

We then fit the instantiated model with a few different numbers of clusters (argument n_components, between n = 2-4) individually for each image. We use the fit_predict method to simultaneously fit and label the images. Note that we also add 1 to each image label. This is because each data point gets labelled with a number from 0 to n-1 where n is the number of clusters used to fit the model. At the plotting stage, we do not want any of our labels to be equal to 0, as this corresponds to the background.

PYTHON

RANDOM_SEED = 123

gmm_2 = gmm_2 = GaussianMixture(2, random_state=RANDOM_SEED)
img1_n2_labels = gmm_2.fit_predict(img1_nz.reshape(-1, 1))
img1_n2_labels += 1

gmm_3 = GaussianMixture(3, random_state=RANDOM_SEED)
img1_n3_labels = gmm_3.fit_predict(img1_nz.reshape(-1, 1))
img1_n3_labels += 1

gmm_4 = GaussianMixture(4, random_state=RANDOM_SEED)
img1_n4_labels = gmm_4.fit_predict(img1_nz.reshape(-1, 1))
img1_n4_labels += 1

PYTHON

gmm_2 = GaussianMixture(2, random_state=RANDOM_SEED)
img2_n2_labels = gmm_2.fit_predict(img2_nz.reshape(-1, 1))
img2_n2_labels += 1

gmm_3 = GaussianMixture(3, random_state=RANDOM_SEED)
img2_n3_labels = gmm_3.fit_predict(img2_nz.reshape(-1, 1))
img2_n3_labels += 1

gmm_4 = GaussianMixture(4, random_state=RANDOM_SEED)
img2_n4_labels = gmm_4.fit_predict(img2_nz.reshape(-1, 1))
img2_n4_labels += 1

PYTHON

gmm_2 = GaussianMixture(2, random_state=RANDOM_SEED)
img3_n2_labels = gmm_2.fit_predict(img3_nz.reshape(-1, 1))
img3_n2_labels += 1

gmm_3 = GaussianMixture(3, random_state=RANDOM_SEED)
img3_n3_labels = gmm_3.fit_predict(img3_nz.reshape(-1, 1))
img3_n3_labels += 1

gmm_4 = GaussianMixture(4, random_state=RANDOM_SEED)
img3_n4_labels = gmm_4.fit_predict(img3_nz.reshape(-1, 1))
img3_n4_labels += 1

PYTHON

gmm_2 = GaussianMixture(2, random_state=RANDOM_SEED)
img4_n2_labels = gmm_2.fit_predict(img4_nz.reshape(-1, 1))
img4_n2_labels += 1

gmm_3 = GaussianMixture(3, random_state=RANDOM_SEED)
img4_n3_labels = gmm_3.fit_predict(img4_nz.reshape(-1, 1))
img4_n3_labels += 1

gmm_4 = GaussianMixture(4, random_state=RANDOM_SEED)
img4_n4_labels = gmm_4.fit_predict(img4_nz.reshape(-1, 1))
img4_n4_labels += 1

Once we have all our image labels, we map the labels back to the two-dimensional image array and plot the result.

PYTHON

img1_n2_labels_mapped = zeros(img1.shape)
img1_n2_labels_mapped[mask] = img1_n2_labels

img1_n3_labels_mapped = zeros(img1.shape)
img1_n3_labels_mapped[mask] = img1_n3_labels

img1_n4_labels_mapped = zeros(img1.shape)
img1_n4_labels_mapped[mask] = img1_n4_labels

PYTHON

img2_n2_labels_mapped = zeros(img2.shape)
img2_n2_labels_mapped[mask] = img2_n2_labels

img2_n3_labels_mapped = zeros(img2.shape)
img2_n3_labels_mapped[mask] = img2_n3_labels

img2_n4_labels_mapped = zeros(img2.shape)
img2_n4_labels_mapped[mask] = img2_n4_labels

PYTHON

img3_n2_labels_mapped = zeros(img3.shape)
img3_n2_labels_mapped[mask] = img3_n2_labels

img3_n3_labels_mapped = zeros(img3.shape)
img3_n3_labels_mapped[mask] = img3_n3_labels

img3_n4_labels_mapped = zeros(img3.shape)
img3_n4_labels_mapped[mask] = img3_n4_labels

PYTHON

img4_n2_labels_mapped = zeros(img4.shape)
img4_n2_labels_mapped[mask] = img4_n2_labels

img4_n3_labels_mapped = zeros(img4.shape)
img4_n3_labels_mapped[mask] = img4_n3_labels

img4_n4_labels_mapped = zeros(img4.shape)
img4_n4_labels_mapped[mask] = img4_n4_labels

PYTHON

fig, ax = subplots(3, 4, figsize=(20, 15))

ax[0, 0].imshow(img1_n2_labels_mapped[:, :, img_slice], cmap='viridis')
ax[1, 0].imshow(img1_n3_labels_mapped[:, :, img_slice], cmap='viridis')
ax[2, 0].imshow(img1_n4_labels_mapped[:, :, img_slice], cmap='viridis')
ax[0, 1].imshow(img2_n2_labels_mapped[:, :, img_slice], cmap='viridis')
ax[1, 1].imshow(img2_n3_labels_mapped[:, :, img_slice], cmap='viridis')
ax[2, 1].imshow(img2_n4_labels_mapped[:, :, img_slice], cmap='viridis')

ax[0, 2].imshow(img3_n2_labels_mapped[:, :, img_slice], cmap='viridis')
ax[1, 2].imshow(img3_n3_labels_mapped[:, :, img_slice], cmap='viridis')
ax[2, 2].imshow(img3_n4_labels_mapped[:, :, img_slice], cmap='viridis')

ax[0, 3].imshow(img4_n2_labels_mapped[:, :, img_slice], cmap='viridis')
ax[1, 3].imshow(img4_n3_labels_mapped[:, :, img_slice], cmap='viridis')
ax[2, 3].imshow(img4_n4_labels_mapped[:, :, img_slice], cmap='viridis')

ax[0, 0].set_ylabel("2 clusters", fontsize=18)
ax[1, 0].set_ylabel("3 clusters", fontsize=18)
ax[2, 0].set_ylabel("4 clusters", fontsize=18)

ax[0, 0].set_title("Image 1", fontsize=18)
ax[0, 1].set_title("Image 2", fontsize=18)
ax[0, 2].set_title("Image 3", fontsize=18)
ax[0, 3].set_title("Image 4", fontsize=18)

tight_layout()

show()

This figure shows the labels acquired from each of the images, using different numbers of clusters. We see that using Image 3 (acquired with FLAIR protocol), the lesion is segmented quite well from the rest of the brain. The other images are less effective at clearly identifying the lesion. However, some of these images, e.g. Image 4 (apparent diffusion coefficient) performs better at segmenting brain tissue from surrounding cerebrospinal fluid (CSF). CSF is not part of brain tissue and can contaminate our results. Ideally, we want to segment three key areas: brain, lesion and CSF.

Combining Contrast from Different Images

So far, we only used the intensities of each image individually, i.e. using only one feature. We now try to combine the images into a single Numpy array containing four columns, one for each image.

PYTHON

all_img = stack([img1_nz, img2_nz, img3_nz, img4_nz], axis=1)
all_img.shape

(240391, 4)

PYTHON

gmm_3 = GaussianMixture(3, random_state=RANDOM_SEED)

all_img_n3_labels = gmm_3.fit_predict(all_img)
all_img_n3_labels += 1

PYTHON

all_img_n3_labels_mapped = zeros(img1.shape)
all_img_n3_labels_mapped[mask] = all_img_n3_labels

PYTHON

fig, ax = subplots(1, 5, figsize=(20, 5))
ax[0].imshow(img1_n3_labels_mapped[:, :, img_slice], cmap='viridis')
ax[1].imshow(img2_n3_labels_mapped[:, :, img_slice], cmap='viridis')
ax[2].imshow(img3_n3_labels_mapped[:, :, img_slice], cmap='viridis')
ax[3].imshow(img4_n3_labels_mapped[:, :, img_slice], cmap='viridis')
ax[4].imshow(all_img_n3_labels_mapped[:, :, img_slice], cmap='viridis')

ax[0].set_ylabel("3 clusters", fontsize=18)

ax[0].set_title("Image 1", fontsize=18)
ax[1].set_title("Image 2", fontsize=18)
ax[2].set_title("Image 3", fontsize=18)
ax[3].set_title("Image 4", fontsize=18)
ax[4].set_title("All images", fontsize=18)

tight_layout()

show()

Here, the last column shows the cluster results when all four images are used in the Gaussian mixture model. These results seem to be better than the individual images, and with three clusters, the lesion, CSF and brain tissue seem clearly identified.

Let’s plot some of the other image slices to check that the segmentation performs well on the whole 3D image.

PYTHON

fig, ax = subplots(5, 5, figsize=(20, 20))

ax[0, 0].imshow(img1[:, :, 16], cmap='gray')
ax[0, 1].imshow(img1[:, :, 18], cmap='gray')
ax[0, 2].imshow(img1[:, :, 20], cmap='gray')
ax[0, 3].imshow(img1[:, :, 22], cmap='gray')
ax[0, 4].imshow(img1[:, :, 24], cmap='gray')

ax[1, 0].imshow(img2[:, :, 16], cmap='gray')
ax[1, 1].imshow(img2[:, :, 18], cmap='gray')
ax[1, 2].imshow(img2[:, :, 20], cmap='gray')
ax[1, 3].imshow(img2[:, :, 22], cmap='gray')
ax[1, 4].imshow(img2[:, :, 24], cmap='gray')

ax[2, 0].imshow(img3[:, :, 16], cmap='gray')
ax[2, 1].imshow(img3[:, :, 18], cmap='gray')
ax[2, 2].imshow(img3[:, :, 20], cmap='gray')
ax[2, 3].imshow(img3[:, :, 22], cmap='gray')
ax[2, 4].imshow(img3[:, :, 24], cmap='gray')

ax[3, 0].imshow(img4[:, :, 16], cmap='gray')
ax[3, 1].imshow(img4[:, :, 18], cmap='gray')
ax[3, 2].imshow(img4[:, :, 20], cmap='gray')
ax[3, 3].imshow(img4[:, :, 22], cmap='gray')
ax[3, 4].imshow(img4[:, :, 24], cmap='gray')

ax[4, 0].imshow(all_img_n3_labels_mapped[:, :, 16], cmap='viridis')
ax[4, 1].imshow(all_img_n3_labels_mapped[:, :, 18], cmap='viridis')
ax[4, 2].imshow(all_img_n3_labels_mapped[:, :, 20], cmap='viridis')
ax[4, 3].imshow(all_img_n3_labels_mapped[:, :, 22], cmap='viridis')
ax[4, 4].imshow(all_img_n3_labels_mapped[:, :, 24], cmap='viridis')

ax[0, 0].set_title("Slice 16", fontsize=16)
ax[0, 1].set_title("Slice 18", fontsize=16)
ax[0, 2].set_title("Slice 20", fontsize=16)
ax[0, 3].set_title("Slice 22", fontsize=16)
ax[0, 4].set_title("Slice 24", fontsize=16)

ax[0, 0].set_ylabel("Image 1", fontsize=16)
ax[1, 0].set_ylabel("Image 2", fontsize=16)
ax[2, 0].set_ylabel("Image 3", fontsize=16)
ax[3, 0].set_ylabel("Image 4", fontsize=16)
ax[4, 0].set_ylabel("Clustered labels", fontsize=16);

show()

Overall, the lesion, shown in yellow, seems to be segmented well across the volume.

Checking the GMM Labels

To investigate how the image intensities were clustered, we can look at the scatter plots for each combination of images. The diagonal plots show histograms of each image. This type of plot can be very useful in exploratory data analysis.

Note

Note that this plot might take a bit longer to run, as there are a very large number of data points.

In a python plotting library called seaborn, such plots are called pairplots and can be very easily plotted if your data is in a pandas dataframe.

You can install it at the command prompt (Windows) or terminal (MacOS, Linux) using:

conda install seaborn

To use it, import the required functions in your Python kernel, e.g.:

PYTHON

from seaborn import pairplot

ModuleNotFoundError: No module named ‘seaborn’

We don’t use this library here, but encourage you to look up further information in the seaborn documentation.

PYTHON

fig, ax = subplots(4, 4, figsize=(20, 20))

ax[0, 0].hist(img1_nz, bins=50);
ax[0, 0].set_title('Image 1', fontsize=16)
ax[0, 1].scatter(img1_nz, img2_nz, c=all_img_n3_labels, cmap='viridis', vmin=0);
ax[0, 1].set_xlabel('Image 1', fontsize=16)
ax[0, 1].set_ylabel('Image 2', fontsize=16)
ax[0, 2].scatter(img1_nz, img3_nz, c=all_img_n3_labels, cmap='viridis', vmin=0);
ax[0, 2].set_xlabel('Image 1', fontsize=16)
ax[0, 2].set_ylabel('Image 3', fontsize=16)
ax[0, 3].scatter(img1_nz, img4_nz, c=all_img_n3_labels, cmap='viridis', vmin=0);
ax[0, 3].set_xlabel('Image 1', fontsize=16)
ax[0, 3].set_ylabel('Image 4', fontsize=16)

ax[1, 0].scatter(img2_nz, img1_nz, c=all_img_n3_labels, cmap='viridis', vmin=0);
ax[1, 0].set_xlabel('Image 2', fontsize=16)
ax[1, 0].set_ylabel('Image 1', fontsize=16)
ax[1, 1].hist(img2_nz, bins=50);
ax[1, 1].set_title('Image 2', fontsize=16)
ax[1, 2].scatter(img2_nz, img3_nz, c=all_img_n3_labels, cmap='viridis', vmin=0);
ax[1, 2].set_xlabel('Image 2', fontsize=16)
ax[1, 2].set_ylabel('Image 3', fontsize=16)
ax[1, 3].scatter(img2_nz, img4_nz, c=all_img_n3_labels, cmap='viridis', vmin=0);
ax[1, 3].set_xlabel('Image 2', fontsize=16)
ax[1, 3].set_ylabel('Image 4', fontsize=16)

ax[2, 0].scatter(img3_nz, img1_nz, c=all_img_n3_labels, cmap='viridis', vmin=0);
ax[2, 0].set_xlabel('Image 3', fontsize=16)
ax[2, 0].set_ylabel('Image 1', fontsize=16)
ax[2, 1].scatter(img3_nz, img2_nz, c=all_img_n3_labels, cmap='viridis', vmin=0);
ax[2, 1].set_xlabel('Image 3', fontsize=16)
ax[2, 1].set_ylabel('Image 2', fontsize=16)
ax[2, 2].hist(img3_nz, bins=50);
ax[2, 2].set_title('Image 3', fontsize=16)
ax[2, 3].scatter(img3_nz, img4_nz, c=all_img_n3_labels, cmap='viridis', vmin=0);
ax[2, 3].set_xlabel('Image 3', fontsize=16)
ax[2, 3].set_ylabel('Image 4', fontsize=16)

ax[3, 0].scatter(img4_nz, img1_nz, c=all_img_n3_labels, cmap='viridis', vmin=0);
ax[3, 0].set_xlabel('Image 4', fontsize=16)
ax[3, 0].set_ylabel('Image 1', fontsize=16)
ax[3, 1].scatter(img4_nz, img2_nz, c=all_img_n3_labels, cmap='viridis', vmin=0);
ax[3, 1].set_xlabel('Image 4', fontsize=16)
ax[3, 1].set_ylabel('Image 2', fontsize=16)
ax[3, 2].scatter(img4_nz, img3_nz, c=all_img_n3_labels, cmap='viridis', vmin=0);
ax[3, 2].set_xlabel('Image 4', fontsize=16)
ax[3, 2].set_ylabel('Image 3', fontsize=16)
ax[3, 3].hist(img4_nz, bins=50);
ax[3, 3].set_title('Image 4', fontsize=16)

fig.tight_layout()
show()

The colours in the scatter plots above correspond to the labels we extracted using all four images and three clusters. I.e. green corresponds to healthy brain tissue, yellow corresponds to CSF and blue corresponds to the lesion. This figure shows reasonably well how CSF (yellow) and lesion (blue) can be clustered. However, it is not as easy to see how the healthy tissue was separated from CSF and lesion tissue. To investigate further, we can plot the above slightly differently, using a 2-dimensional histogram instead of a scatter plot.

A 2-dimensional histogram plots the counts of values in bins for two variables. The results are displayed as a heatmap. An intuitive example with code using the Matplotlib function hist2d is available here.|

PYTHON

import matplotlib.colors as mcolors

fig, ax = subplots(4, 4, figsize=(20, 20))

ax[0, 0].hist(img1_nz, bins=50);
ax[0, 1].hist2d(img1_nz, img2_nz, bins=100, norm=mcolors.PowerNorm(0.2));
ax[0, 2].hist2d(img1_nz, img3_nz, bins=100, norm=mcolors.PowerNorm(0.2));
ax[0, 3].hist2d(img1_nz, img4_nz, bins=100, norm=mcolors.PowerNorm(0.2));

ax[1, 0].hist2d(img2_nz, img1_nz, bins=100, norm=mcolors.PowerNorm(0.2));
ax[1, 1].hist(img2_nz, bins=50);
ax[1, 2].hist2d(img2_nz, img3_nz, bins=100, norm=mcolors.PowerNorm(0.2));
ax[1, 3].hist2d(img2_nz, img4_nz, bins=100, norm=mcolors.PowerNorm(0.2));

ax[2, 0].hist2d(img3_nz, img1_nz, bins=100, norm=mcolors.PowerNorm(0.2));
ax[2, 1].hist2d(img3_nz, img2_nz, bins=100, norm=mcolors.PowerNorm(0.2));
ax[2, 2].hist(img3_nz, bins=50);
ax[2, 3].hist2d(img3_nz, img4_nz, bins=100, norm=mcolors.PowerNorm(0.2));

ax[3, 0].hist2d(img4_nz, img1_nz, bins=100, norm=mcolors.PowerNorm(0.2));
ax[3, 1].hist2d(img4_nz, img2_nz, bins=100, norm=mcolors.PowerNorm(0.2));
ax[3, 2].hist2d(img4_nz, img3_nz, bins=100, norm=mcolors.PowerNorm(0.2));
ax[3, 3].hist(img4_nz, bins=50);

show()

Note in the plot, we used a PowerNorm normalisation on the image intensities. This is just to aid with visualisation, and you are welcome to change or completely remove the normalisation.

The plots show that there is a bright, high-density region corresponding to the clustered healthy tissue region. This gives us a better idea how the GMM algorithm found the three regions. Healthy tissue has low signal variance in all 4 images. Signal intensity in CSF and the lesion have much higher variance making it possible to distinguish them from healthy tissue. Furthermore, the relative intensities of CSF and lesion tissue are different as shown in the scatter plots, making it possible for the GMM to distinguish between the two.

Exercises


End of chapter Exercises

In this assignment, we ask you to use the same set of images as in the work through example. However, instead of GMM, we want you to try a different clustering method called KMeans. The documentation for KMneans is available here: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html. Some examples of how kmeans clustering can go wrong are shown in this example code.

Using KMeans from ‘sklearn.cluster’, do the following tasks:

  1. Investigate different numbers of clusters, similarly to what we did in th work through example.

  2. Use different combinations of the 4 images to see how the clustering performs in different cases.

  3. The labelled results using all four images may not look as clean as the ones in the work-through example. Try scaling the images e.g. using the sklearn standard scaler, and combining the scaled images. Do the results change? If yes, explore and comment on why you think scaling may be advantageous in this clustering example.

  4. Compare the behaviour of KMeans to the outcome with GaussianMixture.

Further Reading


If after this lesson you want to deepen your understanding of clustering and, in particular, want to compare the performance of different clustering methods when dealing with images, try the article Clustering techniques for neuroimaging applications. It is paywalled and you will need an institutional access to download.

Key Points

  • Image analysis almost always requires a bit of pre-processing.
  • Image scaling is performed by using fit_transform method from module StandardScaler in Scikit-learn.
  • GMM offers a good startig point in image clustering.
  • Diagonal plots are very useful in exploratory data analysis.

Content from Dimensionality Reduction


Last updated on 2025-01-27 | Edit this page

Estimated time: 12 minutes

Download Chapter notebook (ipynb)

Download Chapter PDF

Mandatory Lesson Feedback Survey

Overview

Questions

  • Why is it important to perform dimensionality reduction?
  • How is dimensionality reduction performed?
  • How PCA is used to determine variance?
  • When does PCA fail?

Objectives

  • Understanding high-dimensional datasets.
  • Applying dimensionality reduction to extract features.
  • Learning to use PCA for dimensionality reduction.
  • Estimating the optimal number of features.




Code Preparation

PYTHON

from numpy import reshape, append, mean, pi, linspace, var, array, cumsum, arange
from numpy.random import seed, multivariate_normal, binomial
from numpy.linalg import norm

from math import cos, sin, atan2
from matplotlib.pyplot import subplots, xlabel, ylabel, scatter, title
from matplotlib.pyplot import figure, plot, xticks, xlabel, ylabel, suptitle, show

import glob
from pathlib import Path
from pandas import read_csv, DataFrame
import csv

Introduction


Clinical and scientific datasets often contain measurements of many physiological variables, expression of many genes or metabolic markers. These datasets, where each individual observation/sample is described by many variables are referred to as high-dimensional. However, some of these features can be highly correlated or redundant, and we sometimes want a more condensed description of the data. This process of describing the data by a reduced number of new combined features that reduces redundancy, is called dimensionality reduction.

Goal of decomposition
Goal of decomposition

There are several reasons for performing dimensionality reduction, both for data exploration and for pre-processing:

  • Data compression, to optimize memory usage/loading speed and visualisation of “important features”.

  • To find the individual variable or combinations of variables that give a condensed but accurate description of the data (sometimes also called feature extraction).

  • It can also be used for “denoising” - removing features that only account for small variability.

  • It is also useful to perform this step before other supervised (regression, classification) or unsupervised learning procedures (like clustering), by reducing data to “main features” to improve robustness of learning, especially with a limited number of independent observations.

Example: Predicting outcome from large-scale genetic or molecular studies


Clinical research can involve doing an exhaustive measurement of all tentative diagnostic markers, for example levels of various cytokines and hormones, or searching for genetic markers of a disease. There is good reason to expect that these share regulatory/signalling mechanisms, and that their interaction determines disease outcome. Building a regression model including all measurements is problematic because of multicollinearity and the demand for a much larger number of observations.

By performing dimensionality reduction, we first find features that capture the major patterns of covariation of these factors in the sample population. Then we will use these compact features, rather than individual measurements, to train our classifier or regression model, to study outcomes. With fewer features, we train models with less data.

Loading example data: Childhood Onset Rheumatic Disease gene expression profile

Gene expression data (for selected markers) is available for 83 individuals, including some patients with varying forms of early-onset rheumatic disease (available at EMBL-Bioinformatics database, also included in folder). This is a typical example of high dimensional datasets, with transcript levels for >50,000 genes measured!

Here, the data are read from individual files in the ‘Dataset’ folder and combined into a Pandas dataframe called ‘geneData’.

PYTHON

# Load all data
nGenes = 1000 # keep only these genes for now

filectr=0

datapath = Path("data")    #/relative/path/to/folder/with/datsets

for txt in datapath.glob("*.txt"):
    b=read_csv(txt, sep='\t', usecols=[1])
    b=reshape(b.values, (1,54675)) # each file is 1 observation sample i.e. 1 row.
    if filectr==0:
        coln =  read_csv(txt, sep='\t', usecols=[0])
        a = b
    else:
        a=append( a, b, axis=0)
    filectr+=1

# convert 2D array to dataframe
geneData = DataFrame( data=a[:,range(nGenes)]- mean(a[:,range(nGenes)], axis=0), columns = coln.values[range(nGenes)])
The dataframe has 80 samples and 1000 features.

Apart from the large number of variables (compared to number of samples), we also have strongly correlated expression of various markers. We can see that by plotting the correlation matrix of the dataframe.

PYTHON

fig, ax = subplots(figsize=(8,8))
im = ax.matshow(geneData.corr())
cbar = fig.colorbar(im, shrink=0.82)
xlabel('Locus #', fontsize=14);
ylabel('Locus #', fontsize=14);

show()

The little yellow blocks indicate that various features are strongly correlation (values around 1). Further analyses (e.g. checking if we find clusters mapping to healthy individuals and patients, respectively, or whether patients with different patterns of marker expression can be identified for future diagnostic purposes) can be thus be improved if we perform a dimensionality reduction, i.e. decrease the number of features while maintaining relevant information.

Principal Component Analysis


Principal component analysis (PCA) is one of the most commonly used dimensionality reduction methods. Instead of describing the data using the original measurements, it finds a new set of ranked features (the “basis”). The new “features” are linear combinations of the original variables, such that they are all orthogonal to each other, and the “importance” of a feature (called the principal component) is generally given by how much of the variability in the dataset it captures (the so-called explained variance). Thus, we first transform the data from the original variables into a new set of features or principal components (PCs).

To get a lower dimensional description of the data, we retain only the top features (PCs) such that the largest variations or ‘patterns’ in the dataset are preserved.

To apply PCA from Scikit-learn, we need to specify the number of components (n_components) we want to reduce the data to. n_components must be less than the rank of the dataset, i.e. less than the smaller of the number of samples and number of variables measured.

Simple example with 2D generated dataset

To first explore what dimensionality reduction with PCA looks like, we will work with a simple 2-dimensional (2D) dataset, where we want to find a 1D description instead.

We start with generating some observations from a two-dimensional (multivariate) Gaussian distribution, with mean and covariance specified.

PYTHON

RND_SEED = 7890
seed(RND_SEED)

# Set up 2D multivariate Gaussians
means = [20, 20]

cov_mat = [[1, .85], [.85, 1.5]] # 2x2 covariance matrix must be symmetric and positive-semidefinite(>=0)

# Generate data
nSamples = 300

data = multivariate_normal(means, cov_mat, nSamples)

print(data.shape)

(300, 2)

We can check the distribution of the two features in histograms, and we can see how the two data dimensions relate to each other in the scatter plot.

PYTHON

# Center data
zdata = data - mean(data, axis=0)

fig, ax = subplots(ncols=3, nrows=1, figsize=(15, 4))

ax[0].hist(zdata[:, 0]);
ax[0].set_title('Dim 1', fontsize=16, fontweight='bold');
ax[0].set_xlabel('Values', fontsize=14);
ax[0].set_ylabel('Frequency', fontsize=14);

ax[1].hist(zdata[:, 1]);
ax[1].set_title('Dim 2', fontsize=16, fontweight='bold');
ax[1].set_xlabel('Values', fontsize=14);


ax[2].scatter(zdata[:, 0], zdata[:, 1]);
ax[2].set_title('Sample Data', fontsize=16, fontweight='bold')
ax[2].set_xlabel('Dim 1', fontsize=14);
ax[2].set_ylabel('Dim 2', fontsize=14);

show()

Note

The last plot, with the samples plotted as scatter in the space of measured variables is called a “state space plot”. The main axes correspond to individual variables.

Now if we want to represent data by a single feature, we can see that the individual variables are not ideal. If we only keep Dim 1, we lose substantial variability along Dim 2. We see that the data points are stretched along the main diagonal. This implies that the two variables co-vary to some extent. Thus, a linear combination of the two might capture this relevant pattern.

A linear combination corresponds to a vector (or direction) in this 2D state space.

For example, the x-axis vector (1, 0) = [1 x Dim1 + 0 x Dim2], i.e. it contains Dim 1 only. More generally, (a, b) represents the linear combination [a x Dim1 + b x Dim2], i.e. a mixture of the two variables. Typically, we ensure that the length of these vectors is 1, to avoid artefactual rescaling of data.

PCA as projection or rotation of basis

Instead of representing data by their values along Dim 1 and Dim 2, we now try to describe them along different directions. We do so by projecting 2D data onto that vector. To quantify the importance of each direction, we measure the variance of the projected samples.

Let us examine different vectors that point at various angles compared to the x-axis. Remember that the vector denoted by the angle = ( cos(angle), sin(angle) ) denotes the following linear combination of the 2D sample d=[d1, d2]:

f(d1,d2) = cos(angle) * d1 + sin(angle) * d2

PYTHON

## Centre data at origin by subtracting mean
zdata = data - mean(data, axis=0)

# Define a feature as a direction in the 2D space using the angle
angle = pi/4

# Obtain the vector
vec = [cos(angle), sin(angle)] # has norm=1 by definition, otherwise:  vec = vec/norm(vec)

# Project 2D data onto this single dimension
dim1_data = zdata.dot(vec)


fig, ax = subplots(ncols=2, nrows=1, figsize=(15, 6))

# Plot data and the feature direction
ax[0].scatter(zdata[:, 0], zdata[:, 1]);
ax[0].quiver([0], [0], vec[0], vec[1], scale=3);

ax[0].set_title('Sample Data', fontsize=16, fontweight='bold')
ax[0].set_xlabel('Dim 1', fontsize=14); ax[0].set_xlim([-4,4]);
ax[0].set_ylabel('Dim 2', fontsize=14); ax[0].set_ylim([-4,4]);

# Plot histogram of new projected data
ax[1].hist(dim1_data, bins = linspace(-3,3,9));
ax[1].set_title('Along vec', fontsize=16, fontweight='bold');
ax[1].set_xlabel('Values', fontsize=14);
ax[1].set_ylabel('Frequency', fontsize=14);

show()

Do it Yourself

Change the angle in the code above and see how distribution of projected data changes. Use e.g. 0, \(\pi\) /2, 2/3*\(\pi\), and \(\pi\). Try to estimate how much the variance changes.

Now let’s formally calculate the variance by projecting along a set of different directions. We use angles between 0 and \(\pi\) (i.e. between 0 and 180 degrees). For each direction (projection) we calculate the variance and plot it against the angle.

PYTHON

allAngles = linspace( start=0, stop=pi, num=20)
allVar = []
for angle in allAngles:
    vec = array([cos(angle), sin(angle)]) #Unit length vector on which data will be projected
    dim1_data = data.dot(vec)
    allVar.append( var(dim1_data))

# Plot variance captured by different features
scatter( allAngles, allVar );
title('Variance along different directions', fontsize=16, fontweight='bold')
xlabel('Direction (angle in radians)', fontsize=14);
ylabel('Variance of projected data', fontsize=14);

It can be seen there is a particular direction that retains maximal variance. This would be the optimal feature to give a 1D description of the 2D data while retaining maximum variability. PCA is the method that directly find this optimal direction. This makes it especially powerful for high dimensional datasets.

Find 1-D description using PCA

Let us now use dimensionality reduction PCA to directly find the direction that captures maximal variance. For an n-dimensional dataset, PCA finds a set of n new features, ranked by the variance along each feature. See the Scikit-learn documentation for information.

To get the best 1D representation of data, we instantiate the class with one component (n_components = 1) to be returned, and transform the data to the feature space of that component.

PYTHON

from sklearn.decomposition import PCA

# Initialize the PCA model
pcaResults = PCA(n_components=1, whiten=False) # specify no. of components, and whether to standardize

# Fit to data
pcaResults = pcaResults.fit(data)

# Transform the data, using our learned PCA representation
dim1_data = pcaResults.transform(data)

print( "Size of new dataset: ", dim1_data.shape )

Size of new dataset: (300, 1)

SH

Size of new dataset:  (300, 1)

From the fitted model ‘pcaResults’ we can now get:

  • The fractional variance using explained_variance_ratio_. It tells us what fraction of the total variance is retained by th reduced 1D description of the data.

  • The direction (angle) of this first PC as calculate from the arcus tangens atan of the vector. You can compare it directly to the angle in the above plot.

PYTHON

# How much variance was captured compared to original data?
print("Fractional variance captured by first PC: {:1.4f}.".format(pcaResults.explained_variance_ratio_[0]))

Fractional variance captured by first PC: 0.8722.

PYTHON

vec1 = pcaResults.components_[0,:]
print("Direction of PC1 is at angle = {:1.2f} radians".format(atan2(vec1[1],vec1[0])))

Direction of PC1 is at angle = 0.99 radians

Do it Yourself

Exercise (DIY): Change the covariance of the synthetic data, and repeat the above steps. Try to infer the relationship between the shape of the scatter, the strength of correlation between the two dimensions, and the variance captured by the first PC.

Find 2-D description using PCA

Our dataset has two features. We can thus obtain two principal components. The two components must be orthogonal to each other (form a right angle). Let us get the corresponding vectors and plot them on the scatterplot.

PYTHON

# Initialize the PCA model
pcaResults = PCA(n_components=2, whiten=False) # specify no. of components, and whether to standardize

# Fit to data
pcaResults = pcaResults.fit(data)

# Transform the data, using our learned PCA representation
dim1_data = pcaResults.transform(data)

print( "Size of new dataset: ", dim1_data.shape )

Size of new dataset: (300, 2)

PYTHON

print(pcaResults.components_)

[[ 0.54864625 0.8360546 ] [ 0.8360546 -0.54864625]]

PYTHON

vec1 = pcaResults.components_[0,:]
vec2 = pcaResults.components_[1,:]


fig, ax = subplots(figsize=(6, 6))

# Plot data and the feature direction
ax.scatter(zdata[:, 0], zdata[:, 1]);
ax.quiver([0], [0], vec1[0], vec1[1], scale=3, color='r');
ax.quiver([0], [0], vec2[0], vec2[1], scale=3, color='r');

ax.set_title('Sample Data', fontsize=16, fontweight='bold')
ax.set_xlabel('Dim 1', fontsize=14); ax.set_xlim([-4,4]);
ax.set_ylabel('Dim 2', fontsize=14); ax.set_ylim([-4,4]);

show()

The red arrows show the new coordinate system, defined by the principal components. It also shows that the more correlated the data are, the less information we lose by reducing the data representation using PCA.

Important

When reading about PCA you will come across the term Singular Value Decomposition (SVD). As mentioned in the Scikit-learn documentation for PCA, when you apply PCA the algorithm in the background uses SVD to return the results. The difference between the two is technical but SVD is the more general method. Wikipedia has as mathematical introduction to PCA which also discusses the relationship to SVD.

PCA of the Gene Expression Data


Now, we will use PCA to find a few features that capture most of the variability in the gene expression dataset. we arbitrarily start with 50 components assuming that this will include the important features.

PYTHON

nComp = 50 # Number of PCs to be returned
#trainIndx = binomial(1,0.9,size=filectr)

GenePCA = PCA(n_components=nComp, whiten=True)
GenePCA = GenePCA.fit(geneData) #.values[trainIndx==1,:]

How many features should we retain? To investigate this question, we plot the total (cumulative) variance for retaining the top k modes. The more we use, the higher the cumulative variance.

PYTHON

cumExpVar = cumsum(GenePCA.explained_variance_ratio_)
fig = figure(figsize=(5, 5))
im  = plot( range(nComp), cumExpVar )
xticks(arange(0,nComp,5));
xlabel( 'Number of PCs', fontsize=14);
ylabel('Cumulative explained variance', fontsize=14);

show()

A common heuristic for choosing the number of components is by defining a set threshold for the total explained variance. Thresholds commonly vary between 0.8-0.9, depending on the structure of the PCs, e.g. depending on whether there are a few top PCs or many small PCs, but also depending on the expected noise in data, and on the desirable accuracy of the reduced data set. While a dimensionality reduction is convenient it always loses some information.

As an example, we can check how many PCs we need to retain 99 % of the variance.

PYTHON

threshold = 0.99

keepPC = [pc for pc in range(nComp) if cumExpVar[pc]>=threshold][0]

print('Number of features needed to explain {:1.2f} fraction of total variance is {:2d}. '.format(threshold, keepPC) )

Number of features needed to explain 0.99 fraction of total variance is 19.

There are many alternative methods for estimating the number of features. These include:

  • Plot explained variance of individual PCs and cut off when there is a sharp drop/saturation in the values.

  • Cross-validation - Find number of PCs that maximize explained variance on heldout data (using a bi-cross validation scheme).

  • Visual inspection or interpretable PCs

Visualization of reduced dataset

Now that we have selected the number of features to be used, we can see what the data along those features looks like.

PYTHON

newGeneData = GenePCA.transform(geneData)[:,range(keepPC)]

fig, ax = subplots(ncols=3, nrows=2, figsize=(18, 8))
suptitle('Data in new Feature space', fontsize=20)

ax[0, 0].scatter(newGeneData[:,0], newGeneData[:,1], c=range(filectr), cmap='summer');
ax[0, 0].set_xlabel('PC1', fontsize=14); ax[0, 0].set_xlim([-2,2]);
ax[0, 0].set_ylabel('PC2', fontsize=14); ax[0, 0].set_ylim([-2,2]);


ax[0, 1].scatter(newGeneData[:,0], newGeneData[:,2], c=range(filectr), cmap='summer');
ax[0, 1].set_xlabel('PC1', fontsize=14); ax[0, 1].set_xlim([-2,2]);
ax[0, 1].set_ylabel('PC3', fontsize=14); ax[0, 1].set_ylim([-2,2]);


ax[0, 2].scatter(newGeneData[:,0], newGeneData[:,3], c=range(filectr), cmap='summer');
ax[0, 2].set_xlabel('PC1', fontsize=14); ax[0, 2].set_xlim([-2,2]);
ax[0, 2].set_ylabel('PC4', fontsize=14); ax[0, 2].set_ylim([-2,2]);



ax[1, 0].scatter(newGeneData[:,1], newGeneData[:,2], c=range(filectr), cmap='summer');
ax[1, 0].set_xlabel('PC2', fontsize=14); ax[1, 0].set_xlim([-2,2]);
ax[1, 0].set_ylabel('PC3', fontsize=14); ax[1, 0].set_ylim([-2,2]);


ax[1, 1].scatter(newGeneData[:,1], newGeneData[:,3], c=range(filectr), cmap='summer');
ax[1, 1].set_xlabel('PC2', fontsize=14); ax[1, 1].set_xlim([-2,2]);
ax[1, 1].set_ylabel('PC4', fontsize=14); ax[1, 1].set_ylim([-2,2]);


ax[1, 2].scatter(newGeneData[:,3], newGeneData[:,2], c=range(filectr), cmap='summer');
ax[1, 2].set_xlabel('PC4', fontsize=14); ax[1, 2].set_xlim([-2,2]);
ax[1, 2].set_ylabel('PC3', fontsize=14); ax[1, 2].set_ylim([-2,2]);

show()

Beyond dimensionality reduction

After reducing dataset from hundreds or thousands of genes to fewer features (PCs), it can be used:

  • To find groups of genes that seem to be co-regulated (using clustering).

  • To build classifiers to predict the likelihood of juvenile or late-onset rheumatoid disease. For example, one feature can be highly variable in the sampled population and therefore has high explained variance, but it may not predictive for the current objective of predicting disease phenotype (e.g. eye pigmentation genes).

  • To find what combination of genes the top features correspond to.

  • To suggest what individual genes are most variable.

  • To discover co-expression patterns of multiple genes.

Optional Exercises

  1. The dataset contains two types of samples, expression in neutrophil and expression in peripheral blood mononuclear cells (PBMC). Download the sample-data table to get the category of each sample. Make a scatter plot of two (original) features and set the colour of the scatter based on the sample’s origin. Do a PCA, scatter the top scoring components and colour the samples in the same way. Do the top PCs capture gene expression differences in the two types of cells?

  2. The dataset similarly contains samples from individuals with differing phenotypes - healthy, as well as 2 different disease characteristics. Similar to the previous exercise, colour the scatter based on the disease phenotypes. Do the top PCs capture variability in gene expression for different phenotypes?

When does PCA fail?


Although PCA is the first and simplest method of exploring high-dimensional datasets, there are important caveats to keep in mind:

  1. PCA is a linear method

Nonlinear patterns of co-expression will be seemingly broken up into many features. Thus, the ability to find whether genes are co-regulated may be reduced if the effects are nonlinear.

  1. PCA depends on scaling of individual measurements.

As PCA measures variance along different directions, changing the scale of one variable (for example from cm to mm, or fold-change to actual levels) may drastically change the dominant “features” extracted.

The lesson on clustering of images with Gaussian mixture models contained instructions to scale data. You can explore the original dataset of this lesson, to investigate how scaling of variables impacts the results.

  1. Estimating number of features is heuristic, and depends on the purpose.

The definition of what is the ‘relevant’ number of features can depend on:

  • the signal-to-noise ratio in data. For example, noisier or smaller datasets make it harder to accurately estimate smaller PCs. In such cases, a more conservative threshold should be used.
  • the classifier or regression performance following the dimensionality reduction.

Resources


Dataset from:

Frank MB, Wang S, Aggarwal A, et al. Disease-associated pathophysiologic structures in pediatric rheumatic diseases show characteristics of scale-free networks seen in physiologic systems: implications for pathogenesis and treatment. BMC Med Genomics. 2009;2:9. Published 2009 Feb 23. doi:10.1186/1755-8794-2-9

Download data from: https://www.ebi.ac.uk/arrayexpress/experiments/E-GEOD-11083/

Key Points

  • Reduced features in a dataset reduce redundancy and process is called dimensionality reduction.
  • Principal component analysis (PCA) is one of the commonly used dimensionality reduction methods.
  • n_components is used specify the number of components in Scikit-learn.
  • PCA can be helpful to find groups of genes that seem to be co-regulated.