# Unsupervised Learning - Introduction to Dimensionality Reduction

This practical work aims to discover some applications of dimensionality reduction techniques on various datasets of different complexities. 

Some parts are marked as "bonus", so you can postpone them to the end of the session, or for your personnal work at home. 

Take time to understand what you are implementing of visualizing.


In this practical work, we will use dimensionality reduction techniques to i) visualize data, ii) compress the data while keeping a maximum of relevant information.

We will first start with a classic dataset : The Iris dataset. Load the dataset, and check its dimensions. How can you display this data in 2D ? 

In [None]:
from sklearn.datasets import load_iris
X,y = load_iris(return_X_y=True)
n,p = ...
print(f"The dataset as {n} observations, each observation being represented by {p} values")

Use the `PCA` from the module `sklearn.decomposition` to perform a 2D projection of the data while maximizing the information. Always a good advice the check the documentation !

Plot the new 2D coordinates of the data. 

In [None]:
from sklearn.decomposition import PCA
pca = PCA(...)
X_pca = ...
plt.scatter(X_pca[:,0], X_pca[:,1],c=y)

How much percentage of information (aka variance) is kept by this projection ? 

In [None]:
percentage_explained_variance = ...
print(f"The two first components explain {percentage_explained_variance:.2f}% of the information")


Perform a 2D projection using the $t$-SNE method, implemented in the module `sklearn.manifold` by the class `TSNE`. Discuss the results.

In [None]:
from sklearn.manifold import TSNE
X_tsne = ...
plt.scatter(X_tsne[:,0], X_tsne[:,1],c=y)

The main limit of PCA is that this method is limited to linear transformations. Then, we will compare the PCA and $t$-SNE on a non linear dataset. 

Create two "moons" using the `make_moons` function without noise. 

In [None]:
from sklearn.datasets import make_moons

X, y = ...
plt.scatter(X[:,0], X[:,1],c=y)

Perform a 1D projection of this 2D dataset using PCA and $t$-SNE (use a perplexity of 10). Can you spot the difference using a non linear method ? 

In [None]:

X_pca = ...

X_tsne = ...

fig, axes = plt.subplots(1,2, figsize=(10, 5))
axes[0].scatter(X_pca, np.zeros_like(X_pca), c=y, alpha=0.5)
axes[1].scatter(X_tsne, np.zeros_like(X_tsne),c=y, alpha=0.5)
axes[0].set_title("PCA")
axes[1].set_title(r"$t$-SNE")

Let's now try a much more complex dataset. The Olivetti dataset is composed of 400 images of different persons faces, each image being encoded by a 64 x 64 image. We will perform a 2D projection of this data, and plot the coordinates together with the faces to see if the method is able to identify a good latent space. 

First, let's load the dataset and check the dimensions.

In [None]:
from sklearn.datasets import fetch_olivetti_faces
X,y = fetch_olivetti_faces(return_X_y=True)
n,p = X.shape
print(f"The dataset as {n} observations, each observation being represented by {p} values")

The following function  `plot_faces` will help you to plot the images with the coordinates you computed using your 2D projection. 


In [None]:
import matplotlib.pyplot as plt
from matplotlib.offsetbox import OffsetImage, AnnotationBbox

def plot_faces(faces, coords):

    # Setup figure and axis, adjust figsize for a larger plot
    fig, ax = plt.subplots(figsize=(10, 10))  # Adjust figsize as needed

    for face, coord in zip(faces, coords):
        img = face.reshape(64, 64)
        imagebox = OffsetImage(img, zoom=.25, cmap="gray")  # Adjust zoom as needed
        ab = AnnotationBbox(imagebox, coord, frameon=False)
        ax.add_artist(ab)

    # Adjust limits if needed, based on the range of PCA components
    ax.set_xlim(coords[:, 0].min() - 5, coords[:, 0].max() + 5)
    ax.set_ylim(coords[:, 1].min() - 5, coords[:, 1].max() + 5)
    plt.show()

Compute the PCA on Olivetti dataset, and display the images with the coordinates computed with PCA.

Do the same with $t$-SNE. Tune the perplexity parameter.  What do you think about the two 2D projections ?

**Bonus Application : Image compression**

The PCA can be used to compress image information. The purpose is to find a good trade off between quantity of data and quantity of information. We will test the PCA on this task.

First, we will load a gray level image.

In [None]:
import numpy as np
from matplotlib.image import imread
image = imread("chevreuil.png")
coeff_conversion_gray = [0.2989, 0.5870, 0.1140]
image = np.dot(image,coeff_conversion_gray) # we convert the color image to gray level according to human perception
plt.imshow(image,cmap="gray")
plt.axis('off')

Considering the image as a series of observations, perform a full PCA (with maximum number of components) and check how many information is kept by $d$ first dimensions (make a plot). 

Determine graphically and numerically how many components do you need to keep 90% of information ? and 75% ?

In [None]:
pca = PCA()
pca.fit(...)
explained_variance = ...
plt.plot(...)

info_threshold = 0.9

nb_min_components = ... 
print(f"We need {nb_min_components} to encode {info_threshold*100:.2f}% of the information")


Perform a PCA using the minimal number of components you just compute. Compute the projection of your image. How many data do you have now ? (don't forget the data required to reconstruct the image !)

Compare with the amount of data of uncompressed image.

In [None]:

pca = PCA(n_components=...)
pca.fit(...)
reduced_image = ...
nb_pixels_original = ...
print(f"Quantité d'infos avant compression : {nb_pixels_original}")

nb_info_compresse =  ...
print(f"Quantité d'infos après compression : {nb_info_compresse}") 
print(f"Soit un ratio de compression de x {...}")


Now, we will reconstruct our image after the compression. To do so, we can rely on the `inverse_transform` function of the `PCA` class. Check the original image and the reconstructed one side by side. Try different levels of compression. 

In [None]:
reconstructed_image = ... 

fig, ax = plt.subplots(1,2, figsize=(20, 10))
ax[0].imshow(image,cmap="gray")
ax[1].imshow(reconstructed_image,cmap="gray")
ax[0].axis('off')
ax[1].axis('off')
