# Unsupervised Learning - Introduction to Clustering

This practical work aims to discover some applications of clustering techniques on various datasets of different complexities. 

Some parts are marked as "bonus", so you can postpone them to the end of the session, or for your personnal work at home. 

Take time to understand what you are implementing of visualizing.

In the first part of this pratical work, we will implement a clustering on toy datasets, to clearly understand what happens. Then, we will try our implementation on more complex datasets.


In [None]:
import numpy as np
import matplotlib.pyplot as plt


### Clustering on toy dataset

First, we will use the scikit-learn functions to create a simple dataset composed of 3 blobs, aka 3 regions of dense points. To do so, we will use the `make_blobs`function of module `sklearn.datasets`

Complete the parameters `centers`and `n_features` to create a 2D point cloud with 3 blobs. It's a good idea to check the documentation !

In [None]:
from sklearn.datasets import make_blobs


X, y = make_blobs(n_samples=100, 
                  centers= , 
                  n_features= , 
                  random_state=42)
plt.scatter(X[:,0], X[:,1],c=y)

Now, we will forget the `y` variable for our learning process. We will try to recover this information by performing a KMeans on the data `X`.

What is a good value for `K` ? 

Refering to the documentation, complete the folowing code to find the clusters. How do you interpret the colors ?

In [None]:
K = ...
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters= ...)
kmeans.fit( ... )
clusters = ...
plt.scatter(X[:,0], X[:,1],c=clusters)

Previously, we guess the value of $K$ by looking at the data. Suppose, that you can't look at it. How can you find a good value for $K$ ?

As seen in class, a good strategy is to evaluate the quality of clustering for different values of $K$. What metric can you use ? Find it in scikit-learn library.

Then, complete the folowing code to implement your test and conclude on the best $K$ value.

In [None]:
from sklearn.metrics import ...

scores = []
for k in range(2,10):
    kmeans = KMeans(...)
    kmeans.fit(X)
    clusters = kmeans.predict(X)
    score = ...
    scores.append(score)

plt.plot(range(2, 10), scores, "bo-")
plt.xlabel("$k$", fontsize=14)
plt.ylabel("Score", fontsize=14)
plt.show()

Now, we will test KMeans on a less simple dataset where clusters are less obvious. This new dataset will highlight some problems with the initializations of KMeans centroids. 

Load and visualize the data. What value for $K$ may suit this dataset ?

In [None]:

X = np.loadtxt('george.dat')
plt.scatter(X[:,0], X[:,1])


To show the influence of initialisation of KMeans, we will go into the details of the method. Check the documentation to understand how the initialization is managed by the sklearn library. What are the different strategies ?

Implement a KMeans method with a single random initialization, and run it several times. What do you observe ? Compute the intra cluster inertia of each clustering. What is the best clustering ?

In [None]:
from sklearn.cluster import KMeans
...
clusters = ...
plt.scatter(X[:,0], X[:,1], c=clusters)
inertia = ... 
plt.title(f"{inertia:.2f}")

Now, we will reproduce the behavior of sklearn strategy for the initialization of centroids. Check the behavior and default parameters of `KMeans`, especially when `init` is equals to "random". Reproduce the scikit-learn strategy with your own implementation.

Going back to default sklearn parameters, ensure yourself that running several times KMeans to this dataset is less sensitive to the random initialization. You should understand why !

You probably decide a value for $K$ based on the number of letters. Reproduce the evaluation of different $K$ values as before, and discuss the results. 

**A nice application**

A nice application of clustering methods like KMeans is to cluster the colors of an image. Generally, an image is composed of pixels, each pixel being encoded by a vector encoding its red, green and blue (RGB) values. Each pixel having a slightly different colors, this can lead to many differents colors, each channel being encoded by 8 bits, leading to $255^3$ possible different colors. A nice way to compress images is to reduced this set of colors. For instance, if we are able to reduce the number of colors to 32, we will then need only 5 bits by pixels instead of the 24 bits for a full RGB image. 

The question is now : How to find these 32 colors ? KMeans is here to help you !

First, let's load a natural image.

In [None]:

from matplotlib.image import imread
# Make your choice here ! 
#Â image = imread("oiseau.png")
image = imread("chevreuil.png")
#image = imread("foubassan.png")
plt.imshow(image)
plt.axis('off')


Our data is pixels in 3D (RGB) vectors. Perform a `KMeans` on pixels data encoded as a $N \times 3$ matrix. Then, extract the new set of colors.

In [None]:
import matplotlib.pyplot as plt

def display_colors(colors):
    nb_couleurs = len(colors)
    fig, ax = plt.subplots(1, nb_couleurs, figsize=(12, 2))

    # Parcourir les couleurs et les afficher dans chaque sous-graphique
    for i,color in enumerate(colors):
        ax[i].imshow([[color]], aspect='auto')
        ax[i].axis('off')

    plt.show()

pixels = image.reshape(-1,3)
nb_couleurs = ...
...
new_colors = ...
display_colors(new_colors)

Now, compute your new image only using the new colors and the result of your clustering. Try different numbers of colors.

In [None]:
pixels_new = ...
image_new = pixels_new.reshape(image.shape)
plt.imshow(image_new)
plt.axis('off')

**Going further**

To go further on clustering, you can explore two things :
 * Accelerate computations using `MiniBatchKMeans` of `scikit-learn` library
 * Explore more advanced clustering algorithms like `DBSCAN` or others : https://scikit-learn.org/stable/modules/clustering.html