#  TP Classifier and CV 


The purpose of this pratical work is to handle the `scikit-learn` library and implement a classifier with a Cross validation stategy to optmize the hyperparameters. Take care of not including bias when you tune your (hyper) parameters.


## Load some data

First, we will load some data into our environment to perform a regression. `scikit-learn` includes some dataset loading utilities within the `sklearn.datasets` module [link](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.datasets). Take few minutes to see the list of datasets and identify the classification tasks. 

For this TP, we will first load the wine dataset [link](https://scikit-learn.org/stable/datasets/toy_dataset.html#wine-dataset)

In [None]:
from sklearn.datasets import load_wine
X,y = ...

In [None]:
from sklearn.datasets import load_wine
X,y = load_wine(return_X_y=True)

## A first SVM classifier

`sklearn` proposes some implementations of different machine learning algorithms. As seen during the class, we will focus on classification problems. We will first start with SVM.
This method is implemented by the class `SVC` within the module `sklearn.svm`. Check the docs to get familiar and convince yourself that it corresponds to the SVM classification method seen in class [link](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html). 

 1) Create a variable `model` corresponding to a `SVC`.

In [None]:
from sklearn.svm import SVC
model = ...

In [None]:
from sklearn.svm import SVC
model = SVC()

2) Fit the model to the dataset `X` and predict the corresponding properties.

In [None]:
model.fit(X,y)
pred = model.predict(X)


3) Check the accuracy of your prediction. You can use the `sklearn` function `accuracy_score` of module `sklearn.metrics` [link](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html).

In [None]:
from sklearn.metrics import accuracy_score

In [None]:
from sklearn.metrics import accuracy_score
print(accuracy_score(y,pred))

In [None]:
# accuracy score est simplement la moyenne des bonnes classifs.
import numpy as np
acc = np.mean(y==pred)
print(acc)

## Predict unseen data
Congrats, you learned a first model. However, it is rather useless since it predicts only data used during the training phase. Let's predict unseen data.

1) Split your dataset into train and test sets using the `train_test_split` function of module `sklearn.model_selection`. We will use a `test_size` of 50% of the whole dataset, and fix the `random_state` to `42` for a sake of reproducibility. doc is [here](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) 
2) Learn your model on training set
3) Predict your test set and print the performance

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = ...
model = ...
model.fit(...)
perf_test = ...
print(f"Performance on test is {perf_test:.2f}")

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=42)

In [None]:
model = SVC()
model.fit(X_train,y_train)
pred = model.predict(X_test)
perf_test = accuracy_score(y_test,pred)
print(f"Performance on test is {perf_test:.2f}") # perf différente que sur le train

## Tune the `C` hyperparameter 
When creating a new `SVC` variable, we can tune the `C` value which corresponds to the importance of errors in our optimization process. Now, the objective is to find the best `C` value, i.e. the one which will perform the best on unseen data.

1) Compute the performance (accuracy) for different value of `C` on both train and test sets. We will use 25% of dataset as trainset and a log scale (`np.logscale`) for the values of `C`.

    a) Split into train and test sets, using 25% of dataset as train.

    b) Normalize the data using the `StandardScaler` class of module `sklearn.preprocessing`. Take care of not using test set to compute your parameters !
    
    c) Compute the classification error rate for each `C` in `np.logspace(-2,1,25)`

In [None]:
from sklearn.svm import SVC
import numpy as np
from sklearn.model_selection import train_test_split
import  matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score as score
from sklearn.preprocessing import StandardScaler


X_train, X_test, y_train, y_test = ...

In [None]:
from sklearn.svm import SVC
import numpy as np
from sklearn.model_selection import train_test_split
import  matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score as score
from sklearn.preprocessing import StandardScaler


X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.25, random_state=42)

scaler = StandardScaler()
X_train_norm = scaler.fit_transform(X_train)
X_test_norm = scaler.transform(X_test)

error_train = []
error_test = []
Cs = np.logspace(-2,1,25) # leur faire printer pour voir la tête du vecteur
for C in Cs:
    model = SVC(C=C)

    model.fit(X_train_norm,y_train)
    
    y_hat_train = model.predict(X_train_norm)
    y_hat_test = model.predict(X_test_norm)
    
    error_train.append(1-score(y_train, y_hat_train)) # on veut des erreurs pour le plot : erreur = 1 - perf
    error_test.append(1-score(y_test, y_hat_test))


2) Plot the errors on train and test sets according to the complexity of the model. Reminder : low `C` values correspond to simple models, high `C` values to complex models, with less errors. Verify that error train decreases as the complexity of the model increases. Comment the plot of test error. What is the best `C` value ?

**Hint** : use the `plot` function from module `matplotlib.pyplot`

In [None]:
import matplotlib.pyplot as plt
plt.plot(...)
plt.xscale('log') # Recommandé pour afficher de manière lisible le plot

In [None]:
# Ici, on ne demande pas un plot annoté, mais juste les courbes et leurs interprétations.
# Attention, j'ai choisi le random state et le ratio du split pour que ce soit parlant
fig,ax = plt.subplots()
ax.plot(Cs,error_train)
ax.plot(Cs,error_test)
ax.set_xscale('log') # ou plt.xscale('log') mais le ax.set_xscale est + propre


arrow_style = { 'arrowstyle' : "->", 'connectionstyle' : "arc3"}

ax.legend([f" error on train", f" error on test"]);
ax.annotate("sous apprentissage",xy=(0.05,0.64),xytext=(0.01, 0.4),
            arrowprops=arrow_style)
ax.annotate("sur apprentissage",xy=(7.0,0.0),xytext=(1,0.4),
            arrowprops=arrow_style)

best_idx = np.argmin(error_test)
ax.annotate("Optimal",xy=(Cs[best_idx],error_test[best_idx]),
            xytext=(.5,0.2),
            arrowprops=arrow_style)

ax.set_ylabel("Erreur")
ax.set_xlabel("Complexité du modèle")

print(f"The best C value is {Cs[best_idx]:2f}")

## Complete protocol

In the previous exercise, we choose the best `C` value a posteriori, given its performance on test set. However, in real conditions, we may not know how the model will perform on unseen data. To limit this bias, we need to use train, validation and test set.

1) Split the original data set into train and test split, using 30% for the test set.

2) Use the `GridSearchCV` class to compute a cross validation and find the optimal `C` value.

In [None]:
from sklearn.svm import SVC
import numpy as np
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from sklearn.model_selection import GridSearchCV



In [None]:
from sklearn.svm import SVC
import numpy as np
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from sklearn.model_selection import GridSearchCV

scaler = StandardScaler()
X_norm =  scaler.fit_transform(X)

X_train_norm, X_test_norm, y_train, y_test = train_test_split(X_norm, y, test_size=0.3, random_state=42)

params={'C' : np.logspace(-2,2,25)}
model=SVC()
cv = GridSearchCV(model,param_grid=params)

cv.fit(X_train_norm,y_train)
cv.cv_results_ # a explorer par les étudiants

In [None]:
cv.best_estimator_

4) Predict your final performance on test set.

In [None]:

perf_test = ...
print(f"Accuracy on test = {perf_test}")

In [None]:
y_hat_test = cv.predict(X_test_norm)
perf_test = cv.score(X_test_norm,y_test)
print(f"Accuracy on test = {perf_test}")

5. We made a mistake in our protocol. We break the first rule on not using the test set during training. Can you spot the problem ? 
To solve this problem, check the `Pipeline` of scikit-learn : https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html

In [None]:
# We used part of test set to normalize our data. 

In [None]:
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline

X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.3,random_state=42)
pipe = Pipeline(steps=[('scaler', StandardScaler()), ('classifier', SVC())])
params={'classifier__C' : np.logspace(-2,2,25)}
cv = GridSearchCV(pipe,param_grid=params)
cv.fit(X_train, y_train)
cv.score(X_test, y_test)

6. **Bonus 1** Extend the learning process to fit the best kernel

In [None]:
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline

X_train, X_test, y_train, y_test = train_test_split(X, y,test_size=0.3,random_state=42)
kernels= ['linear','poly','rbf','sigmoid'] # on pourra meme fitter chaque hyperparameter de chaque kernel (bonus)

C_values = np.logspace(-2,2,25)
# attention aux temps de calculs !
pipe = Pipeline(steps=[('scaler', StandardScaler()), ('classifier', SVC())])
params={'classifier__C' : C_values,
       'classifier__kernel' : kernels,
       'classifier__gamma' : ['scale','auto', 1e-5,1e-3,1e-1]
       }
cv = GridSearchCV(pipe,param_grid=params)
cv.fit(X_train, y_train)
cv.score(X_test, y_test)

7. **Bonus 2** Extend the learning process to compare different classifiers

In [None]:
# long. Pour occuper les plus rapides