#  TP Classifier and CV 


The purpose of this pratical work is to handle the `scikit-learn` library and implement a classifier with a Cross validation stategy to optmize the hyperparameters. Take care of not including bias when you tune your (hyper) parameters.


## Load some data

First, we will load some data into our environment to perform a regression. `scikit-learn` includes some dataset loading utilities within the `sklearn.datasets` module [link](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.datasets). Take few minutes to see the list of datasets and identify the classification tasks. 

For this TP, we will first load the wine dataset [link](https://scikit-learn.org/stable/datasets/toy_dataset.html#wine-dataset)

In [None]:
from sklearn.datasets import load_wine
X,y = ...

## A first SVM classifier

`sklearn` proposes some implementations of different machine learning algorithms. As seen during the class, we will focus on classification problems. We will first start with SVM.
This method is implemented by the class `SVC` within the module `sklearn.svm`. Check the docs to get familiar and convince yourself that it corresponds to the SVM classification method seen in class [link](https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html). 

 1) Create a variable `model` corresponding to a `SVC`.

In [None]:
from sklearn.svm import SVC
model = ...

2) Fit the model to the dataset `X` and predict the corresponding properties.

3) Check the accuracy of your prediction. You can use the `sklearn` function `accuracy_score` of module `sklearn.metrics` [link](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html).

In [None]:
from sklearn.metrics import accuracy_score

## Predict unseen data
Congrats, you learned a first model. However, it is rather useless since it predicts only data used during the training phase. Let's predict unseen data.

1) Split your dataset into train and test sets using the `train_test_split` function of module `sklearn.model_selection`. We will use a `test_size` of 50% of the whole dataset, and fix the `random_state` to `42` for a sake of reproducibility. doc is [here](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) 
2) Learn your model on training set
3) Predict your test set and print the performance

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = ...
model = ...
model.fit(...)
perf_test = ...
print(f"Performance on test is {perf_test:.2f}")

## Tune the `C` hyperparameter 
When creating a new `SVC` variable, we can tune the `C` value which corresponds to the importance of errors in our optimization process. Now, the objective is to find the best `C` value, i.e. the one which will perform the best on unseen data.

1) Compute the performance (accuracy) for different value of `C` on both train and test sets. We will use 25% of dataset as trainset and a log scale (`np.logscale`) for the values of `C`.

    a) Split into train and test sets, using 25% of dataset as train.

    b) Normalize the data using the `StandardScaler` class of module `sklearn.preprocessing`. Take care of not using test set to compute your parameters !
    
    c) Compute the classification error rate for each `C` in `np.logspace(-2,1,25)`

In [None]:
from sklearn.svm import SVC
import numpy as np
from sklearn.model_selection import train_test_split
import  matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score as score
from sklearn.preprocessing import StandardScaler


X_train, X_test, y_train, y_test = ...

2) Plot the errors on train and test sets according to the complexity of the model. Reminder : low `C` values correspond to simple models, high `C` values to complex models, with less errors. Verify that error train decreases as the complexity of the model increases. Comment the plot of test error. What is the best `C` value ?

**Hint** : use the `plot` function from module `matplotlib.pyplot`

In [None]:
import matplotlib.pyplot as plt
plt.plot(...)
plt.xscale('log') # Recommandé pour afficher de manière lisible le plot

## Complete protocol

In the previous exercise, we choose the best `C` value a posteriori, given its performance on test set. However, in real conditions, we may not know how the model will perform on unseen data. To limit this bias, we need to use train, validation and test set.

1) Split the original data set into train and test split, using 30% for the test set.

2) Use the `GridSearchCV` class to compute a cross validation and find the optimal `C` value.

In [None]:
from sklearn.svm import SVC
import numpy as np
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from sklearn.model_selection import GridSearchCV



4) Predict your final performance on test set.

In [None]:

perf_test = ...
print(f"Accuracy on test = {perf_test}")

5. We made a mistake in our protocol. We break the first rule on not using the test set during training. Can you spot the problem ? 
To solve this problem, check the `Pipeline` of scikit-learn : https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html

6. **Bonus 1** Extend the learning process to fit the best kernel

7. **Bonus 2** Extend the learning process to compare different classifiers