# Lab 10: kNN

Write your name in the *markdown* cell below [2 points]

### Name: 

We will start by looking again at the MNIST handwritten digits dataset.

### Load the data

In [None]:
from sklearn import datasets

# load data set and extract the features (X) and target values (y)
digits = datasets.load_digits()
X = digits.data
y = digits.target

### Visualize the data

In [None]:
import matplotlib.pyplot as plt
# visualize the data by plotting the first 30 images
figure, axes = plt.subplots(3,10, figsize = (15,6))
for ax,image,number in zip(axes.ravel(), digits.images, y) :
    ax.axis('off')
    ax.imshow(image, cmap = plt.cm.gray_r)
    ax.set_title('Number: ' + str(number))

### Question 1 <span style = 'font-size:80%'>[5 points]</span>

Display the 200th image (and only this image), along with its target value.

### Question 2 <span style = 'font-size:80%'>[5 points]</span>

How many digits in the target list are 4s? Your answer should display only the number of 4s. Hint: you can find this answer by using a Counter and displaying the number of 4s, or by creating a list of target values containing only 4s, and then displaying the length of this list.

### Split data into training and testing sets, fit the model, and make predictions in the *test* dataset

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split 

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=99, stratify = y)
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)

### Look at the *confusion matrix*

In [None]:
from sklearn.metrics import confusion_matrix
import seaborn as sns

confusion = confusion_matrix(y_true = y_test, y_pred = y_pred)
s = sns.heatmap(confusion, annot = True, cmap = 'nipy_spectral_r')
s.set_title('Confusion matrix for MNIST dataset')
s.set_ylabel('true value')
s.set_xlabel('predicted value')
None

### Question 3 <span style = 'font-size:80%'>[6 points]</span>

As discussed previously, remember that
- the *recall* for a target value is the proportion of those target values that are predicted correctly
- the *precision* for a target value is the proportion of the predicted target values, for that value, that are correct
- the $f_1$ *score* for a target value is the harmonic mean of *precision* and *recall* for that target value

(a) Using Python as a calculator, calculate the *recall*, for the target value of 8, using the confusion matrix above.

(b) Using Python as a calculator, calculate the *precision*, for the target value of 8, using the confusion matrix above.

### Calculate the *balanced accuracy*

In [None]:
from sklearn import metrics
# calculate the balanced accuracy using metrics.accuracy_score
metrics.balanced_accuracy_score(y_test, y_pred)

### Question 4 <span style = 'font-size:80%'>[4 points]</span>

Using the training and testing sets from above, calculate the balanced accuracy when using *kNN* with *k = 7*. Which value of *k* performs better, 3 or 7?

### K-fold cross validation
The process of k-fold cross-validation involves splitting the dataset into _k_ groups, then using the first group for testing and the remaining *k-1* groups for training; this process is then repeated using the second group for testing and the remaining *k-1* groups for training, then the third group for testing, and so on. 

Commonly the value *k = 10* is used (Note that this is a different _k_ than the one in *knn*).

Cross-validation is useful for *hyperparameter tuning*, which is used to find the optimal (best values) for model hyperparameters such as the value of _k_ in *knn*. 

The code below demonstrates how to use the *KFold* class and *cross_val_score* function to find the best value of *k*.


In [None]:
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

# create an empty list to store accuracies
acc = []

# values of 'k' to iterate through
kvals = list(range(1,27,2))

# for each value of 'k', create a knn estimator and find the mean balanced accuracy using 10-fold cross validation
for k in kvals :

    # create kNN model (estimator)
    knn = KNeighborsClassifier(n_neighbors=k)
    
    # create the k = 10 folds
    kfold = KFold(n_splits=10, random_state=99, shuffle = True)
    
    # return an array of scores that contains the balanced accuracy for each fold
    scores = cross_val_score(estimator = knn, X = digits.data, y = digits.target, cv = kfold, scoring = "balanced_accuracy" )
    print('mean balanced accuracy with k = ', k, ': ', scores.mean(), sep = '')

    # add the mean balanced accuracy for the current fold to the list
    acc.append(scores.mean())

### Question 5 <span style = 'font-size:80%'>[4 points]</span>

We want to find the value of *k* that yields the highest balanced accuracy in the test set. This can be done by visualizing the results as shown below. What value of _k_ is the best?

In [None]:
s = sns.pointplot(x=kvals, y=acc)
s.set_xlabel('k')
s.set_ylabel('balanced accuracy')
s.set_title('Balanced accuracy for 10-fold cross-validation using kNN in MNIST dataset')
None

## Breast Cancer Classification 

The code below loads the *Breast cancer wisconsin dataset* that contains data for 569 images (though the data does not consist of the actual images). Each image is a breast mass that is either malignant (cancerous) or benign (normal). The features are measurements of the cell nuclei in each image, such as the radius of the nuclei (the interpretation of the features is not important, but feel free to ask questions if you are interested). This application has important medical implications -- the goal is to diagnose breast (and other) cancers more accurately and more quickly.

In [None]:
# load the data
bc = datasets.load_breast_cancer()

# extract the feature data into 'X'
X = bc.data

# extract the target data into 'y'
y = bc.target

### Question 6 <span style = 'font-size:80%'>[7 points]</span>

(a) Use python to display the number of samples in this dataset, based on *X*. You should only output the number of samples. Hint: `X.shape` returns a *tuple*.

(b) Use python to display the number of features in this dataset, based on *X*. You should only output the number of features.

(c) Display the names of the features (and only these names), by accessing the appropriate value of the *bc* object.

### Question 7 <span style = 'font-size:80%'>[5 points]</span>

Use the seaborn (*sns*) module to plot the mean radius on the x-axis and the mean smoothness on the y-axis, with the samples color coded by target value. Your plot should include appropriate x- and y-labels, as well as a title. This plot should demonstrate that these features can separate the '0' and '1' target values, which suggest that *kNN* will be effective.

### Question 8 <span style = 'font-size:80%'>[10 points]</span>

The code below generates training and testing data for the Breast Cancer dataset. 

(a) Add code to make predictions in the testing dataset using *kNN* with *k = 3*. Then generate a confusion matrix and a heatmap (set the argument *fmt* to '.3g' in the *heatmap* method, to allow for 3 significant digits)

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split 

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=99, stratify = y)

(b) Calculate the balanced accuracy.

### Question 9 <span style = 'font-size:80%'>[9 points]</span>

Generate a classification report, using the target names, and answer the questions below:

(a) If this classifier is used, what proportion of malignant tumors would be identified (this is the *recall* for malignant images)?

(b) If this classifier is used, what proportion of healthy individuals would be predicted to have cancer (this is the *false positive* rate and is equal to 1 - the *precision* for malignant images).