# K-Nearest Neighbors

The *k-nearest neighbors* (*knn*) classification method predicts the *class* of a sample *x_new*, as follows:
1. Calculate the distance between *x_new* and each training sample (for each training sample, its features and classe are known)
1. Find the *k* nearest neighbors
1. Assign *x_new* to the class that occurs most frequently among the *k* nearest neighbors

In this classifier, _k_ is a *hyperparameter*, which is a parameter of the classifier that must be assigned by the user. 

Typically, odd values of _k_ are considered in order to break ties.

We will use *scikit-learn* for classification and for loading sample datasets. We will start by loading the *iris* dataset, a 'famous' dataset, widely used to test classification methods. The iris dataset contains measurements for 3 species of the iris flower.

## Loading and understanding the data


In [None]:
from sklearn import datasets
iris = datasets.load_iris()

print(iris.DESCR)

### Features and target values

- The *data* contains the *features* (the _X_ values used for making predictions). 
- The *target* contains the class values (the labels or categories for each sample of *X*)
- Both are stored as a *numpy* array, which is a collection of elements of the same type. 
    - if an array has one dimension, you can think of it as a list of values
    - if an array has two dimensions, you can think of it as a table of values, with rows and columns
    - arrays can have more than two dimensions
    
By convention, we use uppercase letters (e.g., *X*), to indicate a matrix (a table/array with rows and columns) and lowercase letters (e.g., *y*) to indicate a vector (an array with 1 dimension).

In [None]:
X = iris.data
y = iris.target

The *shape* attribute of a *numpy* array returns the number of elements in each dimension. *X* has 150 rows and 4 columns:

In [None]:
X.shape

*Y* has 150 rows:

In [None]:
y.shape

For *numpy* arrays with 1 dimension, such as *y*, list slicing rules apply. For *numpy* arrays with 2 dimensions, such as *X*, we can access elements using 

```python
X[row_slice, column_slice]
```
Use the cell below to view X and y:

In [None]:
y[:3]

### Feature and target names

Labels for the data are described in the following properties of the *iris* object:
- *feature_names*:  the column names of the *data* which describe the features
- *target_names*: labels corresonding to the integer values of the *target* 

In [None]:
iris.feature_names

In [None]:
iris.target_names

For clarity, let's make a data frame with labeled columns

In [None]:
import pandas as pd
iris_df = pd.DataFrame(X, columns=iris.feature_names)
iris_df

So that our understanding of the data is clear, let's add the target to the data frame. This is straightforward, because we can treat a data frame as a dictionary where the *keys* are the columns. We now add a column the same way we add a *key:value* pair to a dictionary, where the *value* is a list. 

In [None]:
iris_df['species'] = y
iris_df

## Visualizing the data 

We will use the *seaborn* module to generate a scatterplot of the data. A scatterplot is a plot of *x-* and *y-* values.
- the *x* values are the sepal length (the first column of _X_, which has index 0), 
- the *y* values are the sepal width (the 2nd column of _X_, which has index 1)

A *hue* can be specified to color the points, which will automatically add a legend.

In [None]:
import seaborn as sns
s = sns.scatterplot(x = X[:,0], y = X[:,1], hue = y)
s.set(xlabel = 'Sepal length', ylabel = 'Sepal width', title = 'Sepal length vs. sepal width for 3 iris species')
None

## K-nearest neighbors

Scikit-learn provides a simple framework for working with classifiers (which *scikit-learn* calls *estimators*), that involves 3 basic steps:

1. Create an estimator or *model*, such as a *KNeighborsClassifier*
1. Train the model using *model.train()*
1. Make predictions using *model.predict()*

### Create a KNN classifier

Note that you need to specify the value of the hyperparameter here. We use *k = 3*.

In [None]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=3)
knn

### Train the classifier

In general this step has the form

```python
model.fit(X_train,y_train)
```

where 
- *model* is a model such as one obtained by *KNeighborsClassifier*
- *X_train* is an array of features for the training data, with rows corresponding to samples and columns corresponding to features
- *y_ train* is an array of class labels corresponding to each row of *X_train*

Here we fit the model using the complete dataset, using *X* and *y*. Below, we will see how to split the data into separate training and testing datasets.

In [None]:
knn.fit(X,y)

### Use the classifier to make predictions

Make predictions using

```python
model.predict(xnew)
```
where 
*xnew* is a 2-dimensional array that contains the features for the new samples.

In the example below, we are predicting the species for a flower with the following measurements:
- sepal length: 6 cm
- sepal widht: 2 cm
- petal length: 4.9 cm
- petal width: 1.5 cm

This flower is predicted to have a value of 1 (*versicolor*)

In [None]:
import numpy as np
xnew = np.array(  [  [6, 2, 4.9, 1.5] ] )

# make the prediction
knn.predict(xnew)

## Using training and testing datasets

When evaluating a model, it is critical that you have both *training* data and *testing* data. The training and testing data sets should be independent -- we want to evaluate how well a classifier performs on data that is has not seen previously. If a testing data set is not used, we will not know if the classifier is *overfitting* the data. Overfitting occurs when a classifier works really well on the training dataset but performs poorly on new data.

Scikit-learn makes it easy to split a single data set into a training and testing sets, by providing the function

```python
train_test_split(X,y, test_size, random_state, stratify)
```

where

- *X* is a matrix containing the feature data
- *y* is the corresponding matrix containing the class labels
- *test_size* is the proportion of data to reserve for testing
- *random_state* is the random number seed; set this so results will be reproducible
- *stratify* is a list of values to stratify by (the corresponding values will be balanced in the training and testing datasets)

The function returns a tuple of the form
```python
(X_train, X_test, y_train, y_test)
```
that contains the training and testing data for *X* and _y_.

Note that we *stratify* by the class label, to ensure that our datasets are balanced.

If we did not do this, then it is possible by chance for the training dataset to have few target values of a certain type, in which case the classifier would likely do a poor job predicting observations that belong to that class.

### Split the data into training and testing sets

In [None]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=99, stratify = y)

We want our data to be balanced with respect to the classes. For example, if there were very few *setosa* samples in the training set, the classifier would probably not do well on these in the test dataset. In this case the dataset is balanced (50 observations for each species), and the random sampling results in test and and training sets that are relatively balanced, as seen below:

In [None]:
# user a Counter to see how many of each class we have in the training dataset
from collections import Counter
print('Balance of training set: ', Counter(y_train))
print('Balance of testing set: ', Counter(y_test))

### Fit the model

In [None]:
knn.fit(X_train, y_train)

### Make predictions in the *test* dataset

In [None]:
pred = knn.predict(X_test)

### Evaluate the results by generating a *classification report*  which calculates various performance measures

Using *scikit-learn*, we can generate a *classification report* that contains commonly used performance measures

The *classification_report* function takes the true classes from the test data, the predicted values, and optionally the target names. The columns are defined below:

- precision: The proportion of predicted values that are classified correctly
- recall: The proportion of values for a class that have been classified correctly
- F1 score: the harmonic mean of precision and recall
- support: the number of samples for each group

In [None]:
from sklearn.metrics import classification_report
report = classification_report(y_test, pred, target_names = iris.target_names)
print(report)

## Evaluate the results by looking at the *confusion matrix*

A *confusion matrix* is a matrix that shows how the observations in each row (each class) were classified (corresponding to each column). As the name implies, confusion matrices are useful for identifying areas where the classifier may be "confused" (i.e., where it consistently misclassifies a particular category)

In [None]:
from sklearn.metrics import confusion_matrix
confusion = confusion_matrix(y_true = y_test, y_pred = pred)
confusion

Let's create a data frame so that we can label the rows and columns 

In [None]:
import pandas as pd
confusion_df = pd.DataFrame(confusion, columns=iris.target_names, index=iris.target_names)
confusion_df

We can visualize the heatmap using the seaborn *heatmap* function

In [None]:
import matplotlib.pyplot as plt
sns.heatmap(confusion_df, annot = True, cmap = 'nipy_spectral_r')
plt.ylabel('True Value')
plt.xlabel('Predicted Value')
None