Back to main page

Assignment: Butterfly flu (kNN)

Butterfly flu is a fictitious yet annoying disease that causes a lot of people to seek medical advice.

There is no direct test for butterfly flu, but it can be predicted based on three markers M1, M2, and M3 that can be measured from a blood sample. Based on these markers it may be possible to predict whether an individual's disease is a buttefly flu or not.

Your goal is to find out if buttefly flu can be reliably diagnosed based on markers M1, M2, and M3.

Use kNN to solve the problem. Include an estimate on the reliability of the diagnostic test.

Task: Load and examine data

Load the generated data set. Each row starts with an id, followed by the three markers.

Butterfly flu data set
Material Link Reference
Data set csv

Task: Examine the data

Examine the data and print the basic statistics. Look for potential errors and outliers.

Task: Apply a kNN classifier

Apply k Nearest Neighbours (kNN) classification method to the data.

What is the accuracy estimate, computed from the training set? (Note: there's a danger of model overfitting)

Validate your results

Use split validation to correct for model overfitting. Use 67% as training data and 33% as the validation set.

What is the corrected accuracy estimate? Is it OK to use this as a diagnostic tool?

Back to main page