Butterfly flu is a fictitious yet annoying disease that causes a lot of people to seek medical advice.
There is no direct test for butterfly flu, but it can be predicted based on three markers M1, M2, and M3 that can be measured from a blood sample. Based on these markers it may be possible to predict whether an individual's disease is a buttefly flu or not.
Your goal is to find out if buttefly flu can be reliably diagnosed based on markers M1, M2, and M3.
Use kNN to solve the problem. Include an estimate on the reliability of the diagnostic test.
Load the generated data set. Each row starts with an id, followed by the three markers.
Material | Link | Reference |
Data set | csv |
Examine the data and print the basic statistics. Look for potential errors and outliers.
Apply k Nearest Neighbours (kNN) classification method to the data.
What is the accuracy estimate, computed from the training set? (Note: there's a danger of model overfitting)
Use split validation to correct for model overfitting. Use 67% as training data and 33% as the validation set.
What is the corrected accuracy estimate? Is it OK to use this as a diagnostic tool?
Back to main page