Assignment: Phishing websites (decision tree)

Phishing refers to a family of online frauds where an Internet user is lured into submitting his/her sensitive data for malicious purposes.

Your goal is to construct a decision tree model that accurately decides whether a web site is a phishing site or not.

Task: Load and examine the data

Load the data set.

Phishing Websites data set
Material	Link	Reference
Data set	csv	Mohammad, Rami, Thabtah, Fadi Abdeljaber and McCluskey, T.L. (2015) Phishing Websites Dataset. Downloadable via Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
Documentation	pdf

Study the contents of the data set.

Note: As the interpretation of the -1’s and 1’s in the Result column seems to be missing from the document, it may be helpful to know that a 1 corresponds to a phishing site and a -1 to a legitimate site.

Task: Create a decision tree

Construct a decision tree that classifies the websites into phishing sites and legitimate sites.

Task: Validate the tree

Get an estimate of the classifier's performance by cross-validating.

Interpret the tree

Play with parameters controlling the tree size. Try to make a tree that is:

small enough to be understandable
provides a good accuracy

Interpret the resulting tree. How would you instruct an internet analyst to detect a phishing website, based on your decision tree?

Back to main page