Name | ID | Points | Task2_1 | Task2_2 | Task2_3 | Task3_1 | Task3_2 | Total |
---|---|---|---|---|---|---|---|---|
Burkay Genç | 11111111111 | Max | 10 | 15 | 10 | 20 | 20 | 75 |
Given | 0 | 0 | 0 | 0 | 0 | 0 |
In this assignment you will work on the Dry Bean Dataset from the UCI Machine Learning Repository to explore the properties of the dataset, such as identifying the shapes of the distributions of its features.
Do not change anything in this document, other than
student_name
and student_id
variables in the
above chunk, and the Answer sections below. You will submit a
PDF file at the end using this form. Your solution
should assume that the raw data is imported from
Dry_Bean_Dataset.xlsx
file in the same folder as your Rmd
file.
Your solution should never install new packages! Only the packages we have shown in the course are allowed, and these are already installed on my computer. So, do not try to reinstall them (please!).
Good luck!
Suggest and apply a good discretization on the data features. You should analyze “Area”, “AspectRatio”, “ShapeFactor1”, separately and use your opinion to suggest good discretization intervals for these features. Your analysis must include a complete discussion on why you have chosen those intervals for each feature. You are expected to use plots when making decisions and justifications.
NOTE: You are not expected/asked to use automated means of discretization. NOTE2: You will only suggest (not apply) the discretization on the dataset.
Standardize all numeric columns and apply the kmeans algorithm on the dataset.
k=3..10
and max_iter = 1000
.
Which k
value gives the best silhouette result?Repeat the previous task with DBSCAN. This time compute a DBSCAN
clustering using eps values of 0.3, 0.4, 0.5, 0.6, 0.7, 0.8
and MinPts of 10, 20, 30, 50, 100
. Do you get any
reasonable clusterings? Explain your findings.
In this task, you are asked to train a Decision Tree model. You are expected to fine tune over the hyperparameters cp, maxdepth and minbucket, and apply a proper train-validate-test (70-15-15) framework. You should report the test accuracy and not the training and validation accuracies. Provide confusion matrices and/or plots to enrich your presentation.
In this task, you are free to choose your model algorithm and hyperparameters. You have one goal: beat the decision tree model in test accuracy!