CMP713 Data Mining

2023-2024 Spring - Assigment 2

Given 07/05/2024, Due 21/05/2024 (excluded)

Name ID Points Task2_1 Task2_2 Task2_3 Task3_1 Task3_2 Total
Burkay Genç 11111111111 Max 10 15 10 20 20 75
Given 0 0 0 0 0 0

In this assignment you will work on the Dry Bean Dataset from the UCI Machine Learning Repository to explore the properties of the dataset, such as identifying the shapes of the distributions of its features.

Do not change anything in this document, other than student_name and student_id variables in the above chunk, and the Answer sections below. You will submit a PDF file at the end using this form. Your solution should assume that the raw data is imported from Dry_Bean_Dataset.xlsx file in the same folder as your Rmd file.

Your solution should never install new packages! Only the packages we have shown in the course are allowed, and these are already installed on my computer. So, do not try to reinstall them (please!).

Good luck!

TASK 2_1

Suggest and apply a good discretization on the data features. You should analyze “Area”, “AspectRatio”, “ShapeFactor1”, separately and use your opinion to suggest good discretization intervals for these features. Your analysis must include a complete discussion on why you have chosen those intervals for each feature. You are expected to use plots when making decisions and justifications.

NOTE: You are not expected/asked to use automated means of discretization. NOTE2: You will only suggest (not apply) the discretization on the dataset.

Answer

TASK 2_2

Standardize all numeric columns and apply the kmeans algorithm on the dataset.

  • Try with k=3..10 and max_iter = 1000. Which k value gives the best silhouette result?
  • Compute the frequency table of each k value with respect to assigned cluster and actual class value. Which k value provides the best looking table?
  • Repeat this task a few times with different seeds. How are the results being affected? Discuss your findings.

Answer

TASK 2_3

Repeat the previous task with DBSCAN. This time compute a DBSCAN clustering using eps values of 0.3, 0.4, 0.5, 0.6, 0.7, 0.8 and MinPts of 10, 20, 30, 50, 100. Do you get any reasonable clusterings? Explain your findings.

Answer

TASK 3_1

In this task, you are asked to train a Decision Tree model. You are expected to fine tune over the hyperparameters cp, maxdepth and minbucket, and apply a proper train-validate-test (70-15-15) framework. You should report the test accuracy and not the training and validation accuracies. Provide confusion matrices and/or plots to enrich your presentation.

  • Report test data accuracy.
  • Plot the most successful tree you found.
  • Compare your validation and test accuracies. Which is better? why?
  • Compare the test accuracy with the results you observed in the unsupervised learning tasks.

Answer

TASK 3_2

In this task, you are free to choose your model algorithm and hyperparameters. You have one goal: beat the decision tree model in test accuracy!

  • Try whatever model you want to beat DT.
  • You should work on the same training, validation and test sets you used above.
  • Report test accuracy and discuss your results.
  • You code should clearly display hyperparameter tuning and model selection efforts.

Answer