CMP713 Data Mining

2023-2024 Spring - Assigment 1

Given 02/04/2023, Due 09/04/2023 (excluded)

Name ID Points Task1 Task2 Task3 Task4 Total
Burkay Genç 11111111111 Max 3 6 6 10 25
Given 0 0 0 0 0

In this assignment you will work on the Dry Bean Dataset from the UCI Machine Learning Repository to explore the properties of the dataset, such as identifying the shapes of the distributions of its features.

Do not change anything in this document, other than student_name and student_id variables in the above chunk, and the Answer sections below. You will submit your Rmd file at the end. Your solution should assume that the raw data is imported from Dry_Bean_Dataset.xlsx file in the same folder as your Rmd file.

Your solution should never install new packages! Only the packages we have shown in the course are allowed, and these are already installed on my computer. So, do not try to reinstall them (please!).

Good luck!

TASK 1

Import the data from the file into R. Be careful with the extent of the data, do not accidentally trim it. You should be reading 13611 data rows and 16+1 features.

When you import the data, print out the number of rows and number of columns. Also read (literally, with your eyes) the explanations of each feature on the website.

Answer

TASK 2

Draw a histogram for each feature of the data (except the target column at the end).

  • Discuss the shapes of the distributions
  • Do you notice anything weird?

Answer

TASK 3

Draw a boxplot for each feature of the data (except the target column at the end).

  • Discuss the shapes of the plots
  • Do you notice anything weird?
  • Why do you have so many “outliers”?

Answer

TASK 4

Draw a boxplot of each feature again, but this time facet the data with respect to the classes in the target feature (Class).

  • Explain your findings after this new insight into the data
  • Try to use the plots from Task 2, Task 3 and Task 4 together to come up with some understanding of the data.

Answer