CMP713 Data Mining
2021F Assignment 1 - due 10 November 2021
Question 1
Write a function that takes a single argument, a data frame, and outputs the following on the console:
- Number of rows
- Number of columns
- A list of column names with numeric data type
- A list of column names with character data type
- A list of column names with logical data type
- A list of column names with factor data type
- Number of NAs in the data frame
- Names of columns which contain NAs and the corresponding number of NAs in each
- Indices of rows containing NAs
- For each numerical column, the mean and standart deviation of the column (be careful with NAs)
- For each logical column, the percentages of Trues and Falses
- For each categorical column, the number of levels, and the names and frequencies of the top 3 most frequent levels, in descending order. If there are less than three levels, then just report them all.
- For each pair of numerical columns, the correlation (Pearson) between them
Your output should be clean and clear. This means it should not contain more than what is sufficient and less than what is necessary. For example:
Number of rows: 114
Number of columns: 14
Numeric columns: X1, X4, X6
Character columns: X2, X3
...
Columns with NAs: X1(4), X5(12), X8(1)
...
Categorical column levels:
X5[6] - red(32), green(23), blue(18)
X7[8] - tall(74), short(40)
...
Correlation of X1 and X4: 0.56748
Correlation of X1 and X5: -0.86834
...
The above output is not complete and given as an example. Your output may look different but must match the above example in conciseness.
You should prepare the function in an R script and submit it on this webpage.