WAN WENBO
Data Warehousing and Data Mining
a. Construct different decision trees based on different partitions of each data set into a training set and a test set.
b. Compare the structures and classification performances of these different trees.
c. For a selected training/test set partition in a., generate different pruned versions of the decision tree, and compare its classification performance before and after pruning.
d. For the Vertebral Column data set, observe the classification performance associated with the different classes, and determine which pair(s) of classes are likely to be confused with each other.
e. For a selected confused class pair in d., identify the corresponding leaf node(s) and analyze the sequence of decisions that lead to the misclassification.
a.Construct different decision trees based on different partitions of each data set into a training set and a test set.
WAN WENBO
Answer:
First,install and import packages:
install.packages(\ library(rpart) library(rpart.plot)
Then import data set: “statlog(heart)”,I imported the data set by the website address.
Attribute Information: v1. age v2. sex
v3. chest pain type (4 values) v4. resting blood pressure v5. serum cholestoral in mg/dl
v6. fasting blood sugar > 120 mg/dl
v7. resting electrocardiographic results (values 0,1,2) v8. maximum heart rate achieved v9. exercise induced angina
v10. oldpeak = ST depression induced by exercise relative to rest v11. the slope of the peak exercise ST segment
v12. number of major vessels (0-3) colored by flourosopy v13. thal: 3 = normal; 6 = fixed defect; 7 = reversable defect
V14(Variable to be predicted): Absence (1) or presence (2) of heart disease I name the “v14” attribute “RES” means the predicted results like “Absence (1) or presence (2) of heart disease”
We can view the structure of the data set.It has 270 objects totally and 14 variables.
WAN WENBO
Next we can build the decision tree,because the statlog(heart) data set is
unordered,so I simply chose the first 80% for training and last 20% for testing.
(1).I split the data set into80%trainingset and 20%test set.By default,it uses the Gini indexto build the tree.
The training set of the statlog named stree_train is like the following: