HW1
Due Date: Nov. 8
Submission requirements:
Please submit your solutions to our class website. Part I: written part:
1. Suppose that a data warehouse consists of four dimensions, date, spectator, location, and game, and
two measures, count and charge, where charge is the fare that a spectator pays when watching a game on a given date. Spectators may be students, adults, or seniors, with each category having its own charge rate.
(a) Draw a star schema diagram for the data warehouse.
(b) (b) Starting with the base cuboid [date, spectator, location, game],what specific OLAP
operations should one perform in order to list the total charge paid by student spectators in Los Angeles?
(c) (c) Bitmap indexing is a very useful optimization technique. Please present the pros and cons of
using bitmap indexing in this given data warehouse.
2.某电子邮件数据库中存储了大量的电子邮件。请设计数据仓库的结构,以便用户从多个维度进行查询和挖掘。
3. Suppose a hospital tested the age and body fat data for 18 random selected adults with the following result: age út 23 9.5 23 26.5 27 7.8 27 17.8 39 31.4 41 25.9 47 27.4 49 27.2 50 31.2 52 34.6 54 42.5 54 28.8 56 33.4 57 30.2 58 34.1 58 32.9 60 41.2 61 35.7 (a) (b) (c) (d) (e)
(f) (g)
Calculate the mean, median, and standard deviation of age and út. Draw the boxplots for age and út.
Draw a scatter plot based on these two variables. Normalize age based on min-max normalization.
Calculate the correlation coefficient (Pearson’s product moment coefficient). Are these two variables positively or negatively correlated?
Smooth the fat data by bin means, using a bin depth of 6. Smooth the fat data by bin boundaries, using a bin depth of 6.
4.Consider the data set shown in Table 1(min_sup = 60%, min_conf=70%).
(a) Find all frequent itemsets using Apriori by treating each transaction ID as a market basket. (b) Use the results in part (a) to compute the con?dence for the association rules {a, b}?{c} and
{c}?{a, b}. Is con?dence a symmetric measure?
(c) List all of the strong association rules (with support s and confidence c) matching the following
metarule, where X is a variable representing customers, and itemi denotes variables representing items (e.g. “A”, “B”, etc.):
Table 1. Example of market basket transactions.
PAGE 1 11/21/2017
TID T1 T2 T3 T4 Items-bought {A, D, B, C} {D, A, C, E, B} {A, B, E} {A, B, D} 5.Consider the data set shown in Table 1(min_sup = 60%).
(a) Find all frequent itemsets using FP-Growth. Please present all the FP-trees and all the conditional pattern bases.
(b) Compare the efficiency of Apriori and FP-Growth.
Part II: lab part
Question 1: Learn the use of market basket analysis for the purpose of making product purchase recommendations to the customers.
The data set contains transactions from a large supermarket. Each transaction is made by someone holding the loyalty card. We limited the total number of categories in this supermarket data to 20 categories for simplicity. The field value for a certain product in the transaction basket is 1 if the customer has bought it and 0 if he/she has not. The file named “Transactions” has data for 46243 transactions.
The data are available from the class web page.
Your submission should consist only of those deliverables marked indicated by “Hand-in”.
Market basket analysis has the objective to discover individual products, or groups of products that tend to occur together in transactions. The knowledge obtained from a market basket analysis can be employed by a business to recognize products frequently sold together in order to determine recommendations and cross-sell and up-sell opportunities. It can also be used to improve the efficiency of a promotional campaign.
Run Apriori on “transaction” data set. Set the “Type” of “COD” as “Typeless”, set the “direction” of all the other 20 categories as “Both”, set their “Type” as “Flag”. Set “Minimum antecedent support” to be 7%, “Minimum confidence” to be 45%, and “Maximum number of antecedents” to be 4 in the modeling node (Apriori node). In general you should explore by trying different values of these parameters to see what type of rules you get.
? ?
Hand-in: The list of association rules generated by the model.
Sort the rules by lift, support, and confidence, respectively to see the rules identified. Hand-in: For each case, choose top 5 rules (note: make sure no redundant rules in the 5 rules) and give 2-3 lines comments. Many of the rules will be logically redundant and therefore will have to be eliminated after you think carefully about them.
PAGE 2 11/21/2017