Homework Index: To See If The Questions Have Been Changed, or If You Are Required To Use Different Data or Examples
Homework Index: To See If The Questions Have Been Changed, or If You Are Required To Use Different Data or Examples
Chapter 1:
1, 2, 3, 4, 5, 6
Chapter 2:
2, 3, 4, 6
Chapter 3:
1, 2, 3, 4, 7, 11
Chapter 4:
1, 2, 3, 4, 5, 16
Chapter 5:
Chapter 6:
6.6, 6.14c
Chapter 7:
Chapter 8:
7, 12
Chapter 9:
1, 3, 4, 6
Chapter 10:
1, 2, 6, 10, 16
14
17
19
19
23
25
27
31
31
32
33
37
37
37
37
45
45
48
48
49
49
53
75
79
80
Freq.
53
37
65
18
12
13
2
L_1
n
freq_l
freq_median
width
median:
Rel. Freq.
Problem 2.4
Using the data given in the book (also below), plot a scatter plot of %Fat v. Age.
Also, plot a q-q plot of these two variables.
See instructions below for plotting a q-q plot.
(Those instructions are for plotting vs. a normal dist. NOT the case here.)
Note: If you have the same number of observations in both datasets (which you do in this case)
you can plot a q-q plot by simply plotting a scatter plot of the sorted datasets against each othe
If you are using an Excel trendline, it cannot be customize to run through Q1 and Q3.
For this assignment, you don't have to plot the line.
If you were using a statistical package that supports q-q plots, the line would be drawn.
Alternatively, you could paste the chart into Word or PPT and draw the line there.
Detailed instructions on plotting a q-q plot are below.
These instructions are for plotting a distribution against a normal distribution
to see if the distribution is normal. These can be modifie for other q-q plots.
AGE
23
23
27
27
39
41
47
49
50
52
54
54
56
57
58
58
60
61
% FAT
9.5
26.5
7.8
17.8
31.4
25.9
27.4
27.2
31.2
34.6
42.5
28.8
33.4
30.2
34.1
32.9
41.2
35.7
These instructions are for plotting against a normal distribution, using z-scores. However, the same in
1. Place or load your data values into the first column. Leave the first row blank for labeling the columns
2. Label the second column as Rank. Enter the ranks, starting with 1 in the row right below the label.
3. Label the third column as Rank Proportion. This column shows the rank proportion of each value.
4. Label the fourth column as Rank-based z-scores. Excel provides these values with the normsinv
5. Copy the first column to the fifth column. The Excel chart wizard works better if the x-axis values a
6. Select the fourth and fifth column. Select the chart wizard and then the scatter plot. The default da
%Fat v. Age.
a normal distribution
e for other q-q plots.
g z-scores. However, the same instructions can be modified for plotting other datasets against each other.
ow blank for labeling the columns. Sort the data in ascending order (look under the Data menu).
in the row right below the label. Each following row will be one more than the last (note: you can use an expression, copy
he rank proportion of each value. Use this expression for the first data value =(b2 - 0.5) / count(b$2:b$N) where N sho
these values with the normsinv function. Use this function to create the values in the fourth column.
orks better if the x-axis values are just to the left of the y-axis values.
n the scatter plot. The default data values should be good, but you should provide good labels.
each other.
ng the first data expression to the remaining rows. Check to make sure your percentiles look like they are correct!
12
0
17
9
Euclidean Distance:
Tuple i
Tuple j
(xi-xj)^2
SUM =
SQRT(SUM)=
Tuple j
20
0
36
8
SUM =
SQRT(SUM)=
Tuple j
ABS(xi - xj)
Tuple i
22
1
42
10
SUM =
Tuple i
22
1
42
10
SUM =
h-root=
Supremum Distance: the maximum distance for one of the attributes:
Tuple i
Tuple j
Tuple j
20
0
36
8
SUM =
ABS(xi-xj)
Tuple j
20
0
36
8
SUM =
h-root=
MAX =
BS(xi - xj)
2
1
6
2
11
xi - xj)^3
8
1
216
8
233
6.153449
Problem 3.1:
From textbook, but you must use different examples than what is in the Instructor
And don't tell me that you don't have access to the Instructor's Manual. Or previous studen
Answers must be in your own words.
Problem 3.2:
Describe the methods and how they were or could have been implemented in the
Mean
New Bin
Problem 3.4
Regarding data integration:
a. List some domain-specific (e.g., business, scientific, etc., NOT technicao) reasons why heterogeneo
b. List some ways that data can be heterogeneous and require integration. Give a one-sentence exam
synonyms, homonyms, formatting issues, levels of granularity, etc.
c. What is a schema? What is the difference between static schema integration and partial dynamic i
Give one example of when each would be appropriate.
Problem 3.7
DATA
14
17
19
19
23
25
27
31
31
32
33
37
37
37
37
45
45
48
48
49
49
53
75
79
80
a.
Transform the value 53 for this dataset onto the range [0.0, 1.0]
v' =[ (v - minA)/(maxA-minA)]*(new_maxA - new_minA) +new_minA
v=
minA =
maxA =
new_minA =
new_maxA =
v' =
b.
c.
d.
Which of the normalization methods would be appropriate for the IRIS data
(Review pages 113-115.)
Discuss one of the numeric attributes, such as PetalWidth.
(Consider the limitations of the various methods.)
decimal scaling:
min-max:
z-score:
se 18.18 for .)
or the scaling.)
Problem 3.11
Use the data listed for Part a. Use the IRIS dataset for the other parts.
Data
Bins
14
17
19
19
23
25
27
31
31
32
33
37
37
37
37
45
45
48
48
49
49
53
75
79
80
20
30
40
50
60
70
80
Problem 3.11
a.
Plot an equi-width histogram of width 10.
b.
Using the IRIS dataset, sketch examples of sampling:
SRSWOR:
Obsv. #
PetalWidth Class
SRSWR:
Obsv. #
PetalWidth Class
Clustered:
Obsv. #
width 10.
examples of sampling:
results, show observation#.
also 5 clusters)
tratified sample?
rs on a separate worksheet)
Observation #
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
6
5.1
6.5
5.4
5.7
6.1
5.4
5
5
5
6.3
4.6
6.4
5.5
7.9
6.4
5.8
6.3
5
5.7
6.5
7.6
6.4
5.9
6.7
5.5
4.9
4.9
6.3
7.7
5
5.2
6.8
5.4
6.6
5.6
4.8
6.7
5.5
5.8
5.8
6.1
6.2
6.7
4.6
2.2
3.3
2.8
3.9
3.8
3
3.4
3.5
3
3.3
2.3
3.1
3.2
2.4
3.8
3.1
2.7
3.3
3.4
4.4
3.2
3
3.2
3.2
2.5
4.2
3.1
2.5
3.4
3
3.4
2.7
3.2
3.9
2.9
3
3.4
3
3.5
2.8
2.7
2.8
2.9
3.1
3.6
4
1.7
4.6
1.7
1.7
4.9
1.7
1.6
1.6
1.4
4.4
1.5
5.3
3.7
6.4
5.5
5.1
6
1.5
1.5
5.1
6.6
4.5
4.8
5.8
1.4
1.5
4.5
5.6
6.1
1.6
3.9
5.9
1.3
4.6
4.5
1.9
5
1.3
5.1
4.1
4.7
4.3
4.4
1
1 Iris-versicolor
0.5 Iris-setosa
1.5 Iris-versicolor
0.4 Iris-setosa
0.3 Iris-setosa
1.8 Iris-virginica
0.2 Iris-setosa
0.6 Iris-setosa
0.2 Iris-setosa
0.2 Iris-setosa
1.3 Iris-versicolor
0.2 Iris-setosa
2.3 Iris-virginica
1 Iris-versicolor
2 Iris-virginica
1.8 Iris-virginica
1.9 Iris-virginica
2.5 Iris-virginica
0.2 Iris-setosa
0.4 Iris-setosa
2 Iris-virginica
2.1 Iris-virginica
1.5 Iris-versicolor
1.8 Iris-versicolor
1.8 Iris-virginica
0.2 Iris-setosa
0.1 Iris-setosa
1.7 Iris-virginica
2.4 Iris-virginica
2.3 Iris-virginica
0.4 Iris-setosa
1.4 Iris-versicolor
2.3 Iris-virginica
0.4 Iris-setosa
1.3 Iris-versicolor
1.5 Iris-versicolor
0.2 Iris-setosa
1.7 Iris-versicolor
0.2 Iris-setosa
2.4 Iris-virginica
1 Iris-versicolor
1.2 Iris-versicolor
1.3 Iris-versicolor
1.4 Iris-versicolor
0.2 Iris-setosa
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
4.7
5.1
7.2
5
6.9
5.5
4.8
6.4
5.9
6.7
4.5
6.7
6.1
5.5
5.6
6.5
6.6
6
6.9
6.2
7.2
4.9
5.1
5.8
5.7
7.3
4.7
5
5.7
6.8
6.3
5.8
4.6
6
6.3
5.1
6.8
5.7
5.6
7.7
4.4
4.8
6.9
5
6
3.2
3.8
3
2.3
3.2
2.4
3.1
2.7
3
3.1
2.3
3
2.8
2.3
2.9
3
3
2.7
3.1
3.4
3.6
2.4
3.7
2.7
2.9
2.9
3.2
3.5
2.6
2.8
2.5
4
3.2
3
2.8
3.8
3
2.5
3
2.6
2.9
3
3.1
3.2
3.4
1.6
1.9
5.8
3.3
5.7
3.8
1.6
5.3
5.1
5.6
1.3
5.2
4
4
3.6
5.5
4.4
5.1
5.4
5.4
6.1
3.3
1.5
5.1
4.2
6.3
1.3
1.3
3.5
4.8
5
1.2
1.4
4.8
5.1
1.5
5.5
5
4.1
6.9
1.4
1.4
5.1
1.2
4.5
0.2 Iris-setosa
0.4 Iris-setosa
1.6 Iris-virginica
1 Iris-versicolor
2.3 Iris-virginica
1.1 Iris-versicolor
0.2 Iris-setosa
1.9 Iris-virginica
1.8 Iris-virginica
2.4 Iris-virginica
0.3 Iris-setosa
2.3 Iris-virginica
1.3 Iris-versicolor
1.3 Iris-versicolor
1.3 Iris-versicolor
1.8 Iris-virginica
1.4 Iris-versicolor
1.6 Iris-versicolor
2.1 Iris-virginica
2.3 Iris-virginica
2.5 Iris-virginica
1 Iris-versicolor
0.4 Iris-setosa
1.9 Iris-virginica
1.3 Iris-versicolor
1.8 Iris-virginica
0.2 Iris-setosa
0.3 Iris-setosa
1 Iris-versicolor
1.4 Iris-versicolor
1.9 Iris-virginica
0.2 Iris-setosa
0.2 Iris-setosa
1.8 Iris-virginica
1.5 Iris-virginica
0.3 Iris-setosa
2.1 Iris-virginica
2 Iris-virginica
1.3 Iris-versicolor
2.3 Iris-virginica
0.2 Iris-setosa
0.1 Iris-setosa
2.3 Iris-virginica
0.2 Iris-setosa
1.6 Iris-versicolor
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
5.1
5.6
5.1
5.2
6.7
6.4
4.4
7.2
5.8
6.3
5
5.6
7.1
6.1
4.3
4.8
3.8
2.7
3.5
3.4
3.3
2.8
3
3.2
2.7
3.3
3.6
2.5
3
2.6
3
3
1.6
4.2
1.4
1.4
5.7
5.6
1.3
6
3.9
4.7
1.4
3.9
5.9
5.6
1.1
1.4
0.2 Iris-setosa
1.3 Iris-versicolor
0.2 Iris-setosa
0.2 Iris-setosa
2.1 Iris-virginica
2.1 Iris-virginica
0.2 Iris-setosa
1.8 Iris-virginica
1.2 Iris-versicolor
1.6 Iris-versicolor
0.2 Iris-setosa
1.1 Iris-versicolor
2.1 Iris-virginica
1.4 Iris-virginica
0.1 Iris-setosa
0.3 Iris-setosa
Observation #
137
138
141
145
149
150
5
6
7
9
13
14
16
17
19
23
24
28
29
30
34
38
39
40
41
45
47
55
58
67
68
76
79
80
82
85
86
87
88
93
95
102
103
104
106
5.1
5.2
4.4
5
4.3
4.8
5.9
5.7
5.5
5.1
7
6.3
5.8
5.7
6.4
6.1
5.5
5.7
5
5.4
6.7
6.1
6.2
6.9
6
6
6.5
6.3
5.5
6.4
5.9
5.2
6.6
5.6
6.7
5.8
6.1
6.2
6.7
5
5.5
6.1
5.5
5.6
6.6
3.5
3.4
3
3.6
3
3
3
3
2.6
2.5
3.2
2.5
2.6
2.8
2.9
2.9
2.5
2.8
2
3
3.1
3
2.2
3.1
2.9
2.2
2.8
2.3
2.4
3.2
3.2
2.7
2.9
3
3
2.7
2.8
2.9
3.1
2.3
2.4
2.8
2.3
2.9
3
1.4
1.4
1.3
1.4
1.1
1.4
4.2
4.2
4.4
3
4.7
4.9
4
4.5
4.3
4.7
4
4.1
3.5
4.5
4.7
4.6
4.5
4.9
4.5
4
4.6
4.4
3.7
4.5
4.8
3.9
4.6
4.5
5
4.1
4.7
4.3
4.4
3.3
3.8
4
4
3.6
4.4
0.2 Iris-setosa
0.2 Iris-setosa
0.2 Iris-setosa
0.2 Iris-setosa
0.1 Iris-setosa
0.3 Iris-setosa
1.5 Iris-versicolor
1.2 Iris-versicolor
1.2 Iris-versicolor
1.1 Iris-versicolor
1.4 Iris-versicolor
1.5 Iris-versicolor
1.2 Iris-versicolor
1.3 Iris-versicolor
1.3 Iris-versicolor
1.4 Iris-versicolor
1.3 Iris-versicolor
1.3 Iris-versicolor
1 Iris-versicolor
1.5 Iris-versicolor
1.5 Iris-versicolor
1.4 Iris-versicolor
1.5 Iris-versicolor
1.5 Iris-versicolor
1.5 Iris-versicolor
1 Iris-versicolor
1.5 Iris-versicolor
1.3 Iris-versicolor
1 Iris-versicolor
1.5 Iris-versicolor
1.8 Iris-versicolor
1.4 Iris-versicolor
1.3 Iris-versicolor
1.5 Iris-versicolor
1.7 Iris-versicolor
1 Iris-versicolor
1.2 Iris-versicolor
1.3 Iris-versicolor
1.4 Iris-versicolor
1 Iris-versicolor
1.1 Iris-versicolor
1.3 Iris-versicolor
1.3 Iris-versicolor
1.3 Iris-versicolor
1.4 Iris-versicolor
107
111
114
118
119
128
134
136
143
144
146
1
2
8
10
11
20
21
22
26
31
32
37
50
57
59
60
61
62
65
66
69
72
73
74
77
84
92
94
97
98
99
101
105
108
6
4.9
5.7
5.7
6.8
5.6
6
5.6
5.8
6.3
5.6
6.2
6.3
6.4
6.7
7.7
5.6
6.3
7.7
6
6.5
6.5
7.4
6.1
6.4
7.9
6.4
5.8
6.3
6.5
7.6
6.7
4.9
6.3
7.7
6.8
5.8
7.2
6.9
6.4
5.9
6.7
6.7
6.5
6.9
2.7
2.4
2.9
2.6
2.8
3
3.4
2.7
2.7
3.3
2.5
2.8
2.9
2.8
3.3
3.8
2.8
2.7
2.8
2.2
3
3
2.8
3
3.2
3.8
3.1
2.7
3.3
3.2
3
2.5
2.5
3.4
3
3.2
2.8
3
3.2
2.7
3
3.1
3
3
3.1
5.1
3.3
4.2
3.5
4.8
4.1
4.5
4.2
3.9
4.7
3.9
4.8
5.6
5.6
5.7
6.7
4.9
4.9
6.7
5
5.8
5.2
6.1
4.9
5.3
6.4
5.5
5.1
6
5.1
6.6
5.8
4.5
5.6
6.1
5.9
5.1
5.8
5.7
5.3
5.1
5.6
5.2
5.5
5.4
1.6 Iris-versicolor
1 Iris-versicolor
1.3 Iris-versicolor
1 Iris-versicolor
1.4 Iris-versicolor
1.3 Iris-versicolor
1.6 Iris-versicolor
1.3 Iris-versicolor
1.2 Iris-versicolor
1.6 Iris-versicolor
1.1 Iris-versicolor
1.8 Iris-virginica
1.8 Iris-virginica
2.2 Iris-virginica
2.5 Iris-virginica
2.2 Iris-virginica
2 Iris-virginica
1.8 Iris-virginica
2 Iris-virginica
1.5 Iris-virginica
2.2 Iris-virginica
2 Iris-virginica
1.9 Iris-virginica
1.8 Iris-virginica
2.3 Iris-virginica
2 Iris-virginica
1.8 Iris-virginica
1.9 Iris-virginica
2.5 Iris-virginica
2 Iris-virginica
2.1 Iris-virginica
1.8 Iris-virginica
1.7 Iris-virginica
2.4 Iris-virginica
2.3 Iris-virginica
2.3 Iris-virginica
2.4 Iris-virginica
1.6 Iris-virginica
2.3 Iris-virginica
1.9 Iris-virginica
1.8 Iris-virginica
2.4 Iris-virginica
2.3 Iris-virginica
1.8 Iris-virginica
2.1 Iris-virginica
109
110
113
115
120
123
124
126
127
129
132
139
140
142
147
148
6.2
7.2
5.8
7.3
6.3
6
6.3
6.8
5.7
7.7
6.9
6.7
6.4
7.2
7.1
6.1
3.4
3.6
2.7
2.9
2.5
3
2.8
3
2.5
2.6
3.1
3.3
2.8
3.2
3
2.6
5.4
6.1
5.1
6.3
5
4.8
5.1
5.5
5
6.9
5.1
5.7
5.6
6
5.9
5.6
2.3 Iris-virginica
2.5 Iris-virginica
1.9 Iris-virginica
1.8 Iris-virginica
1.9 Iris-virginica
1.8 Iris-virginica
1.5 Iris-virginica
2.1 Iris-virginica
2 Iris-virginica
2.3 Iris-virginica
2.3 Iris-virginica
2.1 Iris-virginica
2.1 Iris-virginica
1.8 Iris-virginica
2.1 Iris-virginica
1.4 Iris-virginica
50 Iris-setosa.
50 Iris-versicolor.
50 Iris-virginica.
Problem 4.1
from the textbook
Problem 4.2
Problem 4.3
Consider the following situation:
The following attributes are stored in the data warehouse for each instance of equipment assign
(hours used and amount charged)
of equipment assignment:
Problem 4.4
Refer to the problem in the text.
Part-a (the snowflake schema) is completed below.
Using this schema, starting with the base cuboid [student, course, semester, instructor], what specific OLAP operations
(e.g., roll up from semester to year) should one perform in order to list the following:
to list the average grade for all students in COMP 300
to list the average grade for all of my students in COMP 300 (I am instructor #007)
to list the average grade for all students taking courses in the CS department
to list the average grade for all math majors in 2013
to list the average grade for all students in 2013
to list the average grade of English courses for each student
to list the average grade for each student in the year 2012
to list the average grade for all students in the year 2012
Problem 4.5:
a.
b.
c.
d.
Consider the scenario in the textbook for this problem, and given the Star sche
as an example.)
t are not part of a
Problem 4.16:
Answer the questions in the text, but use as an example the 3-D data cube sh
That is the same as the left-most cuboid in Figure 4.4, but without the dimension of S
Although the book states that Fig 4.3 is not a base cuboid (rather, they state it is a c
by Supplier), it displays the same data as the first base cuboid in Figure 4.4.
So for the purposes of this assignment, treat Fig 4.3 as a base cuboid.
Modify it as follows:
The company also does business in Los Angeles
The company has recently expanded into wiring.
The company has decided to track its sales data by the fifth of the year in
(It's a strange company. They call them Q1, Q2, Q3, Q4, and Q
This means that instead of four distinct values for each dimension in the b
A data cube, C, has n dimensions, and each dimension has exactly p distinct values in the base cuboid
Assume that there are no concept hierarchies associated with the dimensions.
For our example, n=3 and p=5
a.
What is the maximum number of cells possible in the base cuboid?
(To visualize, the cuboid is pictured in Figure 4.3.)
b.
What is the minimum number of cells possible in the base cuboid?
Give an example, listing those cells that constitute an example of a minimum number.
For instance, cells (1,1,1) (1,1,2), (1,1,3) (1,1,4) and (1,1,5) (This is an incorrect example.)
Then, give the values of those cells. For instance, (Q1, home entertainment, Vancouver) (Fig. 4
c.
What is the maximum number of cells possible (including both base cells and aggregate cells) i
the data cube, C?
Again, using the example of Figure 4.3, what is the maximum number of cells?
How many cells in the base cuboid?
How many cells in the 3-D cuboids?
How many cells in the 2-D cuboids?
How many cells in the 1-D cuboids?
How many cells in the Apex?
d.
What is the minimum number of cells possible in the data cube, C?
a minimum number.
is an incorrect example.)
rtainment, Vancouver) (Fig. 4.3).
e cells and aggregate cells) in
ber of cells?
Problem 5.2:
Assuming each dimension has only one level, draw the complete lattice of the cube.
b.
c.
What is the total size of the computed cube if the cube is dense?
d.
State the order for computing the chunks in the cube that requires the least amount of space, and
compute the total amount of main memory space required for computing the 2-D planes.
The order of computation that requires the least amount of space is C-A-B.
mber of cells:
chunking.
Problem 5.6
Suppose that there are only 2 base cells in a 20-dimensional base cuboid:
{(a1, a2, a3,a4,a5, ., a19, a20), (a1, a2, a3, a4, b5,.. , b19, b20)}
Compute the # of non-empty aggregate cells.
List the overlapped cells.
When computing a cube of high dimensionality, we encounter the inherent curse of dimensionality
problem: there exists a huge number of subsets of combinations of dimensions.
a.
Suppose that there are only two base cells, {(a1, a2, a3, . . . , a100), (a1, a2, b3, . . . , b100)}, in a
dimensional base cuboid. Compute the number of nonempty aggregate cells. Comment on the
storage space and time required to compute these cells.
Each base cell generates 2^1001 aggregate cells. (We subtract 1 because, for examp
is not an aggregate cell.) Thus, the two base cells generate 2(21001) = 21012 agg
however, four of these cells are counted twice. These four cells are: (a1, a2, , . . . , )
and (, , . . . , ). Therefore, the total number of cells generated is 2^101 6.
NOTE: there are 2 elements in common. So you subtract 2^2 from the number of aggr
If there would be 5 elements In common, you would subtract 2^5 from the # of aggreg
Note that any cell that has a3,, ..a100 or b3, .., b100 in it will NOT be a duplicate
So the only possible duplicate cells are those with ONLY a1 and/or a2 (or all *****)
b.
Suppose we are to compute an iceberg cube from the above. If the minimum support count in
the iceberg condition is two, how many aggregate cells will there be in the iceberg cube? Show
the cells.
They are 4: {(a1, a2, , . . . , ), (a1, , , . . . , ), (, a2, , . . . , ), (, , , . . . , )
Note that this is 2^2. The exponent is the same as the number of common elements.
c.
Introducing iceberg cubes will lessen the burden of computing trivial aggregate cells in a data
cube. However, even with iceberg cubes, we could still end up having to compute a large number
of trivial uninteresting cells (i.e., with small counts). Suppose that a database has 20 tuples that
map to (or cover) the two following base cells in a 100-dimensional base cuboid, each with a cell
count of 10: {(a1, a2, a3, . . . , a100) : 10, (a1, a2, b3, . . . , b100) : 10}.
Let the minimum support be 10. How many distinct aggregate cells will there be like the following
{(a1, a2, a3, a4, . . . , a99, ) : 10, . . . , (a1, a2, , a4, . . . , a99, a100) : 10, . . . , (a1, a2, a3, , .
10}?
There will be 2^101 6, as shown above.
That's because the base cells already have a count of 10. So all of the aggregate cells based on t
will have a minimum of 10 also.
of dimensionality
OT be a duplicate cell.
2 (or all *****)
support count in
eberg cube? Show
), (, , , . . . , )}.
mmon elements.
te cells in a data
pute a large number
has 20 tuples that
oid, each with a cell
c1 overlaps
none
none
c2:3
none
c3:5
none
c2:7, c3:7
none
none
none
c2:11
none
c3:13
none
c2:15, c3:15
none
none
none
c2:19
none
c3:21
none
c2:23, c3:23
none
none
none
c2:27
none
c3:29
additional
c2 overlaps
none
none
none
none
none
c3:6
none
none
none
none
none
none
none
c3:14
none
none
none
none
none
none
none
c3:22
none
none
none
none
none
none
none
30
31
(a1,*,*,*,*)
(*,*,*,*,*)
c1 single overlaps:
c1 double overlaps:
total single overlaps:
1.
2.
none
c2:31, c3:31
8
4
4
12
(d1,*,*,*,*)
(*,*,*,*,*)
c3:30
none
4
1 x 2^2
this has to be counted twice
3 x 2^2
(d1,*,*,*,*)
(*,*,*,*,*)
additional c2 single overlaps
1 x 2^2 -3 = 19 x 2^2 -3 = 73
Problem 5.1
Consider a base cuboid of 4 dimensions with 3 base cells:
base cell 1
(a1, a2, a3, a4)
base cell 2
(a1, a2, a3, b4)
base cell 3
(a1, c2, c3, b4)
NOTES:
1. Notice that there is one overlapped element across all 3 cells.
2. Notice that there are NO additional overlapped elements between cell 1 and cell 3.
This is different from our examples in class.
3. Notice that there is 1 additional (single overlapped) element between cell 2 and cell 3.
4. Notcie that there are 2 additional (single overlapped) elements between cell 1 and cell 2.
A good way to proceed might be:
1. How many double overlapped cells are there?
2. How many single overlapped should be subtracted for overlaps between cell2 and cell3?
3. No consider Cell1 and Cell2. If you were considering them separately, just those two cells,
how many overlapped cells would there be (they have 3 elements in common)?
So how many would you subtract?
But wait!!! You already subtracted some of those, because they are double overlapps.
So how many do you have left to subtract?
This problem is a little bit different from in-class examples, just to see if you really "get it".
Even if you generate all of the cells, and highlight them to show the overlaps, answer the
questions in the steps above, so that I know that you understand why.
and cell 3.
l 1 and cell 2.
l2 and cell3?
hose two cells,
overlapps.
answer the
Problem 6.6 Find frequent itemsets, using both apriori and FP-tree
For apriori: show each C_k an L_k, as demonstrated in class
For FP: show each tree iteration
T100
T200
T300
T400
T500
{H, O, A, R, D, S, E}
{C, O, A, R, S, E}
{E, C, A, R, D, S}
{R, O, A, D, S}
{H, O, U, S, E}
min_sup = 60%
60% of 5 transactions = 3
Create the strong association rules that can be inferred from L_2.
Create the strong assocation rules for set SOR.
To create association rules where min_sup = 60% and min_conf = 80%:
For each set, L, generate all non-empty sets. For each non-empty subset, s:
support_count is simply how often it appears in the list.
support is support_count over total # of transactions.
confidence = support_count(L) / support_count (s)
BTW, this is P(Y and K)/ P(K). It's conditional probability...
More precisely, it's also P(Y U K)/P(K)
ri and FP-tree
hamburgers
~hamburgers
Once your contingency table is complete, and you have also computing the above conditional p
It is probably easier to use the formulae that express the measurement in terms of conditional
These are bolded and blue below:
all_conf = sup(Hamburgers U Hot Dogs)/Max(sup(Hamburgers), sup(Hot Dogs))
all_conf = min[P(A|B,P(B|A)
lift = P(Hamburgers U Hot Dogs)/P(hot dog) P(hamburgers)
lift = P(B|A)/P(B)
max_confidence = max( sup(ab)/sup(a), sup(ab)/sup(b) )
max_confidence MAX[PA|B),P(B|A)]
Kulczynski = sup(ab)/2 * (1/sup(a) + 1/sup(b) )
Kulczynski = 1/2[PA|B)+P(B|A)]
cosine = sup(ab)/sqrt(sup(a) * sup(b))
ed frequencies.
Hot Dogs".
S'
S'
S'
S'
S'
S'
=
=
=
=
=
=
{1,2,3,4,5,6}
{1,2,3,4,5,6}
{2,3,4,5,6,7,8,9,10,11}
{1,2,3,4,5}
{1,2,3,4,5}
{1,2,3,4,5}
Using the above cases as examples, prove by counter-example, or demonstrate with an example,
whether the following rule constraints are antimonotonic or monotonic.
a.
V S
b.
S V
c.
min(S) <= V
d.
max(S) <= V
e.
max(S) >= V
e with an example,
Part 2:
Consider the data for problem 8.7 on page 387.
Calculate (outside of RapidMiner) the Gain and the Gain Ratio for the attribute department.
Step 3:
Calculate Gain_Dept
n RapdMiner.
e department.
Problem 8.12
Complete the following table, and then plot the ROC curve:
TPR = TP/P
FPR = FP/P
P=5 (we know as we are looking at training data)
Tuple #
1
2
3
4
5
6
7
8
9
10
Class
P
N
P
N
N
N
P
P
N
P
Prob.
0.91
0.83
0.72
0.66
0.60
0.55
0.53
0.52
0.45
0.37
TP
FP
TN
FN
TPR
FPR
You may use programming tool you wish (or none at all). But your resu
should be presented in a format similar to the tables below,
as demonstrate in class.
item Old_w
x1
1
x2
1
x3
0
w14 0.1921
w15 -0.3059
w24
0.4
w25
0.1
w34 -0.5079
w35 0.1941
w46 -0.2608
w56 -0.138
4 -0.4079
5 0.1941
6 0.2181
Net Input
Output
4
5
6
Error
item Old_w
x1
1
x2
1
x3
0
w14 0.1921
w15 -0.3059
w24
0.4
w25
0.1
w34 -0.5079
w35 0.1941
w46 -0.2608
w56 -0.138
4 -0.4079
5 0.1941
6 0.2181
New weight
ning tuple
New weight
on methods?
HW Problems, Chapter 10
Homework Problems 10.1, 10.2, 10.6, 10.12, 10.16
xtbook, no changes