HW1
Submission requirements:
Please submit your solutions to our class website.
Q1.Suppose that a data warehouse consists of four dimensions, date, spectator, location, and game, and two measures, count and charge, where charge is the fare that a spectator pays when watching a game on a given date. Spectators may be students, adults, or seniors, with each category having its own charge rate.
(a) Draw a star schema diagram for the data warehouse.
(b) Starting with the base cuboid [date, spectator, location, game],what specific OLAP operations should one perform in order to list the total charge paid by student spectators in Los Angeles?
step 1. Roll-up on date from date_key to all
step 2. Roll-up on spectator from spectator_key to status
step 3. Roll-up on location from location_key to location_name
step 4. Roll-up on game from game_key to all
step 5. Dice with "status=student" and "location_name=Los Angeles"
© Bitmap indexing is a very useful optimization technique. Please present the pros and cons of using bitmap indexing in this given data warehouse.
优点
位图索引是一种高效的索引结构,在查询、过滤等方面上,由于进行的是位运算,所以比常规的查询方式快很多。例如在本仓库中,假设对于spectator
表的子列status
,我们有:
spectator_key | status | gender |
---|---|---|
0 | 学生 | 男 |
1 | 成人 | 女 |
2 | 学生 | 男 |
3 | 学生 | 女 |
4 | 老人 | 女 |
status
就可以建立以下位图索引:
status="学生" : 10110
status="成人" : 01000
status="老人" : 00001
gender
可以建立以下位图索引:
gender="男": 10100
gender="女": 01011
例如,我们想要查询学生,只需要用10110
去过滤原始数据就行。
我们想混合查询,比如同时查询status="学生"
和gender="男"
的数据,只需要进行并操作就行了:
10110 & 10100 = 10100
可以大大提高计算速度。
此外,位图索引可以在一定程度上绕开
原始数据,进一步提高处理速度。例如,我们想统计满足上面条件的人数,只需要:
ans=0
x=(10110&10100)
while x:
x&=(x-1)
ans+=1
缺点
位图索引比较适合枚举类型,也就是离散型变量,对于连续变量,位图索引并不适用,往往需要先做离散化。比如本仓库中,phone number
字段可能就不太适合(也许这个字段没有存在的必要?)
而当属性列非常多时,我们做位图索引的开销也比较大。
Q2.某电子邮件数据库中存储了大量的电子邮件。请设计数据仓库的结构,以便用户从多个维度进行查询和挖掘。
Q3. Suppose a hospital tested the age and body fat data for 18 random selected adults with the following result:
age | 23 | 23 | 27 | 27 | 39 | 41 | 47 | 49 | 50 | 52 | 54 | 54 | 56 | 57 | 58 | 58 | 60 | 61 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
%fat | 9.5 | 26.5 | 7.8 | 17.8 | 31.4 | 25.9 | 27.4 | 27.2 | 31.2 | 34.6 | 42.5 | 28.8 | 33.4 | 30.2 | 34.1 | 32.9 | 41.2 | 35.7 |
(a) Calculate the mean, median, and standard deviation of age and %fat.
age %fat
mean 46.444444 28.783333
std 13.218624 9.254395
median 51.0 30.7
(b) Draw the boxplots for age and %fat.
© Draw a scatter plot based on these two variables.
(d) Normalize age based on min-max normalization.
x=data["age"]
y=data['%fat']
X=(x-x.min())/(x.max()-x.min())
Y=(y-y.min())/(y.max()-y.min())
print(X,Y)
Result is:
0 0.000000
1 0.000000
2 0.105263
3 0.105263
4 0.421053
5 0.473684
6 0.631579
7 0.684211
8 0.710526
9 0.763158
10 0.815789
11 0.815789
12 0.868421
13 0.894737
14 0.921053
15 0.921053
16 0.973684
17 1.000000
(e) Calculate the correlation coefficient (Pearson’s product moment coefficient). Are these two variables positively or negatively correlated?
print(np.corrcoef(x,y))
print("相关系数" ,stats.pearsonr(x,y)[0])
Result is
[[1. 0.8176188]
[0.8176188 1. ]]
相关系数 0.8176187964565874
I think they are positively correlated.
(f) Smooth the fat data by bin means, using a bin depth of 6.
def mean(x):
return round(sum(x)/len(x),2)
N_y=sorted(y)
bins=[[]]
for j in N_y:
bins[-1].append(j)
if len((v:=bins[-1]))==6:
v[:]=[mean(v)]*len(v)
bins.append([])
for i,j in enumerate(bins[:-1]):
print("bin %d is :"%(i+1),j)
bin 1 is : [19.12, 19.12, 19.12, 19.12, 19.12, 19.12]
bin 2 is : [30.32, 30.32, 30.32, 30.32, 30.32, 30.32]
bin 3 is : [36.92, 36.92, 36.92, 36.92, 36.92, 36.92]
(g) Smooth the fat data by bin boundaries, using a bin depth of 6.文章来源:https://www.toymoban.com/news/detail-635554.html
这里因为我们是对排好序的数据做处理,所以可以通过二分法进行优化,获取中间分界。文章来源地址https://www.toymoban.com/news/detail-635554.html
def close(x,a,b):
# 是否靠近下界
return (x-a)<=(b-x)
def boundary(x):
Min=x[0]
Max=x[-1]
l,r=0,len(x)-1
while l<=r:
mid=(r-l)//2+l
if close(x[mid],Min,Max):
if not close(x[mid+1],Min,Max):
l=mid
break
l=mid+1
else:
if close(x[mid-1],Min,Max):
l=mid
break
r=mid-1
return [[Min]*l+[Max]*(len(x)-l)]
N_y=sorted(y)
bins=[[]]
for j in N_y:
bins[-1].append(j)
if len((v:=bins[-1]))==6:
v[:]=boundary(v)
bins.append([])
for i,j in enumerate(bins[:-1]):
print("bin %d is :"%(i+1),j)
bin 1 is : [[7.8, 7.8, 27.2, 27.2, 27.2, 27.2]]
bin 2 is : [[27.4, 27.4, 32.9, 32.9, 32.9, 32.9]]
bin 3 is : [[33.4, 33.4, 33.4, 33.4, 42.5, 42.5]]
到了这里,关于国科大数据挖掘课程HW1的文章就介绍完了。如果您还想了解更多内容,请在右上角搜索TOY模板网以前的文章或继续浏览下面的相关文章,希望大家以后多多支持TOY模板网!