国科大数据挖掘课程HW1-Toy模板网

这篇具有很好参考价值的文章主要介绍了国科大数据挖掘课程HW1。希望对大家有所帮助。如果存在错误或未考虑完全的地方，请大家不吝赐教，您也可以点击"举报违法"按钮提交疑问。

HW1

Submission requirements:

Please submit your solutions to our class website.

Q1.Suppose that a data warehouse consists of four dimensions, date, spectator, location, and game, and two measures, count and charge, where charge is the fare that a spectator pays when watching a game on a given date. Spectators may be students, adults, or seniors, with each category having its own charge rate.

(a) Draw a star schema diagram for the data warehouse.

中科宏一数据挖掘服务,Python,数据挖掘,数据仓库,数据库

(b) Starting with the base cuboid [date, spectator, location, game]，what specific OLAP operations should one perform in order to list the total charge paid by student spectators in Los Angeles?

step 1. Roll-up on date from date_key to all
step 2. Roll-up on spectator from spectator_key to status
step 3. Roll-up on location from location_key to location_name
step 4. Roll-up on game from game_key to all

step 5. Dice with "status=student" and "location_name=Los Angeles"

优点

位图索引是一种高效的索引结构，在查询、过滤等方面上，由于进行的是位运算，所以比常规的查询方式快很多。例如在本仓库中，假设对于spectator表的子列status，我们有：

spectator_key	status	gender
0	学生	男
1	成人	女
2	学生	男
3	学生	女
4	老人	女

status就可以建立以下位图索引：

status="学生" : 10110
status="成人" : 01000
status="老人" : 00001

gender可以建立以下位图索引：

gender="男": 10100
gender="女": 01011

例如，我们想要查询学生，只需要用10110去过滤原始数据就行。

我们想混合查询，比如同时查询status="学生"和gender="男"的数据，只需要进行并操作就行了：

10110 & 10100 = 10100

可以大大提高计算速度。

此外，位图索引可以在一定程度上绕开原始数据，进一步提高处理速度。例如，我们想统计满足上面条件的人数，只需要:

ans=0
x=(10110&10100)
while x:
	x&=(x-1)
	ans+=1

缺点

位图索引比较适合枚举类型，也就是离散型变量，对于连续变量，位图索引并不适用，往往需要先做离散化。比如本仓库中，phone number字段可能就不太适合(也许这个字段没有存在的必要？)

而当属性列非常多时，我们做位图索引的开销也比较大。

Q2．某电子邮件数据库中存储了大量的电子邮件。请设计数据仓库的结构，以便用户从多个维度进行查询和挖掘。

中科宏一数据挖掘服务,Python,数据挖掘,数据仓库,数据库

Q3. Suppose a hospital tested the age and body fat data for 18 random selected adults with the following result:

age	23	23	27	27	39	41	47	49	50	52	54	54	56	57	58	58	60	61
%fat	9.5	26.5	7.8	17.8	31.4	25.9	27.4	27.2	31.2	34.6	42.5	28.8	33.4	30.2	34.1	32.9	41.2	35.7

(a) Calculate the mean, median, and standard deviation of age and %fat.

             age       %fat
mean   46.444444  28.783333
std    13.218624   9.254395
median      51.0       30.7

(b) Draw the boxplots for age and %fat.

中科宏一数据挖掘服务,Python,数据挖掘,数据仓库,数据库

(d) Normalize age based on min-max normalization.

x=data["age"]
y=data['%fat']
X=(x-x.min())/(x.max()-x.min())
Y=(y-y.min())/(y.max()-y.min())
print(X,Y)

Result is:

0     0.000000
1     0.000000
2     0.105263
3     0.105263
4     0.421053
5     0.473684
6     0.631579
7     0.684211
8     0.710526
9     0.763158
10    0.815789
11    0.815789
12    0.868421
13    0.894737
14    0.921053
15    0.921053
16    0.973684
17    1.000000

(e) Calculate the correlation coefficient (Pearson’s product moment coefficient). Are these two variables positively or negatively correlated?

print(np.corrcoef(x,y))
print("相关系数" ,stats.pearsonr(x,y)[0])

Result is

[[1.        0.8176188]
 [0.8176188 1.       ]]
相关系数 0.8176187964565874

I think they are positively correlated.

(f) Smooth the fat data by bin means, using a bin depth of 6.

def mean(x):
    return round(sum(x)/len(x),2)

N_y=sorted(y)
bins=[[]]
for j in N_y:
    bins[-1].append(j)
    if len((v:=bins[-1]))==6:
        v[:]=[mean(v)]*len(v)
        bins.append([])
for i,j  in enumerate(bins[:-1]):
    print("bin %d is :"%(i+1),j)

bin 1 is : [19.12, 19.12, 19.12, 19.12, 19.12, 19.12]
bin 2 is : [30.32, 30.32, 30.32, 30.32, 30.32, 30.32]
bin 3 is : [36.92, 36.92, 36.92, 36.92, 36.92, 36.92]

(g) Smooth the fat data by bin boundaries, using a bin depth of 6.

这里因为我们是对排好序的数据做处理，所以可以通过二分法进行优化，获取中间分界。文章来源地址https://www.toymoban.com/news/detail-635554.html

def close(x,a,b):
    # 是否靠近下界
    return (x-a)<=(b-x)

def boundary(x):
    Min=x[0]
    Max=x[-1]

    l,r=0,len(x)-1
    while l<=r:
        mid=(r-l)//2+l
        if close(x[mid],Min,Max):
            if not close(x[mid+1],Min,Max):
                l=mid
                break
            l=mid+1
        else:
            if close(x[mid-1],Min,Max):
                l=mid
                break
            r=mid-1
    return [[Min]*l+[Max]*(len(x)-l)]

N_y=sorted(y)
bins=[[]]
for j in N_y:
    bins[-1].append(j)
    if len((v:=bins[-1]))==6:
        v[:]=boundary(v)
        bins.append([])
for i,j  in enumerate(bins[:-1]):
    print("bin %d is :"%(i+1),j)

bin 1 is : [[7.8, 7.8, 27.2, 27.2, 27.2, 27.2]]
bin 2 is : [[27.4, 27.4, 32.9, 32.9, 32.9, 32.9]]
bin 3 is : [[33.4, 33.4, 33.4, 33.4, 42.5, 42.5]]

到了这里，关于国科大数据挖掘课程HW1的文章就介绍完了。如果您还想了解更多内容，请在右上角搜索TOY模板网以前的文章或继续浏览下面的相关文章，希望大家以后多多支持TOY模板网！

国科大数据挖掘课程HW1

HW1

Submission requirements:

Please submit your solutions to our class website.

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

微信扫一扫打赏

支付宝扫一扫领取红包，优惠每天领

二维码1

二维码2