【数据挖掘】2017年 Quiz 1-3 整理带答案

1年前作者：叼辣条闯天涯分类：Toy博客阅读(9)违法举报

这篇具有很好参考价值的文章主要介绍了【数据挖掘】2017年 Quiz 1-3 整理带答案。希望对大家有所帮助。如果存在错误或未考虑完全的地方，请大家不吝赐教，您也可以点击"举报违法"按钮提交疑问。

Quiz 1

Answer Problems 1-2 based on the following training set, where $A, B, C$ are describing attributes, and $D$ is the class attribute:

$A$	$B$	$C$	$D$
1	1	1	$\mathrm{y}$
1	0	1	$\mathrm{y}$
0	1	1	$\mathrm{y}$
1	1	0	$\mathrm{y}$
0	1	1	$\mathrm{n}$
1	1	1	$\mathrm{n}$
0	0	0	$\mathrm{n}$
0	1	0	$\mathrm{n}$

Problem 1 (20%). Describe an (arbitrary) decision tree that correctly classifies 6 of the 8 records in the training set. Furthermore, based on your decision tree, what is the predicted class for a record with $A = 1, B = 0, C = 0$ ?

【数据挖掘】2017年 Quiz 1-3 整理带答案,数据挖掘,数据挖掘,人工智能

Problem 2 (40%). Suppose that we apply Bayesian classification with the following conditional independence assumption: conditioned on a value of $C$ and a value of $D$ , attributes $A$ and $B$ are independent. What is the predicted class for a record with $A = 1, B = 1, C = 1$ ? You must show the details of your reasoning.
$\begin{aligned} & \operatorname{Pr}[D \mid A, B, C]=\frac{\operatorname{Pr}[A, B, C \mid D] \cdot \operatorname{Pr}[D]}{\operatorname{Pr}[A, B, C]} \\ & =\frac{\operatorname{Pr}[a, b \mid c, d] \cdot \operatorname{Pr}[c \mid d] \cdot \operatorname{Pr}[D]}{\operatorname{Pr}[A, B, C]} \\ & =\frac{\operatorname{Pr}[A \mid C, D] \cdot \operatorname{Pr}[B \mid C, D] \cdot \operatorname{Pr}[C \mid D] \cdot \operatorname{Pr}[D]}{\operatorname{Pr}[A, B, C]} \\ & \operatorname{Pr}[A=1, B=1, C=1] \cdot \operatorname{Pr}[D=y \mid A=1, B=1, C=1] \\ &=\operatorname{Pr}[A=1 \mid C=1, D=y] \cdot \operatorname{Pr}[B=1 \mid C=1, D=y] \cdot \operatorname{Pr}[C=1 \mid D=y] \cdot \operatorname{Pr}[D=y] \\ &=\frac{2}{3} \cdot \frac{2}{3} \cdot \frac{3}{4} \cdot \frac{1}{2}=\frac{1}{6} \\ & \operatorname{Pr}[A=1, B=1, C=1] \cdot \operatorname{Pr}[D=n \mid A=1, B=1, C=1]\\ & = \operatorname{Pr}[A=1 \mid C=1, D=n] \cdot \operatorname{Pr}[B=1 \mid C=1, D=n] \cdot \operatorname{Pr}[C=1 \mid D=n] \cdot \operatorname{Pr}[D=n] \\ & =\frac{1}{2} \cdot 1 \cdot \frac{1}{2} \cdot \frac{1}{2}=\frac{1}{8} \\ & \text { The predicted class for a record with } A=1, B=1, C=1 \text { is } D=y \end{aligned}$

Problem 3 (40%). The following figure shows a training set of 5 points. Use the Perceptron algorithm to find a plane that (i) crosses the origin, and (ii) separates the black points from the white ones. Recall that this algorithm starts with a vector $\vec{c}=\overrightarrow{0}$ and iteratively adjusts it using a point from the training set. You need to show the value of $\vec{c}$ after every adjustment.

【数据挖掘】2017年 Quiz 1-3 整理带答案,数据挖掘,数据挖掘,人工智能

$\begin{array}{ccc}\text { round } & \vec{c} & \vec{p} \\ 1 & (0,0) & A(0.2) \\ 2 & (0,2) & C(2,0) \\ 3 & (2,2) & \end{array}$

Quiz 2

Problem 1 (30%). The figure below shows the boundary lines of 5 half-planes. Find the point with the smallest $\boldsymbol{y}$ -coordinate in the intersection of these half-planes with the linear programming algorithm that we discussed in the class. Assume the algorithm (randomly) permutes the boundary lines into $\ell_{1}, \ell_{2}, \ldots, \ell_{5}$ and processes them in the same order. Starting from the second round, give the point maintained by the algorithm at the end of each round.

【数据挖掘】2017年 Quiz 1-3 整理带答案,数据挖掘,数据挖掘,人工智能

Answer: Let $H_{1}, \ldots, H_{5}$ be the half-planes whose boundary lines are $\ell_{1}, \ldots, \ell_{5}$ , respectively. Let $p$ be the point maintained by the algorithm. At the end of the second round, $p$ is the intersection $A$ of $\ell_{1}$ and $\ell_{2}$ . At Round 3 , because $p = A$ does not fall in $H_{3}$ , the algorithm computes a new $p$ as the lowest point on $\ell_{3}$ that satisfies all of $H_{1}, \ldots, H_{3}$ . As a result, $p$ is set to $B$ . At Round 4 , because $p = B$ does not fall in $H_{4}$ , the algorithm computes a new $p$ as the lowest point on $\ell_{4}$ that satisfies all of $H_{1}, \ldots, H_{4}$ . As a result, $p$ is set to $C$ . Finally, the last round processes $H_{5}$ . As $p = B$ falls in $H_{5}$ , the algorithm finishes with $C$ as the final answer.

【数据挖掘】2017年 Quiz 1-3 整理带答案,数据挖掘,数据挖掘,人工智能

Problem 2 (30%). Consider a set $P$ of red points $A (2, 1), B (2, - 2)$ and blue points $C (- 2, 1)$ , $D (- 2, - 3)$ . Give an instance of quadratic programming for finding a separation line with the maximum margin.

Answer: Minimize $w_{1}^{2}+w_{2}^{2}$ subject to the following constraints:

$\begin{aligned} 2 w_{1}+w_{2} & \geq 1 \\ 2 w 1-2 w_{2} & \geq 1 \\ -2 w_{1}+w_{2} & \leq-1 \\ -2 w 1-3 w_{2} & \leq-1 \end{aligned}$

Problem 3 (40%). Let $P$ be a set of points as shown in the figure below. Assume $k = 3$ , and that the distance metric is Euclidian distance.

【数据挖掘】2017年 Quiz 1-3 整理带答案,数据挖掘,数据挖掘,人工智能

Run the $k$ -center algorithm we discussed in the class on $P$ . If the first center is (randomly) chosen as point $a$ , what are the second and third centers?

Answer: The second center is $h$ , and the third is $d$ .

Quiz 3

Problem 1 (30%). Consider the dataset as shown in the figure below. What is the covariance matrix of the dataset?

【数据挖掘】2017年 Quiz 1-3 整理带答案,数据挖掘,数据挖掘,人工智能

Answer: Let $A=\left[\begin{array}{ll}\sigma_{x x} & \sigma_{x y} \\ \sigma_{y x} & \sigma_{y y}\end{array}\right]$ be the covariance matrix, where $\sigma_{x x}\left(\sigma_{y y}\right)$ is the variance along the $\mathrm{x}-(\mathrm{y}-)$ dimension, and $\sigma_{x y}\left(=\sigma_{y x}\right)$ is the covariance of the $\mathrm{x}$ - and y-dimensions. Since the means along both the $\mathrm{x}$ - and $\mathrm{y}$ -dimensions are 0 , we have that:
$\begin{aligned} \sigma_{x x} & =\frac{1}{4}\left((-3)^{2}+(-2)^{2}+1^{2}+4^{2}\right)=30 / 4=7.5 \\ \sigma_{y y} & =\frac{1}{4}\left(4^{2}+1^{2}+(-2)^{2}+(-3)^{2}\right)=30 / 4=7.5 \\ \sigma_{x y} & =\frac{1}{4}((-3) \times 4+(-2) \times 1+1 \times(-2)+4 \times(-3))=-28 / 4=-7 \end{aligned}$

Therefore, $A=\left[\begin{array}{rr}7.5 & -7 \\ -7 & 7.5\end{array}\right]$ .

Problem 2 (30%). Use PCA to find the line passing the origin on which the projections of the points in Problem 1 have the greatest variance.

Answer: Let $\lambda$ be an eigenvalue of $A$ , which implies that the determinant of $\left[\begin{array}{cc}7.5-\lambda & -7 \\ -7 & 7.5-\lambda\end{array}\right]$ is 0 . By expanding the determinant, we get the following equation:
$(7.5-\lambda)^{2}-49=0 .$

It follows that $\lambda_{1}=14.5$ and $\lambda_{2}=0.5$ are the eigenvalues of $A$ , where $\lambda_{1}$ is the larger one.

Let $\vec{v}=\left[\begin{array}{l}x \\ y\end{array}\right]$ be an eigenvector corresponding to $\lambda_{1}$ , which satisfies that

$\left[\begin{array}{ll} 7.5 & -7 \\ -7 & 7.5 \end{array}\right]\left[\begin{array}{l} x \\ y \end{array}\right]=\left[\begin{array}{l} 14.5 x \\ 14.5 y \end{array}\right]$

Note that the above equation is satisfied by any pair of $x$ and $y$ satisfying $x + y = 0$ . As the line chosen by PCA has the same direction as $\vec{v}$ , the line is $x + y = 0$ .

Problem 3 (40%). Run DBSCAN on the set of points shown in the figure below with $\epsilon=1$ and minpts $= 4$ . What are the core points and the clusters?

【数据挖掘】2017年 Quiz 1-3 整理带答案,数据挖掘,数据挖掘,人工智能

Answer: The core points are $b, e, g, j, k$ and $o$ . There are three clusters:文章来源地址https://www.toymoban.com/news/detail-728337.html

Cluster 1: $a, b, c, d, e, f$
Cluster 2: $f, g, h, i, j, k, l$
Cluster 3: $m, n, o, p, q$

到了这里，关于【数据挖掘】2017年 Quiz 1-3 整理带答案的文章就介绍完了。如果您还想了解更多内容，请在右上角搜索TOY模板网以前的文章或继续浏览下面的相关文章，希望大家以后多多支持TOY模板网！

本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如若转载，请注明出处：如若内容造成侵权/违法违规/事实不符，请点击违法举报进行投诉反馈，一经查实，立即删除！

分享到：

领支付宝红包赞助服务器费用

1024程序员狂欢节 | IT前沿技术、人工智能、数据挖掘、网络空间安全技术
一年一度的1024程序员狂欢节又到啦！成为更卓越的自己，坚持阅读和学习，别给自己留遗憾，行动起来吧！那么，都有哪些好书值得入手呢？小编为大家整理了前沿技术、人工智能、集成电路科学与芯片技术、新一代信息与通信技术、网络空间安全技术，四大热点领域近期
2024年02月06日
浏览(13)
GEO生信数据挖掘（十）肺结核数据-差异分析-WGCNA分析（900行代码整理注释更新版本）
第六节，我们使用结核病基因数据，做了一个数据预处理的实操案例。例子中结核类型，包括结核，潜隐进展，对照和潜隐，四个类别。第七节延续上个数据，进行了差异分析。第八节对差异基因进行富集分析。本节进行WGCNA分析。 WGCNA分析分段代码（附运行效果图）请查
2024年02月08日
浏览(9)
《天池精准医疗大赛-人工智能辅助糖尿病遗传风险预测》模型复现和数据挖掘-论文_企业
进入21世纪，生命科学特别是基因科技已经广泛而且深刻影响到每个人的健康生活，于此同时，科学家们借助基因科技史无前例的用一种全新的视角解读生命和探究疾病本质。人工智能（AI）能够处理分析海量医疗健康数据，通过认知分析获取洞察，服务于政府、健康医疗机构
2023年04月09日
浏览(46)
【SCI征稿】3个月左右录用！计算机信息技术等领域均可，如机器学习、遥感技术、人工智能、物联网、人工神经网络、数据挖掘、图像处理
计算机技术类SCIEEI 【期刊简介】IF：1.0-2.0，JCR4区，中科院4区【检索情况】SCIEEI 双检，正刊【参考周期】期刊部系统内提交，录用周期3个月左右，走完期刊部流程上线【征稿领域】计算机信息技术在土地变化检测中的应用包括但不限于以下主题： ● 利用基于机器学习的
2024年02月10日
浏览(18)
【数据挖掘算法与应用】——数据挖掘导论
数据挖掘技术背景大数据如何改变我们的生活 1.数据爆炸但知识贫乏人们积累的数据越来越多。但是，目前这些数据还仅仅应用在数据的录入、查询、统计等功能，无法发现数据中存在的关系和规则，无法根据现有的数据预测未来的发展趋势，导致了“数据爆炸但知识
2023年04月09日
浏览(45)
关联规则挖掘（上）：数据分析 | 数据挖掘 | 十大算法之一
⭐️⭐️⭐️⭐️⭐️欢迎来到我的博客⭐️⭐️⭐️⭐️⭐️ 🐴作者：秋无之地 🐴简介：CSDN爬虫、后端、大数据领域创作者。目前从事python爬虫、后端和大数据等相关工作，主要擅长领域有：爬虫、后端、大数据开发、数据分析等。 🐴欢迎小伙伴们点赞👍🏻、收藏
2024年02月07日
浏览(14)
【数据挖掘竞赛】零基础入门数据挖掘-二手汽车价格预测
目录一、导入数据二、数据查看可视化缺失值占比绘制所有变量的柱形图，查看数据查看各特征与目标变量price的相关性三、数据处理处理异常值查看seller,offerType的取值查看特征 notRepairedDamage 异常值截断填充缺失值删除取值无变化的特征查看目标变量p
2023年04月27日
浏览(12)
数据挖掘-实战记录（一）糖尿病python数据挖掘及其分析
一、准备数据 1.查看数据二、数据探索性分析 1.数据描述型分析 2.各特征值与结果的关系 a)研究各个特征值本身类别 b)研究怀孕次数特征值与结果的关系 c)其他特征值 3.研究各特征互相的关系三、数据预处理 1.去掉唯一属性 2.处理缺失值 a)标记缺失值 b)删除缺失值行数 c
2024年02月11日
浏览(12)
数据挖掘(3.1)--频繁项集挖掘方法
目录 1.Apriori算法 Apriori性质伪代码 apriori算法 apriori-gen(Lk-1)【候选集产生】 has_infrequent_subset(c,Lx-1)【判断候选集元素】例题求频繁项集：对于频繁项集L={B,C,E}，可以得到哪些关联规则： 2.FP-growth算法 FP-tree构造算法【自顶向下建树】 insert_tree([plP],T) 利用FP-tree挖掘频繁项集
2023年04月09日
浏览(9)
数据仓库与数据挖掘
数据挖掘（Data mining），又译为资料探勘、数据采矿。它是数据库知识发现（Knowledge-Discovery in Databases，KDD）中的一个步骤。数据挖掘一般是指从大量的数据中通过算法搜索隐藏于其中的信息的过程。数据挖掘通常与计算机科学有关，并通过统计、在线分析处理、情报检索、
2024年02月06日
浏览(10)