《斯坦福数据挖掘教程·第三版》读书笔记(英文版)Chapter 11 Dimensionality Reduction

这篇具有很好参考价值的文章主要介绍了《斯坦福数据挖掘教程·第三版》读书笔记(英文版)Chapter 11 Dimensionality Reduction。希望对大家有所帮助。如果存在错误或未考虑完全的地方,请大家不吝赐教,您也可以点击"举报违法"按钮提交疑问。

来源:《斯坦福数据挖掘教程·第三版》对应的公开英文书和PPT

Chapter 11 Dimensionality Reduction

Let M be a square matrix. Let λ be a constant and e a nonzero column vector with the same number of rows as M. Then λ is an eigenvalue of M and e is the corresponding eigenvector of M if M e = λ e Me = λe Me=λe.

Start with any unit vector v of the appropriate length and compute M i v M^iv Miv iteratively until it converges. When M is a stochastic matrix, the limiting vector is the principal eigenvector (the eigenvector with the largest eigenvalue), and its corresponding eigenvalue is 1. This method for finding the principal eigenvector, called power iteration, works quite generally, although if the principal eigenvalue (eigenvalue associated with the principal eigenvector) is not 1, then as i grows, the ratio of M i + 1 v M^{i+1}v Mi+1v to M i v M^iv Miv approaches the principal eigenvalue while M i v M^iv Miv approaches
a vector (probably not a unit vector) with the same direction as the principal eigenvector.

To find the second eigenpair we create a new matrix M ∗ = M − λ 1 x x T M^∗ = M − λ_1xx^T M=Mλ1xxT. Then, use power iteration on M ∗ M^∗ M to compute its largest eigenvalue. The obtained x ∗ x^∗ x and λ ∗ λ^∗ λ correspond to the second largest eigenvalue and the corresponding eigenvector of matrix M. Intuitively, what we have done is eliminate the influence of a given eigenvector by setting its associated eigenvalue to zero. The formal justification is the following two observations. If M ∗ = M − λ x x T M^∗ = M − λxx^T M=MλxxT, where x and λ are the eigenpair with the largest eigenvalue, then:

  1. x is also an eigenvector of M ∗ M^∗ M, and its corresponding eigenvalue is 0. In proof, observe that

    M ∗ x = ( M − λ x x T ) x = M x − λ x x T x = M x − λ x = 0 M^∗x = (M − λxx^T)x = Mx − λxx^Tx = Mx − λx = 0 Mx=(MλxxT)x=MxλxxTx=Mxλx=0

    At the next-to-last step we use the fact that x T x = 1 x^Tx = 1 xTx=1 because x is a unit vector.

  2. Conversely, if v and λ v λ_v λv are an eigenpair of a symmetric matrix M other than the first eigenpair (x, λ), then they are also an eigenpair of M ∗ M^∗ M.
    Proof :

    M ∗ v = ( M ∗ ) T v = ( M − λ x x T ) T v = M T v − λ x ( x T v ) = M T v = λ v v M^∗v = (M^∗)^Tv = (M − λxx^T)^Tv = M^Tv − λx(x^Tv) = M^Tv = λ_vv Mv=(M)Tv=(MλxxT)Tv=MTvλx(xTv)=MTv=λvv

    This sequence of equalities needs the following justifications:
    (a) If M is symmetric, then M = M T M = M^T M=MT.
    (b) The eigenvectors of a symmetric matrix are orthogonal. That is, the dot product of any two distinct eigenvectors of a matrix is 0. We do not prove this statement here.

Principal-component analysis, or PCA, is a technique for taking a dataset consisting of a set of tuples representing points in a high-dimensional space and finding the directions along which the tuples line up best. The idea is to treat the set of tuples as a matrix M and find the eigenvectors for M M T MM^T MMT or M T M M^TM MTM. The matrix of these eigenvectors can be thought of as a rigid rotation in a high dimensional space. When you apply this transformation to the original data, the axis corresponding to the principal eigenvector is the one along which the points are most “spread out,” More precisely, this axis is the one along which the variance of the data is maximized. Put another way, the points can best be viewed as lying along this axis, with small deviations from this axis. Likewise, the axis corresponding to the second eigenvector (the eigenvector corresponding to the second-largest eigenvalue) is the axis along which the variance of distances from the first axis is greatest, and so on.

Any matrix of orthonormal vectors (unit vectors that are orthogonal to one another) represents a rotation and/or reflection of the axes of a Euclidean space.

We conclude that the eigenvalues of M M T MM^T MMT are the eigenvalues of M T M M^TM MTM plus additional 0’s. If the dimension of M M T MM^T MMT were less than the dimension off M T M M^TM MTM, then the opposite would be true; the eigenvalues of M T M M^TM MTM would be those of M M T MM^T MMT plus additional 0’s.

Let M be an m × n m × n m×n matrix, and let the rank of M be r. Recall that the rank of a matrix is the largest number of rows (or equivalently columns) we can choose for which no nonzero linear combination of the rows is the all-zero vector 0 (we say a set of such rows or columns is independent). Then we can find matrices U, Σ, and V as shown in Fig. 11.5 with the following properties:

  1. U is an m × r m × r m×r column-orthonormal matrix; that is, each of its columns is a unit vector and the dot product of any two columns is 0.
  2. V is an n × r n × r n×r column-orthonormal matrix. Note that we always use V in its transposed form, so it is the rows of V T V^T VT that are orthonormal.
  3. Σ is a diagonal matrix; that is, all elements not on the main diagonal are 0. The elements of Σ are called the singular values of M.

《斯坦福数据挖掘教程·第三版》读书笔记(英文版)Chapter 11 Dimensionality Reduction

Suppose we want to represent a very large matrix M by its SVD components U, Σ, and V , but these matrices are also too large to store conveniently. The best way to reduce the dimensionality of the three matrices is to set the smallest of the singular values to zero. If we set the s smallest singular values to 0, then we can also eliminate the corresponding s columns of U and V.

How Many Singular Values Should We Retain?

A useful rule of thumb is to retain enough singular values to make up 90% of the energy in Σ. That is, the sum of the squares of the retained singular values should be at least 90% of the sum of the squares of all the singular values.

The choice of the lowest singular values to drop when we reduce the number of dimensions can be shown to minimize the root-mean-square error between the original matrix M and its approximation.

It says that V is the matrix of eigenvectors of M T M M^TM MTM and Σ 2 Σ^2 Σ2 is the diagonal matrix whose entries are the corresponding eigenvalues.

Thus, the same algorithm that computes the eigenpairs for M T M M^TM MTM gives us the matrix V for the SVD of M itself. It also gives us the singular values for this SVD; just take the square roots of the eigenvalues for M T M M^TM MTM. U is the matrix of eigenvectors of M M T MM^T MMT.

Definition of CUR

Let M be a matrix of m rows and n columns. Pick a target number of “concepts” r to be used in the decomposition. A CUR-decomposition of M is a randomly chosen set of r columns of M, which form the m × r m × r m×r matrix C, and a randomly chosen set of r rows of M, which form the r × n r × n r×n matrix R. There is also an r × r r × r r×r matrix U that is constructed from C and R as follows:

  1. Let W be the r × r r × r r×r matrix that is the intersection of the chosen columns of C and the chosen rows of R. That is, the element in row i and column j of W is the element of M whose column is the jth column of C and whose row is the ith row of R.
  2. Compute the SVD of W; say W = X Σ Y T W = XΣY^T W=XΣYT.
  3. Compute Σ + Σ^+ Σ+, the Moore-Penrose pseudoinverse of the diagonal matrix Σ. That is, if the ith diagonal element of Σ is σ ≠ 0 σ \ne 0 σ=0, then replace it by 1/σ. But if the ith element is 0, leave it as 0.
  4. Let U = Y ( Σ + ) 2 X T U = Y (Σ^+)^2X^T U=Y(Σ+)2XT.

Having selected each of the columns of M, we scale each column by dividing its elements by the square root of the expected number of times this column would be picked. That is, we divide the elements of the jth column of M, if it is selected, by r q j \sqrt {rq_j} rqj . The scaled column of M becomes a column of C.
Rows of M are selected for R in the analogous way. For each row of R we select from the rows of M, choosing row i with probability p i p_i pi. Recall p i p_i pi is the sum of the squares of the elements of the ith row divided by the sum of the squares of all the elements of M. We then scale each chosen row by dividing by r p i \sqrt {rp_i} rpi if it is the ith row of M that was chosen.

It is quite possible that a single row or column is selected more than once. However, it is also possible to combine k rows of R that are each the same row of the matrix M into a single row of R, thus leaving R with fewer rows. Likewise, k columns of C that each come from the same column of M can be combined into one column of C. However, for either rows or columns,
the remaining vector should have each of its elements multiplied by k \sqrt k k .
When we merge some rows and/or columns, it is possible that R has fewer rows than C has columns, or vice versa. As a consequence, W will not be a square matrix. However, we can still take its pseudoinverse by decomposing it into W = X Σ Y T W = XΣY^T W=XΣYT, where Σ is now a diagonal matrix with some all-0 rows or columns, whichever it has more of. To take the pseudoinverse of such a diagonal matrix, we treat each element on the diagonal as usual (invert nonzero elements
and leave 0 as it is), but then we must transpose the result.

Summary of Chapter 11

  • Dimensionality Reduction: The goal of dimensionality reduction is to replace a large matrix by two or more other matrices whose sizes are much smaller than the original, but from which the original can be approximately reconstructed, usually by taking their product.
  • Eigenvalues and Eigenvectors: A matrix may have several eigenvectors such that when the matrix multiplies the eigenvector, the result is a constant multiple of the eigenvector. That constant is the eigenvalue associated with this eigenvector. Together the eigenvector and its eigenvalue are called an eigenpair.
  • Finding Eigenpairs by Power Iteration: We can find the principal eigenvector (eigenvector with the largest eigenvalue) by starting with any vector and repeatedly multiplying the current vector by the matrix to get a new vector. When the changes to the vector become small, we can treat the result as a close approximation to the principal eigenvector. By modifying the matrix, we can then use the same iteration to get the second eigenpair (that with the second-largest eigenvalue), and similarly get each of the eigenpairs in turn, in order of decreasing value of the eigenvalue.
  • Principal-Component Analysis: This technique for dimensionality reduction views data consisting of a collection of points in a multidimensional space as a matrix, with rows corresponding to the points and columns to the dimensions. The product of this matrix and its transpose has eigenpairs, and the principal eigenvector can be viewed as the direction in the space along which the points best line up. The second eigenvector represents the direction in which deviations from the principal eigenvector are the greatest, and so on.
  • Dimensionality Reduction by PCA: By representing the matrix of points by a small number of its eigenvectors, we can approximate the data in a way that minimizes the root-mean-square error for the given number of columns in the representing matrix.
  • Singular-Value Decomposition: The singular-value decomposition of a matrix consists of three matrices, U, Σ, and V . The matrices U and V are column-orthonormal, meaning that as vectors, the columns are orthogonal, and their lengths are 1. The matrix Σ is a diagonal matrix, and the values along its diagonal are called singular values. The product of U, Σ, and the transpose of V equals the original matrix.
  • Concepts: SVD is useful when there are a small number of concepts that connect the rows and columns of the original matrix. For example, if the original matrix represents the ratings given by movie viewers (rows) to movies (columns), the concepts might be the genres of the movies. The matrix U connects rows to concepts, Σ represents the strengths of the concepts, and V connects the concepts to columns.
  • Queries Using the Singular-Value Decomposition: We can use the decomposition to relate new or hypothetical rows of the original matrix to the concepts represented by the decomposition. Multiply a row by the matrix V of the decomposition to get a vector indicating the extent to which that row matches each of the concepts.
  • Using SVD for Dimensionality Reduction: In a complete SVD for a matrix, U and V are typically as large as the original. To use fewer columns for U and V , delete the columns corresponding to the smallest singular values from U, V , and Σ. This choice minimizes the error in reconstructing the original matrix from the modified U, Σ, and V .
  • Decomposing Sparse Matrices: Even in the common case where the given matrix is sparse, the matrices constructed by SVD are dense. The CUR decomposition seeks to decompose a sparse matrix into sparse, smaller matrices whose product approximates the original matrix.
  • CUR Decomposition: This method chooses from a given sparse matrix a set of columns C and a set of rows R, which play the role of U and V T V^T VT in SVD; the user can pick any number of rows and columns. The choice of rows and columns is made randomly with a distribution that depends on the Frobenius norm, or the square root of the sum of the
    squares of the elements. Between C and R is a square matrix called U that is constructed by a pseudo-inverse of the intersection of the chosen rows and columns.

END文章来源地址https://www.toymoban.com/news/detail-469497.html

到了这里,关于《斯坦福数据挖掘教程·第三版》读书笔记(英文版)Chapter 11 Dimensionality Reduction的文章就介绍完了。如果您还想了解更多内容,请在右上角搜索TOY模板网以前的文章或继续浏览下面的相关文章,希望大家以后多多支持TOY模板网!

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处: 如若内容造成侵权/违法违规/事实不符,请点击违法举报进行投诉反馈,一经查实,立即删除!

领支付宝红包 赞助服务器费用

相关文章

  • 笔记汇总 | 斯坦福 CS229 机器学习

    本文为斯坦福大学 CS229 机器学习课程学习笔记 本文主体部分转载自黄海广博士,文末已给出链接,大家有兴趣可以直接访问笔记首页,下载对应课程资料及作业代码 课程官网:CS229: Machine Learning (stanford.edu) 课程视频:Stanford CS229: Machine Learning Course, Lecture 1 - Andrew Ng (Autumn 2

    2024年02月14日
    浏览(37)
  • 斯坦福JSKarel编程机器人使用介绍

    为了避免被编程语言固有的复杂性所困扰,有一个被称为卡雷尔(Karel)机器人的微型世界(microworld)的简化环境,可以让编程初学者从中学习理解编程的基本概念,而不必掌握大量无关的细节,让编程初学者更容易理解编程的要点和思维方式。 斯坦福Karel是一门面向初学者

    2024年02月05日
    浏览(43)
  • LLaMA模型微调版本:斯坦福 Alpaca 详解

    项目代码:https://github.com/tatsu-lab/stanford_alpaca 博客介绍:https://crfm.stanford.edu/2023/03/13/alpaca.html Alpaca 是 LLaMA-7B 的微调版本,使用Self-instruct[2]方式借用text-davinct-003构建了52K的数据,同时在其构建策略上做了一些修改。 性能上作者对Alpaca进行了评估,与openai的text-davinct-003模型在

    2024年02月16日
    浏览(38)
  • 斯坦福人生设计课——简略笔记(未完待更新)

    来源: ⽐尔 · 博内特 戴夫 · 伊万斯 著图书《人生设计课》 目录 一、认清当下的情况,从四个维度观察自己的人生 二、平衡人生,但不要走入误区 2.1 记录你的“美好时光日志”: 2.1.1 记录内容: 2.1.2 辅助反思的方法:AEIOU方法 2.1.3 一个小TIPS: 2.1.4 如果你发现自己当下

    2024年02月11日
    浏览(39)
  • 自驱力超强的羊驼?斯坦福微调LLaMa

    大型“指令调优”语言模型在新任务上展现了Zero-shot的卓越能力,但严重依赖于人类编写的指令数据,而这些数据在数量、多样性和创造性方面都是有限的。 斯坦福科研人员引入了self-instruction框架,提高指令遵循能力来自我迭代进化,与InstructGPT的性能相当,相比原始GPT3提

    2024年02月09日
    浏览(40)
  • 【LLM系列】00:斯坦福 Alpaca 模型介绍及其复现

    西风吹老洞庭波,一夜湘君白发多。醉后不知天在水,满船清梦压星河。小伙伴好,我是微信公众号《小窗幽记机器学习》的小编:卖核弹的小女孩。更多、更新文章欢迎关注微信公众号:小窗幽记机器学习。后续会持续输出模型推理加速、工程部署、LLM、AI艺术等系列,敬

    2024年02月13日
    浏览(43)
  • 斯坦福2023【FrugalGPT】减少大模型的商业化应用成本

    FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance 这篇文章主要是要解决如何降低调用大语言模型的成本(ChatGPT)。大模型API调用成本主要是三方面的:1. prompt cost(输入的prompt);2. generation cost(输出的部分);3. 每次调用的固定开销(网费等)。不用的模型之前的

    2024年02月06日
    浏览(56)
  • 斯坦福| ChatGPT用于生成式搜索引擎的可行性

    文|智商掉了一地 随着 ChatGPT 在文本生成领域迈出了重要一步,Bing 浏览器也接入了聊天机器人功能,因此如何保证 Bing Chat 等搜索引擎结果的精确率和真实性也成为了搜索领域的热门话题之一。 当我们使用搜索引擎时,往往希望搜索结果能够真实准确地反映我们的需求。然

    2024年02月06日
    浏览(38)
  • 斯坦福Dan Boneh密码学——02 计算密码与语义安全

    语义安全这块内容实在是被书绕晕了,虽然模型就那么一个,但有各种各样的数学符号交织证明,还有官方深奥的语言表述。第一次看是一知半解的,后面势必还要再返回来精读几遍完善笔记。以篇幅来看,语义安全是密码学中非常重要的一个版块。 计算密码与语义安全 我

    2024年02月08日
    浏览(66)
  • 【斯坦福】FrugalGPT: 如何使用大型语言模型,同时降低成本并提高性能

    FrugalGPT: 如何使用大型语言模型,同时降低成本并提高性能 作者:Lingjiao Chen, Matei Zaharia, James Zou 本文介绍了一种新颖的方法,旨在解决使用大型语言模型(LLM)时面临的成本和性能挑战。随着GPT-4和ChatGPT等LLM的日益流行,我们需要找到降低这些模型推理成本的策略。作者强调

    2024年02月11日
    浏览(45)

觉得文章有用就打赏一下文章作者

支付宝扫一扫打赏

博客赞助

微信扫一扫打赏

请作者喝杯咖啡吧~博客赞助

支付宝扫一扫领取红包,优惠每天领

二维码1

领取红包

二维码2

领红包