本笔记基于清华大学《机器学习》的课程讲义监督学习相关部分,基本为笔者在考试前一两天所作的Cheat Sheet。内容较多,并不详细,主要作为复习和记忆的资料。
Linear Regression
Perceptron
- f ( x ) = s i g n ( w ⊤ x + b ) f(x)=sign(w^\top x+b) f(x)=sign(w⊤x+b)
- convergence
Logistic Regression
- output probability instead of labels.
- Loss: Cross entropy X E ( y , p ) = − ∑ i y i log p i XE(y,p)=-\sum_iy_i\log p_i XE(y,p)=−∑iyilogpi. y i y_i yi is the actual probability
Ridge Regression
- l 2 l_2 l2 regularization λ 2 ∥ w ∥ 2 2 \frac{\lambda}{2}\|w\|^2_2 2λ∥w∥22
- Shrink every coordinate:
w
′
=
w
⋅
(
1
−
η
λ
)
w'=w\cdot (1-\eta \lambda)
w′=w⋅(1−ηλ)
- weight decay
LASSO Regression
- Find sparse features: Want ∥ w ∥ 0 ≤ c \|w\|_0\le c ∥w∥0≤c
- l 1 l_1 l1 regularization λ ∥ w ∥ 1 \lambda\|w\|_1 λ∥w∥1
- Gradient: add or minus η λ \eta \lambda ηλ to pull w w w into 0 0 0.
Compressed Sensing*
- Compare with LASSO: pick X t r a i n X_{train} Xtrain, observe Y t r a i n Y_{train} Ytrain, and learn w w w. Instead of given X t r a i n , Y t r a i n X_{train},Y_{train} Xtrain,Ytrain.
- RIP condition: A matrix A A A is ( ϵ , s ) (\epsilon,s) (ϵ,s)-RIP if ∥ A x ∥ ≈ ∥ x ∥ \|Ax\|\approx \|x\| ∥Ax∥≈∥x∥ for s s s-sparse x x x.
- Application: reconstruct
x
x
x on a RIP matrix is easy.
- For sparse x x x
- For spase x x x + some noise
- Lemma: RIP implies almost orthogonality
- Proof*
- How to find RIP matrix: Random
- Non-linear sparse: X = G ( Z ) X=G(Z) X=G(Z), Z Z Z is sparse, G G G is non-linear.
SVM
- Minimize
∥
w
∥
2
+
λ
∑
i
ξ
i
\|w\|_2+\lambda \sum_{i}\xi_i
∥w∥2+λ∑iξi.
- Soft version: y i w ⊤ x i ≥ 1 − ξ i , ξ i ≥ 0 y_iw^\top x_i\ge 1-\xi_i,\xi_i\ge 0 yiw⊤xi≥1−ξi,ξi≥0.
- maximize margin 1 / ∥ w ∥ 2 1/\|w\|_2 1/∥w∥2.
Dual derivation
-
Primal problem
min x x 2 x ≥ b \min_x x^2\\ x\ge b xminx2x≥b -
Lagrangian L ( x , α ) = x 2 − α ( x − b ) s.t. α ≥ 0 L(x,\alpha)=x^2-\alpha(x-b)\text{ s.t. }\alpha\ge 0 L(x,α)=x2−α(x−b) s.t. α≥0
- Solution in primal problem correponds to a L ( x , α ) ≤ x 2 L(x,\alpha)\le x^2 L(x,α)≤x2. Thus min x L ( x , α ) \min_x L(x,\alpha) minxL(x,α) is lower bound of the primal solution.
- Dual: We want to find the maximum lowerbound max α d ( α ) = max α min x L ( x , a ) \max_\alpha d(\alpha)=\max_\alpha \min_xL(x,a) maxαd(α)=maxαminxL(x,a).
- Strong duality p ∗ = d ∗ p^*=d^* p∗=d∗
-
For SVM(hard version)
-
Primal: min ∥ w ∥ 2 2 \min\frac{\|w\|_2}{2} min2∥w∥2, y i w ⊤ x i ≥ 1 y_iw^\top x_i\ge 1 yiw⊤xi≥1.
-
Dual: maximize(solved by taking derivative)
L ( w , α ) = ∥ w ∥ 2 2 − ∑ i α ( y i w ⊤ x i − 1 ) = ∑ i α i − 1 2 ∑ i ∑ j y i y j α i α j ⟨ x i , x j ⟩ α ≥ 0 w = ∑ i α i y i x i L(w,\alpha)=\frac{\|w\|_2}{2}-\sum_i\alpha (y_iw^\top x_i-1)\\ =\sum_{i}\alpha _i-\frac{1}{2}\sum_i\sum_jy_iy_j\alpha_i\alpha_j\langle x_i,x_j\rangle\\ \alpha\ge 0\\ w=\sum_{i}\alpha_iy_ix_i L(w,α)=2∥w∥2−i∑α(yiw⊤xi−1)=i∑αi−21i∑j∑yiyjαiαj⟨xi,xj⟩α≥0w=i∑αiyixi
-
Kernel Method
- Replace ⟨ x i , x j ⟩ \langle x_i,x_j\rangle ⟨xi,xj⟩ with ⟨ ϕ ( x i ) , ϕ ( x j ) ⟩ \langle \phi(x_i),\phi(x_j)\rangle ⟨ϕ(xi),ϕ(xj)⟩ to embed x x x into a high dimension space.
- Use k ( x i , x j ) k(x_i,x_j) k(xi,xj) to compute $\langle \phi(x_i),\phi(x_j)\rangle $. Usually gaussian kernel e − ∣ x i − x j ∣ 2 σ 2 e^{-\frac{|x_i-x_j|}{2\sigma^2}} e−2σ2∣xi−xj∣
- Mercer’s theorem: positive semidefinite kernel matrix has a corresponding embedding ϕ \phi ϕ.
Decision Tree
Boolean funcional analysis
-
X S ( x ) = ∏ i ∈ S x i \mathcal{X}_S(x)=\prod_{i\in S}x_i XS(x)=∏i∈Sxi
-
f ( x ) = ∑ S f S ^ X S ( x ) f(x)=\sum_{S}\hat{f_S}\mathcal{X}_S(x) f(x)=∑SfS^XS(x)
-
f S ^ = ⟨ f , X S ⟩ = E x ∼ D [ f ( x ) X S ( x ) ] \hat{f_S}=\langle f,\mathcal{X}_S\rangle=\mathbb{E}_{x\sim D}[f(x)\mathcal{X}_S(x)] fS^=⟨f,XS⟩=Ex∼D[f(x)XS(x)]
-
E x ∼ D [ f ( x ) 2 ] = ∑ S f S 2 ^ \mathbb{E}_{x\sim D}[f(x)^2]=\sum_{S}\hat{f_S^2} Ex∼D[f(x)2]=∑SfS2^
-
Decision tree with s s s leaf nodes can be converted into log ( s ϵ ) \log(\frac{s}{\epsilon}) log(ϵs)-degree, sparsity- s 2 ϵ \frac{s^2}{\epsilon} ϵs2 function that 2 ϵ 2\epsilon 2ϵ-appoximates T T T.
- log depth
- Combination of AND term
- Take the bases S S S with big f S ^ \hat{f_S} fS^
-
Estimation
- LMN:
f
S
^
=
E
x
∼
D
[
f
(
x
)
X
S
(
x
)
]
=
1
m
∑
i
=
1
m
f
(
x
i
)
X
S
(
x
i
)
\hat{f_S}=\mathbb{E}_{x\sim D}[f(x)\mathcal{X}_S(x)]=\frac{1}{m}\sum_{i=1}^m f(x_i)\mathcal{X}_{S}(x_i)
fS^=Ex∼D[f(x)XS(x)]=m1∑i=1mf(xi)XS(xi)
- Not work well in practice. Have not guarantees in noisy setting
- Compressed Sensing
- y = A x + e y=Ax+e y=Ax+e: x x x contains f S ^ \hat{f_S} fS^, A A A and y y y are from samples.
- Lasso find x ∗ x^* x∗ with bounded error.
- LMN:
f
S
^
=
E
x
∼
D
[
f
(
x
)
X
S
(
x
)
]
=
1
m
∑
i
=
1
m
f
(
x
i
)
X
S
(
x
i
)
\hat{f_S}=\mathbb{E}_{x\sim D}[f(x)\mathcal{X}_S(x)]=\frac{1}{m}\sum_{i=1}^m f(x_i)\mathcal{X}_{S}(x_i)
fS^=Ex∼D[f(x)XS(x)]=m1∑i=1mf(xi)XS(xi)
Splitting variables
- Greedy: Gini index*
Random forest
- overfitting problem
- Construct B B B trees, every tree is trained by n n n samples, which is from { ( x i , y i ) } i = 1 n \{(x_i,y_i)\}_{i=1}^n {(xi,yi)}i=1n with replacement(可重复), each element will miss with probability ( 1 − 1 n ) n = 1 e (1-\frac{1}{n})^n=\frac{1}{e} (1−n1)n=e1
- Output the average of B B B trees
- Can also bag the features
Adaboost*
-
Combine weak learners. Make hard cases more likely.
-
Sample distribution D t ( i ) D_t(i) Dt(i) in t t t round. ϵ t = Pr i ∼ D t [ h t ( x i ) ≠ y i ] \epsilon_t=\Pr_{i\sim D_t}[h_t(x_i)\neq y_i] ϵt=Pri∼Dt[ht(xi)=yi].
-
D 1 ( i ) = 1 m D_1(i)=\frac{1}{m} D1(i)=m1Learn h t h_t ht from D t ( i ) D_t(i) Dt(i)
D t + 1 ( i ) = { 1 Z t D t ( i ) e − α t y i = h t ( x i ) 1 Z t D t ( i ) e α t y i ≠ h t ( x i ) α = 1 2 ln ( 1 − ϵ t ϵ t ) D_{t+1}(i)=\begin{cases} \frac{1}{Z_t}D_t(i)e^{-\alpha_t} && y_i=h_t(x_i)\\ \frac{1}{Z_t}D_t(i)e^{\alpha_t} && y_i\neq h_t(x_i) \end{cases}\\ \alpha=\frac{1}{2}\ln\left(\frac{1-\epsilon_t}{\epsilon_t}\right) Dt+1(i)={Zt1Dt(i)e−αtZt1Dt(i)eαtyi=ht(xi)yi=ht(xi)α=21ln(ϵt1−ϵt) -
H f i n a l = s i g n ( ∑ α t h t ) H_{final}=sign(\sum \alpha_t h_t) Hfinal=sign(∑αtht)
-
ϵ t = 1 2 − γ t \epsilon_t=\frac{1}{2}-\gamma_t ϵt=21−γt.
- γ t \gamma_t γt means how better this weak learner is than random classifier.
e r r o r ( H f i n a l ) ≤ ∏ t 2 ϵ t ( 1 − ϵ t ) ≤ ∏ t 1 − 4 γ t 2 ≤ exp ( − 2 ∑ t γ t 2 ) error(H_{final})\le \prod _{t}2\sqrt{\epsilon_t(1-\epsilon_t)}\le \prod_t\sqrt{1-4\gamma_t^2}\le \exp\left(-2\sum_t\gamma_t^2\right) error(Hfinal)≤t∏2ϵt(1−ϵt)≤t∏1−4γt2≤exp(−2t∑γt2)
-
Proof:
-
D T ( i ) = 1 m exp ( − y i ∑ t α t h t ( x i ) ∏ t Z t D_T(i)=\frac{1}{m}\frac{\exp(-y_i\sum_{t}\alpha _th_t(x_i)}{\prod_tZ_t} DT(i)=m1∏tZtexp(−yi∑tαtht(xi)
-
e r r o r ( H f i n a l ) = 1 m ∑ i = 1 m 1 y i f ( x i ) ≤ 0 = 1 m ∑ i = 1 m exp ( − y i f ( x i ) ) = ∑ i = 1 m D T ( i ) ∏ t Z t = ∏ t Z t \begin{align*} error(H_{final})&=\frac{1}{m}\sum_{i=1}^m 1_{y_if(x_i)\le 0}\\ &=\frac{1}{m}\sum_{i=1}^m \exp(-y_if(x_i))\\ &=\sum_{i=1}^m D_T(i)\prod_t Z_t\\ &=\prod_t Z_t \end{align*} error(Hfinal)=m1i=1∑m1yif(xi)≤0=m1i=1∑mexp(−yif(xi))=i=1∑mDT(i)t∏Zt=t∏Zt
-
Z t = ∑ i D t ( i ) exp ( − α t y i h ( x i ) ) = ( 1 − ϵ t ) e − α t + ϵ t e α t Z_t=\sum_{i}D_t(i)\exp(-\alpha_ty_ih(x_i))=(1-\epsilon_t)e^{-\alpha_t}+\epsilon_te^{\alpha_t} Zt=i∑Dt(i)exp(−αtyih(xi))=(1−ϵt)e−αt+ϵteαt
To minimize Z t Z_t Zt, α t = 1 2 ln ( 1 − ϵ t ϵ t ) \alpha_t=\frac{1}{2}\ln(\frac{1-\epsilon_t}{\epsilon_t}) αt=21ln(ϵt1−ϵt), then Z t = 2 ϵ t ( 1 − ϵ t ) Z_t=2\sqrt{\epsilon_t(1-\epsilon_t)} Zt=2ϵt(1−ϵt)
-
-
Generalization
-
Drawback: Only binary classification文章来源:https://www.toymoban.com/news/detail-808365.html
-
Extension: Gradient boosting: Regression, minimize 1 2 ∑ i ( F ( x i ) − y i ) 2 \frac{1}{2}\sum_i(F(x_i)-y_i)^2 21∑i(F(xi)−yi)2文章来源地址https://www.toymoban.com/news/detail-808365.html
- Adaboost use Coordinate descent: change α t \alpha_t αt from 0 0 0 to α t \alpha_t αt in round t t t. Only change one coordinate since the dimension is too large.
- learn a new regression tree h ( x ) h(x) h(x) to fit ∂ L / ∂ F ( x i ) = F ( x i ) − y \partial L/\partial F(x_i)=F(x_i)-y ∂L/∂F(xi)=F(xi)−y for square loss. Then update F ′ ( x ) = F ( x ) + η h ( x ) F'(x)=F(x)+\eta h(x) F′(x)=F(x)+ηh(x)
到了这里,关于【Machine Learning】Supervised Learning的文章就介绍完了。如果您还想了解更多内容,请在右上角搜索TOY模板网以前的文章或继续浏览下面的相关文章,希望大家以后多多支持TOY模板网!