Logistic 回归
1. Logistic 回归模型
Logistic 回归由统计学家David Cox(1958)提出,其实质是将数据拟合成到Logistic 模型中,从而预测事件发生的可能性。由于因变量是二分类的(也可以是多分类),因此可以代表指定某种事件发生与不发生的概率。
设因变量
y
y
y的取值为
{
0
,
1
}
\{0,1\}
{0,1},
x
1
,
x
2
,
…
x
p
x_1,x_2,\dots x_p
x1,x2,…xp为
y
y
y的解释变量,Logistic 回归就是研究
X
=
(
x
1
,
x
2
,
…
x
p
)
X =(x_1,x_2,\dots x_p)
X=(x1,x2,…xp)对
y
y
y的影响关系。记
p
=
P
(
y
=
1
∣
X
)
;
1
−
p
=
P
(
y
=
0
∣
X
)
p = P(y=1|X);1-p = P(y=0|X)
p=P(y=1∣X);1−p=P(y=0∣X)
则概率比
p
/
(
1
−
p
)
p/(1-p)
p/(1−p)的概率称作机会比(或优势比,odds)。这里因变量的期望为
E
(
y
∣
X
)
=
1
P
+
0
(
1
−
p
)
=
p
E(y|X) = 1P+0(1-p)=p
E(y∣X)=1P+0(1−p)=p
按照线性模型建模思路,有
y
=
β
0
+
β
1
x
1
+
…
β
p
x
p
+
ε
y = \beta_0+\beta_1x_1+\dots \beta_px_p+\varepsilon
y=β0+β1x1+…βpxp+ε
其中
ε
\varepsilon
ε为扰动项。如果利用OLS方法估计,则为线性概率模型。但由于
y
=
0
,
1
y =0,1
y=0,1,故扰动项
ε
\varepsilon
ε与
X
X
X存在相关性,从而导致内生性与异方差等问题。另外线性模型不能解释自变量
X
X
X取极端值时
y
<
0
y<0
y<0或
y
>
1
y>1
y>1的情形,故考虑用连接函数使得
{
p
(
y
=
1
∣
X
)
=
Λ
(
X
,
β
)
p
(
y
=
0
∣
X
)
=
1
−
Λ
(
X
,
β
)
\left\{\begin{array}{lr} p(y=1|X) =\Lambda(X,\beta)\\ \\ p(y=0|X) =1-\Lambda(X,\beta)\\ \end{array}\right.
⎩
⎨
⎧p(y=1∣X)=Λ(X,β)p(y=0∣X)=1−Λ(X,β)
其中
Λ
(
)
\Lambda()
Λ()表示连接函数,
β
\beta
β为参数。连接函数可以用标准正态累计分布函数与逻辑分布函数来表示,如果使用标准正态累计分布函数,则得到Probit模型;如果采取逻辑分布函数则为Logit模型。但考虑到用标准正态累计分布函数不存在解析式,一般采用逻辑分布函数,即
P
(
y
=
1
∣
X
)
=
p
=
e
x
p
(
X
′
β
)
1
+
e
x
p
(
X
′
β
)
=
e
x
p
(
β
0
+
β
1
x
1
+
…
β
p
x
p
)
1
+
e
x
p
(
β
0
+
β
1
x
1
+
…
β
p
x
p
)
\begin{aligned} P(y=1|X)& =p = \frac{exp(X'\beta)}{1+exp(X'\beta)}\\ \\ &=\frac{exp(\beta_0+\beta_1x_1+\dots \beta_px_p)}{1+exp(\beta_0+\beta_1x_1+\dots \beta_px_p)} \end{aligned}
P(y=1∣X)=p=1+exp(X′β)exp(X′β)=1+exp(β0+β1x1+…βpxp)exp(β0+β1x1+…βpxp)
Logit分布密度函数关于原点对称,期望为0,方程为
π
2
/
3
\pi^2/3
π2/3,厚尾。由上式可推出对数机会比
O
d
d
s
=
ln
(
p
1
−
p
)
=
β
0
+
β
1
x
1
+
…
β
p
x
p
Odds = \ln(\frac{p}{1-p})=\beta_0+\beta_1x_1+\dots \beta_px_p
Odds=ln(1−pp)=β0+β1x1+…βpxp
上述模型表明,在其他不变条件下,
x
i
x_i
xi变动一个单位,其机会比对数将变化
β
i
\beta_i
βi个单位,而非因变量变动
β
i
\beta_i
βi个单位。
2.参数估计
由于
y
y
y服从0-1分布,故
y
y
y的概率函数可以写为
P
(
y
)
=
p
y
(
1
−
p
)
1
−
y
(
y
=
0
,
1
)
P(y) = p^y(1-p)^{1-y}(y=0,1)
P(y)=py(1−p)1−y(y=0,1)
其似然函数为
L
=
∏
P
(
y
)
=
∏
p
y
(
1
−
p
)
1
−
y
L= \prod {P(y)} = \prod { p^y(1-p)^{1-y}}
L=∏P(y)=∏py(1−p)1−y
取对数得
l
n
L
=
∑
[
y
ln
p
+
(
1
−
y
)
ln
(
1
−
p
)
]
=
∑
[
y
ln
p
1
−
p
+
l
n
(
1
−
p
)
]
\begin{aligned} ln L& = \sum[y\ln p+(1-y)\ln (1-p)]\\ \\ &=\sum[y\ln \frac{p}{1-p}+ln(1-p)] \end{aligned}
lnL=∑[ylnp+(1−y)ln(1−p)]=∑[yln1−pp+ln(1−p)]
将
p
p
p的表达式代入得
ln
L
=
∑
{
y
(
β
0
+
β
1
x
1
+
…
β
p
x
p
)
−
[
1
+
e
x
p
(
β
0
+
β
1
x
1
+
…
β
p
x
p
)
]
}
\begin{aligned} \ln L = &\sum\{y(\beta_0+\beta_1x_1+\dots \beta_px_p)\\ \\ &-[1+exp(\beta_0+\beta_1x_1+\dots \beta_px_p)]\} \end{aligned}
lnL=∑{y(β0+β1x1+…βpxp)−[1+exp(β0+β1x1+…βpxp)]}
其中一阶条件
∂
ln
L
∂
β
j
=
0
(
j
=
0
,
1
,
…
,
p
)
\frac{\partial\ln L}{\partial\beta_j} =0(j=0,1,\dots,p)
∂βj∂lnL=0(j=0,1,…,p)
于是求出极大似然估计量
β
j
^
(
j
=
0
,
1
,
…
,
p
)
\hat{\beta_j}(j=0,1,\dots,p)
βj^(j=0,1,…,p)。再将
β
j
^
(
j
=
0
,
1
,
…
,
p
)
\hat{\beta_j}(j=0,1,\dots,p)
βj^(j=0,1,…,p)代回
P
(
y
=
1
∣
X
)
P(y=1|X)
P(y=1∣X)中得
P
(
y
=
1
∣
X
)
=
e
x
p
(
β
^
0
+
β
^
1
x
1
+
…
β
^
p
x
p
)
1
+
e
x
p
(
β
^
0
+
β
^
1
x
1
+
…
β
^
p
x
p
)
\begin{aligned} P(y=1|X)& =\frac{exp(\hat{\beta}_0+\hat{\beta}_1x_1+\dots \hat{\beta}_px_p)}{1+exp(\hat{\beta}_0+\hat{\beta}_1x_1+\dots \hat{\beta}_px_p)} \end{aligned}
P(y=1∣X)=1+exp(β^0+β^1x1+…β^pxp)exp(β^0+β^1x1+…β^pxp)
当然
P
(
y
=
0
∣
X
)
=
1
1
+
e
x
p
(
β
^
0
+
β
^
1
x
1
+
…
β
^
p
x
p
)
\begin{aligned} P(y=0|X)& =\frac{1}{1+exp(\hat{\beta}_0+\hat{\beta}_1x_1+\dots \hat{\beta}_px_p)} \end{aligned}
P(y=0∣X)=1+exp(β^0+β^1x1+…β^pxp)1
3 软件实现
以数据集womenwk为例,构建如下模型:
work
i
=
β
0
+
β
1
age
i
+
β
2
married
i
+
β
3
children
i
+
β
4
education
i
+
ε
i
\text { work }_{i}=\beta_{0}+\beta_{1} \text { age }_{i}+\beta_{2} \text { married }_{i}+\beta_{3} \text { children }_{i}+\beta_{4} \text { education }_{i}+\varepsilon_{i}
work i=β0+β1 age i+β2 married i+β3 children i+β4 education i+εi
其中work:是否就业;age:年龄;marrie:婚否;children:子女数;education:教育年限
Stata代码如下:
*------------------------ Logistic 回归--------------------
cd "D:\master\笔记\markdown笔记\计量经济学\二值选择模型"
use womenwk.dta,clear
*变量含义:
*数据集womenwk
*work:是否就业
*age:年龄
*marrie:婚否
*children:子女数
*education:教育年限
*---------------------------LPM估计----------------------
reg work age married children education,r
/*
Linear regression Number of obs = 2,000
F(4, 1995) = 192.58
Prob > F = 0.0000
R-squared = 0.2026
Root MSE = .41992
------------------------------------------------------------------------------
| Robust
work | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
age | .0102552 .0012236 8.38 0.000 .0078556 .0126548
married | .1111116 .0226719 4.90 0.000 .0666485 .1555748
children | .1153084 .0056978 20.24 0.000 .1041342 .1264827
education | .0186011 .0033006 5.64 0.000 .0121282 .025074
_cons | -.2073227 .0534581 -3.88 0.000 -.3121622 -.1024832
------------------------------------------------------------------------------
*/
*-----------------------------logit回归-----------------------------------
logit work age married children education,nolog
/*
Logistic regression Number of obs = 2,000
LR chi2(4) = 476.62
Prob > chi2 = 0.0000
Log likelihood = -1027.9144 Pseudo R2 = 0.1882
------------------------------------------------------------------------------
work | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
age | .0579303 .007221 8.02 0.000 .0437773 .0720833
married | .7417775 .1264705 5.87 0.000 .4938998 .9896552
children | .7644882 .0515289 14.84 0.000 .6634935 .865483
education | .0982513 .0186522 5.27 0.000 .0616936 .134809
_cons | -4.159247 .3320401 -12.53 0.000 -4.810034 -3.508461
------------------------------------------------------------------------------
*/
*稳健标准误logit
logit work age married children education,nolog r
/*
Logistic regression Number of obs = 2,000
Wald chi2(4) = 344.54
Prob > chi2 = 0.0000
Log pseudolikelihood = -1027.9144 Pseudo R2 = 0.1882
------------------------------------------------------------------------------
| Robust
work | Coef. Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
age | .0579303 .0072054 8.04 0.000 .0438079 .0720527
married | .7417775 .1272191 5.83 0.000 .4924326 .9911224
children | .7644882 .0497584 15.36 0.000 .6669635 .8620129
education | .0982513 .019011 5.17 0.000 .0609904 .1355121
_cons | -4.159247 .327398 -12.70 0.000 -4.800936 -3.517559
------------------------------------------------------------------------------
*/
*机率比汇报
logit work age married children education,nolog or
/*
Logistic regression Number of obs = 2,000
LR chi2(4) = 476.62
Prob > chi2 = 0.0000
Log likelihood = -1027.9144 Pseudo R2 = 0.1882
------------------------------------------------------------------------------
work | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
age | 1.059641 .0076517 8.02 0.000 1.04475 1.074745
married | 2.099664 .2655457 5.87 0.000 1.638694 2.690307
children | 2.147895 .1106786 14.84 0.000 1.941563 2.376153
education | 1.10324 .0205779 5.27 0.000 1.063636 1.144318
_cons | .0156193 .0051862 -12.53 0.000 .0081476 .029943
------------------------------------------------------------------------------
Note: _cons estimates baseline odds.
*/
*---------------------边际效应-----------------
*样本均值处边际效应
margins,dydx(*) atmeans
/*
Conditional marginal effects Number of obs = 2,000
Model VCE : OIM
Expression : Pr(work), predict()
dy/dx w.r.t. : age married children education
at : age = 36.208 (mean)
married = .6705 (mean)
children = 1.6445 (mean)
education = 13.084 (mean)
------------------------------------------------------------------------------
| Delta-method
| dy/dx Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
age | .0115031 .0014236 8.08 0.000 .0087129 .0142934
married | .1472934 .0248209 5.93 0.000 .0986453 .1959415
children | .151803 .0093768 16.19 0.000 .1334249 .1701812
education | .0195096 .0036991 5.27 0.000 .0122596 .0267596
------------------------------------------------------------------------------
.
end of do-file
*/
*---------------------指定变量取值处的边际效应-------------------
margins,dydx(*) at(age =30)
/*
Average marginal effects Number of obs = 2,000
Model VCE : OIM
Expression : Pr(work), predict()
dy/dx w.r.t. : age married children education
at : age = 30
------------------------------------------------------------------------------
| Delta-method
| dy/dx Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
age | .011179 .0014719 7.59 0.000 .008294 .0140639
married | .1431427 .0232525 6.16 0.000 .0975687 .1887167
children | .1475253 .0074033 19.93 0.000 .1330151 .1620355
education | .0189598 .0034727 5.46 0.000 .0121534 .0257662
------------------------------------------------------------------------------
*/
*------------------准确预测率------------------
estat clas
/*
Logistic model for work
-------- True --------
Classified | D ~D | Total
-----------+--------------------------+-----------
+ | 1177 361 | 1538
- | 166 296 | 462
-----------+--------------------------+-----------
Total | 1343 657 | 2000
Classified + if predicted Pr(D) >= .5
True D defined as work != 0
--------------------------------------------------
Sensitivity Pr( +| D) 87.64%
Specificity Pr( -|~D) 45.05%
Positive predictive value Pr( D| +) 76.53%
Negative predictive value Pr(~D| -) 64.07%
--------------------------------------------------
False + rate for true ~D Pr( +|~D) 54.95%
False - rate for true D Pr( -| D) 12.36%
False + rate for classified + Pr(~D| +) 23.47%
False - rate for classified - Pr( D| -) 35.93%
--------------------------------------------------
Correctly classified 73.65%
--------------------------------------------------
*/
参考文献:文章来源:https://www.toymoban.com/news/detail-610402.html
陈强(2014),高级计量经济学及stata应用(第二版)文章来源地址https://www.toymoban.com/news/detail-610402.html
到了这里,关于Logistic回归模型的文章就介绍完了。如果您还想了解更多内容,请在右上角搜索TOY模板网以前的文章或继续浏览下面的相关文章,希望大家以后多多支持TOY模板网!