Logistic回归模型-Toy模板网

这篇具有很好参考价值的文章主要介绍了Logistic回归模型。希望对大家有所帮助。如果存在错误或未考虑完全的地方，请大家不吝赐教，您也可以点击"举报违法"按钮提交疑问。

Logistic 回归

1. Logistic 回归模型

Logistic 回归由统计学家David Cox（1958）提出，其实质是将数据拟合成到Logistic 模型中，从而预测事件发生的可能性。由于因变量是二分类的（也可以是多分类），因此可以代表指定某种事件发生与不发生的概率。

设因变量 $y$ 的取值为 ${0,1\}$ ， $x_1,x_2,\dots x_p$ 为 $y$ 的解释变量，Logistic 回归就是研究 $=(x_1,x_2,\dots x_p)$ 对 $y$ 的影响关系。记
$p = P (y = 1∣ X); 1 - p = P (y = 0∣ X)$
则概率比 $p / (1 - p)$ 的概率称作机会比（或优势比，odds）。这里因变量的期望为
$E (y ∣ X) = 1 P + 0 (1 - p) = p$
按照线性模型建模思路，有
$\beta_0+\beta_1x_1+\dots \beta_px_p+\varepsilon$
其中 $\varepsilon$ 为扰动项。如果利用OLS方法估计，则为线性概率模型。但由于 $y = 0, 1$ ，故扰动项 $\varepsilon$ 与 $X$ 存在相关性，从而导致内生性与异方差等问题。另外线性模型不能解释自变量 $X$ 取极端值时 $y < 0$ 或 $y > 1$ 的情形，故考虑用连接函数使得
$\left\{\begin{array}{lr} p(y=1|X) =\Lambda(X,\beta)\\ \\ p(y=0|X) =1-\Lambda(X,\beta)\\ \end{array}\right.$
其中 $\Lambda()$ 表示连接函数， $\beta$ 为参数。连接函数可以用标准正态累计分布函数与逻辑分布函数来表示，如果使用标准正态累计分布函数，则得到Probit模型；如果采取逻辑分布函数则为Logit模型。但考虑到用标准正态累计分布函数不存在解析式，一般采用逻辑分布函数，即
$\begin{aligned} P(y=1|X)& =p = \frac{exp(X'\beta)}{1+exp(X'\beta)}\\ \\ &=\frac{exp(\beta_0+\beta_1x_1+\dots \beta_px_p)}{1+exp(\beta_0+\beta_1x_1+\dots \beta_px_p)} \end{aligned}$
Logit分布密度函数关于原点对称，期望为0，方程为 $\pi^2/3$ ，厚尾。由上式可推出对数机会比
$\ln(\frac{p}{1-p})=\beta_0+\beta_1x_1+\dots \beta_px_p$
上述模型表明，在其他不变条件下， $x_i$ 变动一个单位，其机会比对数将变化 $\beta_i$ 个单位，而非因变量变动 $\beta_i$ 个单位。

2.参数估计

由于 $y$ 服从0-1分布，故 $y$ 的概率函数可以写为
$P(y) = p^y(1-p)^{1-y}(y=0,1)$
其似然函数为
$\prod {P(y)} = \prod { p^y(1-p)^{1-y}}$
取对数得
$\begin{aligned} ln L& = \sum[y\ln p+(1-y)\ln (1-p)]\\ \\ &=\sum[y\ln \frac{p}{1-p}+ln(1-p)] \end{aligned}$
将 $p$ 的表达式代入得
$\begin{aligned} \ln L = &\sum\{y(\beta_0+\beta_1x_1+\dots \beta_px_p)\\ \\ &-[1+exp(\beta_0+\beta_1x_1+\dots \beta_px_p)]\} \end{aligned}$
其中一阶条件
$\frac{\partial\ln L}{\partial\beta_j} =0(j=0,1,\dots,p)$
于是求出极大似然估计量 $\hat{\beta_j}(j=0,1,\dots,p)$ 。再将 $\hat{\beta_j}(j=0,1,\dots,p)$ 代回 $P (y = 1∣ X)$ 中得
$\begin{aligned} P(y=1|X)& =\frac{exp(\hat{\beta}_0+\hat{\beta}_1x_1+\dots \hat{\beta}_px_p)}{1+exp(\hat{\beta}_0+\hat{\beta}_1x_1+\dots \hat{\beta}_px_p)} \end{aligned}$
当然
$\begin{aligned} P(y=0|X)& =\frac{1}{1+exp(\hat{\beta}_0+\hat{\beta}_1x_1+\dots \hat{\beta}_px_p)} \end{aligned}$

3 软件实现

以数据集womenwk为例，构建如下模型：
$\text { work }_{i}=\beta_{0}+\beta_{1} \text { age }_{i}+\beta_{2} \text { married }_{i}+\beta_{3} \text { children }_{i}+\beta_{4} \text { education }_{i}+\varepsilon_{i}$
其中work:是否就业；age:年龄；marrie：婚否；children：子女数；education：教育年限

Stata代码如下：

*------------------------ Logistic 回归--------------------

cd "D:\master\笔记\markdown笔记\计量经济学\二值选择模型"

use womenwk.dta,clear
*变量含义：
*数据集womenwk
*work:是否就业
*age:年龄
*marrie：婚否
*children：子女数
*education：教育年限
*---------------------------LPM估计----------------------
reg work age married children education,r
/*
Linear regression                               Number of obs     =      2,000
                                                F(4, 1995)        =     192.58
                                                Prob > F          =     0.0000
                                                R-squared         =     0.2026
                                                Root MSE          =     .41992

------------------------------------------------------------------------------
             |               Robust
        work |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         age |   .0102552   .0012236     8.38   0.000     .0078556    .0126548
     married |   .1111116   .0226719     4.90   0.000     .0666485    .1555748
    children |   .1153084   .0056978    20.24   0.000     .1041342    .1264827
   education |   .0186011   .0033006     5.64   0.000     .0121282     .025074
       _cons |  -.2073227   .0534581    -3.88   0.000    -.3121622   -.1024832
------------------------------------------------------------------------------
*/

*-----------------------------logit回归-----------------------------------
logit work age married children education,nolog

/*
Logistic regression                             Number of obs     =      2,000
                                                LR chi2(4)        =     476.62
                                                Prob > chi2       =     0.0000
Log likelihood = -1027.9144                     Pseudo R2         =     0.1882

------------------------------------------------------------------------------
        work |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         age |   .0579303    .007221     8.02   0.000     .0437773    .0720833
     married |   .7417775   .1264705     5.87   0.000     .4938998    .9896552
    children |   .7644882   .0515289    14.84   0.000     .6634935     .865483
   education |   .0982513   .0186522     5.27   0.000     .0616936     .134809
       _cons |  -4.159247   .3320401   -12.53   0.000    -4.810034   -3.508461
------------------------------------------------------------------------------
*/

*稳健标准误logit
logit work age married children education,nolog r
/*
Logistic regression                             Number of obs     =      2,000
                                                Wald chi2(4)      =     344.54
                                                Prob > chi2       =     0.0000
Log pseudolikelihood = -1027.9144               Pseudo R2         =     0.1882

------------------------------------------------------------------------------
             |               Robust
        work |      Coef.   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         age |   .0579303   .0072054     8.04   0.000     .0438079    .0720527
     married |   .7417775   .1272191     5.83   0.000     .4924326    .9911224
    children |   .7644882   .0497584    15.36   0.000     .6669635    .8620129
   education |   .0982513    .019011     5.17   0.000     .0609904    .1355121
       _cons |  -4.159247    .327398   -12.70   0.000    -4.800936   -3.517559
------------------------------------------------------------------------------
*/

*机率比汇报

logit work age married children education,nolog or

/*
Logistic regression                             Number of obs     =      2,000
                                                LR chi2(4)        =     476.62
                                                Prob > chi2       =     0.0000
Log likelihood = -1027.9144                     Pseudo R2         =     0.1882

------------------------------------------------------------------------------
        work | Odds Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         age |   1.059641   .0076517     8.02   0.000      1.04475    1.074745
     married |   2.099664   .2655457     5.87   0.000     1.638694    2.690307
    children |   2.147895   .1106786    14.84   0.000     1.941563    2.376153
   education |    1.10324   .0205779     5.27   0.000     1.063636    1.144318
       _cons |   .0156193   .0051862   -12.53   0.000     .0081476     .029943
------------------------------------------------------------------------------
Note: _cons estimates baseline odds.
*/
*---------------------边际效应-----------------
*样本均值处边际效应

margins,dydx(*) atmeans
/*
Conditional marginal effects                    Number of obs     =      2,000
Model VCE    : OIM

Expression   : Pr(work), predict()
dy/dx w.r.t. : age married children education
at           : age             =      36.208 (mean)
               married         =       .6705 (mean)
               children        =      1.6445 (mean)
               education       =      13.084 (mean)

------------------------------------------------------------------------------
             |            Delta-method
             |      dy/dx   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         age |   .0115031   .0014236     8.08   0.000     .0087129    .0142934
     married |   .1472934   .0248209     5.93   0.000     .0986453    .1959415
    children |    .151803   .0093768    16.19   0.000     .1334249    .1701812
   education |   .0195096   .0036991     5.27   0.000     .0122596    .0267596
------------------------------------------------------------------------------

. 
end of do-file
*/

*---------------------指定变量取值处的边际效应-------------------
margins,dydx(*) at(age =30)
/*
Average marginal effects                        Number of obs     =      2,000
Model VCE    : OIM

Expression   : Pr(work), predict()
dy/dx w.r.t. : age married children education
at           : age             =          30

------------------------------------------------------------------------------
             |            Delta-method
             |      dy/dx   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         age |    .011179   .0014719     7.59   0.000      .008294    .0140639
     married |   .1431427   .0232525     6.16   0.000     .0975687    .1887167
    children |   .1475253   .0074033    19.93   0.000     .1330151    .1620355
   education |   .0189598   .0034727     5.46   0.000     .0121534    .0257662
------------------------------------------------------------------------------
*/
*------------------准确预测率------------------
estat clas
/*
Logistic model for work

              -------- True --------
Classified |         D            ~D  |      Total
-----------+--------------------------+-----------
     +     |      1177           361  |       1538
     -     |       166           296  |        462
-----------+--------------------------+-----------
   Total   |      1343           657  |       2000

Classified + if predicted Pr(D) >= .5
True D defined as work != 0
--------------------------------------------------
Sensitivity                     Pr( +| D)   87.64%
Specificity                     Pr( -|~D)   45.05%
Positive predictive value       Pr( D| +)   76.53%
Negative predictive value       Pr(~D| -)   64.07%
--------------------------------------------------
False + rate for true ~D        Pr( +|~D)   54.95%
False - rate for true D         Pr( -| D)   12.36%
False + rate for classified +   Pr(~D| +)   23.47%
False - rate for classified -   Pr( D| -)   35.93%
--------------------------------------------------
Correctly classified                        73.65%
--------------------------------------------------
*/

-END-

参考文献：