Logo

Generalized Linear ModelsΒΆ

In [1]: import numpy as np

In [2]: import statsmodels.api as sm

In [3]: from scipy import stats

In [4]: from matplotlib import pyplot as plt

Example for using GLM on binomial response data the input response vector in this case is N by 2 (success, failure) This data is taken with permission from Jeff Gill (2000) Generalized linear models: A unified approach

The response variable is (of students above the math national median, #of students below)

The explanatory variables are (in column order)
The proportion of low income families “LOWINC”
The proportions of minority students,”PERASIAN”,”PERBLACK”,”PERHISP”
The percentage of minority teachers “PERMINTE”,
The median teacher salary including benefits in 1000s “AVSALK”
The mean teacher experience in years “AVYRSEXP”,
The per-pupil expenditures in thousands “PERSPENK”
The pupil-teacher ratio “PTRATIO”
The percent of students taking college credit courses “PCTAF”,
The percentage of charter schools in the districut “PCTCHRT”
The percent of schools in the district operating year round “PCTYRRND”
The following are interaction terms “PERMINTE_AVYRSEXP”,”PERMINTE_AVSAL”,
“AVYRSEXP_AVSAL”,”PERSPEN_PTRATIO”,”PERSPEN_PCTAF”,”PTRATIO_PCTAF”,
“PERMINTE_AVYRSEXP_AVSAL”,”PERSPEN_PTRATIO_PCTAF”
In [5]: data = sm.datasets.star98.load()

In [6]: data.exog = sm.add_constant(data.exog)

The response variable is (success, failure). Eg., the first observation is

In [7]: print data.endog[0]
[ 452.  355.]

Giving a total number of trials for this observation of print data.endog[0].sum()

In [8]: glm_binom = sm.GLM(data.endog, data.exog, family=sm.families.Binomial())

In [9]: binom_results = glm_binom.fit()

The fitted values are

In [10]: print binom_results.params
[-0.01681504  0.00992548 -0.01872421 -0.01423856  0.25448717  0.24069366
  0.08040867 -1.9521605  -0.33408647 -0.16902217  0.0049167  -0.00357996
 -0.01407656 -0.00400499 -0.0039064   0.0917143   0.04898984  0.00804074
  0.00022201 -0.00224925  2.95887793]

The corresponding t-values are

In [11]: print binom_results.tvalues
[-38.74908321  16.50473627 -25.1821894  -32.81791308   8.49827113
   4.21247925   5.7749976   -6.16191078  -5.45321673  -5.16865445
   3.92119964 -15.87825999  -7.39093058  -8.44963886  -4.05916246
   6.3210987    6.57434662   5.36229044   7.42806363  -6.44513698
   1.91301155]

It is common in GLMs with interactions to compare first differences. We are interested in the difference of the impact of the explanatory variable on the response variable. This example uses interquartile differences for the percentage of low income households while holding the other values constant at their mean.

In [12]: means = data.exog.mean(axis=0)

In [13]: means25 = means.copy()

In [14]: means25[0] = stats.scoreatpercentile(data.exog[:,0], 25)

In [15]: means75 = means.copy()

In [16]: means75[0] = lowinc_75per = stats.scoreatpercentile(data.exog[:,0], 75)

In [17]: resp_25 = binom_results.predict(means25)

In [18]: resp_75 = binom_results.predict(means75)

In [19]: diff = resp_75 - resp_25

The interquartile first difference for the percentage of low income households in a school district is

In [20]: print diff*100
-11.8752509918

In [21]: means0 = means.copy()

In [22]: means100 = means.copy()

In [23]: means0[0] = data.exog[:,0].min()

In [24]: means100[0] = data.exog[:,0].max()

In [25]: resp_0 = binom_results.predict(means0)

In [26]: resp_100 = binom_results.predict(means100)

In [27]: diff_full = resp_100 - resp_0

In [28]: print """The full range difference is %2.4f %%""" % (diff_full*100)
The full range difference is -36.1636 %

In [29]: nobs = binom_results.nobs

In [30]: y = data.endog[:,0]/data.endog.sum(1)

In [31]: yhat = binom_results.mu

Plot of yhat vs y

In [32]: plt.figure()
Out[32]: <matplotlib.figure.Figure at 0x6346450>

In [33]: plt.scatter(yhat, y)
Out[33]: <matplotlib.collections.PathCollection at 0x617dc10>

In [34]: line_fit = sm.OLS(y, sm.add_constant(yhat)).fit().params

In [35]: fit = lambda x: line_fit[1]+line_fit[0]*x # better way in scipy?

In [36]: plt.plot(np.linspace(0,1,nobs), fit(np.linspace(0,1,nobs)))
Out[36]: [<matplotlib.lines.Line2D at 0x6180750>]

In [37]: plt.title('Model Fit Plot')
Out[37]: <matplotlib.text.Text at 0x65391d0>

In [38]: plt.ylabel('Observed values')
Out[38]: <matplotlib.text.Text at 0x66c9050>

In [39]: plt.xlabel('Fitted values')
Out[39]: <matplotlib.text.Text at 0x60c57d0>
../../_images/glm_fitted.png

Plot of yhat vs. Pearson residuals

In [40]: plt.figure()
Out[40]: <matplotlib.figure.Figure at 0x6441a10>

In [41]: plt.scatter(yhat, binom_results.resid_pearson)
Out[41]: <matplotlib.collections.PathCollection at 0x6d2d5d0>

In [42]: plt.plot([0.0, 1.0],[0.0, 0.0], 'k-')
Out[42]: [<matplotlib.lines.Line2D at 0x617d0d0>]

In [43]: plt.title('Residual Dependence Plot')
Out[43]: <matplotlib.text.Text at 0x6d42690>

In [44]: plt.ylabel('Pearson Residuals')
Out[44]: <matplotlib.text.Text at 0x6d31a10>

In [45]: plt.xlabel('Fitted values')
Out[45]: <matplotlib.text.Text at 0x6d2d090>
../../_images/glm_resids.png

Histogram of standardized deviance residuals

In [46]: plt.figure()
Out[46]: <matplotlib.figure.Figure at 0x6d08990>

In [47]: res = binom_results.resid_deviance.copy()

In [48]: stdres = (res - res.mean())/res.std()

In [49]: plt.hist(stdres, bins=25)
Out[49]: 
(array([ 3,  2,  4, 11, 11, 20, 27, 31, 33, 41, 28, 26, 23, 17, 11,  4,  2,
        3,  1,  2,  1,  0,  1,  0,  1]),
 array([-2.54284768, -2.27049293, -1.99813819, -1.72578344, -1.45342869,
       -1.18107394, -0.9087192 , -0.63636445, -0.3640097 , -0.09165496,
        0.18069979,  0.45305454,  0.72540929,  0.99776403,  1.27011878,
        1.54247353,  1.81482827,  2.08718302,  2.35953777,  2.63189252,
        2.90424726,  3.17660201,  3.44895676,  3.7213115 ,  3.99366625,
        4.266021  ]),
 <a list of 25 Patch objects>)

In [50]: plt.title('Histogram of standardized deviance residuals')
Out[50]: <matplotlib.text.Text at 0x6251690>
../../_images/glm_hist_res.png

QQ Plot of Deviance Residuals

In [51]: plt.figure()
Out[51]: <matplotlib.figure.Figure at 0x62758d0>

In [52]: res.sort()

In [53]: p = np.linspace(0 + 1./(nobs-1), 1-1./(nobs-1), nobs)

In [54]: quants = np.zeros_like(res)

In [55]: for i in range(nobs):
   ....:      quants[i] = stats.scoreatpercentile(res, p[i]*100)
   ....: 

In [56]: mu = res.mean()

In [57]: sigma = res.std()

In [58]: y = stats.norm.ppf(p, loc=mu, scale=sigma)

In [59]: plt.scatter(y, quants)
Out[59]: <matplotlib.collections.PathCollection at 0x6da0b90>

In [60]: plt.plot([y.min(),y.max()],[y.min(),y.max()],'r--')
Out[60]: [<matplotlib.lines.Line2D at 0x6d68550>]

In [61]: plt.title('Normal - Quantile Plot')
Out[61]: <matplotlib.text.Text at 0x6d8b890>

In [62]: plt.ylabel('Deviance Residuals Quantiles')
Out[62]: <matplotlib.text.Text at 0x6d81e10>

In [63]: plt.xlabel('Quantiles of N(0,1)')
Out[63]: <matplotlib.text.Text at 0x62574d0>

In [64]: from statsmodels import graphics

In [65]: img = graphics.gofplots.qqplot(res, line='r')
../../_images/glm_qqplot.png

Example for using GLM Gamma for a proportional count response Brief description of the data and design

In [66]: print sm.datasets.scotland.DESCRLONG

This data is based on the example in Gill and describes the proportion of
voters who voted Yes to grant the Scottish Parliament taxation powers.
The data are divided into 32 council districts.  This example's explanatory
variables include the amount of council tax collected in pounds sterling as
of April 1997 per two adults before adjustments, the female percentage of
total claims for unemployment benefits as of January, 1998, the standardized
mortality rate (UK is 100), the percentage of labor force participation,
regional GDP, the percentage of children aged 5 to 15, and an interaction term
between female unemployment and the council tax.

The original source files and variable information are included in
/scotland/src/


In [67]: data2 = sm.datasets.scotland.load()

In [68]: data2.exog = sm.add_constant(data2.exog)

In [69]: glm_gamma = sm.GLM(data2.endog, data2.exog, family=sm.families.Gamma())

In [70]: glm_results = glm_gamma.fit()

Example for Gaussian distribution with a noncanonical link

In [71]: nobs2 = 100

In [72]: x = np.arange(nobs2)

In [73]: np.random.seed(54321)

In [74]: X = np.column_stack((x,x**2))

In [75]: X = sm.add_constant(X)

In [76]: lny = np.exp(-(.03*x + .0001*x**2 - 1.0)) + .001 * np.random.rand(nobs2)

In [77]: gauss_log = sm.GLM(lny, X, family=sm.families.Gaussian(sm.families.links.log))

In [78]: gauss_log_results = gauss_log.fit()

check summary

In [79]: print binom_results.summary()
                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:           ['y1', 'y2']   No. Observations:                  303
Model:                            GLM   Df Residuals:                      282
Model Family:                Binomial   Df Model:                           20
Link Function:                  logit   Scale:                             1.0
Method:                          IRLS   Log-Likelihood:                -2998.6
Date:                Wed, 11 Jul 2012   Deviance:                       4078.8
Time:                        19:52:11   Pearson chi2:                 4.05e+03
No. Iterations:                     6                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
x1            -0.0168      0.000    -38.749      0.000        -0.018    -0.016
x2             0.0099      0.001     16.505      0.000         0.009     0.011
x3            -0.0187      0.001    -25.182      0.000        -0.020    -0.017
x4            -0.0142      0.000    -32.818      0.000        -0.015    -0.013
x5             0.2545      0.030      8.498      0.000         0.196     0.313
x6             0.2407      0.057      4.212      0.000         0.129     0.353
x7             0.0804      0.014      5.775      0.000         0.053     0.108
x8            -1.9522      0.317     -6.162      0.000        -2.573    -1.331
x9            -0.3341      0.061     -5.453      0.000        -0.454    -0.214
x10           -0.1690      0.033     -5.169      0.000        -0.233    -0.105
x11            0.0049      0.001      3.921      0.000         0.002     0.007
x12           -0.0036      0.000    -15.878      0.000        -0.004    -0.003
x13           -0.0141      0.002     -7.391      0.000        -0.018    -0.010
x14           -0.0040      0.000     -8.450      0.000        -0.005    -0.003
x15           -0.0039      0.001     -4.059      0.000        -0.006    -0.002
x16            0.0917      0.015      6.321      0.000         0.063     0.120
x17            0.0490      0.007      6.574      0.000         0.034     0.064
x18            0.0080      0.001      5.362      0.000         0.005     0.011
x19            0.0002   2.99e-05      7.428      0.000         0.000     0.000
x20           -0.0022      0.000     -6.445      0.000        -0.003    -0.002
const          2.9589      1.547      1.913      0.057        -0.073     5.990
==============================================================================

This Page