Logo

Weighted Least Squares

In [1]: import numpy as np

In [2]: import statsmodels.api as sm

In [3]: import matplotlib.pyplot as plt

In [4]: data = sm.datasets.ccard.load()

In [5]: data.exog = sm.add_constant(data.exog, prepend=False)

In [6]: ols_fit = sm.OLS(data.endog, data.exog).fit()

perhaps the residuals from this fit depend on the square of income

In [7]: incomesq = data.exog[:,2]

In [8]: plt.scatter(incomesq, ols_fit.resid)
 Out[8]: <matplotlib.collections.PathCollection at 0x7c18d90>

In [9]: plt.grid()
../../_images/wls_resid_check.png

If we think that the variance is proportional to income**2 we would want to weight the regression by income the weights argument in WLS weights the regression by its square root and since income enters the equation, if we have income/income it becomes the constant, so we would want to perform this type of regression without an explicit constant in the design

In [10]: wls_fit = sm.WLS(data.endog, data.exog[:,:-1], weights=1/incomesq).fit()

This however, leads to difficulties in interpreting the post-estimation statistics. Statsmodels does not yet handle this elegantly, but the following may be more appropriate

explained sum of squares

In [11]: ess = wls_fit.uncentered_tss - wls_fit.ssr

rsquared

In [12]: rsquared = ess/wls_fit.uncentered_tss

mean squared error of the model

In [13]: mse_model = ess/(wls_fit.df_model + 1) # add back the dof of the constant

f statistic

In [14]: fvalue = mse_model/wls_fit.mse_resid

adjusted r-squared

In [15]: rsquared_adj = 1 -(wls_fit.nobs)/(wls_fit.df_resid)*(1-rsquared)

Trying to figure out what’s going on in this example

JP: I need to look at this again. Even if I exclude the weight variable from the regressors and keep the constant in then the reported rsquared stays small. Below also compared using squared or sqrt of weight variable. TODO: need to add 45 degree line to graphs

In [16]: wls_fit3 = sm.WLS(data.endog, data.exog[:,(0,1,3,4)], weights=1/incomesq).fit()

In [17]: print wls_fit3.summary()
                            WLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.052
Model:                            WLS   Adj. R-squared:                  0.010
Method:                 Least Squares   F-statistic:                     1.232
Date:                Wed, 11 Jul 2012   Prob (F-statistic):              0.305
Time:                        19:52:23   Log-Likelihood:                -487.93
No. Observations:                  72   AIC:                             983.9
Df Residuals:                      68   BIC:                             993.0
Df Model:                           3                                         
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
x1            -2.1789      3.732     -0.584      0.561        -9.626     5.268
x2           104.7881     25.577      4.097      0.000        53.750   155.826
x3            58.2829     58.287      1.000      0.321       -58.028   174.594
const        -46.1074    105.445     -0.437      0.663      -256.520   164.305
==============================================================================
Omnibus:                       49.275   Durbin-Watson:                   1.625
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              165.137
Skew:                           2.190   Prob(JB):                     1.38e-36
Kurtosis:                       8.988   Cond. No.                         146.
==============================================================================

In [18]: print 'corrected rsquared',
corrected rsquared

In [19]: print (wls_fit3.uncentered_tss - wls_fit3.ssr)/wls_fit3.uncentered_tss
0.541521419331

In [20]: plt.figure();

In [21]: plt.title('WLS dropping heteroscedasticity variable from regressors');

In [22]: plt.plot(data.endog, wls_fit3.fittedvalues, 'o');

In [23]: plt.xlim([0,2000]);

In [24]: plt.ylim([0,2000]);

In [25]: print 'raw correlation of endog and fittedvalues'
raw correlation of endog and fittedvalues

In [26]: print np.corrcoef(data.endog, wls_fit.fittedvalues)
[[ 1.          0.45290477]
 [ 0.45290477  1.        ]]

In [27]: print 'raw correlation coefficient of endog and fittedvalues squared'
raw correlation coefficient of endog and fittedvalues squared

In [28]: print np.corrcoef(data.endog, wls_fit.fittedvalues)[0,1]**2
0.205122735117
../../_images/wls_drop_het.png

compare with robust regression, heteroscedasticity correction downweights the outliers

In [29]: rlm_fit = sm.RLM(data.endog, data.exog).fit()

In [30]: plt.figure();

In [31]: plt.title('using robust for comparison');

In [32]: plt.plot(data.endog, rlm_fit.fittedvalues, 'o');

In [33]: plt.xlim([0,2000]);

In [34]: plt.ylim([0,2000]);
../../_images/wls_robust_compare.png

What is going on? A more systematic look at the data

two helper functions

In [35]: def getrsq(fitresult):
   ....:      '''calculates rsquared residual, total and explained sums of squares
   ....:      Parameters
   ....:      ----------
   ....:      fitresult : instance of Regression Result class, or tuple of (resid, endog) arrays
   ....:          regression residuals and endogenous variable
   ....:      Returns
   ....:      -------
   ....:      rsquared
   ....:      residual sum of squares
   ....:      (centered) total sum of squares
   ....:      explained sum of squares (for centered)
   ....:      '''
   ....:      if hasattr(fitresult, 'resid') and hasattr(fitresult, 'model'):
   ....:          resid = fitresult.resid
   ....:          endog = fitresult.model.endog
   ....:          nobs = fitresult.nobs
   ....:      else:
   ....:          resid = fitresult[0]
   ....:          endog = fitresult[1]
   ....:          nobs = resid.shape[0]
   ....:      rss = np.dot(resid, resid)
   ....:      tss = np.var(endog)*nobs
   ....:      return 1-rss/tss, rss, tss, tss-rss
   ....: 

In [36]: def index_trim_outlier(resid, k):
   ....:      '''returns indices to residual array with k outliers removed
   ....:      Parameters
   ....:      ----------
   ....:      resid : array_like, 1d
   ....:          data vector, usually residuals of a regression
   ....:      k : int
   ....:          number of outliers to remove
   ....:      Returns
   ....:      -------
   ....:      trimmed_index : array, 1d
   ....:          index array with k outliers removed
   ....:      outlier_index : array, 1d
   ....:          index array of k outliers
   ....:      Notes
   ....:      -----
   ....:      Outliers are defined as the k observations with the largest
   ....:      absolute values.
   ....:      '''
   ....:      sort_index = np.argsort(np.abs(resid))
   ....:      # index of non-outlier
   ....:      trimmed_index = np.sort(sort_index[:-k])
   ....:      outlier_index = np.sort(sort_index[-k:])
   ....:      return trimmed_index, outlier_index
   ....: 

Comparing estimation results for ols, rlm and wls with and without outliers

ols_test_fit = sm.OLS(data.endog, data.exog).fit()

In [37]: olskeep, olsoutl = index_trim_outlier(ols_fit.resid, 2)

In [38]: print 'ols outliers', olsoutl, ols_fit.resid[olsoutl]
ols outliers [22 44] [ 1029.99405449  1460.61975104]

In [39]: ols_fit_rm2 = sm.OLS(data.endog[olskeep], data.exog[olskeep,:]).fit()

In [40]: rlm_fit_rm2 = sm.RLM(data.endog[olskeep], data.exog[olskeep,:]).fit()

weights = 1/incomesq

In [41]: results = [ols_fit, ols_fit_rm2, rlm_fit, rlm_fit_rm2]

Note: I think incomesq is already square

In [42]: for weights in [1/incomesq, 1/incomesq**2, np.sqrt(incomesq)]:
   ....:      print '\nComparison OLS and WLS with and without outliers'
   ....:      wls_fit0 = sm.WLS(data.endog, data.exog, weights=weights).fit()
   ....:      wls_fit_rm2 = sm.WLS(data.endog[olskeep], data.exog[olskeep,:],
   ....:                           weights=weights[olskeep]).fit()
   ....:      wlskeep, wlsoutl = index_trim_outlier(ols_fit.resid, 2)
   ....:      print '2 outliers candidates and residuals'
   ....:      print wlsoutl, wls_fit.resid[olsoutl]
   ....:      # redundant because ols and wls outliers are the same:
   ....:      ##wls_fit_rm2_ = sm.WLS(data.endog[wlskeep], data.exog[wlskeep,:],
   ....:      ##                     weights=1/incomesq[wlskeep]).fit()
   ....:      print 'outliers ols, wls:', olsoutl, wlsoutl
   ....:      print 'rsquared'
   ....:      print 'ols vs ols rm2', ols_fit.rsquared, ols_fit_rm2.rsquared
   ....:      print 'wls vs wls rm2', wls_fit0.rsquared, wls_fit_rm2.rsquared #, wls_fit_rm2_.rsquared
   ....:      print 'compare R2_resid  versus  R2_wresid'
   ....:      print 'ols minus 2', getrsq(ols_fit_rm2)[0],
   ....:      print getrsq((ols_fit_rm2.wresid, ols_fit_rm2.model.wendog))[0]
   ....:      print 'wls        ', getrsq(wls_fit)[0],
   ....:      print getrsq((wls_fit.wresid, wls_fit.model.wendog))[0]
   ....:      print 'wls minus 2', getrsq(wls_fit_rm2)[0],
   ....:      # next is same as wls_fit_rm2.rsquared for cross checking
   ....:      print getrsq((wls_fit_rm2.wresid, wls_fit_rm2.model.wendog))[0]
   ....:      #print getrsq(wls_fit_rm2_)[0],
   ....:      #print getrsq((wls_fit_rm2_.wresid, wls_fit_rm2_.model.wendog))[0]
   ....:      results.extend([wls_fit0, wls_fit_rm2])
   ....: 

Comparison OLS and WLS with and without outliers
2 outliers candidates and residuals
[22 44] [ 1056.13211484  1533.40818784]
outliers ols, wls: [22 44] [22 44]
rsquared
ols vs ols rm2 0.243577916822 0.317459940305
wls vs wls rm2 0.0593688805068 0.0679070566139
compare R2_resid  versus  R2_wresid
ols minus 2 0.317459940305 0.317459940305
wls         0.198643181027 0.050000418999
wls minus 2 0.311233296924 0.0679070566139

Comparison OLS and WLS with and without outliers
2 outliers candidates and residuals
[22 44] [ 1056.13211484  1533.40818784]
outliers ols, wls: [22 44] [22 44]
rsquared
ols vs ols rm2 0.243577916822 0.317459940305
wls vs wls rm2 0.0666184777739 0.127696536358
compare R2_resid  versus  R2_wresid
ols minus 2 0.317459940305 0.317459940305
wls         0.198643181027 0.050000418999
wls minus 2 0.293669036814 0.127696536358

Comparison OLS and WLS with and without outliers
2 outliers candidates and residuals
[22 44] [ 1056.13211484  1533.40818784]
outliers ols, wls: [22 44] [22 44]
rsquared
ols vs ols rm2 0.243577916822 0.317459940305
wls vs wls rm2 0.353266862329 0.546812251767
compare R2_resid  versus  R2_wresid
ols minus 2 0.317459940305 0.317459940305
wls         0.198643181027 0.050000418999
wls minus 2 0.312186685643 0.546812251767

In [43]: print '     ols             ols_rm2       rlm           rlm_rm2     wls (lin)    wls_rm2 (lin)   wls (squ)   wls_rm2 (squ)  wls (sqrt)   wls_rm2 (sqrt)'
     ols             ols_rm2       rlm           rlm_rm2     wls (lin)    wls_rm2 (lin)   wls (squ)   wls_rm2 (squ)  wls (sqrt)   wls_rm2 (sqrt)

In [44]: print 'Parameter estimates'
Parameter estimates

In [45]: print np.column_stack([r.params for r in results])
[[  -3.08181404   -5.06103843   -4.98510966   -5.34410309   -2.69418516
    -3.1305703    -1.43815462   -1.58893054   -3.57074829   -6.80053364]
 [ 234.34702702  115.08753715  129.85391456  109.01433492  158.42697752
   128.38182357   60.95113284  100.25000841  254.82166855  103.75834726]
 [ -14.99684418   -5.77558429   -6.46204829   -4.77409191   -7.24928987
    -7.41228893    6.84943071   -3.34972494  -16.40524256   -4.5924465 ]
 [  27.94090839   85.46566835   89.91389709   95.85086459   60.44877369
    79.7759146    55.9884469    60.97199734   -3.8085159    84.69170048]
 [-237.1465136    39.51639838  -15.50014814   31.39771833 -114.10886935
   -40.04207242   -6.41976501  -38.83583228 -260.72084271  117.20540179]]

In [46]: print 'R2 original data, next line R2 weighted data'
R2 original data, next line R2 weighted data

In [47]: print np.column_stack([getattr(r, 'rsquared', None) for r in results])
[[0.243577916822 0.317459940305 None None 0.0593688805068 0.0679070566139
  0.0666184777739 0.127696536358 0.353266862329 0.546812251767]]

In [48]: print 'Standard errors'
Standard errors

In [49]: print np.column_stack([getattr(r, 'bse', None) for r in results])
[[   5.51471653    3.31028758    2.61580069    2.39537089    3.80730631
     2.90027255    2.71141739    2.46959477    6.37593755    3.39477842]
 [  80.36595035   49.35949263   38.12005692   35.71722666   76.39115431
    58.35231328   87.18452039   80.30086861   86.99568216   47.58202096]
 [   7.46933695    4.55366113    3.54293763    3.29509357    9.72433732
     7.41259156   15.15205888   14.10674821    7.18302629    3.91640711]
 [  82.92232357   50.54681754   39.33262384   36.57639175   58.55088753
    44.82218676   43.11017757   39.31097542   96.4077482    52.57314209]
 [ 199.35166485  122.1287718    94.55866295   88.3741058   139.68749646
   106.89445525  115.79258539  105.99258363  239.38105863  130.32619908]]

In [50]: print 'Heteroscedasticity robust standard errors (with ols)'
Heteroscedasticity robust standard errors (with ols)

In [51]: print 'with outliers'
with outliers

In [52]: print np.column_stack([getattr(ols_fit, se, None) for se in ['HC0_se', 'HC1_se', 'HC2_se', 'HC3_se']])
[[   3.30166123    3.42264107    3.4477148     3.60462409]
 [  88.86635165   92.12260235   92.08368378   95.48159869]
 [   6.94456348    7.19902694    7.19953754    7.47634779]
 [  92.18777672   95.56573144   95.67211143   99.31427277]
 [ 212.9905298   220.79495237  221.08892661  229.57434782]]

a quick bootstrap analysis

(I didn’t check whether this is fully correct statistically)

With OLS on full sample

In [53]: nobs, nvar = data.exog.shape

In [54]: niter = 2000

In [55]: bootres = np.zeros((niter, nvar*2))

In [56]: for it in range(niter):
   ....:      rind = np.random.randint(nobs, size=nobs)
   ....:      endog = data.endog[rind]
   ....:      exog = data.exog[rind,:]
   ....:      res = sm.OLS(endog, exog).fit()
   ....:      bootres[it, :nvar] = res.params
   ....:      bootres[it, nvar:] = res.bse
   ....: 

In [57]: np.set_printoptions(linewidth=200)

In [58]: print 'Bootstrap Results of parameters and parameter standard deviation  OLS'
Bootstrap Results of parameters and parameter standard deviation  OLS

In [59]: print 'Parameter estimates'
Parameter estimates

In [60]: print 'median', np.median(bootres[:,:5], 0)
median [  -3.18762808  230.28190284  -14.91350869   31.31426536 -230.76507856]

In [61]: print 'mean  ', np.mean(bootres[:,:5], 0)
mean   [  -2.83204335  237.04728999  -15.16402208   26.36037718 -250.6443012 ]

In [62]: print 'std   ', np.std(bootres[:,:5], 0)
std    [   3.83381044   98.71047041    9.45405218   95.45545876  225.33289039]

In [63]: print 'Standard deviation of parameter estimates'
Standard deviation of parameter estimates

In [64]: print 'median', np.median(bootres[:,5:], 0)
median [   5.43542602   82.08961746    7.59405275   81.36978696  201.58548065]

In [65]: print 'mean  ', np.mean(bootres[:,5:], 0)
mean   [   5.46029027   86.34490341    8.57550315   80.77601548  203.356959  ]

In [66]: print 'std   ', np.std(bootres[:,5:], 0)
std    [  1.42063357  30.34539049   4.32578572  19.13318826  55.7301691 ]

In [67]: plt.figure()
Out[67]: <matplotlib.figure.Figure at 0x5f7f5d0>

In [68]: for i in range(4):
   ....:      plt.subplot(2,2,i+1)
   ....:      plt.hist(bootres[:,i],50)
   ....:      plt.title('var%d'%i)
   ....: 

In [69]: plt.figtext(0.5, 0.935,  'OLS Bootstrap',
   ....:                 ha='center', color='black', weight='bold', size='large')
   ....: 
Out[69]: <matplotlib.text.Text at 0x5f80590>
../../_images/wls_bootstrap.png

With WLS on sample with outliers removed

In [70]: data_endog = data.endog[olskeep]

In [71]: data_exog = data.exog[olskeep,:]

In [72]: incomesq_rm2 = incomesq[olskeep]

In [73]: nobs, nvar = data_exog.shape

In [74]: niter = 500  # a bit slow

In [75]: bootreswls = np.zeros((niter, nvar*2))

In [76]: for it in range(niter):
   ....:      rind = np.random.randint(nobs, size=nobs)
   ....:      endog = data_endog[rind]
   ....:      exog = data_exog[rind,:]
   ....:      res = sm.WLS(endog, exog, weights=1/incomesq[rind,:]).fit()
   ....:      bootreswls[it, :nvar] = res.params
   ....:      bootreswls[it, nvar:] = res.bse
   ....: 

In [77]: print 'Bootstrap Results of parameters and parameter standard deviation',
Bootstrap Results of parameters and parameter standard deviation

In [78]: print 'WLS removed 2 outliers from sample'
WLS removed 2 outliers from sample

In [79]: print 'Parameter estimates'
Parameter estimates

In [80]: print 'median', np.median(bootreswls[:,:5], 0)
median [  -3.85414013  134.81822248   -8.87959655   87.61078396  -34.89330649]

In [81]: print 'mean  ', np.mean(bootreswls[:,:5], 0)
mean   [  -3.6211722   136.14971578   -8.98259132   87.41910954  -48.11761555]

In [82]: print 'std   ', np.std(bootreswls[:,:5], 0)
std    [   2.85719978   62.33377642    7.75700445   48.78693317  110.93812134]

In [83]: print 'Standard deviation of parameter estimates'
Standard deviation of parameter estimates

In [84]: print 'median', np.median(bootreswls[:,5:], 0)
median [   2.93450772   59.94331313    6.80259478   44.86159298  120.94206658]

In [85]: print 'mean  ', np.mean(bootreswls[:,5:], 0)
mean   [   2.96668034   60.64580021    6.99451597   45.80030781  122.42460949]

In [86]: print 'std   ', np.std(bootreswls[:,5:], 0)
std    [  0.52108974  11.18651362   1.5955339    7.77500052  22.75644625]

In [87]: plt.figure()
Out[87]: <matplotlib.figure.Figure at 0x826b5d0>

In [88]: for i in range(4):
   ....:      plt.subplot(2,2,i+1)
   ....:      plt.hist(bootreswls[:,i],50)
   ....:      plt.title('var%d'%i)
   ....: 

In [89]: plt.figtext(0.5, 0.935,  'WLS rm2 Bootstrap',
   ....:                 ha='center', color='black', weight='bold', size='large')
   ....: 
Out[89]: <matplotlib.text.Text at 0x826b950>
../../_images/wls_bootstrap_rm2.png
The following a random variables not fixed by a seed

Bootstrap Results of parameters and parameter standard deviation
OLS

Parameter estimates
median [  -3.26216383  228.52546429  -14.57239967   34.27155426 -227.02816597]
mean   [  -2.89855173  234.37139359  -14.98726881   27.96375666 -243.18361746]
std    [   3.78704907   97.35797802    9.16316538   94.65031973  221.79444244]

Standard deviation of parameter estimates
median [   5.44701033   81.96921398    7.58642431   80.64906783  200.19167735]
mean   [   5.44840542   86.02554883    8.56750041   80.41864084  201.81196849]
std    [   1.43425083   29.74806562    4.22063268   19.14973277   55.34848348]

Bootstrap Results of parameters and parameter standard deviation
WLS removed 2 outliers from sample

Parameter estimates
median [  -3.95876112  137.10419042   -9.29131131   88.40265447  -44.21091869]
mean   [  -3.67485724  135.42681207   -8.7499235    89.74703443  -46.38622848]
std    [   2.96908679   56.36648967    7.03870751   48.51201918  106.92466097]

Standard deviation of parameter estimates
median [   2.89349748   59.19454402    6.70583332   45.40987953  119.05241283]
mean   [   2.97600894   60.14540249    6.92102065   45.66077486  121.35519673]
std    [   0.55378808   11.77831934    1.69289179    7.4911526    23.72821085]

Conclusion: problem with outliers and possibly heteroscedasticity

in bootstrap results

  • bse in OLS underestimates the standard deviation of the parameters compared to standard deviation in bootstrap
  • OLS heteroscedasticity corrected standard errors for the original data (above) are close to bootstrap std
  • using WLS with 2 outliers removed has a relatively good match between the mean or median bse and the std of the parameter estimates in the bootstrap

We could also include rsquared in bootstrap, and do it also for RLM. The problems could also mean that the linearity assumption is violated, e.g. try non-linear transformation of exog variables, but linear in parameters.

for statsmodels

  • In this case rsquared for original data looks less random/arbitrary.

  • Don’t change definition of rsquared from centered tss to uncentered

    tss when calculating rsquared in WLS if the original exog contains a constant. The increase in rsquared because of a change in definition will be very misleading.

  • Whether there is a constant in the transformed exog, wexog, or not,

    might affect also the degrees of freedom calculation, but I haven’t checked this. I would guess that the df_model should stay the same, but needs to be verified with a textbook.

  • df_model has to be adjusted if the original data does not have a

    constant, e.g. when regressing an endog on a single exog variable without constant. This case might require also a redefinition of the rsquare and f statistic for the regression anova to use the uncentered tss. This can be done through keyword parameter to model.__init__ or through autodedection with hasconst = (exog.var(0)<1e-10).any() I’m not sure about fixed effects with a full dummy set but without a constant. In this case autodedection wouldn’t work this way. Also, I’m not sure whether a ddof keyword parameter can also handle the hasconst case.