In [1]: import numpy as np
In [2]: import statsmodels.api as sm
In [3]: import matplotlib.pyplot as plt
In [4]: data = sm.datasets.ccard.load()
In [5]: data.exog = sm.add_constant(data.exog, prepend=False)
In [6]: ols_fit = sm.OLS(data.endog, data.exog).fit()
perhaps the residuals from this fit depend on the square of income
In [7]: incomesq = data.exog[:,2]
In [8]: plt.scatter(incomesq, ols_fit.resid)
Out[8]: <matplotlib.collections.PathCollection at 0xc4a4e2c>
In [9]: plt.grid()
If we think that the variance is proportional to income**2 we would want to weight the regression by income the weights argument in WLS weights the regression by its square root and since income enters the equation, if we have income/income it becomes the constant, so we would want to perform this type of regression without an explicit constant in the design
In [10]: wls_fit = sm.WLS(data.endog, data.exog[:,:-1], weights=1/incomesq).fit()
This however, leads to difficulties in interpreting the post-estimation statistics. Statsmodels does not yet handle this elegantly, but the following may be more appropriate
explained sum of squares
In [11]: ess = wls_fit.uncentered_tss - wls_fit.ssr
rsquared
In [12]: rsquared = ess/wls_fit.uncentered_tss
mean squared error of the model
In [13]: mse_model = ess/(wls_fit.df_model + 1) # add back the dof of the constant
f statistic
In [14]: fvalue = mse_model/wls_fit.mse_resid
adjusted r-squared
In [15]: rsquared_adj = 1 -(wls_fit.nobs)/(wls_fit.df_resid)*(1-rsquared)
JP: I need to look at this again. Even if I exclude the weight variable from the regressors and keep the constant in then the reported rsquared stays small. Below also compared using squared or sqrt of weight variable. TODO: need to add 45 degree line to graphs
In [16]: wls_fit3 = sm.WLS(data.endog, data.exog[:,(0,1,3,4)], weights=1/incomesq).fit()
In [17]: print wls_fit3.summary()
WLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.052
Model: WLS Adj. R-squared: 0.010
Method: Least Squares F-statistic: 1.232
Date: Wed, 11 Jul 2012 Prob (F-statistic): 0.305
Time: 19:01:04 Log-Likelihood: -487.93
No. Observations: 72 AIC: 983.9
Df Residuals: 68 BIC: 993.0
Df Model: 3
==============================================================================
coef std err t P>|t| [95.0% Conf. Int.]
------------------------------------------------------------------------------
x1 -2.1789 3.732 -0.584 0.561 -9.626 5.268
x2 104.7881 25.577 4.097 0.000 53.750 155.826
x3 58.2829 58.287 1.000 0.321 -58.028 174.594
const -46.1074 105.445 -0.437 0.663 -256.520 164.305
==============================================================================
Omnibus: 49.275 Durbin-Watson: 1.625
Prob(Omnibus): 0.000 Jarque-Bera (JB): 165.137
Skew: 2.190 Prob(JB): 1.38e-36
Kurtosis: 8.988 Cond. No. 146.
==============================================================================
In [18]: print 'corrected rsquared',
corrected rsquared
In [19]: print (wls_fit3.uncentered_tss - wls_fit3.ssr)/wls_fit3.uncentered_tss
0.541521419331
In [20]: plt.figure();
In [21]: plt.title('WLS dropping heteroscedasticity variable from regressors');
In [22]: plt.plot(data.endog, wls_fit3.fittedvalues, 'o');
In [23]: plt.xlim([0,2000]);
In [24]: plt.ylim([0,2000]);
In [25]: print 'raw correlation of endog and fittedvalues'
raw correlation of endog and fittedvalues
In [26]: print np.corrcoef(data.endog, wls_fit.fittedvalues)
[[ 1. 0.45290477]
[ 0.45290477 1. ]]
In [27]: print 'raw correlation coefficient of endog and fittedvalues squared'
raw correlation coefficient of endog and fittedvalues squared
In [28]: print np.corrcoef(data.endog, wls_fit.fittedvalues)[0,1]**2
0.205122735117
compare with robust regression, heteroscedasticity correction downweights the outliers
In [29]: rlm_fit = sm.RLM(data.endog, data.exog).fit()
In [30]: plt.figure();
In [31]: plt.title('using robust for comparison');
In [32]: plt.plot(data.endog, rlm_fit.fittedvalues, 'o');
In [33]: plt.xlim([0,2000]);
In [34]: plt.ylim([0,2000]);
two helper functions
In [35]: def getrsq(fitresult):
....: '''calculates rsquared residual, total and explained sums of squares
....: Parameters
....: ----------
....: fitresult : instance of Regression Result class, or tuple of (resid, endog) arrays
....: regression residuals and endogenous variable
....: Returns
....: -------
....: rsquared
....: residual sum of squares
....: (centered) total sum of squares
....: explained sum of squares (for centered)
....: '''
....: if hasattr(fitresult, 'resid') and hasattr(fitresult, 'model'):
....: resid = fitresult.resid
....: endog = fitresult.model.endog
....: nobs = fitresult.nobs
....: else:
....: resid = fitresult[0]
....: endog = fitresult[1]
....: nobs = resid.shape[0]
....: rss = np.dot(resid, resid)
....: tss = np.var(endog)*nobs
....: return 1-rss/tss, rss, tss, tss-rss
....:
In [36]: def index_trim_outlier(resid, k):
....: '''returns indices to residual array with k outliers removed
....: Parameters
....: ----------
....: resid : array_like, 1d
....: data vector, usually residuals of a regression
....: k : int
....: number of outliers to remove
....: Returns
....: -------
....: trimmed_index : array, 1d
....: index array with k outliers removed
....: outlier_index : array, 1d
....: index array of k outliers
....: Notes
....: -----
....: Outliers are defined as the k observations with the largest
....: absolute values.
....: '''
....: sort_index = np.argsort(np.abs(resid))
....: # index of non-outlier
....: trimmed_index = np.sort(sort_index[:-k])
....: outlier_index = np.sort(sort_index[-k:])
....: return trimmed_index, outlier_index
....:
ols_test_fit = sm.OLS(data.endog, data.exog).fit()
In [37]: olskeep, olsoutl = index_trim_outlier(ols_fit.resid, 2)
In [38]: print 'ols outliers', olsoutl, ols_fit.resid[olsoutl]
ols outliers [22 44] [ 1029.99405449 1460.61975104]
In [39]: ols_fit_rm2 = sm.OLS(data.endog[olskeep], data.exog[olskeep,:]).fit()
In [40]: rlm_fit_rm2 = sm.RLM(data.endog[olskeep], data.exog[olskeep,:]).fit()
weights = 1/incomesq
In [41]: results = [ols_fit, ols_fit_rm2, rlm_fit, rlm_fit_rm2]
Note: I think incomesq is already square
In [42]: for weights in [1/incomesq, 1/incomesq**2, np.sqrt(incomesq)]:
....: print '\nComparison OLS and WLS with and without outliers'
....: wls_fit0 = sm.WLS(data.endog, data.exog, weights=weights).fit()
....: wls_fit_rm2 = sm.WLS(data.endog[olskeep], data.exog[olskeep,:],
....: weights=weights[olskeep]).fit()
....: wlskeep, wlsoutl = index_trim_outlier(ols_fit.resid, 2)
....: print '2 outliers candidates and residuals'
....: print wlsoutl, wls_fit.resid[olsoutl]
....: # redundant because ols and wls outliers are the same:
....: ##wls_fit_rm2_ = sm.WLS(data.endog[wlskeep], data.exog[wlskeep,:],
....: ## weights=1/incomesq[wlskeep]).fit()
....: print 'outliers ols, wls:', olsoutl, wlsoutl
....: print 'rsquared'
....: print 'ols vs ols rm2', ols_fit.rsquared, ols_fit_rm2.rsquared
....: print 'wls vs wls rm2', wls_fit0.rsquared, wls_fit_rm2.rsquared #, wls_fit_rm2_.rsquared
....: print 'compare R2_resid versus R2_wresid'
....: print 'ols minus 2', getrsq(ols_fit_rm2)[0],
....: print getrsq((ols_fit_rm2.wresid, ols_fit_rm2.model.wendog))[0]
....: print 'wls ', getrsq(wls_fit)[0],
....: print getrsq((wls_fit.wresid, wls_fit.model.wendog))[0]
....: print 'wls minus 2', getrsq(wls_fit_rm2)[0],
....: # next is same as wls_fit_rm2.rsquared for cross checking
....: print getrsq((wls_fit_rm2.wresid, wls_fit_rm2.model.wendog))[0]
....: #print getrsq(wls_fit_rm2_)[0],
....: #print getrsq((wls_fit_rm2_.wresid, wls_fit_rm2_.model.wendog))[0]
....: results.extend([wls_fit0, wls_fit_rm2])
....:
Comparison OLS and WLS with and without outliers
2 outliers candidates and residuals
[22 44] [ 1056.13211484 1533.40818784]
outliers ols, wls: [22 44] [22 44]
rsquared
ols vs ols rm2 0.243577916822 0.317459940305
wls vs wls rm2 0.0593688805068 0.0679070566139
compare R2_resid versus R2_wresid
ols minus 2 0.317459940305 0.317459940305
wls 0.198643181027 0.050000418999
wls minus 2 0.311233296924 0.0679070566139
Comparison OLS and WLS with and without outliers
2 outliers candidates and residuals
[22 44] [ 1056.13211484 1533.40818784]
outliers ols, wls: [22 44] [22 44]
rsquared
ols vs ols rm2 0.243577916822 0.317459940305
wls vs wls rm2 0.0666184777739 0.127696536358
compare R2_resid versus R2_wresid
ols minus 2 0.317459940305 0.317459940305
wls 0.198643181027 0.050000418999
wls minus 2 0.293669036814 0.127696536358
Comparison OLS and WLS with and without outliers
2 outliers candidates and residuals
[22 44] [ 1056.13211484 1533.40818784]
outliers ols, wls: [22 44] [22 44]
rsquared
ols vs ols rm2 0.243577916822 0.317459940305
wls vs wls rm2 0.353266862329 0.546812251767
compare R2_resid versus R2_wresid
ols minus 2 0.317459940305 0.317459940305
wls 0.198643181027 0.050000418999
wls minus 2 0.312186685643 0.546812251767
In [43]: print ' ols ols_rm2 rlm rlm_rm2 wls (lin) wls_rm2 (lin) wls (squ) wls_rm2 (squ) wls (sqrt) wls_rm2 (sqrt)'
ols ols_rm2 rlm rlm_rm2 wls (lin) wls_rm2 (lin) wls (squ) wls_rm2 (squ) wls (sqrt) wls_rm2 (sqrt)
In [44]: print 'Parameter estimates'
Parameter estimates
In [45]: print np.column_stack([r.params for r in results])
[[ -3.08181404 -5.06103843 -4.98510966 -5.34410309 -2.69418516
-3.1305703 -1.43815462 -1.58893054 -3.57074829 -6.80053364]
[ 234.34702702 115.08753715 129.85391456 109.01433492 158.42697752
128.38182357 60.95113284 100.25000841 254.82166855 103.75834726]
[ -14.99684418 -5.77558429 -6.46204829 -4.77409191 -7.24928987
-7.41228893 6.84943071 -3.34972494 -16.40524256 -4.5924465 ]
[ 27.94090839 85.46566835 89.91389709 95.85086459 60.44877369
79.7759146 55.9884469 60.97199734 -3.8085159 84.69170048]
[-237.1465136 39.51639838 -15.50014814 31.39771833 -114.10886935
-40.04207242 -6.41976501 -38.83583228 -260.72084271 117.20540179]]
In [46]: print 'R2 original data, next line R2 weighted data'
R2 original data, next line R2 weighted data
In [47]: print np.column_stack([getattr(r, 'rsquared', None) for r in results])
[[0.243577916822 0.317459940305 None None 0.0593688805068 0.0679070566139
0.0666184777739 0.127696536358 0.353266862329 0.546812251767]]
In [48]: print 'Standard errors'
Standard errors
In [49]: print np.column_stack([getattr(r, 'bse', None) for r in results])
[[ 5.51471653 3.31028758 2.61580069 2.39537089 3.80730631
2.90027255 2.71141739 2.46959477 6.37593755 3.39477842]
[ 80.36595035 49.35949263 38.12005692 35.71722666 76.39115431
58.35231328 87.18452039 80.30086861 86.99568216 47.58202096]
[ 7.46933695 4.55366113 3.54293763 3.29509357 9.72433732
7.41259156 15.15205888 14.10674821 7.18302629 3.91640711]
[ 82.92232357 50.54681754 39.33262384 36.57639175 58.55088753
44.82218676 43.11017757 39.31097542 96.4077482 52.57314209]
[ 199.35166485 122.1287718 94.55866295 88.3741058 139.68749646
106.89445525 115.79258539 105.99258363 239.38105863 130.32619908]]
In [50]: print 'Heteroscedasticity robust standard errors (with ols)'
Heteroscedasticity robust standard errors (with ols)
In [51]: print 'with outliers'
with outliers
In [52]: print np.column_stack([getattr(ols_fit, se, None) for se in ['HC0_se', 'HC1_se', 'HC2_se', 'HC3_se']])
[[ 3.30166123 3.42264107 3.4477148 3.60462409]
[ 88.86635165 92.12260235 92.08368378 95.48159869]
[ 6.94456348 7.19902694 7.19953754 7.47634779]
[ 92.18777672 95.56573144 95.67211143 99.31427277]
[ 212.9905298 220.79495237 221.08892661 229.57434782]]
(I didn’t check whether this is fully correct statistically)
With OLS on full sample
In [53]: nobs, nvar = data.exog.shape
In [54]: niter = 2000
In [55]: bootres = np.zeros((niter, nvar*2))
In [56]: for it in range(niter):
....: rind = np.random.randint(nobs, size=nobs)
....: endog = data.endog[rind]
....: exog = data.exog[rind,:]
....: res = sm.OLS(endog, exog).fit()
....: bootres[it, :nvar] = res.params
....: bootres[it, nvar:] = res.bse
....:
In [57]: np.set_printoptions(linewidth=200)
In [58]: print 'Bootstrap Results of parameters and parameter standard deviation OLS'
Bootstrap Results of parameters and parameter standard deviation OLS
In [59]: print 'Parameter estimates'
Parameter estimates
In [60]: print 'median', np.median(bootres[:,:5], 0)
median [ -3.18762808 230.28190284 -14.91350869 31.31426536 -230.76507856]
In [61]: print 'mean ', np.mean(bootres[:,:5], 0)
mean [ -2.83204335 237.04728999 -15.16402208 26.36037718 -250.6443012 ]
In [62]: print 'std ', np.std(bootres[:,:5], 0)
std [ 3.83381044 98.71047041 9.45405218 95.45545876 225.33289039]
In [63]: print 'Standard deviation of parameter estimates'
Standard deviation of parameter estimates
In [64]: print 'median', np.median(bootres[:,5:], 0)
median [ 5.43542602 82.08961746 7.59405275 81.36978696 201.58548065]
In [65]: print 'mean ', np.mean(bootres[:,5:], 0)
mean [ 5.46029027 86.34490341 8.57550315 80.77601548 203.356959 ]
In [66]: print 'std ', np.std(bootres[:,5:], 0)
std [ 1.42063357 30.34539049 4.32578572 19.13318826 55.7301691 ]
In [67]: plt.figure()
Out[67]: <matplotlib.figure.Figure at 0xc9d70ec>
In [68]: for i in range(4):
....: plt.subplot(2,2,i+1)
....: plt.hist(bootres[:,i],50)
....: plt.title('var%d'%i)
....:
In [69]: plt.figtext(0.5, 0.935, 'OLS Bootstrap',
....: ha='center', color='black', weight='bold', size='large')
....:
Out[69]: <matplotlib.text.Text at 0xc9d774c>
With WLS on sample with outliers removed
In [70]: data_endog = data.endog[olskeep]
In [71]: data_exog = data.exog[olskeep,:]
In [72]: incomesq_rm2 = incomesq[olskeep]
In [73]: nobs, nvar = data_exog.shape
In [74]: niter = 500 # a bit slow
In [75]: bootreswls = np.zeros((niter, nvar*2))
In [76]: for it in range(niter):
....: rind = np.random.randint(nobs, size=nobs)
....: endog = data_endog[rind]
....: exog = data_exog[rind,:]
....: res = sm.WLS(endog, exog, weights=1/incomesq[rind,:]).fit()
....: bootreswls[it, :nvar] = res.params
....: bootreswls[it, nvar:] = res.bse
....:
In [77]: print 'Bootstrap Results of parameters and parameter standard deviation',
Bootstrap Results of parameters and parameter standard deviation
In [78]: print 'WLS removed 2 outliers from sample'
WLS removed 2 outliers from sample
In [79]: print 'Parameter estimates'
Parameter estimates
In [80]: print 'median', np.median(bootreswls[:,:5], 0)
median [ -3.85414013 134.81822248 -8.87959655 87.61078396 -34.89330649]
In [81]: print 'mean ', np.mean(bootreswls[:,:5], 0)
mean [ -3.6211722 136.14971578 -8.98259132 87.41910954 -48.11761555]
In [82]: print 'std ', np.std(bootreswls[:,:5], 0)
std [ 2.85719978 62.33377642 7.75700445 48.78693317 110.93812134]
In [83]: print 'Standard deviation of parameter estimates'
Standard deviation of parameter estimates
In [84]: print 'median', np.median(bootreswls[:,5:], 0)
median [ 2.93450772 59.94331313 6.80259478 44.86159298 120.94206658]
In [85]: print 'mean ', np.mean(bootreswls[:,5:], 0)
mean [ 2.96668034 60.64580021 6.99451597 45.80030781 122.42460949]
In [86]: print 'std ', np.std(bootreswls[:,5:], 0)
std [ 0.52108974 11.18651362 1.5955339 7.77500052 22.75644625]
In [87]: plt.figure()
Out[87]: <matplotlib.figure.Figure at 0xd0e82cc>
In [88]: for i in range(4):
....: plt.subplot(2,2,i+1)
....: plt.hist(bootreswls[:,i],50)
....: plt.title('var%d'%i)
....:
In [89]: plt.figtext(0.5, 0.935, 'WLS rm2 Bootstrap',
....: ha='center', color='black', weight='bold', size='large')
....:
Out[89]: <matplotlib.text.Text at 0xd0e8a0c>
The following a random variables not fixed by a seed
Bootstrap Results of parameters and parameter standard deviation
OLS
Parameter estimates
median [ -3.26216383 228.52546429 -14.57239967 34.27155426 -227.02816597]
mean [ -2.89855173 234.37139359 -14.98726881 27.96375666 -243.18361746]
std [ 3.78704907 97.35797802 9.16316538 94.65031973 221.79444244]
Standard deviation of parameter estimates
median [ 5.44701033 81.96921398 7.58642431 80.64906783 200.19167735]
mean [ 5.44840542 86.02554883 8.56750041 80.41864084 201.81196849]
std [ 1.43425083 29.74806562 4.22063268 19.14973277 55.34848348]
Bootstrap Results of parameters and parameter standard deviation
WLS removed 2 outliers from sample
Parameter estimates
median [ -3.95876112 137.10419042 -9.29131131 88.40265447 -44.21091869]
mean [ -3.67485724 135.42681207 -8.7499235 89.74703443 -46.38622848]
std [ 2.96908679 56.36648967 7.03870751 48.51201918 106.92466097]
Standard deviation of parameter estimates
median [ 2.89349748 59.19454402 6.70583332 45.40987953 119.05241283]
mean [ 2.97600894 60.14540249 6.92102065 45.66077486 121.35519673]
std [ 0.55378808 11.77831934 1.69289179 7.4911526 23.72821085]
in bootstrap results
We could also include rsquared in bootstrap, and do it also for RLM. The problems could also mean that the linearity assumption is violated, e.g. try non-linear transformation of exog variables, but linear in parameters.
for statsmodels
In this case rsquared for original data looks less random/arbitrary.
tss when calculating rsquared in WLS if the original exog contains a constant. The increase in rsquared because of a change in definition will be very misleading.
might affect also the degrees of freedom calculation, but I haven’t checked this. I would guess that the df_model should stay the same, but needs to be verified with a textbook.
constant, e.g. when regressing an endog on a single exog variable without constant. This case might require also a redefinition of the rsquare and f statistic for the regression anova to use the uncentered tss. This can be done through keyword parameter to model.__init__ or through autodedection with hasconst = (exog.var(0)<1e-10).any() I’m not sure about fixed effects with a full dummy set but without a constant. In this case autodedection wouldn’t work this way. Also, I’m not sure whether a ddof keyword parameter can also handle the hasconst case.