Prediction (out of sample)ΒΆ

Link to Notebook GitHub

In [1]:
from __future__ import print_function
import numpy as np
import statsmodels.api as sm

Artificial data

In [2]:
nsample = 50
sig = 0.25
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, np.sin(x1), (x1-5)**2))
X = sm.add_constant(X)
beta = [5., 0.5, 0.5, -0.02]
y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)

Estimation

In [3]:
olsmod = sm.OLS(y, X)
olsres = olsmod.fit()
print(olsres.summary())
                            OLS Regression Results
==============================================================================
Dep. Variable:                      y   R-squared:                       0.983
Model:                            OLS   Adj. R-squared:                  0.981
Method:                 Least Squares   F-statistic:                     867.5
Date:                Fri, 05 Feb 2016   Prob (F-statistic):           1.78e-40
Time:                        17:38:33   Log-Likelihood:                 1.0243
No. Observations:                  50   AIC:                             5.951
Df Residuals:                      46   BIC:                             13.60
Df Model:                           3
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const          5.0830      0.084     60.339      0.000         4.913     5.253
x1             0.4953      0.013     38.125      0.000         0.469     0.521
x2             0.4219      0.051      8.260      0.000         0.319     0.525
x3            -0.0208      0.001    -18.221      0.000        -0.023    -0.018
==============================================================================
Omnibus:                        1.746   Durbin-Watson:                   1.889
Prob(Omnibus):                  0.418   Jarque-Bera (JB):                0.934
Skew:                          -0.255   Prob(JB):                        0.627
Kurtosis:                       3.435   Cond. No.                         221.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In-sample prediction

In [4]:
ypred = olsres.predict(X)
print(ypred)
[  4.5634   5.0144   5.431    5.7901   6.0771   6.288    6.4302   6.5212
   6.5858   6.6521   6.7465   6.8901   7.0943   7.3597   7.6756   8.0219
   8.3724   8.699    8.9764   9.1863   9.3203   9.3814   9.3833   9.3484
   9.3041   9.2779   9.2936   9.3666   9.5019   9.6933   9.9239  10.1695
  10.4022  10.595   10.7265  10.7841  10.7662  10.6825  10.5525  10.4018
  10.2588  10.149   10.0916  10.0959  10.1601  10.2712  10.4075  10.542
  10.6469  10.698 ]

Create a new sample of explanatory variables Xnew, predict and plot

In [5]:
x1n = np.linspace(20.5,25, 10)
Xnew = np.column_stack((x1n, np.sin(x1n), (x1n-5)**2))
Xnew = sm.add_constant(Xnew)
ynewpred =  olsres.predict(Xnew) # predict out of sample
print(ynewpred)
[ 10.6641  10.5168  10.2727   9.9695   9.6569   9.3841   9.1884   9.0854
   9.0655   9.0963]

Plot comparison

In [6]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.plot(x1, y, 'o', label="Data")
ax.plot(x1, y_true, 'b-', label="True")
ax.plot(np.hstack((x1, x1n)), np.hstack((ypred, ynewpred)), 'r', label="OLS prediction")
ax.legend(loc="best");

Predicting with Formulas

Using formulas can make both estimation and prediction a lot easier

In [7]:
from statsmodels.formula.api import ols

data = {"x1" : x1, "y" : y}

res = ols("y ~ x1 + np.sin(x1) + I((x1-5)**2)", data=data).fit()

We use the I to indicate use of the Identity transform. Ie., we don't want any expansion magic from using **2

In [8]:
res.params
Out[8]:
Intercept           5.083036
x1                  0.495320
np.sin(x1)          0.421859
I((x1 - 5) ** 2)   -0.020785
dtype: float64

Now we only have to pass the single variable and we get the transformed right-hand side variables automatically

In [9]:
res.predict(exog=dict(x1=x1n))
Out[9]:
array([ 10.6641,  10.5168,  10.2727,   9.9695,   9.6569,   9.3841,
         9.1884,   9.0854,   9.0655,   9.0963])