# Regularization in Linear Models

Ridge and Lasso Regression | September 14, 2018

Least squares estimates are often not very satisfactory due to their poor out-of-sample performance, especially when the model is overly complex with a lot of features. We can attribute this to low bias and large variance in least squares estimates. Additionally, when we have a lot of features in our model, it is harder to explain the features with the strongest effect or what we call the **Big Picture**. Hence, we might want to choose fewer features in order to trade a worse in-sample variance for a better out-of-sample prediction.

**Regularization** is a method to shrink or drop coefficients/parameters from a model by imposing a penalty on their size. It is also referred to as the **Shrinkage Method**. In this post, I will discuss two of the most common regularization techniques - Ridge and Lasso regularization.

## Setup

For starters, we will use the `Prostate Cancer`

dataset from the **Elements of Statistical Learning** book. If you want more information about the dataset, it’s available in *Chapter 1* of the book or **here** and the dataset is available **here**

If you want to follow along with some code, I have put a `Jupyter`

Notebook on **Github**.

```
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
from logorama import logger
from bokeh.io import output_notebook, curdoc
from bokeh.plotting import figure, show
from bokeh.themes import Theme
from bokeh.embed import components
from bokeh.models import Range1d
from bokeh.palettes import Spectral10
import scipy.optimize as so
%matplotlib inline
plot_theme = Theme("./theme.yml")
output_notebook()
sns.set_style(style="dark")
```

Let’s load the data and plot the pair-wise scatter plot for it.

```
prostate_data = pd.read_csv("https://web.stanford.edu/~hastie/ElemStatLearn/datasets/prostate.data",
sep="\t", index_col=0)
train_idx = prostate_data["train"] == "T"
prostate_data.drop(columns="train", inplace=True)
fp = sns.pairplot(prostate_data, diag_kind="kde")
fp.fig.suptitle("Pair-wise Scatter Plot")
plt.show()
```

We will need the **L2-Norm** for the Linear Least Squares model, so let’s implement that.

```
def norm(betas, x_train, y_train):
return np.linalg.norm(y_train - np.mean(y_train) - x_train.dot(betas))
```

#### About the dataset

The original data comes from a study by Stamey et. al. [1989], where they examined the relationship between the level of prostate-specific antigen and number of clinical measures in men who were about to receive a radical prostatectomy. Features in model are:

- log cancer volume (
`lcavol`

) - log prostate weight (
`lweight`

) - age (
`age`

) - log of the amount of benign prostatic hyperplasia (
`lbph`

) - seminal vesicle invasion (
`svi`

) - log of capsular penetration (
`lcp`

) - Gleason score (
`gleason`

) - Percent of Gleason scores 4 or 5 (
`pgg45`

)

- log of prostate-specific antigen (
`lpsa`

)

## Simple Linear Fit

Before we jump to Ridge and Lasso regression, let’s fit a least squares model to the data and get corresponding `standard error`

and `z-score`

estimates for each coefficient.

We know that the linear regression fit is given by:

\begin{equation} \hat y = \mathbf{X} \beta = \mathbf{X} (\mathbf{X^\intercal} \mathbf{X})^{-1} \mathbf{X^\intercal} y \end{equation}

The variance-covariance matrix for least squares parameters is given by

\begin{equation} Var(\hat \beta) = (\mathbf{X^\intercal} \mathbf{X})^{-1} \sigma^2 \end{equation}

where $\sigma$ is the population standard deviation of $y_i$. We can estimate $\sigma^{2}$ by:

\begin{equation} \hat \sigma^2 = \dfrac{1}{N - p -1} \sum\limits_{i=1}^N (y_i - \hat y_i)^2 \end{equation}

Finally, we can calculate the z-score as:

\begin{equation} Z_{\beta_i} = \dfrac{\hat \beta_i - 0}{\hat \sigma_{\beta_i}} \end{equation}

```
y_data = prostate_data["lpsa"]
x_data = prostate_data.drop(columns="lpsa")
x_data = (x_data - x_data.mean()) / x_data.std() # Standardize
```

```
x_train = x_data.loc[train_idx]
x_train = np.hstack([np.ones((len(x_train), 1)), x_train.values.copy()])
y_train = y_data.loc[train_idx]
betas = np.linalg.inv(x_train.T.dot(x_train)).dot(x_train.T).dot(y_train)
y_train_hat = x_train.dot(betas)
```

```
dof = len(x_train) - len(betas)
mse = np.sum((y_train_hat - y_train) ** 2) / dof
betas_cov = np.linalg.inv(x_train.T.dot(x_train)) * mse
betas_se = np.sqrt(betas_cov.diagonal())
betas_z = (betas - 0) / betas_se
betas_estimate_table = pd.DataFrame({"Beta": betas, "SE": betas_se, "Z-Score": betas_z},
index=np.append(["intercept"], x_data.columns))
logger.info(f"Degrees of Freedom: {dof}")
logger.info(f"MSE: {np.round(mse, 4)}")
logger.info(f"Beta Errors:\n{betas_estimate_table}")
```

```
2018-09-08 16:09:32.860426 - [INFO] - {root:<module>:12} - Degrees of Freedom: 58
2018-09-08 16:09:32.865155 - [INFO] - {root:<module>:13} - MSE: 0.5074
2018-09-08 16:09:32.872027 - [INFO] - {root:<module>:14} - Beta Errors:
Beta SE Z-Score
intercept 2.464933 0.089315 27.598203
lcavol 0.679528 0.126629 5.366290
lweight 0.263053 0.095628 2.750789
age -0.141465 0.101342 -1.395909
lbph 0.210147 0.102219 2.055846
svi 0.305201 0.123600 2.469255
lcp -0.288493 0.154529 -1.866913
gleason -0.021305 0.145247 -0.146681
pgg45 0.266956 0.153614 1.737840
```

At a 95% confidence level, `z-score`

greater/less than value of $\pm1.96$ is significant.

Out of the 9 features, 4 (

`age`

,`lcp`

,`gleason`

,`pgg45`

) are not significant in our current model

Let’s calculate the `MSE`

for out-of-sample/test data

```
x_test = x_data.loc[~train_idx]
x_test = np.hstack([np.ones((len(x_test), 1)), x_test.values.copy()])
y_test = y_data.loc[~train_idx]
y_test_hat = x_test.dot(betas)
dof_test = len(x_test) - len(betas)
mse = np.sum((y_test_hat - y_test) ** 2) / dof_test
logger.info(f"MSE: {np.round(mse, 4)}")
```

```
2018-09-08 16:09:33.139871 - [INFO] - {root:<module>:11} - MSE: 0.7447
```

We can see that the variance is much higher for the out-of-sample data, as expected in case of an over-fit least squares model.

Next, let’s discuss how we can use regularization to help with this model.

## Ridge Regression

Ridge regression imposes an $L_2$ `norm`

based penalty on the sizes of coefficients given by $\beta^{\intercal} \beta$. We also introduce a complexity parameter $\lambda >= 0$ that controls the amount of shrinkage. Hence, the total shrinkage is given by $\lambda \beta^{\intercal} \beta$.

As the value of $\lambda$ gets larger, we get higher shrinkage

Our new regression coefficients are given by:

\begin{equation} \hat \beta^{ridge} = \mathbf{argmin}_{\beta} \Bigg \{ \sum _{i = 1}^N (y_i - \beta_0 - \sum _{j = 1}^p x _{ij} \beta_j )^2 + \lambda \sum _{j = 1}^p \beta _j^2 \Bigg \} \end{equation}

Inputs need to be standardized to bring Ridge coefficients to equivalent scale

Notice that $\beta_0$ is left out of the penalty term since that will make the intercept depend on origin of $Y$. In other words, if we add a constant c to all $y_i$, our intercept should not change. However, if we do include $\beta_0$ in the penalty term, the intercept will change.

Therefore, we need to re-parameterize the model such that effective $y_i$ are given by $y_i - \bar y$, where $\bar y$ is given by $\dfrac{\sum _i^N y_i}{N}$. After re-parameterization, $X$ has $p$ columns instead of $p+1$ and no constant term. Now, we can rewrite our Ridge coefficients in matrix form as:

\begin{equation} \hat \beta^{ridge} = (\mathbf{X}^\intercal \mathbf{X} + \lambda \mathbf{I})^{-1}\mathbf{X}^\intercal y \end{equation}

Next, let’s analyze the effect of $\lambda$ on shrinkage of Ridge regression coefficients. It’s easy to notice to that $\lambda = 0$ should just lead to least squares coefficients.

```
ridge_coeffs = np.append([0], np.logspace(0, 5.0, 21))
x_train = x_data.loc[train_idx].values.copy()
y_train = y_data.loc[train_idx].copy()
y_train = y_train - np.mean(y_train)
betas_ridge = {}
for lambdaa in ridge_coeffs:
betas = np.linalg.inv(x_train.T.dot(x_train) +
lambdaa * np.identity(len(x_train[0]))).dot(x_train.T).dot(y_train)
betas_ridge[lambdaa] = pd.Series(betas, index=x_data.columns)
betas_ridge = pd.DataFrame(betas_ridge).T
```

```
p = figure(plot_width=800, plot_height=600)
for col, color in zip(x_data.columns, Spectral10):
p.line(x=np.log10(betas_ridge.index.values), y=betas_ridge[col].values, legend=col, color=color)
p.circle(x=np.log10(betas_ridge.index.values), y=betas_ridge[col].values, legend=col, color=color)
p.title.text = "Coefficient Shrinkage in Ridge Regression (\u03B2 vs log\u2081\u2080(\u03BB))"
p.xaxis.axis_label = "log\u2081\u2080(\u03BB)"
p.yaxis.axis_label = "Coefficient"
p.legend.location = "top_right"
curdoc().clear()
doc = curdoc()
doc.theme = plot_theme
doc.add_root(p)
show(p)
```

- As $\lambda$ increases, coefficients’ magnitudes get smaller
- We can see that at large values of $\lambda$, all coefficients approach $0$

One interesting thing to note is that the

gradientof $|| \beta ||^2_2$ is $2 \beta$. The gradient gets smaller and smaller as $\beta$ gets closer to $\mathbf{0}$. Hence, $\beta$ asymptotically approaches $\mathbf{0}$.

### Alternate Formulation

We can formulate the Ridge regression solution alternatively as:

\begin{equation} \hat \beta^{ridge} = \mathbf{argmin}_{\beta} \Bigg \{ \sum _{i = 1}^N (y_i - \beta_0 - \sum _{j = 1}^p x _{ij} \beta_j )^2 \Bigg \} \hspace{1cm} s.t. \sum _{j = 1}^p \beta _j^2 <= t \end{equation}

where there is a one-to-one correspondence betwee $\lambda$ and $t$.

If we scale $t$, such that we divide it by $\sqrt{ \sum_{j = 1}^p \beta_{ls(j)}^2}$, or ($|| \beta_{ls} ||_2$), we get a shrinkage factor, which is in the domain $[0, 1]$. **Notice that I haven’t used the squared norm.**. I will discuss later why I made that decision.

Let’s plot our Ridge coefficients against the shrinkage factor $\dfrac{t}{|| \beta ||_2}$:

```
ols_beta_norm = np.linalg.norm(betas_estimate_table.Beta.values[1:])
ridge_coeffs = np.linspace(0, ols_beta_norm, 21)
x_train = x_data.loc[train_idx].values.copy()
y_train = y_data.loc[train_idx].copy()
y_train = y_train - np.mean(y_train)
def fit_ridge(t, x_train, y_train):
return so.minimize(
fun=norm, x0=np.zeros(len(x_train[0])),
constraints={
"type": "ineq",
"fun": lambda x: t - np.linalg.norm(x)
},
args=(x_train, y_train)
)
betas_ridge = {}
for t in ridge_coeffs:
result = fit_ridge(t, x_train=x_train, y_train=y_train)
shrink_coeff = t / ols_beta_norm
betas_ridge[shrink_coeff] = pd.Series(result.x, index=x_data.columns)
betas_ridge = pd.DataFrame(betas_ridge).T
betas_ridge.sort_index(inplace=True)
```

```
p = figure(plot_width=800, plot_height=600)
for col, color in zip(x_data.columns, Spectral10):
p.line(x=betas_ridge.index, y=betas_ridge[col].values, legend=col, color=color)
p.circle(x=betas_ridge.index, y=betas_ridge[col].values, legend=col, color=color)
p.title.text = "Coefficient Shrinkage in Ridge Regression"
p.xaxis.axis_label = "Shrinkage Factor: t / ||\u03B2||\u2082"
p.yaxis.axis_label = "Coefficient"
p.legend.location = "top_left"
curdoc().clear()
doc = curdoc()
doc.theme = plot_theme
doc.add_root(p)
show(p)
```

- As we can see, $t$ determines how large the coefficients can get
- If $t$ is closer to $0$, none of the betas can get too large
- On the other hand, as $t \rightarrow 1$, the coefficients get closer to the least squares coefficients

## Lasso Regression

Lasso regression imposes an $L_1$ penalty to the regression coefficient compared to the $L_2$ penalty in case of Ridge. Hence, the coefficient formulation changes slightly to:

\begin{equation} \hat \beta^{lasso} = \mathbf{argmin}_{\beta} \Bigg \{ \sum _{i = 1}^N (y_i - \beta_0 - \sum _{j = 1}^p x _{ij} \beta_j )^2 + \lambda \sum _{j = 1}^p |\beta _j| \Bigg \} \end{equation}

or alternatively:

\begin{equation} \hat \beta^{lasso} = \mathbf{argmin}_{\beta} \Bigg \{ \sum _{i = 1}^N (y_i - \beta_0 - \sum _{j = 1}^p x _{ij} \beta_j )^2 \Bigg \} \hspace{1cm} s.t. \sum _{j = 1}^p |\beta _j| <= t \end{equation}

Let’s look at the shrinkage of coefficients for a range of $\lambda’s$:

```
lasso_coeffs = np.linspace(0, 10, 21)
x_train = x_data.loc[train_idx].values.copy()
y_train = y_data.loc[train_idx].copy()
y_train = y_train - np.mean(y_train)
def modified_norm(betas, x_train, y_train, coeff):
return norm(betas=betas, x_train=x_train,
y_train=y_train) + coeff * np.sum(np.abs(betas))
def fit_lasso(lambdaa, x_train, y_train):
return so.minimize(
fun=modified_norm, x0=np.zeros(len(x_train[0])),
args=(x_train, y_train, lambdaa)
)
betas_lasso = {}
for lambdaa in lasso_coeffs:
result = fit_lasso(lambdaa, x_train=x_train, y_train=y_train)
betas_lasso[lambdaa] = pd.Series(result.x, index=x_data.columns)
betas_lasso = pd.DataFrame(betas_lasso).T
betas_lasso.sort_index(inplace=True)
```

```
p = figure(plot_width=800, plot_height=600)
for col, color in zip(x_data.columns, Spectral10):
p.line(x=betas_lasso.index, y=betas_lasso[col].values, legend=col, color=color)
p.circle(x=betas_lasso.index, y=betas_lasso[col].values, legend=col, color=color)
p.title.text = "Coefficient Shrinkage in Lasso Regression"
p.xaxis.axis_label = "\u03BB"
p.yaxis.axis_label = "Coefficient"
p.legend.location = "top_right"
curdoc().clear()
doc = curdoc()
doc.theme = plot_theme
doc.add_root(p)
show(p)
```

- Making $t$ sufficiently small will automatically make certain coefficients $0$
- If $t$ is chosen larger than $\sum _{j = 1}^p |\beta_j|$ (sum of absolute values of least square coefficients) Lasso coefficients are equal to least square coefficients

Again, similar to Ridge, if we scale $t$, such that we divide it by $\sum_{j = 1}^p | \beta_{ls(j)} | $, or ($|| \beta_{ls} ||_1$), we get a shrinkage factor, which is in the domain $[0, 1]$.

I used the $L_2$ norm directly instead of squaring it earlier to scale Ridge and Lasso shrinkage factors equally

Let’s plot the Lasso coefficients $w.r.t.$ the shrinkage factor:

We can see that as $t$ gets sufficiently small, the coefficients start dropping out of the model. Hence, Lasso regression can be used for feature selection

## Final Words

In this post, I discussed how to use regularization in the case of linear models and specifically explored Ridge and Lasso techniques. I also discussed how Ridge proportionally shrinks the least squares coefficients versus Lasso that shrinks each coefficients by a constant factor i.e. $\lambda$. Finally, we also saw how Lasso can be used for feature selection/drop-out in a linear model.