Faisal Qureshi
http://www.vclab.ca
import numpy as np
import scipy as sp
import matplotlib.pyplot as plt
import matplotlib.lines as mlines
import matplotlib.cm as cm
Given data $D = \{(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), \cdots , (x^{(N)}, y^{(N)})\}$, learn function $y=f(x)$.
This is a convex function. We can solve for $\theta_0$ and $\theta_1$ by setting $\frac{\partial C}{\partial \theta_0}=0$ and $\frac{\partial C}{\partial \theta_1}=0$.
$\theta_0 = \langle y \rangle - \theta_1 \langle x \rangle$
$\theta_1 = \left(\sum_{i=1}^N (y^{(i)} - \langle y \rangle ) x^{(i)} \right) / \left(\sum_{i=1}^N (x^{(i)} - \langle x \rangle ) x^{(i)} \right)$
# Generate some data
np.random.seed(0)
m_true = .75
c_true = 1
x = np.hstack([np.linspace(3, 5, 5), np.linspace(7, 8, 5)])
y = m_true * x + c_true + np.random.rand(len(x))
#x = np.linspace(1,9,10)
#y = np.ones(10)+5
#y = np.linspace(1,9,10)
#x = np.ones(10)+5
data = np.vstack([x, y]).T
n = data.shape[0]
plt.figure(figsize=(7,7))
plt.scatter(data[:,0], data[:,1], c='red', alpha=0.9)
plt.xlim([0,10])
plt.ylim([0,10])
plt.title('Generating some points from a line')
print(f'data =\n{data}')
n = data.shape[0]
print(f'Number of data items = {n}')
A = np.ones([n, 2])
A[:,0] = data[:,0]
print(f'A =\n{A}')
AtA = np.dot(A.T,A)
print(f'AtA =\n {AtA}')
AtY = np.dot(A.T, data[:,1].reshape(n,1))
print(f'AtY = \n{AtY}')
# m = slope
# c = intercept
m_and_c = np.dot(np.linalg.inv(AtA), AtY)
m = m_and_c[0]
c = m_and_c[1]
print(f'm = {m}')
print(f'c = {c}')
x_plotting = np.linspace(0,10,100)
y_plotting = m * x_plotting + c
plt.figure(figsize=(7,7))
plt.scatter(data[:,0], data[:,1], c='red', alpha=0.9)
plt.plot(x_plotting, y_plotting, 'b-')
plt.xlim([0,10])
plt.ylim([0,10])
plt.title('Line fitted to data using least squares.');
# Generate some data
np.random.seed(0)
m_true = .75
c_true = 1
x = np.hstack([np.linspace(3, 5, 5), np.linspace(7, 8, 5)])
y = m_true * x + c_true + np.random.rand(len(x))
# Outliers
y[9] = 1
y[0] = 9
data = np.vstack([x, y]).T
n = data.shape[0]
plt.figure(figsize=(7,7))
plt.scatter(data[:,0], data[:,1], c='red', alpha=0.9)
plt.xlim([0,10])
plt.ylim([0,10])
plt.title('Generating some points from a line');
x, y = data[:,0], data[:,1]
theta = np.polyfit(x, y, 1)
m = theta[0]
c = theta[1]
print(f'c = {c}')
print(f'm = {m}')
x_plotting = np.linspace(0,10,100)
y_plotting = m * x_plotting + c
plt.figure(figsize=(7,7))
plt.scatter(data[:,0], data[:,1], c='red', alpha=0.9)
plt.plot(x_plotting, y_plotting, 'b-')
plt.xlim([0,10])
plt.ylim([0,10])
plt.title('The effect of outliers on least squares');
In the discussion below, set $$ e = y_{\mathrm{prediction}} - y_{\mathrm{true}} $$ and $$ z = e^2 $$ In other words, $z$ is error-squared.
The following expressions are pulled out from scipy implementation of least squares. For an in depth discussion, I suggest "Convex Optimization" by Boyd et al, 2004 to the interested reader.
Generally
$$ \rho(e) = \begin{cases} \frac{1}{2} e^2 & \text{for } |e| \le k \\ k|e| - \frac{1}{2}k^2 & \text{otherwise} \end{cases} $$Here $k$ is referred to as the tuning constant. Smaller values of $k$ produce more resistance to outliers, but at the expense of lower efficiency when the errors are normally distributed. The tuning constant is generally picked to give reasonably high efficiency in the normal case. Typically, we set $k = 1.345 \sigma$ for the Huber (where $\sigma$ is the standard deviation of the errors) produce 95-percent efficiency when the errors are normal, and still offer protection against outliers.
In scipy.optimize.least_square
Huber loss is defined as
Generally
$$ \rho(e) = \begin{cases} \frac{k^2}{6} \{ 1 - \left[ 1 - \left( \frac{e}{k} \right)^2 \right]^3 \} & \text{for } |e| \le k \\ \frac{k^2}{6} & \text{otherwise} \end{cases} $$$$ \rho(z) = \frac{z}{z + \sigma^2} $$Here $\sigma$ is often referred to a scale parameter.
e = np.linspace(-3,3,100)
z = np.square(e)
linear_loss = z
l1_loss = np.abs(e)
soft_l1_loss = 2*(np.power(1+z, 0.5) - 1) # Smooth approx. of l1 loss.
huber_loss = 2*np.power(z, 0.5) - 1
huber_loss[z <= 1] = z[z <= 1]
cauchy_loss = np.log(1 + z)
arctan_loss = np.arctan(z)
scale = 0.1
robust_loss = z / (z + scale**2)
plt.figure(figsize=(15,15))
plt.subplot(331)
plt.title('Linear loss')
plt.ylim(0,10)
plt.grid()
plt.plot(e, linear_loss, label='Linear loss')
plt.subplot(332)
plt.grid()
plt.ylim(0,10)
plt.title('L1 loss')
plt.plot(e, l1_loss, label='L1 loss')
plt.subplot(333)
plt.grid()
plt.ylim(0,10)
plt.title('Soft L1 loss')
plt.plot(e, soft_l1_loss, label='Soft L1 loss')
plt.subplot(334)
plt.grid()
plt.ylim(0,10)
plt.title('Huber loss')
plt.plot(e, huber_loss, label='Huber loss')
plt.subplot(335)
plt.grid()
plt.ylim(0,10)
plt.title('Cauchy loss')
plt.plot(e, cauchy_loss, label='Cauchy loss')
plt.subplot(336)
plt.grid()
plt.ylim(0,10)
plt.title('Arctan loss')
plt.plot(e, arctan_loss, label='Arctan loss')
plt.subplot(337)
plt.grid()
plt.ylim(0,10)
plt.title('?Loss')
plt.plot(e, robust_loss, label='Robust loss')
plt.subplot(338)
plt.grid()
plt.title('?Loss (Rescaled y-axis)')
plt.plot(e, z / (z + 2**2), label='$\sigma=2$')
plt.plot(e, z / (z + 1**2), label='$\sigma=1$')
plt.plot(e, z / (z + .5**2), label='$\sigma=.5$')
plt.plot(e, z / (z + .25**2), label='$\sigma=.25$')
plt.plot(e, z / (z + .1**2), label='$\sigma=.1$')
plt.legend();
# Generate some data
np.random.seed(0)
m_true = .75
c_true = 1
x = np.hstack([np.linspace(1, 5, 50), np.linspace(7, 10, 50)])
y = m_true * x + c_true + np.random.rand(len(x))*.2
data = np.vstack([x, y]).T
n = data.shape[0]
outlier_percentage = 50
ind = (np.random.rand(int(n*outlier_percentage/100))*n).astype('int')
data[ind, 1] = np.random.rand(len(ind))*9
plt.figure(figsize=(7,7))
plt.scatter(data[:,0], data[:,1], c='red', alpha=0.9)
plt.xlim([0,10])
plt.ylim([0,10])
plt.title('Noisy data');
theta_guess = [1, 1] # Need an intial guess
# Available losses
losses = [
'linear',
'soft_l1',
'huber',
'cauchy',
'arctan'
]
loss_id = 4 # Pick a loss
def line_fitting_errors(x, data):
m = x[0]
c = x[1]
y_predicted = m * data[:,0] + c
e = y_predicted - data[:,1]
return e
# Non-linear least square fitting
from scipy.optimize import least_squares
retval = least_squares(line_fitting_errors, theta_guess, loss=losses[loss_id], args=[data])
print(f'Reasons for stopping: {retval["message"]}')
print(f'Success: {retval["success"]}')
theta = retval['x']
print(f'#data = {n}')
print(f'theta = {theta}')
def draw_line(theta, ax, **kwargs):
m = theta[0]
c = theta[1]
xmin, xmax = ax.get_xbound()
ymin = m * xmin + c
ymax = m * xmax + c
l = mlines.Line2D([xmin, xmax], [ymin,ymax], **kwargs)
ax.add_line(l)
plt.figure(figsize=(7,7))
ax = plt.gca()
plt.scatter(data[:,0], data[:,1], c='red', alpha=0.9)
plt.xlim([0,10])
plt.ylim([0,10])
plt.title(f'Non linear least squares. Loss = {losses[loss_id]}')
draw_line(theta, ax, c='black', linewidth='3');
$ \newcommand{\bx}{\mathbf{x}} \newcommand{\bX}{\mathbf{X}} \newcommand{\by}{\mathbf{y}} \newcommand{\bY}{\mathbf{Y}} $
Input feature: $\mathbf{x}^{(i)} = \left(1, x_{1}^{(i)}, x_{2}^{(i)}, \cdots, x_{M}^{(i)} \right)^T$.
For this discussion, we assume $x_{0}^{(i)}=1$ (just to simplify mathematical notation).
Target feature: $y^{(i)}$
Parameters: $\theta = \left( \theta_0, \theta_1, \cdots, \theta_M \right)^T \in \mathbb{R}^{(M+1)}$
Model: $f(\bx) = \bx^T \theta$
Linear regression and perceptron ... each connection represents a model parameter.
Linear regression and perceptron ... each connection represents a model parameter. With a linear activation function.
Loss: $C(\theta) = (\bY - \bX \theta)^T (\bY - \bX \theta)$
Design matrix:
$$ \bX = \left[ \begin{array}{ccc} - & \bx_1^T & -\\ - & \bx_2^T & -\\ & \vdots & \\ - & \bx_N^T & - \end{array} \right] \in \mathbb{R}^{N \times (M+1)} $$Output:
$$ \bY = \left[ \begin{array}{c} y_1 \\ y_2 \\ \vdots \\ y_N \end{array} \right] \in \mathbb{R}^{N \times 1} $$Solution:
Solve $\theta$ by setting $\frac{\partial C}{\partial \theta} = 0$
Solution is $\hat \theta = (\bX^T \bX)^{-1} \bX^T \bY$
How can we construct more complex models?
There are many ways to make linear models more powerful while retaining their nice mathematical properties:
For a cubic polynomial (in 1D)
$\phi_0(x) = 1$
$\phi_1(x) = x$
$\phi_2(x) = x^2$
$\phi_3(x) = x^3$
Then
$$ f(x) = \left[ \begin{array}{cccc} \phi_0(x) & \phi_1(x) & \phi_2(x) & \phi_3(x) \end{array} \right] \left[ \begin{array}{c} \theta_0 \\ \theta_1 \\ \theta_2 \\ \theta_3 \end{array} \right] $$Polynomials (as seen before)
Sigmoid
Sigmoid function
Gaussian
Gaussian function
Model complexity and fit
Q. What about neural networks? How do we deal with model complexity in neural networks?
Given a point $\mathbf{x}^{(*)}$, classify using the following rule:
$$ \begin{equation} y^{(*)} = \begin{cases} 1 & \mathrm{ if\ } \mathrm{Pr}(y|\bx^{(*)},\theta) \ge 0.5 \\ 0 & \mathrm{ otherwise} \end{cases} \end{equation} $$The decision boundary is $\mathbf{x}^T \theta = 0$. Recall that this is where sigmoid function is $0.5$.