**Introduction**

Linear regression is a powerful tool used in supervised machine learning to predict a continuous, real-valued output based on input data. This statistical method is commonly used for predictive analysis in a wide range of applications, such as stock market forecasting, sales projection, salary estimation, and product pricing.

In linear regression, the predicted output is modeled as a linear function of the input features, with a constant slope that remains the same across the entire range of input values. The goal of linear regression is to find the optimal values for the model’s parameters, which can be used to make accurate predictions on new, unseen data.

**There are two categories**

i. Simple regression

ii. Multivariable regression

**Simple regression** **example**

let’s say we have a dataset that includes the amount a company spends on radio advertising each year and the annual sales figures in terms of units sold. By using simple linear regression, we can develop an equation that predicts the number of units sold based on how much the company spends on radio advertising.

To do this, we use the traditional slope-intercept equation,

**y = mx+b**

where y represents our prediction, m represents the slope, x represents the input data, and b represents the y-intercept. By optimizing the values of m and b, we can make accurate predictions about how much the company will sell based on its radio advertising budget.

Company | Ratio($) | Sales |

Netflix | 37.8 | 22.1 |

Microsoft | 39.3 | 10.4 |

45.9 | 18.3 | |

41.3 | 18.5 |

**Making predictions**

The prediction function in linear regression estimates sales based on radio advertising spend and current weight and bias values.

**Sales=Weight⋅Radio+Bias**

**Weight**

In machine learning, coefficients for independent variables are called weights. For radio independent variables, it represents the impact of radio advertising spending on sales.

**Radio**

In machine learning, independent variables are called features. For example, radio advertising spending is a feature that can impact sales.

**Bias**

The intercept where the line intercepts the y-axis is called bias. Bias offsets all predictions.

Our algorithm learns the correct values for weight and bias during training to approximate the line of best fit for accurate predictions.

**Code**

def predict_sales(radio, weight, bias): return weight*radio + bias

**Cost function**

In machine learning, the prediction function is not as important as the cost function, which is used to optimize the weights. The MSE (L2) cost function measures the average squared difference between actual and predicted values, and the goal is to minimize it to improve model accuracy.

**Math**

For a simple linear equation y=mx+b, we can calculate MSE (L2) to measure the difference between actual and predicted values.

**Note**

In calculating MSE (L2), N represents the total number of observations, and 1N∑ni=1 is the mean. The difference between the actual value (yi) of an observation and our prediction (mxi+b) is what is measured.

**Code**

def cost_function(radio, sales, weight, bias): companies = len(radio) total_error = 0.0 for i in range(companies): total_error += (sales[i] - (weight*radio[i] + bias))**2 return total_error / companies

**Gradient descent**

To minimize MSE (L2), Gradient Descent is used to calculate the gradient of the cost function. By using the derivative of the cost function, we can find the gradient, or slope, of the cost function using our current weight. We move our weight in the opposite direction of the gradient to decrease our error.

**Math**

To control the weight (m) and bias (b) in our cost function, we use partial derivatives to consider their impact on the final prediction. The Chain rule is used to find the partial derivatives, as (y−(mx+b))2 is two nested functions. The inner function is y−(mx+b), and the outer function is x2.

Returning to our cost function:

Using the following:

We can split the derivative into

and

And then using the Chain rule which states:

We then plug in each of the parts to get the following derivatives

We can calculate the gradient of this cost function as:

**Code**

To optimize our model, we need to solve for the gradient using the partial derivatives of our cost function with respect to the weights and bias. This is done by iterating through our data points and taking the average of the partial derivatives. The resulting gradient tells us the slope of our cost function and the direction we should update to reduce our cost. We move in the opposite direction of the gradient, and the size of our update is controlled by the learning rate. By repeating this process, we can improve the accuracy of our model and make better predictions.

def update_weights(radio, sales, weight, bias, learning_rate): weight_deriv = 0 bias_deriv = 0 companies = len(radio) for i in range(companies): # Calculate partial derivatives # -2x(y - (mx + b)) weight_deriv += -2*radio[i] * (sales[i] - (weight*radio[i] + bias)) # -2(y - (mx + b)) bias_deriv += -2*(sales[i] - (weight*radio[i] + bias)) # We subtract because the derivatives point in direction of steepest ascent weight -= (weight_deriv / companies) * learning_rate bias -= (bias_deriv / companies) * learning_rate return weight, bias

**Training**

Training a model involves improving the prediction equation by updating weight and bias values using the gradient of the cost function. We loop through the dataset multiple times, updating the weights in the direction of the gradient until we reach an acceptable error threshold. Before training, we need to set default weight values, and hyperparameters, and prepare to track our progress.

**Code**

def train(radio, sales, weight, bias, learning_rate, iters): cost_history = [] for i in range(iters): weight,bias = update_weights(radio, sales, weight, bias, learning_rate) #Calculate cost for auditing purposes cost = cost_function(radio, sales, weight, bias) cost_history.append(cost) # Log Progress if i % 10 == 0: print "iter={:d} weight={:.2f} bias={:.4f} cost={:.2}".format(i, weight, bias, cost) return weight, bias, cost_history

**Model evaluation**

Shortened: With each iteration, the goal is to reduce the cost function. If the model is working, we should observe a decrease in the cost after each iteration.

**Logging**

```
iter=1 weight=.03 bias=.0014 cost=197.25
iter=10 weight=.28 bias=.0116 cost=74.65
iter=20 weight=.39 bias=.0177 cost=49.48
iter=30 weight=.44 bias=.0219 cost=44.31
iter=30 weight=.46 bias=.0249 cost=43.28
```

**Visualizing**

**Cost history**

**Summary**

By optimizing the weight and bias values through training, we can now use the learned equation to make predictions on new data. In this case, the equation is

Sales = 0.55 * Radio + 0.32

which allows us to estimate future sales based on our investment in radio advertising.

**References**