In statistical modeling, particularly in linear regression, the least squares method is a technique used to estimate the parameters of a linear regression model. The objective is to find the line (or hyperplane in higher dimensions) that minimizes the sum of the squared differences between the observed values and the values predicted by the model. This line is known as the fitted model or regression line.
Let's break down the concept of least squares and how it leads to the fitted model.
The least squares method is used to find the best-fitting line by minimizing the sum of squared residuals. Residuals are the differences between the observed values () and the predicted values () from the regression line. In simple linear regression, the goal is to estimate the parameters (slope and intercept ) of the linear model:
Where:
The residual for each data point is given by:
The objective of the least squares method is to minimize the sum of squared residuals, which is mathematically expressed as:
Where:
The least squares method minimizes by adjusting the parameters (intercept) and (slope) of the linear equation.
Using calculus, we can find the values of and that minimize the sum of squared residuals. These values are computed as follows:
The slope is given by the formula:
Where:
Once the slope is found, the intercept can be estimated as:
Where:
Thus, the fitted model (regression line) is given by:
This is the fitted regression line that minimizes the sum of squared errors.
The fitted model refers to the regression equation obtained after applying the least squares method. The line of best fit is the line that minimizes the difference between the observed values and the predicted values . The fitted model is represented by the equation:
Where:
The fitted line is used to make predictions about the dependent variable for any given value of . For example, if represents years of experience, and represents salary, the fitted model can predict the expected salary for any given number of years of experience.
Let’s say we have a dataset that represents the relationship between the number of study hours () and the test scores () of a group of students. After performing linear regression, we obtain the fitted model:
This means that for each additional hour of study (), the test score () increases by 5 points. The intercept of 50 suggests that a student who does not study at all () is expected to have a baseline test score of 50.
Once the least squares method is used to estimate the parameters, it’s important to assess how well the fitted model represents the data. This can be done using several metrics:
Residuals are the differences between the observed values and the predicted values:
By examining the residuals, we can check the assumptions of the regression model, such as homoscedasticity (constant variance) and independence of errors.
is a key metric that tells us how well the fitted model explains the variability in the dependent variable. It is the proportion of the variance in the dependent variable that is explained by the independent variable.
Where:
An value close to 1 indicates that the model explains most of the variance in the data, while an value close to 0 suggests that the model does not explain much of the variance.
In addition to the fitted model, hypothesis tests can be performed on the parameters and to assess whether they are statistically significantly different from zero. Typically, this is done using t-tests for the individual regression coefficients.
While the least squares method provides a useful tool for fitting a linear regression model, there are some important limitations:
Open this section to load past papers