ScholarQuill logoScholarQuillUniversity Notes
  • Notes
  • Past Papers
  • Blogs
  • Todo
Login
ScholarQuill logoScholarQuillUniversity Notes
Login
NotesPast PapersBlogsTodo
More
SubjectsDiscussionCGPA CalculatorGPA CalculatorStudent PortalCourse Outline
About
About usPrivacy PolicyReportContact
Notes
Past Papers
Blogs
Todo
Analytics
    Current Subject
    🧩
    Probability and Statistics
    MS-251
    Progress0 / 36 topics
    Topics
    1. Introduction: Statistics and Data Analysis2. Statistical Inference3. Samples, Populations, and the Role of Probability4. Sampling Procedures5. Discrete and Continuous Data6. Statistical Modeling7. Types of Statistical Studies8. Probability: Sample Space, Events, Counting Sample Points9. Probability of an Event10. Additive Rules11. Conditional Probability12. Independence and the Product Rule13. Bayes’ Rule14. Random Variables and Probability Distributions15. Mathematical Expectation: Mean of a Random Variable16. Variance and Covariance of Random Variables17. Means and Variances of Linear Combinations of Random Variables18. Chebyshev’s Theorem19. Discrete Probability Distributions20. Continuous Probability Distributions21. Fundamental Sampling Distributions22. Sampling Distributions and Data Descriptions23. Random Sampling24. Sampling Distributions25. Sampling Distribution of Means and the Central Limit Theorem26. Sampling Distribution of S227. t-Distribution28. F-Quantile and Probability Plots29. Single Sample & One- and Two-Sample Estimation Problems30. Single Sample & One- and Two-Sample Tests of Hypotheses31. The Use of P-Values for Decision Making in Testing Hypotheses32. Regression: Linear Regression and Correlation33. Least Squares and the Fitted Model34. Multiple Linear Regression and Certain Nonlinear Regression Models35. Linear Regression Model Using Matrices36. Properties of the Least Squares Estimators
    MS-251›Regression: Linear Regression and Correlation
    Probability and StatisticsTopic 32 of 36

    Regression: Linear Regression and Correlation

    10 minread
    1,676words
    Intermediatelevel

    Regression: Linear Regression and Correlation

    In statistics, regression and correlation are methods used to explore the relationship between two or more variables. While both techniques deal with relationships between variables, they serve different purposes and are used in different contexts.

    Linear Regression is a method used for predicting the value of a dependent variable based on the value of one or more independent variables. Correlation, on the other hand, measures the strength and direction of a linear relationship between two variables.

    1. Linear Regression

    Linear regression is a statistical technique used to model the relationship between a dependent variable and one or more independent variables. The goal of linear regression is to find the best-fitting straight line that describes this relationship.

    a. Simple Linear Regression

    Simple linear regression involves modeling the relationship between two variables: one dependent variable (YYY) and one independent variable (XXX). The relationship is assumed to be linear, meaning it can be described by a straight line.

    The equation for a simple linear regression model is:

    Y=β0+β1X+ϵY = \beta_0 + \beta_1 X + \epsilonY=β0​+β1​X+ϵ

    Where:

    • YYY is the dependent variable (the variable being predicted),
    • XXX is the independent variable (the variable used to predict YYY),
    • β0\beta_0β0​ is the intercept of the regression line (the value of YYY when X=0X = 0X=0),
    • β1\beta_1β1​ is the slope of the regression line (the change in YYY for a unit change in XXX),
    • ϵ\epsilonϵ is the error term (captures the unexplained variation).

    b. Estimating the Regression Parameters

    The parameters β0\beta_0β0​ and β1\beta_1β1​ are estimated using the least squares method, which minimizes the sum of squared differences between the observed values and the values predicted by the regression line. The formulas for estimating the slope (β1^\hat{\beta_1}β1​^​) and the intercept (β0^\hat{\beta_0}β0​^​) are:

    β1^=n∑i=1nXiYi−∑i=1nXi∑i=1nYin∑i=1nXi2−(∑i=1nXi)2\hat{\beta_1} = \frac{n \sum_{i=1}^{n} X_i Y_i - \sum_{i=1}^{n} X_i \sum_{i=1}^{n} Y_i}{n \sum_{i=1}^{n} X_i^2 - \left( \sum_{i=1}^{n} X_i \right)^2}β1​^​=n∑i=1n​Xi2​−(∑i=1n​Xi​)2n∑i=1n​Xi​Yi​−∑i=1n​Xi​∑i=1n​Yi​​ β0^=Yˉ−β1^Xˉ\hat{\beta_0} = \bar{Y} - \hat{\beta_1} \bar{X}β0​^​=Yˉ−β1​^​Xˉ

    Where:

    • nnn is the number of data points,
    • XiX_iXi​ and YiY_iYi​ are the individual data points,
    • Xˉ\bar{X}Xˉ and Yˉ\bar{Y}Yˉ are the sample means of XXX and YYY.

    c. Interpreting the Model

    Once the regression coefficients are estimated, the regression equation can be used to make predictions about the dependent variable YYY based on new values of XXX. The slope β1^\hat{\beta_1}β1​^​ tells us how much YYY changes for a one-unit increase in XXX. The intercept β0^\hat{\beta_0}β0​^​ represents the value of YYY when X=0X = 0X=0.

    d. Assumptions of Linear Regression

    For simple linear regression to provide valid results, several assumptions must be met:

    1. Linearity: The relationship between XXX and YYY is linear.
    2. Independence: The residuals (errors) are independent of each other.
    3. Homoscedasticity: The variance of the residuals is constant across all levels of XXX.
    4. Normality: The residuals are normally distributed for inference purposes (e.g., hypothesis testing).

    e. R-squared (R2R^2R2): The coefficient of determination

    R2R^2R2 is a key measure that indicates how well the regression model fits the data. It represents the proportion of the variance in the dependent variable YYY that is explained by the independent variable XXX.

    R2=1−∑i=1n(Yi−Y^i)2∑i=1n(Yi−Yˉ)2R^2 = 1 - \frac{\sum_{i=1}^{n} (Y_i - \hat{Y}_i)^2}{\sum_{i=1}^{n} (Y_i - \bar{Y})^2}R2=1−∑i=1n​(Yi​−Yˉ)2∑i=1n​(Yi​−Y^i​)2​

    Where:

    • YiY_iYi​ are the observed values,
    • Y^i\hat{Y}_iY^i​ are the predicted values from the regression equation,
    • Yˉ\bar{Y}Yˉ is the mean of YYY.

    An R2R^2R2 value close to 1 means that a large proportion of the variance in YYY is explained by XXX, indicating a good fit. An R2R^2R2 value close to 0 means the model does not explain much of the variance in YYY.


    2. Correlation

    Correlation measures the strength and direction of the linear relationship between two variables. It is represented by the correlation coefficient, denoted by rrr, which ranges from -1 to 1.

    a. Pearson’s Correlation Coefficient

    The most commonly used measure of correlation is Pearson's correlation coefficient rrr, which quantifies the strength and direction of the linear relationship between two variables. The formula for Pearson’s rrr is:

    r=n∑i=1nXiYi−∑i=1nXi∑i=1nYi(n∑i=1nXi2−(∑i=1nXi)2)(n∑i=1nYi2−(∑i=1nYi)2)r = \frac{n \sum_{i=1}^{n} X_i Y_i - \sum_{i=1}^{n} X_i \sum_{i=1}^{n} Y_i}{\sqrt{ \left( n \sum_{i=1}^{n} X_i^2 - \left( \sum_{i=1}^{n} X_i \right)^2 \right) \left( n \sum_{i=1}^{n} Y_i^2 - \left( \sum_{i=1}^{n} Y_i \right)^2 \right) }}r=(n∑i=1n​Xi2​−(∑i=1n​Xi​)2)(n∑i=1n​Yi2​−(∑i=1n​Yi​)2)​n∑i=1n​Xi​Yi​−∑i=1n​Xi​∑i=1n​Yi​​

    Where:

    • XiX_iXi​ and YiY_iYi​ are the individual data points,
    • nnn is the number of data points.

    b. Interpreting the Correlation Coefficient

    • r=1r = 1r=1: Perfect positive linear relationship.
    • r=−1r = -1r=−1: Perfect negative linear relationship.
    • r=0r = 0r=0: No linear relationship.
    • 0<r<10 < r < 10<r<1: Positive linear relationship, where as one variable increases, the other also increases.
    • −1<r<0-1 < r < 0−1<r<0: Negative linear relationship, where as one variable increases, the other decreases.

    c. Correlation vs. Causation

    It is important to note that correlation does not imply causation. A high correlation between two variables does not mean that one variable causes the other to change. It only means that there is a statistical relationship between the variables. Other factors, such as confounding variables, may be at play.


    3. Comparing Linear Regression and Correlation

    While both linear regression and correlation deal with relationships between variables, they serve different purposes:

    • Linear Regression:

      • Used to model and predict the value of one variable (dependent variable) based on another (independent variable).
      • Provides a predictive equation with parameters (slope and intercept) that can be used to make predictions.
      • Assumes a directional relationship (i.e., XXX affects YYY).
    • Correlation:

      • Measures the strength and direction of the linear relationship between two variables.
      • Does not assume any directionality (i.e., it does not assume that one variable causes the other).
      • Provides a single number, the correlation coefficient, which quantifies the degree of linear relationship between two variables.

    Summary

    • Linear regression is used to model and predict relationships between variables, typically with one dependent variable and one or more independent variables. The key measure is the regression equation, and the strength of the model is assessed using R2R^2R2.

    • Correlation is used to measure the strength and direction of a linear relationship between two variables, using the correlation coefficient rrr. It does not imply causality but simply quantifies the relationship.

    Understanding both regression and correlation is crucial for analyzing relationships between variables, predicting outcomes, and making data-driven decisions in fields like economics, medicine, and social sciences.

    Previous topic 31
    The Use of P-Values for Decision Making in Testing Hypotheses
    Next topic 33
    Least Squares and the Fitted Model

    Past Papers

    Open this section to load past papers

    Click on Show Past Papers to see past papers.
    On This Page
      Reading Stats
      Est. reading time10 min
      Word count1,676
      Code examples0
      DifficultyIntermediate