8.5.2 The First Method for Finding $\beta_0$ and $\beta_1$
Given the observations $(x_1,y_1)$, $(x_2,y_2)$, $\cdots$, $(x_n,y_n)$, we can write the regression line as \begin{align} \hat{y} = \beta_0+\beta_1 x. \end{align} We can estimate $\beta_0$ and $\beta_1$ as \begin{align} &\hat{\beta_1}=\frac{s_{xy}}{s_{xx}},\\ &\hat{\beta_0}=\overline{y}-\hat{\beta_1} \overline{x}, \end{align} where \begin{align} &s_{xx}=\sum_{i=1}^n (x_i-\overline{x})^2,\\ &s_{xy}=\sum_{i=1}^{n} (x_i-\overline{x})(y_i-\overline{y}). \end{align} For each $x_i$, the fitted value $\hat{y}_i$ is obtained by \begin{align} \hat{y}_i = \hat{\beta_0}+\hat{\beta_1} x_i. \end{align} The quantities \begin{align} e_i=y_i-\hat{y}_i \end{align} are called the residuals.
Example
Consider the following observed values of $(x_i,y_i)$: \begin{equation} (1,3) \quad (2,4) \quad (3,8) \quad (4,9) \end{equation}
- Find the estimated regression line \begin{align} \hat{y} = \hat{\beta_0}+\hat{\beta_1} x, \end{align} based on the observed data.
- For each $x_i$, compute the fitted value of $y_i$ using \begin{align} \hat{y}_i = \hat{\beta_0}+\hat{\beta_1} x_i. \end{align}
- Compute the residuals, $e_i=y_i-\hat{y}_i$ and note that \begin{align} \sum_{i=1}^{4} e_i =0. \end{align}
- Solution
-
- We have \begin{align} &\overline{x}=\frac{1+2+3+4}{4}=2.5,\\ &\overline{y}=\frac{3+4+8+9}{4}=6,\\ &s_{xx}=(1-2.5)^2+(2-2.5)^2+(3-2.5)^2+(4-2.5)^2=5,\\ &s_{xy}=(1-2.5)(3-6)+(2-2.5)(4-6)+(3-2.5)(8-6)+(4-2.5)(9-6)=11. \end{align} Therefore, we obtain \begin{align} &\hat{\beta_1}=\frac{s_{xy}}{s_{xx}}=\frac{11}{5}=2.2 \\ &\hat{\beta_0}=6-(2.2) (2.5)=0.5 \end{align}
- The fitted values are given by \begin{align} \hat{y}_i = 0.5+2.2 x_i, \end{align} so we obtain \begin{align} \hat{y}_1 =2.7, \quad \hat{y}_2 =4.9, \quad \hat{y}_3 =7.1, \quad \hat{y}_4 =9.3 \end{align}
- We have \begin{align} &e_1=y_1-\hat{y}_1=3-2.7=0.3,\\ &e_2=y_2-\hat{y}_2=4-4.9=-0.9,\\ &e_3=y_3-\hat{y}_3=8-7.1=0.9,\\ &e_4=y_4-\hat{y}_4=9-9.3=-0.3 \end{align} So, $e_1+e_2+e_3+e_4=0$.
-
We can use MATLAB or other software packages to do regression analysis. For example, the following MATLAB code can be used to obtain the estimated regression line in Example 8.31.
x=[1;2;3;4];
x0=ones(size(x));
y=[3;4;8;9];
beta = regress(y,[x0,x]);
x0=ones(size(x));
y=[3;4;8;9];
beta = regress(y,[x0,x]);
Coefficient of Determination ($R$-Squared):
Let's look again at the above model for regression. We wrote \begin{align} Y = \beta_0+\beta_1 X +\epsilon, \end{align} where $\epsilon$ is a $N(0,\sigma^2)$ random variable independent of $X$. Note that, here, $X$ is the only variable that we observe, so we estimate $Y$ using $X$. That is, we can write \begin{align} \hat{Y} = \beta_0+\beta_1 X. \end{align} The error in our estimate is \begin{align} Y-\hat{Y}=\epsilon. \end{align} Note that the randomness in $Y$ comes from two sources: $X$ and $\epsilon$. More specifically, if we look at $\textrm{Var}(Y)$, we can write \begin{align} \textrm{Var}(Y) &= \beta_1^2 \textrm{Var}(X) +\textrm{Var}(\epsilon) \quad (\textrm{since $X$ and $\epsilon$ are assumed to be independent}). \end{align} The above equation can be interpreted as follows. The total variation in $Y$ can be divided into two parts. The first part, $\beta_1^2 \textrm{Var}(X)$, is due to variation in $X$. The second part, $\textrm{Var}(\epsilon)$, is the variance of error. In other words, $\textrm{Var}(\epsilon)$ is the variance left in $Y$ after we know $X$. If the variance of error, $\textrm{Var}(\epsilon)$, is small, then $Y$ is close to $\hat{Y}$, so our regression model will be successful in estimating $Y$. From the above discussion, we can define \begin{align} \rho^2=\frac{\beta_1^2 \textrm{Var}(X)}{\textrm{Var}(Y)} \end{align} as the portion of variance of $Y$ that is explained by variation in $X$. From the above discussion, we can also conclude that $0 \leq \rho^2 \leq 1$. More specifically, if $\rho^2$ is close to $1$, $Y$ can be estimated very well as a linear function of $X$. On the other hand if $\rho^2$ is small, then the variance of error is large and $Y$ cannot be accurately estimated as a linear function of $X$. Since $\beta_1=\frac{\textrm{Cov}(X,Y)}{\textrm{Var}(X)}$, we can write \begin{align} \label{eq:rho-reg} \rho^2=\frac{\beta_1^2 \textrm{Var}(X)}{\textrm{Var}(Y)}=\frac{\left[\textrm{Cov}(X,Y)\right]^2}{\textrm{Var}(X) \textrm{Var}(Y)} \hspace{30pt} (8.6) \end{align} The above equation should look familiar to you. Here, $\rho$ is the correlation coefficient that we have seen before. Here, we are basically saying that if $X$ and $Y$ are highly correlated (i.e., $\rho(X,Y)$ is large), then $Y$ can be well approximated by a linear function of $X$, i.e., $Y \approx \hat{Y}=\beta_0+\beta_1 X$.We conclude that $\rho^2$ is an indicator showing the strength of our regression model in estimating (predicting) $Y$ from $X$. In practice, we often do not have $\rho$ but we have the observed pairs $(x_1,y_1)$, $(x_2,y_2)$, $\cdots$, $(x_n,y_n)$. We can estimate $\rho^2$ from the observed data. We show it by $r^2$ and call it $R$-squared or coefficient of determination.
For the observed data pairs, $(x_1,y_1)$, $(x_2,y_2)$, $\cdots$, $(x_n,y_n)$, we define coefficient of determination, $r^2$ as \begin{align} r^2=\frac{s_{xy}^2}{s_{xx}s_{yy}}, \end{align} where \begin{align} &s_{xx}=\sum_{i=1}^n (x_i-\overline{x})^2, \quad s_{yy}=\sum_{i=1}^n (y_i-\overline{y})^2, \quad s_{xy}=\sum_{i=1}^{n} (x_i-\overline{x})(y_i-\overline{y}). \end{align} We have $0 \leq r^2 \leq 1$. Larger values of $r^2$ generally suggest that our linear model \begin{align} \hat{y_i}=\hat{\beta_0}+\hat{\beta_1}x_i \end{align} is a good fit for the data.
Example
For the data in Example 8.31, find the coefficient of determination.
- Solution
- In Example Example 8.31, we found \begin{align} &s_{xx}=5, \quad s_{xy}=11. \end{align} We also have \begin{align} &s_{yy}=(3-6)^2+(4-6)^2+(8-6)^2+(9-6)^2=26. \end{align} We conclude \begin{align} r^2=\frac{11^2}{5 \times 26} \approx 0.93 \end{align}