Simple Regression

For this blog post, the Cochise College IPEDs Peer data frame will be used. That data frame was first seen in Introduction to IPEDS Peers (August 24, 2018). That data frame includes 113 attributes for 29 colleges and it is natural to wonder if any of those attributes are related to each other in such a way that they can be used for predictions. The relationships between selected attributes was explored in About Correlation where two Correlograms were generated to find highly-correlated attributes. This post will use several attributes from the IPEDS data as factors in a regression analysis to see if a model can be developed that will predict the value of one attribute when given the value of another.

Here is a correlogram with the race/ethnicity attribute along with the tuition charged.

This correlogram has some interesting correlations, but I suspect that those are a factor of nothing more than geography. Community colleges tend to attract local students so the student demographics would tend to mirror the local population. Thus, there is a strong negative correlation indicated between black and white students and between Hispanic and white students. I suspect that this does not indicate one race refusing to go to college with another but, rather, that the communities where colleges are located are somewhat polarized.

For this analysis, I wanted to determine if tuition has an influence on the ethnic makeup of the student body. At just a quick glance, I can see that there is a moderate positive correlation between tuition and white students and a moderate negative correlation between tuition and Hispanic students.

Two simple linear models will be developed with R and then those models will be used to predict the percent of Hispanic and White students when given a specific tuition.

The general regression formula is \(\hat{y}=\beta_1x+\beta_0\) where the output (\(\hat{y}\) is determined by two parameters of the model (\(\beta\) and the input (\(x\). As an example, if \(\beta_1\) is 2 and \(\beta_0\) is 1 then an input (\(x\) of 1 would yield an output (\(\hat{y}\) of 3.

To create the regression model, I focused first on the relationship between the percentage of Hispanic students (the dependent variable) and tuition (the independent variable). R has a linear model function, lm(Hispanic ~ Tuition) that was used to determine the values of \(\beta_0\) and \(\beta_1\). Plugging those values into the regression formula yielded \(\hat{y}=-0.011x+58.13\). Now, to calculate the predicted percent of Hispanic students for a tuition of $3,000, plug that number in for \(x\) and solve the equation (it comes out to 25.21%).

It may be easier to visualize this relationship with a scatter plot that shows the relationship between those two variables along with a line of best fit for the data.

In the above plot, the various colleges in the IPEDS peer group are indicated by the black dots and the blue line is the line of best fit. Because it is a negative correlation the blue line angles downward. The gray zone indicates a 95% confidence level for the true value of the line of best fit. This is an interactive chart and a prediction can be made by simply hovering the mouse over the line of best fit to see the values of Tuition and Percent of Students.

Next, I focused on the relationship between the percentage of White students (the dependent variable) and tuition (the independent variable). Here is a scatter plot that shows that relationship along with a line of best fit for the data.

For example, on the plot of white students, if the mouse is hovered over the line above the value of $3000 for Tuition then the percentage of white students is 47.

While this is an interesting exercise, any sort of cause and effect discussion should be avoided. As the percentage of white students increase is there pressure to increase tuition (perhaps due to services demanded by the student body)? Or does higher tuition tend to deter students of color? Of course, there are many other factors not considered in this simple analysis, like the geographic community where the college is located or the availability of financial aid.

Multiple Regression

It is possible to have more than one variable influence the output and in that case a multiple regression is used for predictions. For this part of the post I decided to use the three income streams to predict the core revenue available. The three streams are the percent of the core revenue provided by tuition, the percent of the core revenue provided by local funding, and the percent of the core revenue provided by state funding. This is a simplified view of revenue and ignores sources like grants, but is adequate for this analysis.

The first step was to construct a correlogram to get a sense of the relationship between these factors. According to this chart the local revenue has a moderate correlation to total revenue but neither the state or tutition has much of a correlation.

The general multiple regression formula is \(\hat{y}=\beta_1x_1+\beta_2x_2+\beta_0\) where the output (\(\hat{y}\) is determined by three parameters of the model (\(\beta\) and the inputs (\(x\). For the revenue regression, there are three input variables (tuition, state, local) that is used to predict the core revenue. In R the formula is lm(Revenue ~ Tuition+State+Local)

This is the regression formula that was generated from the IPEDS peer data. \(\hat{y}=(962285*x_1)+(956919*x_2)+(1312015*x_3)+1953830\). That linear model was used to create the following scatter plot.

In the above plot, the tuition percentages are in blue, the state percentages are in gold, and the local percentages are in red. Notice that the local percentages seem to make the greatest difference since the slope of that line is greater than the other two. The state percentages seem to be the least important since the slope on that line is nearly flat.

There is little to be gained from hovering over each line, though the value of the X and Y variables will be displayed. To determine the impact of changing one of the three percentages would require a way to change the percentage (a “slider” control), but that is a project for another day.