Part 1

Exploring the Cost of College Tuition

All models are wrong, but some are useful.- George Box

Contributors: Shonda Kuiper, Joyce Gill, Khanh Do

Finding Variables to Determine the Cost of Tuition

As a high school student, one of your biggest decisions is what you’re going to do when you graduate. If you choose to further your education, how can you know which college or university is best for you? One possible approach is to identify the college with the best value. Instead of relying solely on brochures or rankings, we can look for data-based evidence.

To determine the college with the best value, we need to understand information about colleges and look for meaningful patterns:

Do more expensive schools produce better outcomes, provide more student aid, or have more selective students?

The Tuition Visualization App below allows us to look at data from all United States colleges and universities to find answers to these types of questions. Let’s assume your goal is to find what variables predict the cost of tuition. What model did you find? What can you conclude?

What Does Your Model Reveal?

It’s tempting to start exploring, trying multiple combinations of variables until you find something interesting. For example, “Schools with higher percentages of faculty with PhDs have significantly higher tuition (p < 0.05).” That sounds plausible. Prestigious schools hire more PhDs, right? You start to write up your research paper. But there is a problem: You didn’t just run one model, you ran dozens.

Even if none of the variables truly affect tuition in a meaningful causal way, random noise can produce “statistically significant” results if you try enough combinations. This is often called p-hacking (or a multiple comparisons problem).

Unfortunately, there are many times when researchers have large datasets where none of the predictors actually matter, but if they test 20 different hypotheses, by chance alone, about 1 in 20 tests will appear “significant” at the 0.05 level.

If you try many models, select one that results in a small p-value and ignore the other models, you might easily convince yourself you’ve discovered something real. However, other analysts can take the same dataset, choose different variables, and find “statistically significant” results that contradict your conclusions.

In addition to choosing variables, other decisions can also influence statistical results, such as the type of analysis to use, how to treat missing data, or how to treat outliers. Each choice may be reasonable on its own, but they can result in very different conclusions.

Statistical research and p-values can be very useful, but a researcher needs to carefully consider how the data was collected, how the data was cleaned or modified, what type of assumptions were made, the type of analysis that was used, and how carefully the conclusions were drawn.

Don’t be too quick to accept a statistical result; instead, ask questions about the data, review the assumptions and methods, and think critically about the results. If analysis is done carefully and the results are repeatable (the same results are consistently found with multiple datasets and methods), we can then gain confidence that our conclusions are correct.

Calculating a Simple Regression Model

The Tuition Visualization App allows us to see the relationship between graduation rates and tuition. Do schools with higher graduation rates also have higher tuition costs? To answer this question, select the following:

Response variable: Private school tuition
Selectivity and Student Outcomes >> 6 yr graduation rate
Predictor for X axis: 6-year grad rate

Question 1) Move your cursor over the points and identify the two private schools with the highest 6-year graduation rates. List the schools and their tuition.

Question 2) Write the name Grinnell College in the box below the title. There will be a black circle around this school. Estimate the graduation rate and tuition for Grinnell College.

Question 3) Does there appear to be a strong relationship between 6-year graduation rates and tuition costs? What does the p-value tell us about this model? What does the R^2 tell us in this model?

Calculating a Multiple Regression Model

Let’s repeat this process with a few more variables.

Response variable: Public out-of-state
Explanatory variables:
- Selectivity and student outcomes >> 6 Year Graduation Rate
- Costs, Aid & Debt >> log(Grants Aid total)
Predictor for X axis: 6 Year Graduation Rate

Question 4) What is the slope and y-intercept for this model? Give an interpretation of this model.

Question 5) What is the p-value for the slope? Interpret this p-value.

Question 6) Does the regression model still indicate that schools with higher 6-year graduate rates have higher tuition?

Question 7) Now build the following multivariate model.

Response variable: Public out-of-state tuition
Explanatory variables:
- Selectivity and student outcomes >> 6 Year Graduation Rate
- Selectivity and student outcomes >> Graduation Rate (Men)
- Costs and Aid >> Application Fee
- Costs and Aid >> log(Grants Aid total)
Predictor for X axis: 6 Year Graduation Rate

Write out the regression model and show the p-value for the 6 Year Graduation Rate variable. What is the slope coefficient and p-value for 6 Year Graduation Rate?

Question 8) Do schools with higher 6-year graduation rates have higher or lower out-of-state tuition? How do the above regression models make both hypotheses appear correct?

Can You Trust p-values?

In the above questions, the first two models show significant p-values demonstrating an increasing relationship between 6-year graduation rates and tuition. However, adding additional terms to our model shows evidence of a decreasing relationship between 6-year graduation rates and tuition.

Simply getting a p-value less than 0.05 (or even less than 0.001) does not give “statistically significant” evidence that “public schools with higher 6-year graduation rates have higher out-of-state tuition costs.”

This is a common error made in regression analysis. When conducting hypothesis tests for slopes (coefficients) within a multiple regression model, the p-values are not reliable. There are two primary reasons why the coefficients for p-values are not reliable.

Section 5A

Correlated explanatory variables: Multicollinearity is a condition in regression analysis where two or more explanatory variables are highly correlated with one another, meaning they contain overlapping or redundant information about the response variable. When this happens, the model has difficulty distinguishing the individual contribution of each predictor, because changes in one variable tend to be associated with changes in another. As a result, the estimated regression coefficients can become unstable and highly sensitive to small changes in the data.

Question 9) Rebuild the model in Question 7, with and without Graduation Rate (Men). List the coefficients and p-values for 6-year Graduation Rate in both models. Explain why including Graduation Rate (Men) can have such a strong influence on the p-value for another variable, 6-year graduation rate.

Question 10) Question 9 shows that multicollinearity makes the coefficients highly variable, making it harder to determine which predictors are statistically significant. In severe cases, coefficients may even have counterintuitive signs or magnitudes, not because of true relationships, but because the model cannot reliably separate the effects of correlated predictors. Even though the interpretation of individual variables may not be reliable, the overall model may still produce accurate predictions. How much does the R^2 value change in the model from Question 7, with and without the variable Graduation Rate (Men)?

Section 5B

Multiple Comparisons (often called p-hacking): Even a simple research question, such as “what variables have the most influence on college tuition?”, requires many choices that can shape the results.

Here’s the subtle problem with multivariate regression problems; we didn’t just discover a pattern in the data, we constructed one. In the above examples, we manipulated numerous variables to shape the model into getting the results we wanted. When creating multivariate models, researchers often try numerous models to find the best combination of explanatory variables that can explain or predict the response variable. Often, there’s no obviously correct way to proceed, which makes it tempting to try different things until you get the result you’re seeking. It isn’t wrong to explore data until we find a pattern, but it does dramatically influence how we can interpret the model.

What most researchers want to know is whether their hypothesis is true, and, if so, is the strength of the finding. “A p-value does not give you that — it can never give you that,” said Regina Nuzzo, a statistician and journalist in Washington, D.C., who wrote about " the p-value problem in Nature.

Question 11) Use the definition of a p-value to explain Rigina Nuzzo’s statement. What does a p-value actually tell us?

Question 12) Many researchers assume that if the Variance Inflation Factors (VIFs) are low, we can trust the p-values for each coefficient in their final multiple regression model. Explan why this is incorrect.

Question 13) Many researchers want to conduct multiple regression to “find the most important explanatory variables to predict their response”. Explain why multiple regression techniques cannot do this.

Question 14) Explain why we can trust the p-value when conducting an ESS test.

Question 15) What are the limitations of conducting an ESS test instead of just using results from variable selection techniques such as best subsets or stepwise regression?

Question 16) If we are trying to describe or predict a response (i.e., find the best R^2 value), when is best subsets better than stepwise? When is stepwise better than best subsets?

Dataspace

Data Stories

NYPD

Covid-19

Brexit

Stats Games

Racer

Greenhouse

Statistically Grounded

Psychic

Roller Coaster

Statspital

Questions?

If you have any questions or comments, please email us at DASIL@grinnell.edu

This page was last updated on March 19, 2025.