Are you Psychic?

Understanding Hypothesis Tests

Contributors: Shonda Kuiper DASIL

Part 1A: Introduction

Have you ever felt that you have a special ability to know something is true just through intuition? For example, have you ever played a casino game where you won because you correctly guessed the throw of the dice, the right card, or the color on a roulette wheel? If someone wins several times in a row, do they have special skills or are they just lucky? The Psychic Test below will evaluate your ability to predict the right card.

Part 1B: The Psychic Test

Use the Psychic game (below) to conduct your own test:

Provide a Player ID (see Figure 1). You may use any name you like: this name will be on the internet, so do not use a name that will easily identify you.

Provide a Group ID. If you are playing this game for a course, use the exact Group ID provided by your instructor, which is the same for every person in the class.

Click the Play button.

Click on one of the five shapes as shown in Figure 2. Your goal is to guess which shape will be displayed on the top card. You will have 10 attempts in this initial test. The number of attempts that you correctly guess will be automatically saved.

Note: If you are unable to play the game, either try a different browser or go to the game website directly here . You can also test your skills on your phone here .

Part 1C: Exploring Example Data

How well did you do at guessing the right cards? Before we come to any conclusions about psychic abilities, we should carefully consider several questions. For example:

How many does the average person, with no special abilities, get right?
How many would you need to correctly guess before we started to suspect there was something more going on than random guessing?

Before we analyze your data, let's look at a sample dataset from a previous class of students, called sample1. Before answering any of the questions, make sure the settings in the app follow Settings A shown on the left.

Settings A

Group ID sample1

Number of Cards: 5

Attempts: 10

At least as extreme as: 0

Histograms: Sample Data

Check: Summary Statistics

Figure 3: Output when Settings A are selected

Instructors Note: Go to faculty resources to access student data

Part 1D: Hypothesis Testing and p-values

The above activity can be used to explain the core ideas behind statistical hypothesis tests. Hypothesis testing is a process used to determine whether an event can reasonably be attributed to chance or whether there is some other explanation.

For example, let’s assume your friend, Akilah, claims to be psychic and you decide to test this claim using a test similar to the one above. There are two possible conclusions we can make.

Claim 1: Akilah has no special ability to predict the right card. Then we would expect that Akilah would typically get 1 out of 5 guesses correct. This would also be equivalent to saying the probability of success is 0.2. Statisticians write this as a null hypothesis: Ho: p = 0.2. Null hypotheses are typically what we assume before we collect any data. Here we use “p” to represent the assumed proportion.

Note: In every null hypothesis we are making several corresponding assumptions. It is important to identify these assumptions before any test can be trusted. In this test, Claim 1 is also assuming:

The above Psychic Test is truly random. We assume that this test is not manipulated in a way that allows people to be more likely to guess the right card correctly.

Akilah is only playing one game. In other words, we are collecting one sample of data with 10 observations (10 attempts). Akilah did not play the game multiple times and only her best score was shown.

Claim 2: At least one of the assumptions in the null hypothesis is incorrect, meaning Akilah is expected to do better than 1 out of 5 correct guesses. This would also be equivalent to saying the probability of success is greater than 0.2. Statisticians write this as an alternative hypothesis: Ha: p > 0.2.

Note: It is important to recognize that one hypothesis test, by itself, should never be enough to prove a theory, but it can be used to determine how much evidence there is to support a theory. * [[See the American Statistical Association statement to see more details behind these ideas, https://www.amstat.org/asa/files/pdfs/p-valuestatement.pdf]]

Based on the data visualization in Part 1C, we can make conclusions based on Akilah’s score.

Case 1): Let’s assume Akilah played your game and got 4 out of 10 correct, she would have done better than expected. However, the above app shows us that 12% of the time people can get 4 or more correct just by chance. In terms of our statistical hypothesis test, the p-value = 0.12 does not give evidence to believe Akilah does better than the average person with no abilities. Thus the sample data (Akilah’s score) provides no evidence to support the idea that Claim 2 is true.

A p-value is a number between 0 and 1 that we use to quantify our decision. A p-value is the probability of observing an outcome while assuming that the null hypothesis and all corresponding assumptions are true.

In a hypothesis test, “no evidence” or "weak evidence" is typically associated with a large p-value (greater than 0.05), "moderate evidence" might fall in a range between 0.05 and 0.1, and "very strong evidence" would be a very small p-value (such as 0.01 or lower) signifying a highly significant result against the null hypothesis.

The p-value is not the same as the p we use in our null and alternative hypotheses.

CASE 2) Let’s assume Akilah got 6 out of 10 guesses correct. If the null hypothesis is true (p = 0.2) the probability that Akilah correctly guesses 6 or more cards is 0.01. This may cause us to question the null hypothesis and conclude that something is helping Akilah correctly guess the cards (at least one of the assumptions in the null hypothesis is incorrect.

A p-value = 0.01 provides very strong evidence against the null hypothesis, meaning that it is unlikely, but not impossible, that someone will correctly guess 6 or more cards just by random chance. However, p-values never prove that our alternative hypothesis is true; it simply tells us how unlikely the null hypothesis is when only chance is involved.

Even if Case 2 is true, one hypothesis test does not prove Akilah has special abilities. Instead, we should say, “The observed data provides some evidence that leads us to question the null hypothesis.” The questions below discuss additional questions to consider when conducting hypothesis tests.

Part 1E: Get Curious

1.7 Evaluate your class scores. Sometimes it can take up to 60 minutes for your class data to show in the App, here. After your instructor confirms enough class data is available, answer the following questions:
- In the App above, use the GroupID assigned to your class. Use the summary statistics button to identify the mean, median, and sample size of your class data.
- Compare the histogram of your class data to the histogram when we use “all” as the Group ID. Are the center, variability, and shape similar?
- What was the best score in your class? If there are 20 people in your class (20 games played), would you be surprised if one person got a “psychic” result by getting 5 or more correct? Explain your reasoning.

1.8 How does the number of cards influence our results? Go to the Pyschic test above, enter the same PlayerID and GroupID, however, click the Options button. Then select the following:
Number of Attempts: 10
Number of Cards: 2
Deck Style: Open Deck
- In the first game, the probability of success was 1/5 = 0.2. What is the probability of success when you are selecting 1 card from 2 choices?
- When you have 10 attempts, with 2 cards to choose from each, how many successes would you expect to get?
- Use the App above to estimate how many successes would be needed before the result might be considered unusual.

1.9 In 2019, USA Today published an article discussing two marine lab manatees (Buffet and Hugh) attempting to predict the Super Bowl winner. Watch the video here. Each manatee had a 50% chance of selecting the winner.
- Buffett had correctly picked 9 winners from 2008-2018, while Hugh had only picked 6 winners. Does this mean you should expect Buffett to be better at predicting the winner of the 2019 Super Bowl?
- Use the App above to estimate how likely it is to get 9 out of 11 attempts correct. Since the Psychic game doesn’t allow us to play with 11 attempts, click the binomial distribution option under Histograms for a more accurate estimate.
- The USA Today article also lists other animals predicting the Super Bowl, such as Fiona the Hippo, Kiki the Lioness, and April the Giraffe. When thousands of animals are used to predict the Super Bowl, explain why it is NOT surprising for a few to be correct often.

1.10 Your friend Eli claims he can tell the difference between Coke and Pepsi. However, you are skeptical. You design a test where Eli guesses between two cups, each containing one drink, repeated 10 times.
- State the null and alternative hypotheses.
- If the null hypothesis is true, would you be surprised if Eli got 7 out of 10 correct? Use the App to explain your reasoning.
- In this test, Eli got 9 out of 10 correct. Use the App to find the p-value for this test. Based on this p-value, explain whether or not you have strong evidence to support the idea that Eli can taste the difference between the two.

1.11 How does the number of attempts (or sample size) influence our conclusions? Use the App to find the following probabilities for the psychic game. For this question, we may not have enough psychic game data, so use the binomial distribution option to get a more reliable histogram and probability.
- With 5 cards and 10 attempts, how likely is it for a player to get 40% or more correct (4 or more out of 10 correct guesses)?
- With 5 cards and 20 attempts, how likely is it for a player to get 40% or more correct (8 or more out of 20 correct guesses)?
- With 5 cards and 50 attempts, how likely is it for a player to get 40% or more correct (20 or more out of 50 correct guesses)?
- Explain the pattern you see from Questions 1.11a – 1.11c. Why would you expect a larger number of attempts to influence the probability of getting 40% or more.

1.12 In 2016, the American Statistical Association posted a statement to address the many misconceptions and misuses of the p-value in published research.
- The first principle described in this statement says, “P-values can indicate how incompatible the data are with a specified statistical model.” Use the Pepsi-Coke example (Question 1.10) to explain how a small p-value provides evidence against the null hypothesis (where our statistical model assumed p = 0.5).
- The second principle states, “P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.” Use the Pepsi-Coke example (Question 1.10) to explain how a p-value larger than 0.10 does not prove Eli has no ability to differentiate between the two colas.

1.13 In the App we see that over 10,000 games have been played the original psychic game, using 5 cards and 10 attempts.
- Estimate the proportion of all the games played that have scores of 5 or more.
- If there are 100 students in your class, how many would you expect to get a score of 5 or more (while assuming the null hypothesis is true)?
- If there are 40 students in your class who played the original psychic game, explain why it would not be surprising for at least one student to get a score that leads to a p-value < 0.05, even when no one in the class has psychic abilities.
- Explain why it is not appropriate for someone to play the original psychic game 100 times and then use only their best score to “show evidence” that they are psychic.
- Terms such as p-hacking, data dredging, data fishing, or data snooping can all be used to describe when data is manipulated to create hypothesis tests with small p-values, when there are no meaningful conclusions. Tyler Vigen has created a website here that demonstrates one type of data manipulation focusing on the “multiple comparison problem.” In essence, this means a researcher conducts numerous hypothesis tests, then publishes only the tests with small p-values. Choose one graph from Tyler Vigen’s site and explain how this is an example of p-hacking.

References

Shannon, Joel. "Adorable Animals Across the Nation Are Making Super Bowl Predictions." USA Today, Feb. 2019, www.usatoday.com/story/news/nation/2019/02/03/animals-predict-super-bowl-outcome/2756507002/.

Wasserstein, R. L., and Nicole A. Lazar. "The ASA’s Statement on P-Values: Context, Process, and Purpose." The American Statistician, vol. 70, no. 2, 2016, amstat.tandfonline.com/doi/full/10.1080/00031305.2016.1154108#d1e167.

Settings A

Part 1D: Hypothesis Testing and p-values

Part 1E: Get Curious

References

Dataspace

Data Stories

Stats Games

Other Links

Questions?