After reading "The ASA's statement on p-values: context, process, and purpose", and some other related references, here are some excerpts and notes I took on p-value and null-hypothesis significance testing.

• American Statistical Association (ASA) has stated the following five principles about p-values and null hypothesis significance testing:

1. "P-values can indicate how incompatible the data are with a specified statistical model."
2. "P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone."
3. " … It is a statement about data in relation to a specified hypothetical explanation, and is not a statement about the explanation itself."
4. "Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold."
5. "… Practices that reduce data analysis or scientific inference to mechanical “bright-line” rules (such as “p < 0.05”) for justifying scientific claims or conclusions can lead to erroneous beliefs and poor decision-making. …"
6. "Proper inference requires full reporting and transparency."
7. "A p-value, or statistical significance, does not measure the size of an effect or the importance of a result."
8. "… Smaller p-values do not necessarily imply the presence of larger or more important effects, and larger p-values do not imply a lack of importance or even lack of effect. Any effect, no matter how tiny, can produce a small p-value if the sample size or measurement precision is high enough, and large effects may produce unimpressive p-values if the sample size is small or measurements are imprecise. …"
• Null hypothesis is usually a hypothesis that assumes that observed data and its distribution is a result of random chances rather than that of effects caused by some intrinsic mechanisms. It is usually what is to disapprove or to reject in order to establish evidence to or belief in that there is some real effect due to underlying intrinsic mechanism. In turn, the details of the statistical model used in this evaluation can be used to make quantitative estimations on properties of the underlying mechanism.

• The p-value is the probability that one has falsely rejected the null hypothesis.

• The smaller is, the smaller the chance is that one has falsely rejected the null hypothesis.
• Being able to reject or not being able to reject the null hypothesis may tells one if the observed data suggests that there is an effect, however, it does not tell one how much an effect there is and if the effect is true. See effect size.
• "a p-value near 0.05 taken by itself offers only weak evidence against the null hypothesis".
• UK statistician and geneticist Sir Ronald Fisher introduced the p-value in the 1920s. "The p-value was never meant to be used the way it's used today."
• As ASA p-value principle No. 3 states, the decision to reject the null hypothesis should not be based solely on if p-value passes a "bright-line" threshold. Rather, in order to reject the null hypothesis, one must make a subjective judgment involving the degree of risk acceptable for being wrong. The degree of risk of being wrong may be specified in terms of confidence levels which characterizes the sampling variability.

• Alternative ways used for referring to data cherry-picking include data dredging, significance chasing, significance questing, selective inference, p-hacking, snooping, fishing, and double-dipping.

• "The difference between statistically significant and statistically insignificant is not, itself, statistically significant."

• "According to one widely used calculation , a p-value of 0.01 corresponds to a false-alarm probability of at least 11%, depending on the underlying probability that there is a true effect; a p-value of 0.05 raises that chance to at least 29%." See the following figure:

p-value and probable cause.png

## Some related concepts

• The standard score, or z-score is the deviation from the mean in units of standard deviation. A small p-value corresponds to a large positive z-score.

• 68-95-99.7 rule

• Magnitude - How big is the effect? Large effects are more compelling than small ones.
• Articulation - How specific is it? Precise statements are more compelling than imprecise ones.
• Generality - How generally does it apply?
• Interestingness - interesting effects are those that "have the potential, through empirical analysis, to change what people believe about an important issue".
• Credibility - Credible claims are more compelling than incredible ones. The researcher must show that the claims made are credible.

## References

1. Goodman, "Of P-Values and Bayes: A Modest Proposal," S. N. Epidemiology 12, 295–297 (2001), http://journals.lww.com/epidem/fulltext/2001/05000/of_p_values_and_bayes__a_modest_proposal.6.aspx