Seven deadly sins of statistics and how to avoid them

今天统计部门的领导在PiCR课上给我们讲了如何规避统计分析中的7个陷阱。废话不多说，直接上干货。

How to avoid:

Be critical what plots you use: can the reader extract all necessary information?
Don’t use bars representing the mean with whiskers representing? SD? SEM?
Use Dotplots for small data sets.
Use Boxplots if you have enough data points.

How to avoid:

Be aware of multiple levels on experiments (like concentration gradients).
Technical replicates are generated to account for technical variability.
Technical variability is typically not the focus of interest.
Biological replicates are necessary to obtain information about biological variability.
Choose appropriate statistical analysis method for the evaluation of multi-level data:
- Mixed effects models (statistically challenging)
- Substitute in case of technical replicates of biological replicates:
  - average technical replicates for every biological replicate.

How to avoid:

Type I error: False positive

Type II error: False negative

Type I error probability is selected by investigator.
Type II error probability is unknown. It depends on unknown truth (What’s the real difference? How much variability?)
Keep statistical errors (type I, type II) in mind:
- if H0 can be rejected: has type I error occurred?
- if H0 can not be rejected: has type II error occurred?
Do you want to “prove the null hypothesis”?

Example: Does a generic drug have the same effect as a brand-name drug?

–> switch the roles of null and alternative hypothesis –> “equivalence” testing

How to avoid:

Not only perform statistical testing (and get a p-value) but always derive confidence intervals to get an estimate of effect size.
The 95%-confidence interval covers the true effect with probability 95%.

Something you should know:

95% CI indicates that we are 95% confident that the population mean will fall within the range described. So CI is similar to a measure of spread, like SD.
As sample size increase or variability in the measurement decrease, the CI will become more narrow. And the results will be more precise.
CI can be used similar to a p-value to determine significant differences, but conveys more useful information than p values
Always report 95% CI with p-value, not report solely p-value. If only one is to be reported, then it should be CI, because p value is less important and can be deduced from CI
never interpret p-value > 0.05 as an indication of no difference or no association, only the CI can provide this message. (absence of evidence does not equate to evidence of absence)

How to avoid:

Include variables in the model with regards to scientific content.

比如，研究肺癌的影响因素的时候，吸烟即使统计上不显著也得放入模型呀，小朋友们
Use statistical variable selection procedures (老师说，这些个方法一般适用于变量数小于30的):
- stepwise selection
- backward selection
- forward selection
- best subset selection
In case of high-dimensional covariables (e.g., omics data)
- complex variables selection methods available. (In my opinion, PCA/PLS-DA/lasso regression and other methods)

How to avoid:

Be aware of overfitting.
Don’t report accuracy on training set. It will most probably be good and it will be overoptimistic.
Challenge the new method (score, cut-off value for a score, prediction model …) by applying it to a separate test set.
If no separate test set is available, split your data set into training and test set.
Crossvalidation, bootstrapping, …

How to avoid:

从实验的设计开始就要有统计学家的参与，因为他们可以预见一些数据处理时候的问题。比如你要是没重复没对照，估计统计学家不会理你，毕竟巧妇难为无米之炊。

December 07, 2017