Seven deadly sins of statistics and how to avoid them


Sin 1: Using uninformative or misleading plots**

How to avoid:

  1. Be critical what plots you use: can the reader extract all necessary information?

  2. Don’t use bars representing the mean with whiskers representing? SD? SEM?

  3. Use Dotplots for small data sets.

  4. Use Boxplots if you have enough data points.


Sin 2: Ignoring the hierarchical structure of data**

How to avoid:

  1. Be aware of multiple levels on experiments (like concentration gradients).

  2. Technical replicates are generated to account for technical variability.

  3. Technical variability is typically not the focus of interest.

  4. Biological replicates are necessary to obtain information about biological variability.

  5. Choose appropriate statistical analysis method for the evaluation of multi-level data:

    • Mixed effects models (statistically challenging)

    • Substitute in case of technical replicates of biological replicates:

      • average technical replicates for every biological replicate.

Sin 3: State that you proved the null hypothesis

How to avoid:

Type I error: False positive

Type II error: False negative

  1. Type I error probability is selected by investigator.

  2. Type II error probability is unknown. It depends on unknown truth (What’s the real difference? How much variability?)

  3. Keep statistical errors (type I, type II) in mind:

    • if H0 can be rejected: has type I error occurred?
    • if H0 can not be rejected: has type II error occurred?
  4. Do you want to “prove the null hypothesis”?

Example: Does a generic drug have the same effect as a brand-name drug?

–> switch the roles of null and alternative hypothesis –> “equivalence” testing

Sin 4: Not distinguishing between statistical significance and practical relevance**

How to avoid:

  1. Not only perform statistical testing (and get a p-value) but always derive confidence intervals to get an estimate of effect size.

  2. The 95%-confidence interval covers the true effect with probability 95%.

    Relationship between statistical and clinical significance

Something you should know:

  • 95% CI indicates that we are 95% confident that the population mean will fall within the range described. So CI is similar to a measure of spread, like SD.
  • As sample size increase or variability in the measurement decrease, the CI will become more narrow. And the results will be more precise.
  • CI can be used similar to a p-value to determine significant differences, but conveys more useful information than p values
  • Always report 95% CI with p-value, not report solely p-value. If only one is to be reported, then it should be CI, because p value is less important and can be deduced from CI
  • never interpret p-value > 0.05 as an indication of no difference or no association, only the CI can provide this message. (absence of evidence does not equate to evidence of absence)

Sin 5: Select variables for multivariable model by significance in univariable models

How to avoid:

  1. Include variables in the model with regards to scientific content.


  2. Use statistical variable selection procedures (老师说,这些个方法一般适用于变量数小于30的):

    • stepwise selection
    • backward selection
    • forward selection
    • best subset selection
  3. In case of high-dimensional covariables (e.g., omics data)

    • complex variables selection methods available. (In my opinion, PCA/PLS-DA/lasso regression and other methods)

Sin 6: Report accuracy of newly developed method on training data

How to avoid:

  1. Be aware of overfitting.

  2. Don’t report accuracy on training set. It will most probably be good and it will be overoptimistic.

  3. Challenge the new method (score, cut-off value for a score, prediction model …) by applying it to a separate test set.

  4. If no separate test set is available, split your data set into training and test set.

  5. Crossvalidation, bootstrapping, …

Sin 7: Connect statisticians at the last step

How to avoid:



Pursuing the dream and the best future

CHENYUAN Blog Homepage

因为不想遗忘! 在这个信息大爆炸的年代,最重要的是对知识的消化-吸收-重铸。每天学了很多东西,但是理解的多少,以及能够运用多少是日后成功的关键。作为一个PhD,大脑中充斥了太多的东西,同时随着年龄的增长,难免会忘掉很多事情。所以只是为了在众多教程中写一个自己用到的,与自己...… Continue reading