Univariate and Multivariate Survival Analysis

Introduction

Techniques of survival analysis are needed once you have right-censored data. Such data is the result of clinical trials or retrospective studies that observe a defined endpoint such as progression free survival (PFS) or overall survival (OS).

Data source

The survivalAnalysis package assumes the data is a data frame with one row per subject and at least two columns: - A numeric column giving the survival time. - A logical or numeric (0/1) column with the censoring indicator. TRUE or 1 is interpreted such that the endpoint occurred for the subject at the given time, FALSE or 0 is interpreted such that the endpoint did not occur during the given time under observation.

suppressPackageStartupMessages(library(survival))
head(lung) #Survival data is contained in the "time" and "status" columns, with all other columns as presumed or potential covariates.
##   inst time status age sex ph.ecog ph.karno pat.karno meal.cal wt.loss
## 1    3  306      2  74   1       1       90       100     1175      NA
## 2    3  455      2  68   1       0       90        90     1225      15
## 3    3 1010      1  56   1       0       90        90       NA      15
## 4    5  210      2  57   1       1       90        60     1150      11
## 5    1  883      2  60   1       0      100        90       NA       0
## 6   12 1022      1  74   1       1       50        80      513       0

The Analysis Result

suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(tidytidbits))
## Warning: package 'tidytidbits' was built under R version 3.5.2
library(survivalAnalysis) 
#perform an analysis generating a result object, and then perform output such as printing or plotting using this object.
## Warning: package 'survivalAnalysis' was built under R version 3.5.2
# Overview of data and check the median survival
survival::lung %>%
  analyse_survival(vars(time, status))
# vars() is from dplyr, allows to give the plain variable names. Passing column names as strings is also possible.
## Overall Analysis:
##          log.rank.p   n median Lower.CI Upper.CI min    max
## overall:       <NA> 228  310.0    285.0    363.0 5.0 1022.0
## 
## <only one group in analysis>
survival::lung %>%
  analyse_survival(vars(time, status)) %>%
  print(timespan_unit="months")
## Overall Analysis:
##          log.rank.p   n median Lower.CI Upper.CI min  max
## overall:       <NA> 228   10.2      9.3     11.9 0.2 33.5
## 
## <only one group in analysis>

Univariate Survival Analysis

survival::lung %>% #The ECOG performance status is a classical prognostic indicator.
  count_by(ph.ecog) #count_by method from tidytidbits to check the distribution of "ph.ecog"
## # A tibble: 5 x 4
##   ph.ecog     n     rel percent
##     <dbl> <int>   <dbl> <chr>  
## 1       0    63 0.276   27.6%  
## 2       1   113 0.496   49.6%  
## 3       2    50 0.219   21.9%  
## 4       3     1 0.00439 " 0.4%"
## 5      NA     1 0.00439 " 0.4%"

survival analysis comparing the subgroups formed by the ECOG status

survival::lung %>%
  mutate(ecog=recode_factor(ph.ecog, `0`="0", `1`="1", `2`="2-3", `3`="2-3")) %>% #group 2 und 3 together
  analyse_survival(vars(time, status), by=ecog) -> result #store the result object for later use
print(result)
## Analysis by ecog:
## Overall Analysis:
##          log.rank.p   n median Lower.CI Upper.CI min    max
## overall:    p<0.001 227  310.0    285.0    363.0 5.0 1022.0
## 
## Descriptive Statistics per Subgroup:
##     records events median Lower.CI Upper.CI
## 0        63     37  394.0    348.0    574.0
## 1       113     82  306.0    268.0    429.0
## 2-3      51     45  183.0    153.0    288.0
## 
## Pair-wise p-values (log-rank, uncorrected):
##         0     1
## 1   0.063    NA
## 2-3   0.0 0.003
## 
## Pair-wise Hazard Ratios:
##           Lower.CI   HR Upper.CI      p
## 1 vs. 0       0.98 1.45     2.13  0.064
## 0 vs. 1       0.47 0.69     1.02  0.064
## 2-3 vs. 0     1.64 2.54     3.93 <0.001
## 0 vs. 2-3     0.25 0.39     0.61 <0.001

Instead of inserting a mutate call, we just pass the logical expression ph.ecog <= 1

survival::lung %>%
  analyse_survival(vars(time, status), 
                   by=ph.ecog <= 1,
                   by_label_map=c(`TRUE`="ECOG 0-1", #give labels to the values of `by`. These labels will also be used in plots
                                  `FALSE`="ECOG 2-3"))
## Analysis by ph.ecog <= 1:
## Overall Analysis:
##          log.rank.p   n median Lower.CI Upper.CI min    max
## overall:    p<0.001 227  310.0    285.0    363.0 5.0 1022.0
## 
## Descriptive Statistics per Subgroup:
##          records events median Lower.CI Upper.CI
## ECOG 0-1     176    119  353.0    306.0    433.0
## ECOG 2-3      51     45  183.0    153.0    288.0
## 
## Hazard Ratio:
##                       Lower.CI  HR Upper.CI      p
## ECOG 2-3 vs. ECOG 0-1     1.41 2.0     2.82 <0.001
## ECOG 0-1 vs. ECOG 2-3     0.35 0.5     0.71 <0.001

Kaplan-Meier Plots

The survminer package has revolutionized drawing of Kaplan-Meier plots in R. The survivalAnalysis package fully integrates plotting with survminer, providing sane defaults:

kaplan_meier_plot(result)

Refine parameters, refer to the parameters in function ggsurvplot. All parameters supported by ggsurvplot can simply be passed to kaplan_meier_plot. In addition, there are some shortcuts which we will now use.

kaplan_meier_plot(result,
                  break.time.by="breakByQuarterYear", #x axis breaks by quarter year 
                  xlab=".OS.months", #x axis label for OS
                  legend.title="ECOG Status", #legend label
                  hazard.ratio=T, #display the hazard ratios
                  risk.table=TRUE, #display risk table
                  table.layout="clean", #risk table "clean" only with minimal overhead
                  ggtheme=ggplot2::theme_bw(10)) #custom theme

Two KM plots side-by-side using kaplan_meier_grid. This piece of code also demonstrates the possible use of lists of default arguments, which can be overridden. If arguments differ per-plot, there is the mapped_plot_args argument.

Internally, gridExtra::marrangeGrob does the layout. For some reason, the returned value needs an explicit print. Normally you will want to save it to PDF. I recommend using tidytidbits’ save_pdf, into which you can simply pipe the kaplan_meier_grid output.

default_args <- list(break.time.by="breakByMonth",
                     xlab=".OS.months",
                     legend.title="ECOG Status",
                     hazard.ratio=T,
                     risk.table=TRUE,
                     table.layout="clean",
                     ggtheme=ggplot2::theme_bw(10))

list(result,
     survival::lung %>%
       analyse_survival(vars(time, status), 
                        by=sex,
                        by_label_map=c(`1`="Male", `2`="Female"))
     ) %>%
  kaplan_meier_grid(nrow=2,
                    default_args,
                    break.time.by="breakByQuarterYear",
                    mapped_plot_args=list(
                      legend.title=c("ECOG Status", "Sex"),
                      title=c("A", "B")
                    )) %>%
  print

Multivariate Survival Analysis

Multivariate analysis, using the technique of Cox regression, is applied when there are multiple, potentially interacting covariates. While the log-rank test and Kaplan-Meier plots require categorical variables, Cox regression works with continuous variables. (Of course, you can use it with categorical variables as well, but this has implications which are described below.)

In lung data, possibly interesting covariates are patient age, sex, ECOG status and the amount of weight loss. Sex is encoded as a numerical vector, but is in fact categorical. We need to make it a factor. ECOG status is at least ordinally scaled, so we leave it numerical for now. Following the two-step philosophy of survivalAnalysis, we first perform the analysis with analyse_multivariate and store the result object. We provide readable labels for the covariates to allow easy interpretation. There is a print() implementation which prints essential information for our result.

covariate_names <- c(age="Age at Dx", #provide readable labels for the covariates to allow easy interpretation.
                     sex="Sex",
                     ph.ecog="ECOG Status",
                     wt.loss="Weight Loss (6 mo.)",
                     `sex:female`="Female",
                     `ph.ecog:0`="ECOG 0",
                     `ph.ecog:1`="ECOG 1",
                     `ph.ecog:2`="ECOG 2",
                     `ph.ecog:3`="ECOG 3")

survival::lung %>%
  mutate(sex=rename_factor(sex, `1` = "male", `2` = "female")) %>% #make sex a factor.
  analyse_multivariate(vars(time, status),
                       covariates = vars(age, sex, ph.ecog, wt.loss),
                       covariate_name_dict = covariate_names) -> result #store the result object.
print(result)
## Overall:
##    n                 covariates Likelihood ratio test p Wald test p
##  213 age, sex, ph.ecog, wt.loss                  <0.001      <0.001
##  Score (logrank) test p
##                  <0.001
## 
## Hazard Ratios:
##   factor.id         factor.name factor.value   HR Lower_CI Upper_CI Inv_HR
##  sex:female                 Sex       female 0.55     0.39     0.78   1.81
##     wt.loss Weight Loss (6 mo.) <continuous> 0.99     0.98      1.0   1.01
##         age           Age at Dx <continuous> 1.01     0.99     1.03   0.99
##     ph.ecog         ECOG Status <continuous> 1.67     1.31     2.14    0.6
##  Inv_Lower_CI Inv_Upper_CI      p
##          1.28         2.55 <0.001
##           1.0         1.02  0.176
##          0.97         1.01  0.165
##          0.47         0.76 <0.001

A Note on Categorical Variables

Inv_HR: for binary variables, it means inverted HR, equal to 1/HR.
Things get more complicated with three or more categories. The rule is: You must choose one level of the factor as the reference level with defined hazard ratio 1.0. Then, for k levels, there will be k-1 pseudo variables created which represent the hazard ratio of this level compared to subjects in the reference level. (For the comparison of one level vs. all remaining subjects see the paragraph on one-hot analysis further down.)

As an example, we consider the two covariates which were significant above, sex and ECOG, and regard the ECOG status as a categorical variable with four levels. As reference level, we choose ECOG=0 with the parameter reference_level_dict.

survival::lung %>%
  mutate(sex=rename_factor(sex, `1` = "male", `2` = "female"),
         ph.ecog = as.factor(ph.ecog)) %>%
  analyse_multivariate(vars(time, status),    
                       covariates = vars(sex, ph.ecog),   
                       covariate_name_dict=covariate_names,
                       reference_level_dict=c(ph.ecog="0"))
## Overall:
##    n   covariates Likelihood ratio test p Wald test p
##  227 sex, ph.ecog                  <0.001      <0.001
##  Score (logrank) test p
##                  <0.001
## 
## Hazard Ratios:
##   factor.id factor.name factor.value   HR Lower_CI Upper_CI Inv_HR
##  sex:female         Sex       female 0.58     0.42     0.81   1.72
##   ph.ecog:1 ECOG Status            1 1.52     1.03     2.25   0.66
##   ph.ecog:2 ECOG Status            2 2.58     1.66     4.01   0.39
##   ph.ecog:3 ECOG Status            3 7.76     1.04    58.04   0.13
##  Inv_Lower_CI Inv_Upper_CI      p
##          1.24          2.4  0.001
##          0.45         0.97  0.036
##          0.25          0.6 <0.001
##          0.02         0.96  0.046

A Forest Plot

The usual method to display results of multivariate analyses is the forest plot. The survivalAnalysis package provides an implementation which generates ready-to-publish plots and allows extensive customization.

forest_plot(result)

Tune parameters

forest_plot(result,
            factor_labeller = covariate_names, #label the subgroups.
            endpoint_labeller = c(time="OS"), #label the endpoint.
            orderer = ~order(HR), #order by hazard ratio
            labels_displayed = c("endpoint", "factor", "n"),
            ggtheme = ggplot2::theme_bw(base_size = 10), #Adjust font size
            relative_widths = c(1, 1.5, 1), #more space for the plot, less space for the tables
            HR_x_breaks = c(0.25, 0.5, 0.75, 1, 1.5, 2)) #more breaks on the X axis

The forest_plot function actually accepts any number of results and will display them all in the same plot. For example, you may want to display both OS and PFS in the same plot, but of course ordered and with a line separating the two. To do that,

  • throw both results into forest_plot
  • order first by endpoint, then by something else: orderer = ~order(endpoint, factor.name)
  • use a categorizer function returning a logical vector which determines where the separating line shall be drawn. For a flexible solution, the usual idiom is something like categorizer = ~!sequential_duplicates(endpoint, ordering = order(endpoint, factor.name)), where the ordering clause is identical to your orderer code.
    Finally, if you want separate plots but display them in a grid, use forest_plot_grid to do the grid layout and forest_plot’s title argument to add the A, B, C. labels.

Multiple Univariate Analyses

It is common practice to perform univariate analyses of all covariates first and take only those into the multivariate analysis which were significant to some level in the univariate analysis. (I see some pros and strong cons with this procedure, but am open to learn more on this). The univariate part can easily be achieved using purrr’s map function. A forest plot will plot multiple results, even if they come as a list.

df <- survival::lung %>% mutate(sex=rename_factor(sex, `1` = "male", `2` = "female"))

map(vars(age, sex, ph.ecog, wt.loss), function(by)
{
  analyse_multivariate(df,
                       vars(time, status),
                       covariates = list(by), # covariates expects a list
                       covariate_name_dict = covariate_names)
}) %>%
  forest_plot(factor_labeller = covariate_names,
              endpoint_labeller = c(time="OS"),
              orderer = ~order(HR),
              labels_displayed = c("endpoint", "factor", "n"),
              ggtheme = ggplot2::theme_bw(base_size = 10))

One-Hot Encoding

The one-hot parameter triggers a mode where for each factor level, the hazard ratio vs. the remaining cohort is plotted. This means that no level is omitted. Please be aware of the statistical caveats. And please note that this has nothing to do any more with multivariate analysis. In fact, now you need the result of the univariate, categorically-minded analyse_survival.

survival::lung %>% 
  mutate(kras=sample(c("WT", "G12C", "G12V", "G12D", "G12A"), 
                     nrow(.), 
                     replace = T, 
                     prob = c(0.6, 0.24, 0.16, 0.06, 0.04))
          ) %>%
  analyse_survival(vars(time, status), by=kras) %>%
  forest_plot(use_one_hot=TRUE,
              endpoint_labeller = c(time="OS"),
              orderer = ~order(HR),
              labels_displayed = c("endpoint", "factor", "n"),
              ggtheme = ggplot2::theme_bw(base_size = 10))

CHENYUAN

CHENYUAN

CHENYUAN
Pursuing the dream and the best future

CHENYUAN Blog Homepage

因为不想遗忘! 在这个信息大爆炸的年代,最重要的是对知识的消化-吸收-重铸。每天学了很多东西,但是理解的多少,以及能够运用多少是日后成功的关键。作为一个PhD,大脑中充斥了太多的东西,同时随着年龄的增长,难免会忘掉很多事情。所以只是为了在众多教程中写一个自己用到的,与自己...… Continue reading