Abstract

Introduction: Modern genomic medicine, based on omic technologies, provides a novel approach to the study of diseases, since it can facilitate the identification of molecular biomarkers associated with risk, prognosis or outcome. To achieve this, we need accurate methods to analyze survival data from patient cohorts and correlate these data with the expression level of specific genes associated with the disease (i.e., in the case of cancer, associated with tumor prognosis). In addition, this type of analysis can also generate data for patient risk prediction and patient stratification, which are critical for a real progress in personalized medicine. The ASURI package presented here provides a set of integrated tools for survival analysis and patient risk prediction based on gene expression profiling, ensuring the discovery of novel gene survival markers with high robustness and reproducibility.

Key Features:

  1. Evaluate the association of gene expression with survival time (geneSurv).

  2. Discover gene markers that correlate with phenotypic factors (genePheno).

  3. Build robust patient risk predictors using genes associated with survival (patientRisk).

  4. Predict patient risk and classify into risk groups (predict_PatientRisk).

  5. Estimate survival for novel query patients (predict_SurvCurve).

Introduction

Here we present a new algorithm called ASURI that provides an integrated set of functions to analyse disease SURVIVAL and obtain patient RISK predictions based on gene activity signatures. Specifically, the algorithm is designed to: (1) Analyse the ability of a given gene to mark disease prognosis and survival based on gene expression or other gene-related activity signal; (2) Discover gene markers by identifying the significant association of gene expression (or other gene-related signal) with a clinical variable or phenotypic trait; (3) Constructing robust patient risk predictors based on selected gene signatures using univariate and multivariate approaches; (4) Predicting the risk of new query patients based on their gene expression profile and classifying the patients into a risk group (high/intermediate/low); (5) Estimating a survival curve for new query patients.

In accordance with the 5 utilities of the package described above, ASURI is built with five R functions specific to each of these uses (Figure @ref(fig:diagramaCrop)): (1) geneSurv(), that implements a robust method to evaluate the ability of a given gene to mark survival based on its expression level. It is based on a bootstrapped version of the Kaplan-Meier estimator and the long-rank statistics provided by (Therneau 2021). This function helps to overcome the instability and irreproducibility of other tools developed for this purpose (Aguirre-Gamboa et al. 2013). (2) genePheno(), which implements a robust method to select a subset of genes whose expression is correlated with a clinical variable or phenotypic parameter. This function is based on the Elastic Net algorithm proposed by (Friedman, Hastie, and Tibshirani 2010; Simon et al. 2011), which removes features that are not-relevant to the prediction the clinical variable. To improve the stability and reproducibility of the list of genes associated with a phenotype, the method implements a bootstrap strategy. The function provides several metrics that are useful to evaluate the stability and relevance of each gene. This function is particularly useful for clinicians who want to discover survival biomarkers that interact with clinical variables or parameters relevant to patient treatment.

Workflow diagram of the package ASURI showing the decision-making lines and the five main functions.

Workflow diagram of the package ASURI showing the decision-making lines and the five main functions.

Next function, (3) patientRisk(), which accurately predicts the patient risk given a set of informative genes, provided as an input list of selected genes. The function is based on the multivariate UNICOX model proposed by (R. Tibshirani 2009). The function provides as output the list of selected genes ranked according to their impact on patient risk along with a relevance index. It also provides a risk prediction for each one of the patients included in the training considering the gene signature obtained and an optimal stratification into two or three risk groups (i.e., poor, intermediate and good prognosis). (4) predict_PatientRisk(), which predicts the risk for new, novel patients based on the model and the gene signature produced by the patientRisk() function over the training set. This function also classifies the patients in two or three risk groups. (5) predict_SurvCurve(), which estimates the survival curve for any given query patient using the patient’s gene expression profile. These 5 functions included in ASURI also provide graphical output plots to help visualize and analyze survival and patient risk; and, as indicated, they are designed to work with one or more genes (i.e., features), that is, to analyze univariate and multivariate gene signatures.

Package download and installation. Package dependencies.

The ASURI package can be installed directly from Bioconductor typing the following commands in an R:

if (!requireNamespace(“BiocManager”, quietly = TRUE)) { install.packages(“BiocManager”) }

BiocManager::install(“ASURI”)

It can be installed also from a local directory if you have downloaded previously the package.

install.packages(“/path/to/ASURI.tar.gz”, dependencies = TRUE) library(ASURI)

If a problem with dependencies is reported during installation, users should first download the missing R packages directly from CRAN R repository or from Bioconductor repository, and then repeat previous instructions to install ASURI R package. In Windows based OS, Rtools might need to be installed. Some of the package dependencies of ASURI are the following, from CRAN: ‘ggplot2’, ‘glmnet’, ‘Rdpack’, ‘ROCR’, ‘scales’; and from Bioconductor: ‘Biobase’, ‘siggenes’, ‘survcomp’. The ASURI package also contains an experimental dataset to illustrate the performance of the various functions and methods implemented in it.

Dataset description and download.

The dataset included in ASURI as a study case is intended to facilitate the use of the package. The dataset was obtained from several GSE series from GEO database corresponding to breast cancer (BRCC) samples, where expression and survival time were reported (Bueno-Fortes et al. 2023). Each sample corresponds to genome-wide expression profiles of BRCC primary tumor samples hybridized on the transcriptomic platform: Affymetrix HGU133 Plus2.0; which is a high-density microarray expression platform containing 594,000 oligo-nucleotide probes (organized in probe-sets) that we mapped to ENSEMBL genes using the hgu133plus2hsensgcdf CDF package, obtained from BRAINARRAY. The expression signal of the samples was normalized using fRMA (McCall, Bolstad, and Irizarry 2010) or RMA (Gautier et al. 2004) and Combat (Kupfer et al. 2012), as described in (Bueno-Fortes et al. 2023). Clinical variables and information about the tumors were collected and included in the phenodata object. More specifically, the information about the expression level of 3 key BRCC biomarkers (i.e. ER / ESR1, PR / PGR, and HER2 / ERBB2 protein-coding genes) was included in the phenodata when available.

Once this dataset is loaded the seBRCA object is created. This SummarizedExperiment object contains the normalized gene expression data matrix and phenotypical information about each sample (i.e. clinical physio-pathological data and other phenotypic data).

library(asuri)
data(seBRCA)
# To extract the normalized expression
mExprs <- assay(seBRCA)
mPheno <- colData(seBRCA)
# To visualize the dimensions of the two extracted objects
dim(mExprs)
dim(mPheno)

Platforms supported by the library and preprocessing.

The library supports expression data matrices coming from high-density microarrays (such as the example data set described above), as well as expression data matrices coming from RNA-seq platforms. However, it can be applied to any omic technology that provides a gene-related activity signal.

For the transcriptomic / gene expression data, it is recommended to transform the gene signal using the log2log_2 function and to standardize the expression values, for example by calculating a z-score. For microarray data, it is recommended to perform a robust normalization of the raw signal before running COX regression methods. In addition, for the microarray datasets, the gene expression signals are calculated using the RMA method (Gautier et al. 2004; McCall, Bolstad, and Irizarry 2010) or the normalizeBetweenArrays method from (Ritchie et al. 2015) (both work well within the ASURI package). For RNA-seq datasets, the data matrix of counts per gene can be used, and we recommend applying a normalization method such as the Voom method, considered in (Scarfo et al. 2016).

Survival and risk assessment (case study on Breast Cancer).

In this section, we present and explain the use of the ASURI library, including the five main functions described above, by applying the package to the example breast cancer dataset (which contains 200 samples and 20,049 genes). The different parts of this main section show the different analyses that can be performed with the package.

Analysis of individual genes as survival markers: geneSurv()

The first goal of the ASURI algorithm is to identify human genes measured in biological samples from patients (e.g., measured in tumor samples from cancer patients) that are associated with the survival of the individuals (i.e., with the prognosis). To do this, the algorithm examines the expression (or activity) of the query genes in a cohort of samples from patients with the same disease, and divides such a cohort into samples with high expression of a query gene and samples with low expression of such query gene. In this way, the algorithm splits the cohort into two groups and then runs a Kaplan-Meier (K-M) analysis for each of these two groups of samples, determining whether such K-M show a significant difference based on the log-rank statistic. All of this is implemented in the geneSurv() function (Bueno-Fortes et al. 2023). This statistic evaluates the separability between the survival curves of the groups obtained by a gene expression (or gene activity) threshold, which will be: samples with high expression versus samples with low expression of the query gene. The naive approach introduced in other tools, such as (Aguirre-Gamboa et al. 2013), estimates the two groups of samples just splitting the cohort using the mean or median of the gene activity, and then trying to obtain the best p-value of the log-rank statistic for the KM of the two groups. This strategy often yields meaningless solutions, giving rise to highly unbalanced groups. To overcome these problems, the geneSurv() function implements an original method that uses bootstrapping to compute the optimal threshold and focuses the split only on the central 60% of the samples, avoiding very unbalanced groups.

As described above, to improve the stability and reproducibility of the survival analysis available in other R packages (Kassambara et al. 2024), a bootstrap strategy is implemented in ASURI. A membership probability is estimated for the classification of the each patient (each sample) into one of the two risk groups (Figure @ref(fig:pclassesr1)). These membership probabilities allow us to robustly reclassify patients around the gene expression threshold (Figure @ref(fig:boxesr1)). This approach is univariate and does not take into account interactions or correlations between the genes (i.e., it considers each feature gene as independent).

Once the patients are assigned to the two groups by the bootstrap strategy, the function plots robust K-M curves with confidence intervals (Figure @ref(fig:kmesr1)).

As an example of use, let’s now analyse using the geneSurv() function a well known gene survival marker in the dataset of breast cancer: the Estrogen Receptor (ER+, ESR1).

The input parameters for the function are the following:

  • seData SummarizedExperiment object with the normalized expression data and the phenotypic data in colData. Phenotypic colData must contain the samples name in the first column and two columns with the time and the status.
  • time SummarizedExperiment colData column name containing the survival time in years for each sample in numeric format.
  • status SummarizedExperiment colData column name containing the status (censored 0 and not censored 1) for each sample..
  • geneName A character string with the name of the gene being analyzed.
  • boxplot A logical value indicating whether to generate a boxplot of gene expression by survival group (by default = TRUE).
  • iter The number of iterations (bootstrap resampling) for calculating optimal group cutoffs (by default = 100).
  • type Specifies whether the two Kaplan-Meier curves are computed using the gene expression groups (by default “exprs”), or using the risk groups (“risk”) if the samples are already classified by risk level and this is used directly to divide the samples into two groups.
  • cut_time A numeric value specifying the cutoff time (in years) for survival analysis. All events beyond this time are treated as censored (by default = 10 years).
data(seBRCA)
time <- "time"
status <- "status"
geneName <- "ESR1"
# The TIME value must be transformed to YEARS The gene
# expression vector must be provided with the NAMES of each
# sample, that should match the time and status NAMES.
set.seed(5)
outputKM <- geneSurv(seBRCA, time, status, geneName, type = "exprs")

The function returns different outputs, i.e., a list with values and plots that is different depending on the specified type or mode how it is run. Run of the function geneSurv() as type = "exprs":

  • geneName: A character string with the name of the gene being analyzed.
  • patientExpr: The expression level of the query gene in each sample (i.e., in each patient).
  • patientClass: Vector of group classification according to the gene expression level: 2 = high expression (HighExpr) and 1 = low expression level (LowExpr).
  • patientClassProbality: Vector of membership probabilities for the classification.
  • wilcox.pvalue: The p-value from the Wilcoxon test comparing the two expression groups.
  • plot_values: A list containing the Kaplan-Meier curve results, the log-rank test p-value, and the hazard ratio.

Also three plots are provided:

  • Plot of the probability (from 0 to 1) of the samples to belong to one of the two groups or classes obtained based on the expression level of the query gene (low expression and high expression) (Figure @ref(fig:pclassesr1)).
  • Boxplots showing the expression distributions of the samples divided into the two defined groups (LowExpr versus HighExpr). The plot includes the p.value of the Wilcoxon test comparing these two expression groups (Figure @ref(fig:boxesr1)).
  • Kaplan-Meier plot showing the curves calculated based on the survival of the two groups of samples generated (i.e., samples with high expression of the query gene, in red, and samples with low expression of the query gene, in green). The plot includes the p.value of the log-rank statistical test obtained when comparing the K-M survival profile of the two groups (Figure @ref(fig:kmesr1)).

For type = "risk":

  • geneName: A character string with the name of the gene being analyzed.
  • patientExpr: The expression level of the query gene in each sample (i.e., in each patient).
  • risk_score_predicted: A numeric vector of predicted or calculated risk score (in %, in relative numbers) of each patient.
  • plot_values: A list containing the Kaplan-Meier curve results, the log-rank test p-value, and the hazard ratio.

Also a plot is provided:

  • Kaplan-Meier plot showing the curves calculated based on the survival of the two groups of samples generated (high risk and low risk). The plot includes the p.value of the log-rank statistical test obtained when comparing the K-M survival profile of the two groups (Figure @ref(fig:kmrisk)).
Patient Class Probability for low exprs (89) and high exprs (111) groups).

Patient Class Probability for low exprs (89) and high exprs (111) groups).

Figure @ref(fig:pclassesr1) shows the plot of the probability of the samples to be assigned to one of the expression groups: low expression in green or high expression in red. The samples in the example correspond to the tumors isolated from 200 breast cancer patients. The probability of belonging to a group is estimated by bootstrapping. Patients with a membership probability greater than 0.8 are classified with a high degree of confidence. The classification is uncertain (p < 0.8) for patients around the intermediate gene expression level.

Boxplot: high expression vs low expression for ESR1 gene.

Boxplot: high expression vs low expression for ESR1 gene.

Figure @ref(fig:boxesr1) boxplots showing the expression distribution of the samples divided into the two defined groups (LowExpr versus HighExpr). The plot includes the p.value of the Wilcoxon test comparing these two expression groups. Notice that there is a small overlapping between the high and low expression groups. This is due to the reassignment of patients with intermediate expression level accomplished by the bootstrap strategy.

Kaplan Meier curves obtained after the assignment of the patients by bootstrap to high or low survival using the selected feature ESR1.

Kaplan Meier curves obtained after the assignment of the patients by bootstrap to high or low survival using the selected feature ESR1.

Figure @ref(fig:kmesr1) shows the two Kaplan-Meier curves with the confidence intervals obtained by the function geneSurv() for the gene ESR1. The threshold for the gene expression level that splits the samples in two groups is estimated by optimization of the log-rank test statistic. The procedure is repeated 100 times using a bootstrap strategy to improve the classification of intermediate risk patients.

The result shown in Figure @ref(fig:kmesr1), with a p-value for the log-rank test (= 0.01695) and an Hazard Ratio (= 1.86: CI (0.31, 3.13) ) suggest that the gene ESR1 is a good survival marker for breast cancer.

Run of the function geneSurv() as type = "risk":

# If we instead consider to run the function as *type* = risk
geneName <- "BRCA1"
set.seed(5)
outputKM.TP53 <- geneSurv(seBRCA, time, status, geneName, type = "risk")
Kaplan Meier curves obtained after the assignment of the patients by bootstrap to high or low survival using the estimated risk based on feature BRCA1.

Kaplan Meier curves obtained after the assignment of the patients by bootstrap to high or low survival using the estimated risk based on feature BRCA1.

Figure @ref(fig:kmrisk) shows the two Kaplan-Meier curves with the confidence intervals obtained by the function geneSurv() for the gene BRCA1. Now the two groups are obtained by estimating the risk for each patient and determining the split threshold by optimizing the log-rank test statistic between the two curves. The procedure is repeated 100 times using a bootstrap strategy to improve the classification of intermediate risk patients.

Discovery of genes associated with phenotypic factors: genePheno()

It has been suggested in the literature that the clinical features and physiopathological parameters of a patient’s disease can be associated with specific biomolecular pathways or genes that can be identified as markers of that patient’s condition or status. The genePheno() function implements a robust algorithm to obtain a list of genes associated with a given clinical variable or phenotypic factor (Bueno-Fortes et al. 2023) measured in a cohort of patients. These genes may provide alternative and novel targets to standard known disease markers.

The genePheno() function performs a feature selection by fitting a predictor to the given set of samples based on the Elastic Net algorithm (proposed by (Friedman, Hastie, and Tibshirani 2010; Simon et al. 2011)). This method incorporates a regularization term that lies between the Lasso and the Ridge regression. Small coefficients are reduced to zero but the predictor can include more relevant variables than the Lasso to improve the prediction accuracy, even though some of them may be slightly correlated. In this way, we avoid the extreme behavior of Lasso, which randomly selects only one of them. Thus, the Elastic Net algorithm helps to improve the stability and to reduce the variance in the feature selection process (Robert Tibshirani 1996). The optimal value for the regularization parameter is determined automatically by a nested cross-validation strategy. The loss function for the predictor is the AUC (Area Under the ROC Curve), which considers a balance between false positive and false negative errors.

To improve the reproducibility and stability of the list of candidate genes found (i.e., the selected features) found to be significantly correlated with a clinical variable, a bootstrap strategy is implemented. Training data are resampled with replacement and an ensemble of gene lists is obtained. A stability metric based on the selection probability is defined. Thus, genes with a low selection probability are robustly filtered out. A final gene list is obtained with two metrics. The first one evaluates the stability and the second one the degree of correlation with the clinical variable. This approach is multivariate and takes into account additive interactions between candidate genes as well as possible correlations.

Before running genePheno(), the number of input features, i.e. the number of tested genes, should be reduced (to no more than a thousand) to improve the computational efficiency of the function. However, this preprocessing step should be done carefully to avoid removing relevant genes. The ASURI package provides the function prefilterSAM(), which implements a differential expression algorithm called SAM (proposed by (Tusher, Tibshirani, and Chu 2001; Li and Tibshirani 2013)) to obtain a set of significant genes to be selected. prefilterSAM() improves the stability and robustness of SAM by applying a bootstrapping strategy. For each preliminary gene list, the optimal lambda threshold is calculated by repeating the process with different FDR values and selecting the one that minimizes the p-value. The final list contains only those genes that are present in at least 10% of the iterations. This step can be skipped if you have already selected a set of informative genes.

The input parameters for the function are the following:

  • seData: SummarizedExperiment object with the normalized expression data and the phenotypic data in colData.
  • groupsVector: A binary vector indicating the group to which each sample is assigned.
  • FDRfilter: A numeric value indicating the FDR threshold for selecting significant genes. The default is FDR = 0.05.
  • iter: The number of iterations for the bootstrapping. Default is 100.
  • percentageFilter: A numeric value indicating the percentage of iterations a gene must appear in to be considered significant. Default is 80.
data(seBRCA)

# Bootstrapped differential expression based on SAM.
# Parameters: FDR = 0.05, iter = 100, percentage filter = 80 %
# CAUTION: if the data have a high number of genes this function
# will take several minutes to compute.

groupsVector <- colData(seBRCA)$ER.IHC

set.seed(5)
DE_list_genes <- prefilterSAM(seBRCA, groupsVector)

The output of the DE_list_genes function is a list with the genes obtained from the differential expression analysis ordered by the p-value (from lowest to highest p-value).

The following code illustrates the use of the genePheno() function to discover genes that are correlated with the Estrogen Receptor (ER+, ESR1) measured by the clinicians (by the pathologists) as a phenotypic variable of breast cancer tumors (i.e., to discover genes correlated with the level of ESR1 protein detected by immunohistochemistry, IHC, in the BRCC tumors).

The input parameters are the following:

  • seBRCA: SummarizedExperiment object with the normalized expression data and the phenotypic data in colData.
  • DEgenes: Vector containing the genes to be used. Expected to be in the same format as the rows of the assay(seBRCA). Usually this vector is the result of running prefilterSAM() function.
  • vectorGroups: Clinical or phenotypic variable tested. It must be provided as a binary vector. In this particular study vectorGroups is mPheno$ER.IHC.
  • vectorSampleID: Vector containing the sample names ordered as in the assay(seBRCA).
  • iter: Number of bootstrap iterations (default: 100). This can be changed if the function takes too long to run.
  • numberOfFolds: Number of folds to implement nested cross-validation (default: 10).
vectorSampleID <- as.character(rownames(colData(seBRCA)))
vectorGroups <- as.numeric(colData(seBRCA)$ER.IHC)
Pred_ER.IHC <- genePheno(seBRCA, DE_list_genes, vectorGroups, vectorSampleID)

# Pred_ER.IHC is an output object with the list of genes that
# show a significant correlation with the clinical variable.
# Since a bootstrap is performed, the results of how many times
# across iterations a gene is found significant are reported as
# *stability* (in relative numbers 0-1, 1=100%) and the *beta
# values* from the regression across iterations are also
# provided as *betaMedian* and *betaMean* :

Pred_ER.IHC$stability
Pred_ER.IHC$betasMedian
Pred_ER.IHC$betasMean
Pred_ER.IHC$betasTable
Gene stability betasMedian betasMean
NAT1 0.90 0.13152533 0.13359943
ESR1 0.84 0.16505283 0.16619666
USP6NL 0.73 -0.30479925 -0.35228853
SUSD3 0.68 0.07534488 0.08846058
PREX1 0.67 0.10376235 0.13750165
TPBG 0.60 0.22454524 0.25966395
CA12 0.56 0.08063623 0.08739032
AGR3 0.55 0.04043867 0.04232274
PROX1 0.54 -0.38767094 -0.43736144
CALML5 0.53 -0.03846191 -0.04940331

Table 1: List of genes by Stability.

The output of the function is an object with the following elements:

names(Pred_ER.IHC)
# [1] 'genes' 'listCoeff' 'stability' 'betasMedian' 'betasMean'
# 'betasTable'
  • genes: A list of genes ranked according to the degree of association with the clinical or phenotypic variable tested.
  • listCoeff: A list with the beta regression coefficients and the AUC score for each bootstrap iteration.
  • stability: Gene selection probability (0 to 1) estimated by bootstrap (the number of times discovered over “n” iterations).
  • betasMedian: Median of the beta coefficients over the B replicates.
  • betasMean: Mean of the beta coefficients over the B replicates.
  • betasTable: Output table with genes ordered by decreasing value of the stability, that includes the genes, the stability, the betasMedian and the betasMean.

The results obtained in Table 1 show that the expression of ESR1 gene is strongly associated with the clinical marker ER (i.e., with the detection by IHC of Estrogen Receptor active in the tumors). This will be expected, but it is a result that supports the value of the developed method presented here. In addition, the AGR3, SUSD3 or CA12 genes have recently been found in the breast cancer literature in relation to the ER marker (Barnett et al. 2008; Yu et al. 2015; Garczyk et al. 2015); and they are also found by our method with a good ranking in the association with ER (i.e., they present high stability). These genes are positively correlated, suggesting that they may play a role or function in a similar way to the clinical marker ER. Therefore, these genes can be proposed as alternative molecular targets in breast cancer, becoming excellent candidates to investigate their value in diagnosis, prognosis, and sensitivity to endocrine therapies.

Identification of gene survival markers & patient risk prediction: patientRisk()

When a preliminary set of candidate genes is proposed, the ASURI algorithm is designed to explore these genes and identify which ones are correlated with survival and risk, and therefore are adequate prognostic markers (i.e., good gene survival markers). In the case of the study presented above, using the genePheno() function, the algorithm explores the data to identify gene markers that correlate with the presence of ER+ as a clinical variable. However, in a more general use with the patientRisk() function, the algorithm evaluates a whole set of genes with respect to the survival profiles and builds a multivariate risk predictor, identifying which genes are able to accurately predict patient risk.

The analysis to find a significant correlation between the expression level of the genes and the survival of the patients is performed by the function patientRisk() (Bueno-Fortes et al. 2023). This function is based on the Univariate Shrinkage COX Model (UNICOX) proposed by (R. Tibshirani 2009). This method shrinks irrelevant coefficients to zero, taking into account a L1L_1 norm penalty term. Features are penalized individually, allowing all the genes correlated with risk to be retained at a given numerical strength. This behavior differs from Lasso models (Robert Tibshirani 1996), where only one representative gene is randomly chosen from the group of correlated genes. The patientRisk() function estimates the regularization parameter using a double nested cross-validation to provide robustness in the feature selection. This cross-validation includes an inner (first) loop to estimate the optimal regularization parameter and an outer (second) loop to estimate the risk.

Once the risk curve is obtained, patientRisk() implements an original method to stratify the patients into two main risk groups. This method is similar to the one used in (Figure @ref(fig:pclassesr1)), since it orders the patients by risk and computes the p-values of the log-rank test for each possible split of the patients in two groups with their corresponding survival. Then, it calculates the interval corresponding to the lower 10th percentile of the p-values. The extreme values for this interval provide the lower and upper thresholds that stratify the patients into low, intermediate and high risk groups (see Figure @ref(fig:plot1risk)).

The optimal threshold for dividing the patients into two groups (low/high) can be obtained by three methods: (1) The first approach estimates the threshold by minimizing the p-value of the log-rank statistic, i.e. by maximizing the separability between the K-M curves for the high and low risk groups. When multiple local minima occur, this may be sample dependent and unstable. (2) To improve the robustness a second method is implemented that computes the threshold as the median of the lower 10th percentile log-rank p-values. The lower 10 percentile selects the smallest values from the p-value distribution that correspond to intermediate risk patients who are on the border between the two groups. This interval is more robust than a single minimum and provides good experimental results for a wide variety of problems tested. The median threshold in this interval may change from one iteration to another because the distribution of p-values for patients with intermediate risk may change due to sample variation. Therefore, (3) a bootstrapping strategy is implemented for these intermediate risk patients, which helps to improve the robustness of the threshold obtained and provides a probability of membership for each risk group.

The following code illustrates how to apply the patientRisk() function to discover/identify significant gene survival markers and to accurately predict patient risk.

# Survival times should be provided in YEARS
time <- "time"
status <- "status"
# Pred_ER.IHC$genes is the subset of genes to be tested. In our
# case study, it is the list of genes related to the ER clinical
# variable that was obtained using the function **genePheno()**.
geneList <- names(Pred_ER.IHC$genes)
# Training of the multivariate COX model. Provide the
# SummarizedExperiment object with the normalized expression
# data and the phenotypic data in colData, the time and the
# status vectors, and the method to stratify the patients
# (select one of these methods: `min.pval`, `med.pval`,
# `class.probs`).
set.seed(5)
multivariate_risk_predictor <- patientRisk(seBRCA, geneList, time,
    status, method = "class.probs")

The input parameters of the patientRisk() function are the following:

  • mExpr A numeric matrix of gene expression data where each row represents a gene and each column represents a sample.
  • time A numeric vector representing the survival time for each sample, names(time) should be defined as the sample names.
  • status A numeric vector representing the event status (1 = event, 0 = censored) for each sample, names(status) should be defined as the sample names.
  • group.vector Optional numeric vector specifying predefined risk groups for the patients.
  • method A string that determines the method considered to stratified the patients in two risk groups according to the risk score for the lower 10th percentile of the p-values: (1) min.pval Defines risk groups based on the minimum p-value. (2) med.pval Defines risk groups based on the median p-value. This approach can be more robust when several local minima are found. Finally, (3) class.probs Defines risk groups based on the classification probabilities from the model. It provides also a membership probability to each risk group. class.probs is selected by default.
  • nboot An integer specifying the number of bootstrap iterations for risk score calculation. Default is 50.
  • cut_time A numeric value specifying the cutoff time (in years) for survival analysis. All events beyond this time are treated as censored (by default = 10 years).

The patientRisk() function returns a list containing the following objects:

  • cv_risk_score: Risk score prediction for the training set using a double nested crossvalidated strategy.
  • cv_normalized_risk: Normalized risk score in the interval (0, 100).
  • table_genes_selected: Data frame with the following columns: (i) the names for the genes selected by the COX regression; (ii) the beta coefficients for the optimal multivariate COX regression fitted to the training set; (iii) the Hazard Ratio for each gene; and (iv) the p-value for the univariate log-rank statistical test. Genes are shown by descending order of the HR index.
  • table_genes_selected_extended: Table with the same format as table_genes_selected. A search for local minima within a 5% range of the selected minimum is performed. The goal is expanding the list of significant genes to improve biological interpretability, since the Lasso penalty drastically reduces the number of significant genes.
  • model.optimalLambda: Object that contains the parameters for the optimal multivariate COX regression.
  • groups: Vector of classification of patients in two risk groups: HighRisk (2) or LowRisk (1).
  • riskThresholds: Thresholds that allows to stratify the test patients in three groups according to the predicted risk score: low, intermediate and high risk.
  • range.risk: Range of the unscaled risk score in the training set.
  • listmodels: List of models tested for different values of the regularization parameter λ\lambda.
  • evaluation.models: Dataframe that provides several metrics for each model evaluated. lambda is the regularization parameter for the multivariate COX regression adjusted; number_features gives the number of genes selected by this model; c.index and se.c.index are the concordance index and the standard deviation for the risk prediction; and p_value_c.index and logrank_p_value give the p-values for the the concordance index and the log-rank statistics respectively. The models are shown by ascending order of p-values of the log-rank test, and the best one is marked with two asterisks. All the metrics have been estimated using a double nested cross-validation strategy.
  • betasplot: Dataset used to create the plot of genes ranked according to the regression coefficients in the multivariate COX model (UNICOX).
  • plot_values: A list containing Kaplan-Meier fit results, logrank p-value, and hazard ratio.
  • membership_prob: If method class.probs is selected a table with two columns is returned. The first one is the probability of classification to the low risk group while the second one is the membership probability to the high risk group.

The function patientRisk() generates 5 plots:

Curve of log-rank p-values for each possible splitting of patients into 2 risk groups. The X axis represents the patient number ranked from low to high risk. The green line marks the selected cutpoint, i.e., the point that provides the threshold for stratifying patients into 2 risk groups: low risk, in blue; high risk, in red. The blue & red lines define thresholds for classifying patients into 3 risk groups: *low*, *intermediate* and *high*. These lines mark samples that fall in an *intermediate* region, where risk assignment is not very precise and samples may fluctuate in position.

Curve of log-rank p-values for each possible splitting of patients into 2 risk groups. The X axis represents the patient number ranked from low to high risk. The green line marks the selected cutpoint, i.e., the point that provides the threshold for stratifying patients into 2 risk groups: low risk, in blue; high risk, in red. The blue & red lines define thresholds for classifying patients into 3 risk groups: low, intermediate and high. These lines mark samples that fall in an intermediate region, where risk assignment is not very precise and samples may fluctuate in position.

Risk Score, between 0 and 100, calculated for each patient, i.e., for each of the 200 patient tumor samples included in this study, obtained with a multivariate UNICOX regression model. The model includes as features the subset of selected genes included in the *geneList*, which are tested to find out their correlation with risk. The green and blue & red lines provide the thresholds estimated by the **patientRisk()** function to split the patients respectively in 2 or 3 groups, as described in the legend of the previous figure. Note that the implemented algorithm is capable of detecting changes in the slope of the risk curve, giving rise to risk groups of differenciated prognosis.

Risk Score, between 0 and 100, calculated for each patient, i.e., for each of the 200 patient tumor samples included in this study, obtained with a multivariate UNICOX regression model. The model includes as features the subset of selected genes included in the geneList, which are tested to find out their correlation with risk. The green and blue & red lines provide the thresholds estimated by the patientRisk() function to split the patients respectively in 2 or 3 groups, as described in the legend of the previous figure. Note that the implemented algorithm is capable of detecting changes in the slope of the risk curve, giving rise to risk groups of differenciated prognosis.

Log-rank p-values of the minimum optimal versus the number of genes selected by the model. As the regularization parameter in UNICOX increases, the regression model becomes more especific in the selection of features and, therefore the number of genes that have non-zero (non-null) coefficients is reduced.

Log-rank p-values of the minimum optimal versus the number of genes selected by the model. As the regularization parameter in UNICOX increases, the regression model becomes more especific in the selection of features and, therefore the number of genes that have non-zero (non-null) coefficients is reduced.

The patientRisk() function offers multiple values for the regularization parameter, the user can choose the value that minimizes the log-rank p-value, and then re-evaluate everything with that new gene list. For our case study the optimal value is achieved with 12 genes. Biotechnologists, physicians, and researchers in general often prefer to have a smaller set of selected genes (selected features) as risk markers, at the expense of a slight increase in p-value. Taken together, this representation allows us to choose the optimal number of genes that achieve a balance between the lowest (most significant) log-rank p-value and the fewest number of selected genes. Once the subset of genes is selected, the corresponding COX model can be retrieved from the list of objects returned by the patientRisk() function. Finally, the risk score for new patients (placed as an independent query) can be estimated considering this constructed model as a predictor (this is further explained in the next section). Note that the patientRisk() function can be applied to any subset of genes provided by the user (related or unrelated to a clinical variable). Thus, this function can be applied to study the association of each gene with risk and to select the best gene risk markers (i.e., providing a gene signature of risk markers). In this study, considering the selected gene signature, the patientRisk() function provides an accurate risk prediction for each patient tumor sample and also the stratification of patients in low, intermediate and high risk.

Figure @ref(fig:plot4risk) presents the genes selected by the multivariate COX model, ranked by the value of the regression coefficients in the model. Genes with “positive” values (in red) increase the hazard ratio (HR) and, therefore, increase the risk in the samples (so they can be associated with an oncogenic effect). While genes with “negative” values (in blue) reduce the HR and the risk in the samples (so they can be associated with a tumor suppression or a protective effect). The absolute value of the coefficients determines the relevance of each gene with respect to the patient’s risk.

Plot of the genes selected by the multivariate COX model, ranked by the value of the regression coefficients in the model. Genes with *positive* values (red) increase the risk in the samples. While genes with *negative* values (blue) reduce the risk in the samples. Genes with significant p-values are marked with asterisks (** p-value<0.01, * p-value<0.05). Similarly, genes with slightly significance level are marked with a dot (· between 0.1 and 0.05). All the analyses carried out in this part are multivariate and take into account additive interactions among the genes.

Plot of the genes selected by the multivariate COX model, ranked by the value of the regression coefficients in the model. Genes with positive values (red) increase the risk in the samples. While genes with negative values (blue) reduce the risk in the samples. Genes with significant p-values are marked with asterisks (** p-value<0.01, * p-value<0.05). Similarly, genes with slightly significance level are marked with a dot (· between 0.1 and 0.05). All the analyses carried out in this part are multivariate and take into account additive interactions among the genes.

Gene Beta Value Beta Group P-Value Signif
AKR1B10 0.32736318 type1 0.004322987 **
EVL -0.32328558 type2 0.010718153 *
SUSD3 -0.26224518 type2 0.003719841 **
SLC39A6 -0.21327395 type2 0.016501657 *
ADIRF -0.20193189 type2 0.022008641 *
GFRA1 -0.14895182 type2 0.036662667 *
SOX11 0.14360541 type1 0.053214622 ·
CLIC6 -0.12819465 type2 0.057410691 ·
CPB1 -0.11722549 type2 0.002609143 **
PGR -0.10142232 type2 0.199169763
NPY1R -0.09605778 type2 0.027526344 *
CALML5 0.07877817 type1 0.147320701
SERPINA5 -0.07173386 type2 0.280007397
NAT1 -0.06950291 type2 0.253075375
SCGB1D2 -0.06748915 type2 0.114389857
AGR3 -0.02526576 type2 0.548559368

Table 2: List of genes by relevance for risk assignement.

Kaplan Meier plot: Blue (low risk) and red (high risk) of the two groups of samples with different survival obtained from the stratification of the multivariate risk curve (Figure 6).

Kaplan Meier plot: Blue (low risk) and red (high risk) of the two groups of samples with different survival obtained from the stratification of the multivariate risk curve (Figure 6).

Risk Prediction for an independent set of patients: predict_PatientRisk()

A relevant goal for medical doctors is to predict the risk of cancer patients considering the level of certain molecular markers. This prediction is implemented by function predict_PatientRisk() using a multivariate model that takes into account the cooperative work between the genes to carry out a biological function. Once the multivariate risk predictor is trained by the patientRisk() function (see previous section), the predict_PatientRisk() function may be applied to predict the risk of new patients considering the gene expression of a small subset of genes associated to risk. Moreover, the patients can be classified into low, intermediate or high risk groups considering the value of the risk score.

The following lines show how to predict the risk score for an independent set of patients. The function plots the risk score and stratifies the patients into three groups.

# Generate the validation set, mExprs_testData if necessary.
# Vector of genes (same ones used in Cox model training)

# Simulate expression data
num_samples <- 20
set.seed(5)
mExprs_testData <- matrix(rnorm(length(geneList) * num_samples, mean = 10,
    sd = 3), nrow = length(geneList), ncol = num_samples)

# Assign row names (genes) and column names (samples)
rownames(mExprs_testData) <- geneList
colnames(mExprs_testData) <- paste0("Sample", 1:num_samples)

set.seed(5)
risk_prediction_validation_set <- predict_PatientRisk(multivariate_risk_predictor,
    mExprs_testData)

The input parameters of the function predict_PatientRisk() are:

  • model.fit A list containing the pre-fitted model and necessary parameters for risk prediction, including the optimal lambda value, risk thresholds, and plot values given by patientRisk() function.
  • mExpr.testData A data frame for an independent set of patients representing the gene expression data, where each row is a gene and each column is a sample. The genes must be the same subset of genes (as rows) considered in the training of the COX model using patientRisk function. colnames(mExpr) corresponds to the sample names and rownames(mExpr) to the gene names.

The output of the function predict_PatientRisk() is a risk plot and a list with two objects:

  • risk_score: The unscaled risk score provided by the multivariate COX regression.
  • scaled_risk_score: The scaled risk score (0-100) for the independent set of patients.
This function provides a plot with the predicted risk score using the multivariate COX model trained by function patientRisk().
Risk score prediction for an independent set of patients using the optimal multivariate COX model trained by function patientRisk(). Green and blue/red lines correspond to the optimal thresholds that split the patients in two and three groups respectively. These thresholds were learned in training phase.

Risk score prediction for an independent set of patients using the optimal multivariate COX model trained by function patientRisk(). Green and blue/red lines correspond to the optimal thresholds that split the patients in two and three groups respectively. These thresholds were learned in training phase.

This function can be applied also to predict the risk of a single patient given the gene expression profile. The mExprs_testData will contain now a column with the expression level for the list of genes selected.

# Example for single patient prediction: Patient fourth is
# selected.
mExprs_testSingleData <- data.frame(mExprs_testData[, 4])
colnames(mExprs_testSingleData) <- colnames(mExprs_testData)[4]
# Risk prediction for the optimal subset of genes selected by
# patientRisk function
set.seed(5)
risk_prediction_one_patient <- predict_PatientRisk(multivariate_risk_predictor,
    mExprs_testSingleData)
# Normalized patient Risk (0 100): 27.9017675117161 The patient
# is classified as Low Risk Low Risk interval: (0,
# 37.0600635839135)

When a single patient is considered, the function will print the normalized risk for the test patient and the classification into three possible groups: Low, Intermediate or High. Treatment for the intermediate group is uncertain and should be studied with care.

Survival estimation from the COX regression for a test patient: predict_SurvCurve()

The function predict_SurvCurve() will obtain the expected survival curve for a test patient estimated from the risk score predicted by a multivariate COX regression model and considering the Breslow estimator for the baseline Hazard. The function provides a plot for the expected survival curve.

The input parameters are:

  • cox_pred_training A numeric vector representing the predicted risk scores (or log-risk scores) from the COX model for the training set. It can be obtained from predict_PatientRisk() function.
  • mSurv A data frame with two columns: “time” representing survival times, and “status” representing the event status (1 for event, 0 for censored).
  • cox_pred_test A numeric vector of length 1 representing the predicted risk score (or log-risk score) for the test patient. It can be obtained from predict_PatientRisk() function.
  • eval_surv_times A numeric vector of times at which the survival curve should be evaluated. If NULL (default), the times will be taken from the training data up to the maximum event time.
data(seBRCA)
# COX prediction for the training set
mPheno <- colData(seBRCA)
mExprs <- assay(seBRCA)

set.seed(5)
mExprSelectedGenes <- mExprs[rownames(mExprs) %in% geneList, ]

cox_pred_training <- predict_PatientRisk(multivariate_risk_predictor,
    mExprSelectedGenes)
cox_pred_training$risk_score
# COX prediction for the test patient
set.seed(5)
cox_pred_test <- predict_PatientRisk(multivariate_risk_predictor,
    mExprs_testSingleData)
cox_pred_test$risk_score
# Survival curve estimation
eval_surv_times <- seq(0, max(mPheno$time), by = 0.1)
set.seed(5)
surv_curv_cox <- predict_SurvCurve(cox_pred_training$risk_score, mPheno[,
    c(2, 3)], cox_pred_test$risk_score, eval_surv_times)

The output for this function is a list with three objects:

  • eval_times: Vector with the times at which the survival curve has been evaluated.
  • S_0_t: Vector with the baseline survival probability estimated at each eval_times.
  • S_test_patient: Vector with the survival probability at eval_times for the test patient.
It provides also a plot with the expected survival curve estimation for the test patient.
Survival curve estimation for a single patient based on the risk score predicted by a multivariate COX model. The blue line shows the median survival time at which the survival probability drops to 0.5.

Survival curve estimation for a single patient based on the risk score predicted by a multivariate COX model. The blue line shows the median survival time at which the survival probability drops to 0.5.

ASURI is related to several packages available in CRAN or Bioconductor: Biospear (Ternès et al., 2018), Biomarker Selection in Penalized Regression Models, CRAN; which provides a framework for biomarker selection based on multivariate penalized Cox regression models; CPSM (Kaur H. et al., 2024), Cancer Patient Survival Model, Bioconductor; which enables us to identify a set of biomarkers using univariate or multivariate Lasso Cox regression and to predict the survival probability of individual patients; and RTNsurvival (Groeneveld C. S. et al., 2019), Regulatory Network Survival Analysis, Bioconductor; that focuses on multivariate survival analysis of regulons obtained from transcriptional regulatory networks. However, ASURI has several advantages over these and other packages: First, it incorporates a pipeline to obtain survival genes that can be related to the standard clinical features and pathophysiological markers considered by oncologists. Second, the stability and robustness of the markers selected as well as the risk prediction is significantly improved by a boostrapping and resampling strategy. Finally, ASURI implements advanced methods to stratify the patients into risk groups and to perform a fuzzy classification of intermediate-risk patients.

Session Info

devtools::session_info()

References

Aguirre-Gamboa, R, H Gomez-Rueda, E Martínez-Ledesma, A Martínez-Torteya, R Chacolla-Huaringa, A Rodriguez-Barrientos, JG Tamez-Peña, and V. Treviño. 2013. “SurvExpress: An Online Biomarker Validation Tool and Database for Cancer Gene Expression Data Using Survival Analysis.” PLoS One 8 (9): e74250. https://doi.org/10.1371/journal.pone.0074250.
Barnett, DH, S Sheng, TH Charn, A Waheed, WS Sly, CY Lin, ET Liu, and BS Katzenellenbogen. 2008. “Estrogen Receptor Regulation of Carbonic Anhydrase XII Through a Distal Enhancer in Breast Cancer.” Cancer Research 68 (9): 3505–15. https://doi.org/10.1158/0008-5472.CAN-07-6151.
Bueno-Fortes, S, A Berral-Gonzalez, JM Sanchez-Santos, M Martin-Merino, and J De Las Rivas. 2023. “Identification of a Gene Expression Signature Associated with Breast Cancer Survival and Risk That Improves Clinical Genomic Platforms.” Bioinformatics Advances 3 (1): vbad037. https://doi.org/10.1093/bioadv/vbad037.
Friedman, JH, T Hastie, and R Tibshirani. 2010. “Regularization Paths for Generalized Linear Models via Coordinate Descent.” Journal of Statistical Software 33 (1): 1–22. http://www.jstatsoft.org/v33/i01.
Garczyk, S, S VonStillfried, W Antonopoulos, A Hartmann, MG Schrauder, PA Fasching, T Anzeneder, et al. 2015. “AGR3 in Breast Cancer: Prognostic Impact and Suitable Serum-Based Biomarker for Early Cancer Detection.” PloS One 10 (4): e0122106. https://doi.org/10.1371/journal.pone.0122106.
Gautier, L, L Cope, BM Bolstad, and RA Irizarry. 2004. “Affy—Analysis of Affymetrix GeneChip Data at the Probe Level.” Bioinformatics 20 (3): 307–15. https://doi.org/10.1093/bioinformatics/btg405.
Kassambara, A, M Kosinski, P Biecek, and S Fabian. 2024. “Survminer: Drawing Survival Curves Using Ggplot2 (r Package Version 0.5.0).” CRAN-R. https://doi.org/10.32614/CRAN.package.survminer.
Kupfer, P, R Guthke, D Pohlers, R Huber, D Koczan, and RW Kinne. 2012. Batch correction of microarray data substantially improves the identification of genes differentially expressed in rheumatoid arthritis and osteoarthritis.” BMC Medical Genomics 5: 23. https://doi.org/10.1186/1755-8794-5-23.
Li, J, and R Tibshirani. 2013. “Finding Consistent Patterns: A Nonparametric Approach for Identifying Differential Expression in RNA-Seq Data.” Statistical Methods in Medical Research 22 (5): 519–36. https://doi.org/10.1177/0962280211428386.
McCall, MN, BM Bolstad, and RA Irizarry. 2010. “Frozen Robust Multiarray Analysis (fRMA).” Biostatistics 11 (2): 242–53. https://doi.org/10.1093/biostatistics/kxp059.
Ritchie, ME, B Phipson, D Wu, Y Hu, CW Law, W Shi, and GK Smyth. 2015. “Limma Powers Differential Expression Analyses for RNA-Sequencing and Microarray Studies.” Nucleic Acids Research 43 (7): e47–47. https://doi.org/10.1093/nar/gkv007.
Scarfo, I, E Pellegrino, E Mereu, I Kwee, L Agnelli, E Bergaggio, G Garaffo, et al. 2016. “Identification of a New Subclass of ALK-Negative ALCL Expressing Aberrant Levels of ERBB4 Transcripts.” Blood 127 (2): 221–32. https://doi.org/10.1182/blood-2014-12-614503.
Simon, N, J Friedman, T Hastie, and R Tibshirani. 2011. “Regularization Paths for Cox’s Proportional Hazards Model via Coordinate Descent.” Journal of Statistical Software 39 (5): 1. https://doi.org/10.18637/jss.v039.i05.
Therneau, TM. 2021. “A Package for Survival Analysis in r.” CRAN-R. https://CRAN.R-project.org/package=survival.
Tibshirani, R. 2009. “Univariate Shrinkage in the Cox Model for High Dimensional Data.” Statistical Applications in Genetics and Molecular Biology 8 (1). https://doi.org/10.2202/1544-6115.1438.
Tibshirani, Robert. 1996. “Regression Shrinkage and Selection via the Lasso.” Journal of the Royal Statistical Society: Series B (Methodological) 58 (1): 267–88. http://www.jstor.org/stable/2346178.
Tusher, VG, R Tibshirani, and G Chu. 2001. “Significance Analysis of Microarrays Applied to the Ionizing Radiation Response.” Proceedings of the National Academy of Sciences 98 (9): 5116–21. https://doi.org/10.1073/pnas.091062498.
Yu, Z, E Jiang, X Wang, Y Shi, AJ Shangguan, L Zhang, and J Li. 2015. “Sushi Domain-Containing Protein 3: A Potential Target for Breast Cancer.” Cell Biochemistry and Biophysics 72 (2): 321–24. https://doi.org/10.1007/s12013-014-0480-9.