PathwayForte

Python package for pathway database benchmarking.

A Python package for benchmarking pathway databases with functional enrichment and prediction methods tasks.

Command Line Interface

PathwayForte commands.

pathway_forte

Run PathwayForte.

pathway_forte [OPTIONS] COMMAND [ARGS]...

datasets

List the available cancer datasets.

pathway_forte datasets [OPTIONS]

export

Generate gene set files using ComPath.

pathway_forte export [OPTIONS]

fcs

List of FCS Analyses.

pathway_forte fcs [OPTIONS] COMMAND [ARGS]...
gsea

Run GSEA on TCGA data.

pathway_forte fcs gsea [OPTIONS]

Options

-d, --data <data>

Required Name of the cancer dataset from TCGA

Options

prad | ov | kirc | brca | lihc

-p, --permutations <permutations>

Number of permutations

Default

100

gsea-msig

Run GSEA on TCGA data using MSigDB gene sets.

pathway_forte fcs gsea-msig [OPTIONS]

Options

-d, --data <data>

Required Name of the cancer dataset from TCGA

Options

prad | ov | kirc | brca | lihc

ssgsea

Run ssGSEA on TCGA data.

pathway_forte fcs ssgsea [OPTIONS]

Options

-d, --data <data>

Required Name of the cancer dataset from TCGA

Options

prad | ov | kirc | brca | lihc

ora

Perform ORA analysis.

pathway_forte ora [OPTIONS] COMMAND [ARGS]...
hypergeometric

Perform one-tailed hypergeometric test enrichment.

pathway_forte ora hypergeometric [OPTIONS]

Options

-d, --genesets <genesets>

Required Path to GMT file

-s, --fold-changes <fold_changes>

Required Path to fold changes file

--no-threshold

Do not apply threshold

-o, --output <output>

Optional path for output JSON file

prediction

List of Prediction Methods.

pathway_forte prediction [OPTIONS] COMMAND [ARGS]...
binary

Train elastic net for binary prediction.

pathway_forte prediction binary [OPTIONS]

Options

-d, --data <data>

Required Name of the cancer dataset from TCGA

Options

prad | ov | kirc | brca | lihc

--outer-cv <outer_cv>

Number of splits in outer cross-validation

Default

10

--inner-cv <inner_cv>

Number of splits in inner cross-validation

Default

10

-i, --max_iterations <max_iterations>

Number of max iterations to converge

Default

1000

--turn-off-warnings

Turns off warnings

subtype

Train subtype analysis.

pathway_forte prediction subtype [OPTIONS]

Options

-d, --ssgsea <ssgsea>

Required Path to ssGSEA file

-s, --subtypes <subtypes>

Required Path to the subtypes file

--outer-cv <outer_cv>

Number of splits in outer cross-validation

Default

10

--inner-cv <inner_cv>

Number of splits in inner cross-validation

Default

10

--chain-pca
--explained-variance <explained_variance>

Explained variance

Default

0.95

--turn-off-warnings

Turns off warnings

survival

Train survival model.

pathway_forte prediction survival [OPTIONS]

Options

-d, --data <data>

Required Name of dataset

--outer-cv <outer_cv>

Number of splits in outer cross-validation

Default

10

--inner-cv <inner_cv>

Number of splits in inner cross-validation

Default

10

--turn-off-warnings

Turns off warnings

test-stability-prediction

Test stability of prediction.

pathway_forte prediction test-stability-prediction [OPTIONS]

Options

-s, --ssgsea-scores-path <ssgsea_scores_path>

Required ssGSEA scores file

-p, --phenotypes-path <phenotypes_path>

Required Path to the phenotypes file

--outer-cv <outer_cv>

Number of splits in outer cross-validation

Default

10

--inner-cv <inner_cv>

Number of splits in inner cross-validation

Default

10

-i, --max_iterations <max_iterations>

Number of max iterations to converge

Default

1000

--turn-off-warnings

Turns off warnings

Pipeline

Pipelines from Pathway Forte.

Constants

This module contains all the constants used in the PathwayForte repo.

pathway_forte.constants.BIO2BEL_DATA_DIR = '/home/docs/.bio2bel/pathwayforte'

Cancer Data Sets

pathway_forte.constants.make_classifier_results_directory()[source]

Ensure that the result folder exists.

pathway_forte.constants.MSIG_GSEA = '/home/docs/checkouts/readthedocs.org/user_builds/pathwayforte/checkouts/latest/data/results/gsea/msig'

Output files with results for GSEA

pathway_forte.constants.make_gsea_export_directories()[source]

Ensure that gsea export directories exist.

pathway_forte.constants.MSIG_SSGSEA = '/home/docs/checkouts/readthedocs.org/user_builds/pathwayforte/checkouts/latest/data/results/ssgsea/msig'

Pickles with results for ssGSEA

pathway_forte.constants.make_ssgsea_export_directories()[source]

Ensure that gsea export directories exist.

pathway_forte.constants.check_gmt_files()[source]

Check if GMT files exist and returns GMT files as constant variables.

pathway_forte.constants.GENESET_COLUMN_NAMES = {'kegg': 'KEGG Geneset', 'reactome': 'Reactome Geneset', 'wikipathways': 'WikiPathways Geneset'}

Columns to read to perform ORA analysis.

Over Representation Methods

This module contains the functions to run Over Representation Analysis (ORA).

pathway_forte.pathway_enrichment.over_representation.read_fold_change_df(path)[source]

Read csv with gene names, fold changes and their p-values.

Return type

DataFrame

pathway_forte.pathway_enrichment.over_representation.filter_p_value(df, p_value=True, cutoff=0.05)[source]

Return significantly differentially expressed genes in fold change df.

pathway_forte.pathway_enrichment.over_representation.perform_hypergeometric_test(genes_to_test, pathway_dict, gene_universe=41714, apply_threshold=False, threshold=0.05)[source]

Perform hypergeometric tests.

Parameters
  • genes_to_test (Set[str]) – gene set to test against pathway

  • pathway_dict (Mapping[str, Set[str]]) – pathway name to gene set

  • gene_universe (int) – number of HGNC symbols

  • apply_threshold (bool) – return only significant pathways

  • threshold (float) – significance threshold (by default 0.05)

Return type

DataFrame

Functional Class Scoring

This module contain the functional class methods implemented in PathwayForte.

Currently this includes GSEA and ssGSEA.

pathway_forte.pathway_enrichment.functional_class.create_cls_file(gene_expression_file, normal_sample_file, tumor_sample_file, data)[source]

Create categorical (e.g. tumor vs sample) class file format (i.e., .cls) for input into GSEA.

Parameters
  • gene_expression_file (str) – Text file containing expression values for each gene from each sample.

  • normal_sample_file (str) –

  • tumor_sample_file (str) –

  • data

pathway_forte.pathway_enrichment.functional_class.run_gsea(gene_exp, gene_set, phenotype_class, permutations=500, output_dir='/home/docs/checkouts/readthedocs.org/user_builds/pathwayforte/checkouts/latest/data/results/gsea')[source]

Run GSEA on a given dataset with a given gene set.

Parameters
  • gene_exp (str) – file with gene expression data

  • gene_set (str) – gmt files containing pathway gene sets

  • phenotype_class (str) – cls file containing information on class labels

  • permutations (int) – number of permutations

  • output_dir (str) – output directory

Returns

pathway_forte.pathway_enrichment.functional_class.filter_gsea_results(gsea_results_path, source, kegg_manager=None, reactome_manager=None, wikipathways_manager=None, p_value=None, absolute_nes_filter=None, geneset_set_filter_minimum_size=None, geneset_set_filter_maximum_size=None)[source]

Get top and bottom rankings from GSEA results.

Parameters
  • gsea_results_path (str) – path to GSEA results in .tsv file format

  • source

  • kegg_manager (Optional[Manager]) – KEGG manager

  • reactome_manager (Optional[Manager]) – Reactome manager

  • wikipathways_manager (Optional[Manager]) – WikiPathways manager

  • p_value (Optional[float]) – maximum p value allowed

  • absolute_nes_filter (Optional[float]) – filter by magnitude of normalized enrichment scores

  • geneset_set_filter_minimum_size (Optional[int]) – filter to include a minimum number of genes in a gene set

  • geneset_set_filter_maximum_size (Optional[int]) – filter to include a maximum number of genes in a gene set

Return type

DataFrame

Returns

list of pathways ranked as having the highest and lowest significant enrichment scores

pathway_forte.pathway_enrichment.functional_class.merge_statistics(merged_pathways_df, dataset)[source]

Get statistics for pathways included in the merged gene sets dataFrame.

These include the proportion of pathways from each of the other databases and the proportion of pathways deriving from 2 or more primary resources

Parameters

merged_pathways_df (DataFrame) – dataFrame containing pathways from multiple databases

Returns

statistics of contents in merged dataset

pathway_forte.pathway_enrichment.functional_class.rearrange_df_columns(df)[source]

Rearrange order of columns.

Return type

DataFrame

pathway_forte.pathway_enrichment.functional_class.get_pathway_names(database, pathway_df, kegg_manager=None, reactome_manager=None, wikipathways_manager=None)[source]

Get pathway names from database specific pathway IDs.

Parameters
  • database (str) –

  • pathway_df (DataFrame) –

  • kegg_manager (Optional[Manager]) –

  • reactome_manager (Optional[Manager]) –

  • wikipathways_manager (Optional[Manager]) –

Returns

pathway_forte.pathway_enrichment.functional_class.pathway_names_to_df(filtered_gsea_results_df, all_pathway_ids, source, kegg_manager=None, reactome_manager=None, wikipathways_manager=None)[source]

Get pathway names.

Parameters
  • filtered_gsea_results_df

  • all_pathway_ids – list of pathway IDs

  • source – pathway source (i.e., database name or ‘MPath’)

  • kegg_manager (Optional[Manager]) – KEGG manager

  • reactome_manager (Optional[Manager]) – Reactome manager

  • wikipathways_manager (Optional[Manager]) – WikiPathways manager

Return type

DataFrame

pathway_forte.pathway_enrichment.functional_class.gsea_results_to_filtered_df(dataset, kegg_manager=None, reactome_manager=None, wikipathways_manager=None, p_value=None, absolute_nes_filter=None, geneset_set_filter_minimum_size=None, geneset_set_filter_maximum_size=None)[source]

Get filtered GSEA results dataFrames.

pathway_forte.pathway_enrichment.functional_class.get_pathways_by_resource(pathways, resource)[source]

Return pathways by resource.

Return type

list

pathway_forte.pathway_enrichment.functional_class.get_analogs_comparison_numbers(kegg_reactome_pathway_df, reactome_wikipathways_pathway_df, wikipathways_kegg_pathway_df, *, pathway_column='pathway_id')[source]

Get number of existing versus expected pairwise mappings.

pathway_forte.pathway_enrichment.functional_class.get_pairwise_mapping_numbers(kegg_pathway_df, reactome_pathway_df, wikipathways_pathway_df)[source]

Get number of existing versus expected pairwise mappings.

pathway_forte.pathway_enrichment.functional_class.get_pairwise_mappings(kegg_pathway_df, reactome_pathway_df, wikipathways_pathway_df)[source]

Get pairwise mappings.

pathway_forte.pathway_enrichment.functional_class.compare_database_results(df_1, resource_1, df_2, resource_2, mapping_dict, check_contradiction=False)[source]

Compare pathways in the dataframe from enrichment results to evaluate the concordance in similar pathways.

pathway_forte.pathway_enrichment.functional_class.get_matching_pairs(df_1, resource_1, df_2, resource_2, equivalent_mappings_dict)[source]

Get equivalent pathways and their direction of change.

pathway_forte.pathway_enrichment.functional_class.run_ssgsea(filtered_expression_data, gene_set, output_dir='/home/docs/checkouts/readthedocs.org/user_builds/pathwayforte/checkouts/latest/data/results/ssgsea', processes=1, max_size=3000, min_size=15)[source]

Run single sample GSEA (ssGSEA) on filtered gene expression data set.

Parameters
  • filtered_expression_data (DataFrame) – filtered gene expression values for samples

  • gene_set (str) – .gmt file containing gene sets

  • output_dir (str) – output directory

Return type

SingleSampleGSEA

Returns

ssGSEA results in respective directory

pathway_forte.pathway_enrichment.functional_class.filter_gene_exp_data(expression_data, gmt_file)[source]

Filter gene expression data file to include only gene names which are found in the gene set files.

Parameters
  • expression_data (DataFrame) – gene expression values for samples

  • gmt_file (str) – .gmt file containing gene sets

Returns

Filtered gene expression data with genes with no correspondences in gene sets removed

Return type

pandas.core.frame.DataFrame kegg_xml_parser.py

Pathway Topology Methods

This module contain the topology-based topology methods implemented in PathwayForte used R wrappers and are located outside the main Python package in its corresponding R folder at https://github.com/pathwayforte/results/tree/master/R.

Binary Prediction

Prediction of binary classes such as tumor vs. normal patients.

Elastic Net regression with nested cross validation module.

This workflow trains an elastic net model for a binary classification task (e.g., tumor vs. normal patients). The training is conducted using a nested cross validation approach (the number of cross validation in both loops can be selected). The model used can be easily changed since most of the models in scikit-learn (the machine learning library used by this package) required the same input.

pathway_forte.prediction.binary.ssgsea_nes_to_df(ssgsea_scores_csv, classes_file, removed_random=None)[source]

Create dataFrame of Normalized Enrichment Scores (NES) from ssGSEA of TCGA expression data.

Parameters
  • ssgsea_scores_csv – Text file containing normalized ES for pathways from each sample

  • test_size – Default test size is 0.25

  • removed_random (Optional[int]) – Remove percentage of df

pathway_forte.prediction.binary.get_l1_ratios()[source]

Return a list of values that are used by the elastic net as hyperparameters.

pathway_forte.prediction.binary.train_elastic_net_model(x, y, outer_cv_splits, inner_cv_splits, l1_ratio, model_name, max_iter=None, export=True)[source]

Train elastic net model via a nested cross validation given expression data.

Uses a defined hyperparameter space for l1_ratio.

Parameters
  • x (numpy.array) – 2D matrix of pathway scores and samples

  • y (list) – class labels of samples

  • outer_cv_splits (int) – number of folds for cross validation split in outer loop

  • inner_cv_splits (int) – number of folds for cross validation split in inner loop

  • l1_ratio (List[float]) – list of hyper-parameters for l1 and l2 priors

  • model_name (str) – name of the model

  • max_iter (Optional[int]) – default to 1000 to ensure convergence

  • export (bool) – Export the models using joblib

Return type

Tuple[List[float], List[float]]

Returns

A list of AUC-ROC scores

Multi-Class Prediction

Prediction of multi-class labels such as tumor subtypes.

pathway_forte.prediction.multiclass

Survival Prediction

Prediction of survival based on clinical and pathway patient data.

pathway_forte.prediction.survival

Utils

Complementary methods for prediction analysis.

Utilities for prediction.

pathway_forte.prediction.utils.pca_chaining(train, test, n_components)[source]

Chain PCA with logistic regression.

Parameters
  • train (pandas.core.frame.DataFrame) – Training set to apply dimensionality reduction to

  • test (pandas.core.series.Series) – Test set to apply dimensionality reduction to

  • n_components – Amount of variance retained

Return type

Tuple

Returns

array-like, shape (n_samples, n_components)

Mappings Methods

Methods related to ComPath mappings.

Function to deal with ComPath mappings.

pathway_forte.mappings.get_mapping_dict(df, mapping_type)[source]

Create a dictionary with ComPath mappings for each pathway.

Return type

Mapping[Tuple[str, str], List[Tuple[str, str]]]

pathway_forte.mappings.get_equivalent_pairs(df)[source]

Get equivalent pairs of pathways from 2 databases.

Parameters

df (DataFrame) – pairwise mappings dataframe

Returns

equivalent pathway pairs dictionary {(SOURCE_RESOURCE,SOURCE_ID):[(TARGET_RESOURCE,TARGET_ID)]}

Return type

dict[list]

pathway_forte.mappings.load_compath_mapping_dfs()[source]

Load ComPath mappings data frames.

Return type

Tuple[DataFrame, DataFrame, DataFrame, DataFrame]

pathway_forte.mappings.get_equivalent_mappings_dict()[source]

Get mapping dictionary of all equivalent pairs of pathways.

Special mappings are not included in the overall mappings as some of the WP pathways possess identical IDs.

Return type

Mapping[Tuple[str, str], List[Tuple[str, str]]]

Installation Current version on PyPI Stable Supported Python Versions Apache-2.0

pathway_forte can be installed from PyPI with the following command in your terminal:

$ python3 -m pip install pathway_forte

The latest code can be installed from GitHub with:

$ python3 -m pip install git+https://github.com/pathwayforte/pathway-forte.git

For developers, the code can be installed with:

$ git clone https://github.com/pathwayforte/pathway-forte.git
$ cd pathway-forte
$ python3 -m pip install -e .

Main Commands

The table below lists the main commands of PathwayForte.

Command

Action

datasets

Lists of Cancer Datasets

export

Export Gene Sets using ComPath

ora

List of ORA Analyses

fcs

List of FCS Analyses

prediction

List of Prediction Methods

Functional Enrichment Methods

  • ora. Lists Over-Representation Analyses (e.g., one-tailed hyper-geometric test).

  • fcs. Lists Functional Class Score Analyses such as GSEA and ssGSEA using GSEAPy.

Prediction Methods

pathway_forte enables three classification methods (i.e., binary classification, training SVMs for multi-classification tasks, or survival analysis) using individualized pathway activity scores. The scores can be calculated from any pathway with a variety of tools (see 1) using any pathway database that enables to export its gene sets.

  • binary. Trains an elastic net model for a binary classification task (e.g., tumor vs. normal patients). The training is conducted using a nested cross validation approach (the number of cross validation in both loops can be selected). The model used can be easily changed since most of the models in scikit-learn (the machine learning library used by this package) required the same input.

  • subtype. Trains a SVM model for a multi-class classification task (e.g., predict tumor subtypes). The training is conducted using a nested cross validation approach (the number of cross validation in both loops can be selected). Similarly as the previous classification task, other models can quickly be implemented.

  • survival. Trains a Cox’s proportional hazard’s model with elastic net penalty. The training is conducted using a nested cross validation approach with a grid search in the inner loop. This analysis requires pathway activity scores, patient classes and lifetime patient information.

Other

  • export. Export GMT files with current gene sets for the pathway databases included in ComPath 2.

  • datasets. Lists the TCGA data sets 3 that are ready to run in pathway_forte.

References

1

Lim, S., et al. (2018). Comprehensive and critical evaluation of individualized pathway activity measurement tools on pan-cancer data. Briefings in bioinformatics, bby125.

2

Domingo-Fernández, D., et al. (2018). ComPath: An ecosystem for exploring, analyzing, and curating mappings across pathway databases. npj Syst Biol Appl., 4(1):43.

3

Weinstein, J. N., et al. (2013). The cancer genome atlas pan-cancer analysis project. Nature genetics, 45(10), 1113.

Indices and Tables