Extended Topological Regression

Introduction

This is the Python package of an extended version of Topological Regression (TR), a similarity-based regression framework that is statistically grounded, computationally fast, and interpretable. This package offers flexible options for descriptor calculation, distance calculation, anchor point selection, and model configuration, allowing users to fine-tune the method to their specific research needs. For more information, please see

Zhang, Ruibo, et al. "Topological regression as an interpretable and efficient tool for quantitative structure-activity relationship modeling." Nature Communications 15.1 (2024): 5072.

Installation

Step 1: Create a conda environment (optional)

conda create --name extendtr python=3.8
conda activate extendtr

Step 2: Install the extendtr package

Clone the github repo:

git clone [email protected]:yixmao/extend_tr.git
cd extend_tr

Install the package

python setup.py sdist bdist_wheel
pip install .

If an error occurs indicating that there is no module named setuptools or wheel, you can install them by pip install setuptools wheel.

Step 3: Verify installation

python tr_examples.py

Expected output:

----------------- Running different descriptors -----------------
Performance of ensemble predictions:
Spearman: 0.8048832322935336
R2: 0.8187727046698461
RMSE: 0.7894710288568683
NRMSE: 0.4257079930306147
Performance of stacking predictions:
Spearman: 0.7736080001537748
R2: 0.8180014941571891
RMSE: 0.7911490376625655
NRMSE: 0.4266128289712007
----------------- Running different anchor selections -----------------
Performance of ensemble predictions:
Spearman: 0.8280874367843225
R2: 0.869484364577384
RMSE: 0.6699704746442829
NRMSE: 0.3612694775684988
Performance of stacking predictions:
Spearman: 0.7875305228482481
R2: 0.8355063934636877
RMSE: 0.7521403782978013
NRMSE: 0.405578113975979
----------------- Running different distance calculations -----------------
Performance of ensemble predictions:
Spearman: 0.825262577107183
R2: 0.8559124650619488
RMSE: 0.7039431684702607
NRMSE: 0.3795886391056128
Performance of stacking predictions:
Spearman: 0.8621875286012207
R2: 0.8960827625479393
RMSE: 0.5978169015645128
NRMSE: 0.32236196650979265
----------------- Running different models ----------------------
Performance of ensemble predictions:
Spearman: 0.825262577107183
R2: 0.8559124650619488
RMSE: 0.7039431684702607
NRMSE: 0.3795886391056128
Performance of stacking predictions:
Spearman: 0.8621875286012207
R2: 0.8960827625479393
RMSE: 0.5978169015645128
NRMSE: 0.32236196650979265
----------------- All configurations run successfully! -----------------

Example usage

Detailed example usage of different TR configurations can be found in tr_examples.ipynb. Here, we go through a simple example that runs ensemble TR on the CHEMBL dataset 278.

Step 1: Prepare the data

To use TR functions, first we need to prepare the data, including the descriptors, targets, train and test indices and validation indices (optional). Note that the descriptors and targets need to be pd.DataFrame, and the indices need to be list.

import pandas as pd
# load the descriptor - the indices of desc will be used later
desc = pd.read_parquet(f'./SampleDatasets/CHEMBL278/data_ECFP4.parquet', engine='fastparquet').astype('bool')
# load targets
data = pd.read_csv(f'./SampleDatasets/CHEMBL278/data_cp.csv', index_col=0)
target = data["pChEMBL Value"]

As a sanity check, make sure that desc and target have the same indices.

# make sure that the indices of desc and target match
desc = desc.loc[target.index]
target = target.loc[desc.index]

Then, we define the indices for training, test and validation (option) samples.

import json
from sklearn.model_selection import train_test_split
# load indices for scaffold split
with open(f'./SampleDatasets/CHEMBL278/scaffold_split_index.json', 'r') as f:
    index = json.load(f)  
train_idx = index['train_idx']
test_idx = index['test_idx']
# make sure that train and test indices are included in target.index
train_idx = [idx for idx in train_idx if idx in target.index]
test_idx = [idx for idx in test_idx if idx in target.index]

##### alternatively, you can randomly split train and test idx
# dataset_idx = target.index.tolist()
# train_idx, test_idx = train_test_split(dataset_idx, test_size=0.2, random_state=args.seed)

# set validation index if necessary
val_set = 0.2 # a fraction number to use [val_set] percent samples from the train set as val set, or None for no validation
if val_set is not None: # if we want to test on the validation set
    train_idx, val_idx = train_test_split(train_idx, test_size=val_set, random_state=2021)
else: # no validation
    val_idx = None

Step 2: Model and predict

Set the arguments and random seed. You can use args = TopoRegArgs() to receive command-line input.

from extendtr.utils.args import TopoRegArgs
from extendtr.utils.utils import set_seed
# get the args
args = TopoRegArgs('-ensemble 1') # ensemble TR
# set random seed
set_seed(args.seed)

Train the TR model(s) and get the predictions using TopoReg. mdl will be a list of models if ensemble is enabled. Note that pred_val will be None if val_idx is None.

from extendtr.TR.topoReg import TopoReg
# train and get the predictions
mdl, pred_test, pred_val, train_time, test_time = TopoReg(desc, target, train_idx, test_idx, val_idx, args)

Step 3: Evaluate the predictions

We provide a simple function that can calculate Spearman's correlation, R2, root mean square error (RMSE) and normalized RMSE (NRMSE)

from extendtr.utils.utils import metric_calc
# evaluate the resuls
scorr, r2, rmse, nrmse = metric_calc(pred_test, target.loc[test_idx], True)

Parameters

Table below shows the parameters and their possible values and descriptions.

Parameter Name	Possible Values/Range	Default Value	Description
`-anchorselection`	`'random'`, `'maximin'`, `'maximin_density'`	`'random'`	Anchor selection strategy.
`-desc_norm`	`True`, `False`	`False`	Normalize the descriptors or not.
`-distance`	`'jaccard'`, `'tversky'`, `'euclidean'`, `'cosine'`	`'jaccard'`	Distance metric for calculations.
`-model`	`'LR'`, `'LR_L1'`, `'RF'`, `'ANN'`	`'LR'`	Model type for training.
`-anchor_percentage`	`[0, 1]` (float)	`0.5`	Percentage of anchors for training.
`-max_num_anchors`	Integer	`2000`	Maximum number of anchor points.
`-ensemble`	`True`, `False`	`False`	Enable ensemble TR.
`-mean_anchor_percentage`	`[0, 1]` (float)	`0.6`	Mean anchor percentage for ensemble TR.
`-std_anchor_percentage`	`[0, 1]` (float)	`0.2`	Standard deviation of anchor percentage for ensemble TR.
`-min_anchor_percentage`	`[0, 1]` (float)	`0.3`	Minimum anchor percentage for ensemble TR.
`-max_anchor_percentage`	`[0, 1]` (float)	`0.9`	Maximum anchor percentage for ensemble TR.
`-num_TR_models`	Integer	`15`	Number of TR models included in the ensemble TR.
`-seed`	Integer	`2021`	Random seed for reproducibility.
`-rbf_gamma`	Float	`0.5`	Gamma parameter for the RBF function.
`-verbose`	`True`, `False`	`False`	Report the metrics or not.
`-check_duplicates`	`True`, `False`	`False`	Check for duplicate samples in the dataset.
`-tversky_alpha`	`[0, 1]` (float)	`0.5`	Alpha parameter for Tversky distance.
`-tversky_beta`	`[0, 1]` (float)	`0.5`	Beta parameter for Tversky distance.
`-refine_anchors_lasso`	`True`, `False`	`False`	Enable L1-norm regularization to refine the anchors.
`-lasso_alpha`	Float	`0.05`	Alpha parameter for Lasso regularization.
`-lasso_thres`	Float	`1e-6`	Threshold for Lasso coefficient filtering.
`-weight_density`	`[0, 1]` (float)	`0.5`	Weight for the `'maximin_density'` anchor selection approach.
`-ann_cp_dir`	String (directory path)	`'./results/ann_cp/'`	Directory to save the ANN checkpoint.
`-ann_act`	`'tanh'`, `'relu'`, `'sigmoid'`, `'linear'`	`'tanh'`	Activation function for ANN.
`-ann_lr`	Float	`0.001`	Learning rate for ANN training.
`-ann_num_layers`	Integer	`1`	Number of hidden layers in the ANN.
`-ann_epochs`	Integer	`50`	Number of training epochs for ANN.
`-ann_batch_size`	Integer	`256`	Batch size for ANN training.
`-ann_batch_norm`	`True`, `False`	`True`	Enable batch normalization in ANN.
`-ann_init_wts`	`True`, `False`	`True`	Enable weight initialization in ANN.
`-ann_early_stop`	`True`, `False`	`True`	Enable early stopping for ANN.
`-ann_patience`	Integer	`3`	Number of steps to wait before stopping ANN training.
`-ann_min_delta`	Float	`1e-3`	Minimum change in NRMSE to qualify as an improvement for early stopping.

Contact

If you have any questions or suggestions, please feel free to contact: Yixiang Mao ([email protected]) and Dr. Ranadip Pal ([email protected]).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Extended Topological Regression

Introduction

Installation

Example usage

Parameters

Contact

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
SampleDatasets		SampleDatasets
extendtr		extendtr
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
setup.py		setup.py
tr_examples.ipynb		tr_examples.ipynb
tr_examples.py		tr_examples.py

License

yixmao/extend_tr

Folders and files

Latest commit

History

Repository files navigation

Extended Topological Regression

Introduction

Installation

Example usage

Parameters

Contact

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages