This note contains the scripts to process the raw data .
- The raw data and files are in the Google Drive.
- Download the file
data_train_raw.tar.gz
and extract it to the root directory of the project. - Download the raw protein/molecule/peptide files for each database (except for the Uni-Mol data) from the
raw_files
directory and extract it to the corresponding directory.
- Download the file
- The raw Uni-Mol data is very large (114.76GB). You can download the molecular pretrain data directly from the original Uni-Mol repository. Then extract the
ligands.tar.gz
file to thedata_train/unmi/files
directory.
Then you will get the data_train
directory with the following structure:
data_train
├── geom
│ ├── dfs
│ │ └── meta_uni.csv
│ └── mols # extracted from raw_files/geom.tar.gz
├── qm9
│ ├── dfs
│ │ └── meta_uni.csv
│ └── mols # extracted from raw_files/qm9.tar.gz
├── unmi
│ ├── dfs
│ │ └── meta_uni.csv
│ └── files # extracted downloaded ligands.tar.gz here
├── csd
│ ├── dfs
│ │ └── meta_filter_w_pocket.csv
│ └── files # extracted from raw_files/csd.tar.gz
│ ├── proteins
│ └── mols
├── pbdock
│ ├── dfs
│ │ └── meta_filter_w_pocket.csv
│ └── files # extracted from raw_files/pbdock.tar.gz
│ ├── proteins
│ └── mols
├── moad
│ ├── dfs
│ │ └── meta_uni.csv
│ └── files # extracted from raw_files/moad.tar.gz
│ ├── proteins
│ └── mols
├── cremp
│ ├── dfs
│ │ └── meta_uni.csv
│ └── mols # extracted from raw_files/cremp.tar.gz
├── apep
│ ├── dfs
│ │ └── meta_uni.csv
│ └── files # extracted from raw_files/apep.tar.gz
│ ├── proteins
│ ├── mols
│ └── peptides
├── pepbdb
│ ├── dfs
│ │ └── meta_filter.csv
│ └── files # extracted from raw_files/pepbdb.tar.gz
│ ├── proteins
│ ├── mols
│ └── peptide
└── assemblies # train/val split for training
└── split_train_val.csv
The following steps are required to process the raw data files. The python commands for each database should be run in order.
Run
python process/geom/process_mols.py
python process/process_torsional_info.py --db_name geom
python process/process_decompose_info.py --db_name geom
and you will get the processed data (lmdb) in the data_train/geom/lmdb
directory.
Run
python process/qm9/process_mols.py
python process/process_torsional_info.py --db_name qm9
python process/process_decompose_info.py --db_name qm9
and you will get the processed data (lmdb) in the data_train/qm9/lmdb
directory.
Run
python process/unmi/process_mols.py
python process/process_torsional_info.py --db_name unmi
python process/process_decompose_info.py --db_name unmi
and you will get the processed data (lmdb) in the data_train/unmi/lmdb
directory.
Run
python process/csd/extract_pockets.py
python process/csd/process_pocmol.py
python process/process_torsional_info.py --db_name csd
python process/process_decompose_info.py --db_name csd
and you will get the processed data (lmdb) in the data_train/csd/lmdb
directory and pocket data in the data_train/csd/files/pockets10
directory.
Run
python process/pbdock/extract_pockets.py
python process/pbdock/process_pocmol.py
python process/process_torsional_info.py --db_name pbdock
python process/process_decompose_info.py --db_name pbdock
and you will get the processed data (lmdb) in the data_train/pbdock/lmdb
directory and pocket data in the data_train/pbdock/files/pockets10
directory.
Run
python process/moad/extract_pockets.py
python process/moad/process_pocmol.py
python process/process_torsional_info.py --db_name moad
python process/process_decompose_info.py --db_name moad
and you will get the processed data (lmdb) in the data_train/moad/lmdb
directory and pocket data in the data_train/moad/files/pockets10
directory.
Run
python process/process_mols.py --db_name cremp
and you will get the processed data (lmdb) in the data_train/cremp/lmdb
directory.
Run
python process/extract_pockets.py --db_name apep
python process/process_pocmol.py --db_name apep
python process/process_peptide_allinone.py --db_name apep
python process/process_torsional_info.py --db_name apep
python process/process_decompose_info.py --db_name apep
and you will get the processed data (lmdb) in the data_train/apep/lmdb
directory and pocket data in the data_train/apep/files/pockets10
directory.
Run
python process/process_pocmol_allinone.py --db_name pepbdb
python process/process_peptide_allinone.py --db_name pepbdb
and you will get the processed data (lmdb) in the data_train/pepbdb/lmdb
directory and pocket data in the data_train/pepbdb/files/pockets10
directory.
Finall run
python process/make_assembly_lmdb.py
to generate the training/validation split data (lmdb) in the data_train/assemblies
directory for model training.