This is the code repository for manuscript "Predictive Modeling in Urgent Care: A Comparative Study of Machine Learning Approaches" by Fengyi Tang, Cao Xiao, Fei Wang, Jiayu Zhou.
Objective: The growing availability of rich clinical data such as patients' electronic health records (EHR) provide great opportunities to address a broad range of real-world questions in medicine. At the same time, artificial intelligence and machine learning based approaches have shown great premise on extracting insights from those data and helping with various clinical problems. The goal of this study is to conduct a systematic comparative study of different machine learning algorithms for several predictive modeling problems in urgent care.
Design: We assess the performance of four benchmark prediction tasks (e.g., mortality and prediction, differential diagnostics and disease marker discovery) using medical histories, physiological time-series and demographics data from the Medical Information Mart for Intensive Care (MIMIC-III) database.
Measurements: For each given task, performance was estimated using standard measures including the area under the receiver operating characteristic (AUC) curve, F-1 score, sensitivity and specificity. Micro-averaged AUC was used for multi-class classification models.
Results and Discussion: Our results suggest that recurrent neural networks show the most promise in mortality prediction where temporal patterns in physiologic features alone can capture in-hospital mortality risk (AUC > 0.90). Temporal models did not provide additional benefit compared to deep models in differential diagnostics. When comparing the training-testing behaviors of readmission and mortality models, we illustrate that readmission risk may be independent of patient stability at discharge. We also introduce a multi-class prediction scheme for length of stay which preserves sensitivity and AUC with outliers of increasing duration despite decrease in sample size.
- Python 3.4+
- Keras 2.0
- Scikit-Learn
- Gensim
- NumPy
- Pandas
- Tensorflow 1.11+
- Progressbar2
- Postgres (or equivalent for building local MIMIC-III)
Please apply for access to the publicly available MIMIC-III DataBase via https://www.physionet.org/
.
Workflow: MIMIC-III Access -> Obtain Views and Tables -> Preprocessing -> Pipeline
- Obtain access to MIMIC-III and clone this repo to local folder. Create a local MIMIC-III folder to store a few files:
.../local_mimic
.../local_mimic/views
.../local_mimic/tables
.../local_mimic/save
These paths will be important for storing views and pivot tables, which will be used for preprocessing.
-
Build MIMIC-III database using
postgres
, follow the instructions outlined in the MIMIC-III repository:https://github.com/MIT-LCP/mimic-code/tree/master/buildmimic/postgres
. -
Go to the pivot folder in the MIMIC-III repository:
https://github.com/MIT-LCP/mimic-code/tree/master/concepts/pivot
. Run use the.sql
scripts to build a local set of.csv
files of the pivot tables:
- pivoted-bg.sql
- pivoted_vital.sql
- pivoted_lab.sql
- pivoted_gcs.sql (optional)
- pivoted_uo.sql (optional)
When running the .sql
script, change the delimiter of the materialized views to ','
when saving as .csv
file.
For example,
mimic=> \copy (select * FROM mimiciii.icustay_detail) to 'icustay_detail.csv' delimiter ',' csv header;
After running these scripts, you should have obtained local .csv
files of the pivot tables.
Create a local folder to place them in, i.e. .../local_mimic/views/pivoted-bg.csv
.
Remember this .../local_mimic/views
folder, as it will be the path_views
input for preprocessing purposes.
- Go to the demographics folder in the MIMIC-III repository:
https://github.com/MIT-LCP/mimic-code/tree/master/concepts/demographics
.
Run icustay-detail.sql
and obtain a local .csv
file of icustays-detail
view.
Create a local folder to place the .csv
file in, i.e..../local_mimic/views/icustay_details.csv
.
Again, have this .csv
file inside the local views
folder.
A minor change needs to be made in icustay_details.csv
:
change 'admission_age' -> 'age'
for the column header in the .csv
file manually.
- Obtain a local copy of the following tables from MIMIC-III:
- admissions.csv
- diagnoses_icd.csv
- d_icd_diagnoses.csv
These can be directly obtained from Physionet as compressed files.
While tables such as chartevents
are large, the above tables are quite small and easy to query directly if a local copy is available.
Save these tables under .../local_mimic/tables
folder.
Make the following changes:
- In
~/local_mimic/tables/diagnoses_icd.csv
, change the column titles"ROW_ID","SUBJECT_ID","HADM_ID","SEQ_NUM","ICD9_CODE"
to"row_id","subject_id","hadm_id","seq_num","icd9_code"
(i.e., make lower case). - In
~local_mimic/tables/d_icd_diagnoses.csv
change the column titles"ROW_ID","ICD9_CODE","SHORT_TITLE","LONG_TITLE"
to"row_id","icd9_code","short_title","long_title"
(i.e., again, make lower case).
- Run
preprocessing.py
with inputs:
--path_tables <path_tables>
--path_views <path_views>
--path_save <path_save>
.
<path_tables>
and <path_views>
should correspond to the folders under which the local tables and views (pivots and icustays-details) are saved.
<path_save>
corresponds to the desired folder to save your variables for training and beyond.
preprocessing.py
will generate the following files:
X19.npy
: main feature tensor, consisting of time-series data generated from a combination of 19 lab values and vital signs over 48 hour period from start of admissions.X48.npy
: summary feature matrix of the time-series data, with min, mean, max, and standard-deviation (std) of each feature as extended features instead of time-series.y
: main label matrix, with (mortality_flag, readmission_status, LOS_bin, diagnoses_labels) for each patient. The labels are coupled here, but duringmain.py
, user can define which task to pick (i.e. which column ofy
).onehot
: one-hot vector of diagnostic history of each patient. This is different than the top 25 differential diagnosis task, which is the last column ofy
. Diagnostic history uses ICD-9 Group Codes instead of ICD-9 codes (i.e. more general). Used only for mortality, LOS and readmission predictions.w2v
: Skip-Gram embeddings for diagnosis histories (auxiliary input).h2v
: Skip-Gram embeddings of both diagnostic histories and demographics info (auxiliary input).demo
: one-hot vector representation of demographics info (auxiliary input).sentences
: Skip-Gram embeddings of mixed diagnostic histories and abnormal laboratory flags (main feature input).
- Run
main.py
with selection of features, auxiliary features, task, model, and training conditions:
--features_dir
: path to saved the feature file to use as X. Selections includeX19
,X48
,sentences
, oronehot
.--auxiliary_dir
: path to auxiliary features to be used for certain models. Selections includew2v
,h2v
, ordemo
.--y_dir
: path toy
.--model
: type of model to use for train / test. User can choose among['lstm', 'cnn', 'mlp', 'svm', 'rf', 'lr','gbc' ]
. LSTM, CNN-LSTM and MLP are deep models, while SVM, random forest (rf), logistic regression (lr) and gradient boost (GBC) are classical models. Note that LSTM and CNN-LSTM need to useX19
as input feature because they are temporal models. Non-temporal models such as MLP, SVM, rf, lr and gbc should not useX19
.--task
: specifies the learning task. User can choose between['readmit', 'mort', 'los', 'dx']
.-checkpoint_dir
: specifies the path to save best models and testing results.--hidden_size
: specifies number of hidden units for deep models (default =256).--learning_rate
: specifies the initial learning rate (default=0.005).--nepochs
: number of training epochs (default = 100). *--batch_size
: batch size during training (default = 32).
If you find any errors or issues, please do not hesitate to report.