From b2d34a6fd32e213d735be3019ef0432a45faad33 Mon Sep 17 00:00:00 2001
From: Jiyang <jiyang.zhang@utexas.edu>
Date: Mon, 4 Dec 2023 13:44:49 -0600
Subject: [PATCH] Update README

---
 README.md | 170 ++++--------------------------------------------------
 1 file changed, 12 insertions(+), 158 deletions(-)

diff --git a/README.md b/README.md
index 2e6b2fa..bccb3cb 100644
--- a/README.md
+++ b/README.md
@@ -15,140 +15,51 @@ Authors: [Jiyang Zhang](https://jiyangzhang.github.io/), [Pengyu Nie](https://pe
 }
 ```
 
-<!-- ## News
-**Aug 2023**
-
-**Pretrained CoditT5** model is released on 🤗 ! 🔥 
-[link](https://huggingface.co/JiyangZhang/CoditT5) 
-\
-Note: It is recommended fine-tuning it before applying to downstream tasks. -->
 ## Introduction
 
-This repo contains the code and artifacts for reproducing the experiments in [CoditT5: Pretraining for Source Code and Natural Language Editing](https://arxiv.org/abs/2208.05446).
-In this work, we introduce CoditT5 for software **edit** tasks. CoditT5 is a large Language Model pretrained with a novel objective to explicitly model edits. CoditT5 sets the state-of-the-art for downstream tasks including comment updating, bug fixing and automated code review.
+This repo contains the code and artifacts for reproducing the experiments in [Multilingual Code Co-Evolution Using Large Language Models](https://arxiv.org/abs/2307.14991).
+In this work, we introduce Codeditor for co-evolving software implemented in multiple programming languages.
 
 The code includes:
 
-- scripts for collecting and processing dataset
+- scripts for processing dataset
 - scripts for training and evaluating codeditor models
 
 The artifacts include:
 
-- Java to C# translation dataset
-- checkpoints for the Codeditor models fine-tuned for Java to C# translation and C# to Java translation
-
-## Table of Contents
-
-
-1. [How to Use][sec-howto]
-2. [Dependency][sec-dependency]
-3. [Data Downloads][sec-downloads]
-4. [Code for Pretraining][sec-pretrain]
-5. [Code for Processing Fine-tuning Data][sec-process]
-6. [Code for Training and Evaluating Models][sec-traineval]
-7. [Code for Combining CodeT5 and CoditT5][sec-rerank]
-
-
-## Dependency
-
-[sec-dependency]: #dependency
-
-Our code require the following hardware and software environments.
-
-- Operating system: Linux (tested on Ubuntu 20.04)
-- Minimum disk space: 4 GB
-- Python: 3.8
-- Anaconda/Miniconda: appropriate versions for Python 3.8 or higher
-
-Additional requirements for training and evaluating ML models:
-
-- GPU: NVIDIA GTX 1080 Ti or better (with >= 11GB memory)
-- CUDA: 10.2 or 11.3
-- Disk space: 2 GB per trained model
-
-[Anaconda](https://www.anaconda.com/products/individual#Downloads) or [Miniconda](https://docs.conda.io/en/latest/miniconda.html) is required for installing the other Python library dependencies. Once Anaconda/Miniconda is installed, you can use the following command to setup a virtual environment, named `deltr`, with the Python library dependencies installed:
-
-```
-cd python/
-./prepare_conda_env.sh
-```
-
-And then use `conda activate cdt` to activate the created virtual environment.
+- Java to C# raw paired changes
+- Java to C# translation dataset processed for codeditor models
 
 ## Data Downloads
 
 [sec-downloads]: #data-downloads
 
-All our data is hosted on UTBox via [a zip file](https://utexas.box.com/s/9rkqnlp6wjhwyfmxce97pgb4ersfc1f9).
+All our data is hosted on UTBox via [a shared folder](https://utexas.box.com/s/iwcvwgx23g9xvowu9joa661rz74k9eea).
 
-Data should be downloaded to this directory with the same directory structure (e.g., `data/` from the shared folder should be downloaded as `data/` under current directory).
-
-## Code for Pretraining
-
-[sec-pretrain]: #code-for-pretraining
-
-### Synthesize Pretraining Data
-
-We provide sample scripts to synthesize the pretraining dataset (by corrupting programming language code snippets and natural language comments) for CoditT5.
-
-First, prepare the programming language and natural language data for pretraining; Then specify the following variables in the function `corrupt_pretrain_data()` in `python/run.sh`:
-
-- `source_pl_file`: the path of data file where each line is a programming language function;
-- `tokenized_pl_file`: the path of tokenized version of `source_pl_file`;
-- `corrupt_pl_file`: corrupted version of `tokenized_pl_file` which is the input of pretrained model.
-- `source_nl_file`: the path of data file where each line is a natural language sequence;
-- `tokenized_nl_file`: the path of tokenized version of `source_nl_file`;
-- `corrupt_nl_file`: corrupted version of `tokenized_nl_file` which is the input of pretrained model.
-
-```
-cd python/
-./run.sh corrupt_pretrain_data
-```
-
-### Pretrain CoditT5
-
-Requires the pretrain dataset at `data/CoditT5/pretrain/`
-
-```
-cd python/
-./run.sh pretrain_CoditT5
-```
 
 ## Code for Processing Fine-tuning Data
 
 [sec-process]: #code-for-processing-fine-tuning-data
 
-We provide the sample script to process the downstream datasets for CoditT5. Requires the raw data files at `raw_data/`.
+We provide the sample script to process the datasets for edit-translation. Requires the raw data files at `raw_data/`.
 
 ```
 cd python/
-./run.sh process_coditT5_dataset --dataset ${dataset}
+python -m deltr.collector.DataProcessor edit_translation_data_process --exp cs2java --src_lang cs --tgt_lang java
 
-# Example: ./run.sh process_coditT5_dataset --dataset comment-update
 ```
 
-Where `${dataset}` is the name of the dataset (comment-update, code-review, bf-small, bf-medium). The data files are generated to `data/CoditT5/${dataset}/`.
-
-Notes:
-
-- CoditT5's input data file name ends with `.buggy`; CoditT5's target output (edit plan + generation) file name ends with `.fixed`; target generation file name ends with `.seq`.
-- CoditT5's input is in the form of `source_sequence </s> context_sequence`; and CoditT5's output is in the form of `edit_plan <s> target_sequence`
-- Raw data files are stored in `raw_data/` (we provide some examples for demo), processed data files are generated to `data/CoditT5/${dataset}`
-- Note that for the comment-update dataset, the processed `edit_plan` is the edits applied to the comment w/o parameter (@return, @param)
-
 ## Code for Training and Evaluating Models
 
 [sec-traineval]: #code-for-training-and-evaluating-models
 
 ### Train ML models
 
-Requires the dataset at `data/${model}/${dataset}/`, where `${model}` is the name of the model (CodeT5, CoditT5); `${dataset}` is the name of the dataset.
-
 ```
 cd python/
-./run.sh ${model}_train ${dataset}
+python -m deltr.coditT5.CodeT5 fit --exp_dir {MODELS_DIR}/${model_name}/${dataset} --data.dataset {dataset} --data.model ${model_name} --config  configs/coditT5.yaml
 
-# Example: ./run.sh CoditT5_train comment-update
+# Example: python -m deltr.coditT5.CodeT5 fit --exp_dir models/edit-translation/java2cs --data.dataset java2cs --data.model edit-translation --config  configs/coditT5.yaml
 ```
 
 Results are generated to `models/${model}/${dataset}/`, where:
@@ -157,73 +68,16 @@ Results are generated to `models/${model}/${dataset}/`, where:
 
 - `logs/`: stores logs during training.
 
-### Evaluate ML models
+### Run ML models to do inference
 
 Requires the dataset at `data/${model}/${dataset}/`, the trained model at `models/${model}/${dataset}/model/`.
 
 ```
 cd python/
-./run.sh ${model}_generate ${dataset}
+python -m deltr.coditT5.CodeT5 predict --exp_dir {MODELS_DIR}/${model_name}/${dataset} --data.dataset {dataset} --data.model ${model_name} --config  configs/coditT5.yaml
 
-# Example: ./run.sh CoditT5_generate comment-update
 ```
 
 Results are generated to `models/${model}/${dataset}/`, where:
 
 - `output.hyp`: the predictions.
-
-### Compute automatic metrics
-
-Requires the model's predictions at `models/${model}/${dataset}/`. Note that the provided script assumes the names for the data files conform the what described in [Code for Processing Fine-tuning Data][sec-process]
-
-```
-./run.sh ${model}_eval ${dataset}
-
-# Example: ./run.sh CoditT5_eval comment-update
-```
-
-Results are generated to `results/`:
-
-- `results-${dataset}-${model}.json`: the average of automatic metrics.
-
-- `scores-${dataset}-${model}.json`: the list of automatic metrics per sample.
-
-## Code for Combining CodeT5 and CoditT5
-
-[sec-rerank]: #code-for-combining-codet5-and-coditt5
-
-Requires the dataset at `data/${model}/${dataset}/`, the trained models at `models/${model}/${dataset}/model/`.
-
-### Rerank Models' outputs
-
-```
-cd python/
-# Rerank CodeT5's outputs with CoditT5
-./run.sh CodeT5_rerank ${dataset}
-# Rerank CoditT5's outputs with CodeT5
-./run.sh CodeT5_rerank ${dataset}
-
-# Example: ./run.sh CoditT5_rerank comment-update
-```
-
-Main results are generated to `results/reranks/`:
-
-- `test-${dataset}-${model}-top-20-rerank-${reranker}-results.json`: `${model}`'s top 20 beam outputs and `${reranker}`'s likelihood score for each beam output.
-
-### Compute automatic metrics
-
-Requires the model's reranking results file
-`results/reranks/test-${dataset}-${model}-top-20-rerank-${reranker}-results.json`.
-
-```
-./run.sh eval_rerank_${model}_${reranker} ${dataset}
-
-# Example: compute metrics for top 1 CoditT5 prediction reranked by CodeT5
-./run.sh eval_rerank_CoditT5_CodeT5 comment-update
-```
-
-Results are generated to `results/`:
-
-- `results-${dataset}-${model}-rerank-${reranker}.json`: the average of automatic metrics.
-
-- `scores-${dataset}-${model}-rerank-${reranker}.json`: the list of automatic metrics per sample.
\ No newline at end of file