From b2d34a6fd32e213d735be3019ef0432a45faad33 Mon Sep 17 00:00:00 2001 From: Jiyang Date: Mon, 4 Dec 2023 13:44:49 -0600 Subject: [PATCH] Update README --- README.md | 170 ++++-------------------------------------------------- 1 file changed, 12 insertions(+), 158 deletions(-) diff --git a/README.md b/README.md index 2e6b2fa..bccb3cb 100644 --- a/README.md +++ b/README.md @@ -15,140 +15,51 @@ Authors: [Jiyang Zhang](https://jiyangzhang.github.io/), [Pengyu Nie](https://pe } ``` - ## Introduction -This repo contains the code and artifacts for reproducing the experiments in [CoditT5: Pretraining for Source Code and Natural Language Editing](https://arxiv.org/abs/2208.05446). -In this work, we introduce CoditT5 for software **edit** tasks. CoditT5 is a large Language Model pretrained with a novel objective to explicitly model edits. CoditT5 sets the state-of-the-art for downstream tasks including comment updating, bug fixing and automated code review. +This repo contains the code and artifacts for reproducing the experiments in [Multilingual Code Co-Evolution Using Large Language Models](https://arxiv.org/abs/2307.14991). +In this work, we introduce Codeditor for co-evolving software implemented in multiple programming languages. The code includes: -- scripts for collecting and processing dataset +- scripts for processing dataset - scripts for training and evaluating codeditor models The artifacts include: -- Java to C# translation dataset -- checkpoints for the Codeditor models fine-tuned for Java to C# translation and C# to Java translation - -## Table of Contents - - -1. [How to Use][sec-howto] -2. [Dependency][sec-dependency] -3. [Data Downloads][sec-downloads] -4. [Code for Pretraining][sec-pretrain] -5. [Code for Processing Fine-tuning Data][sec-process] -6. [Code for Training and Evaluating Models][sec-traineval] -7. [Code for Combining CodeT5 and CoditT5][sec-rerank] - - -## Dependency - -[sec-dependency]: #dependency - -Our code require the following hardware and software environments. - -- Operating system: Linux (tested on Ubuntu 20.04) -- Minimum disk space: 4 GB -- Python: 3.8 -- Anaconda/Miniconda: appropriate versions for Python 3.8 or higher - -Additional requirements for training and evaluating ML models: - -- GPU: NVIDIA GTX 1080 Ti or better (with >= 11GB memory) -- CUDA: 10.2 or 11.3 -- Disk space: 2 GB per trained model - -[Anaconda](https://www.anaconda.com/products/individual#Downloads) or [Miniconda](https://docs.conda.io/en/latest/miniconda.html) is required for installing the other Python library dependencies. Once Anaconda/Miniconda is installed, you can use the following command to setup a virtual environment, named `deltr`, with the Python library dependencies installed: - -``` -cd python/ -./prepare_conda_env.sh -``` - -And then use `conda activate cdt` to activate the created virtual environment. +- Java to C# raw paired changes +- Java to C# translation dataset processed for codeditor models ## Data Downloads [sec-downloads]: #data-downloads -All our data is hosted on UTBox via [a zip file](https://utexas.box.com/s/9rkqnlp6wjhwyfmxce97pgb4ersfc1f9). +All our data is hosted on UTBox via [a shared folder](https://utexas.box.com/s/iwcvwgx23g9xvowu9joa661rz74k9eea). -Data should be downloaded to this directory with the same directory structure (e.g., `data/` from the shared folder should be downloaded as `data/` under current directory). - -## Code for Pretraining - -[sec-pretrain]: #code-for-pretraining - -### Synthesize Pretraining Data - -We provide sample scripts to synthesize the pretraining dataset (by corrupting programming language code snippets and natural language comments) for CoditT5. - -First, prepare the programming language and natural language data for pretraining; Then specify the following variables in the function `corrupt_pretrain_data()` in `python/run.sh`: - -- `source_pl_file`: the path of data file where each line is a programming language function; -- `tokenized_pl_file`: the path of tokenized version of `source_pl_file`; -- `corrupt_pl_file`: corrupted version of `tokenized_pl_file` which is the input of pretrained model. -- `source_nl_file`: the path of data file where each line is a natural language sequence; -- `tokenized_nl_file`: the path of tokenized version of `source_nl_file`; -- `corrupt_nl_file`: corrupted version of `tokenized_nl_file` which is the input of pretrained model. - -``` -cd python/ -./run.sh corrupt_pretrain_data -``` - -### Pretrain CoditT5 - -Requires the pretrain dataset at `data/CoditT5/pretrain/` - -``` -cd python/ -./run.sh pretrain_CoditT5 -``` ## Code for Processing Fine-tuning Data [sec-process]: #code-for-processing-fine-tuning-data -We provide the sample script to process the downstream datasets for CoditT5. Requires the raw data files at `raw_data/`. +We provide the sample script to process the datasets for edit-translation. Requires the raw data files at `raw_data/`. ``` cd python/ -./run.sh process_coditT5_dataset --dataset ${dataset} +python -m deltr.collector.DataProcessor edit_translation_data_process --exp cs2java --src_lang cs --tgt_lang java -# Example: ./run.sh process_coditT5_dataset --dataset comment-update ``` -Where `${dataset}` is the name of the dataset (comment-update, code-review, bf-small, bf-medium). The data files are generated to `data/CoditT5/${dataset}/`. - -Notes: - -- CoditT5's input data file name ends with `.buggy`; CoditT5's target output (edit plan + generation) file name ends with `.fixed`; target generation file name ends with `.seq`. -- CoditT5's input is in the form of `source_sequence context_sequence`; and CoditT5's output is in the form of `edit_plan target_sequence` -- Raw data files are stored in `raw_data/` (we provide some examples for demo), processed data files are generated to `data/CoditT5/${dataset}` -- Note that for the comment-update dataset, the processed `edit_plan` is the edits applied to the comment w/o parameter (@return, @param) - ## Code for Training and Evaluating Models [sec-traineval]: #code-for-training-and-evaluating-models ### Train ML models -Requires the dataset at `data/${model}/${dataset}/`, where `${model}` is the name of the model (CodeT5, CoditT5); `${dataset}` is the name of the dataset. - ``` cd python/ -./run.sh ${model}_train ${dataset} +python -m deltr.coditT5.CodeT5 fit --exp_dir {MODELS_DIR}/${model_name}/${dataset} --data.dataset {dataset} --data.model ${model_name} --config configs/coditT5.yaml -# Example: ./run.sh CoditT5_train comment-update +# Example: python -m deltr.coditT5.CodeT5 fit --exp_dir models/edit-translation/java2cs --data.dataset java2cs --data.model edit-translation --config configs/coditT5.yaml ``` Results are generated to `models/${model}/${dataset}/`, where: @@ -157,73 +68,16 @@ Results are generated to `models/${model}/${dataset}/`, where: - `logs/`: stores logs during training. -### Evaluate ML models +### Run ML models to do inference Requires the dataset at `data/${model}/${dataset}/`, the trained model at `models/${model}/${dataset}/model/`. ``` cd python/ -./run.sh ${model}_generate ${dataset} +python -m deltr.coditT5.CodeT5 predict --exp_dir {MODELS_DIR}/${model_name}/${dataset} --data.dataset {dataset} --data.model ${model_name} --config configs/coditT5.yaml -# Example: ./run.sh CoditT5_generate comment-update ``` Results are generated to `models/${model}/${dataset}/`, where: - `output.hyp`: the predictions. - -### Compute automatic metrics - -Requires the model's predictions at `models/${model}/${dataset}/`. Note that the provided script assumes the names for the data files conform the what described in [Code for Processing Fine-tuning Data][sec-process] - -``` -./run.sh ${model}_eval ${dataset} - -# Example: ./run.sh CoditT5_eval comment-update -``` - -Results are generated to `results/`: - -- `results-${dataset}-${model}.json`: the average of automatic metrics. - -- `scores-${dataset}-${model}.json`: the list of automatic metrics per sample. - -## Code for Combining CodeT5 and CoditT5 - -[sec-rerank]: #code-for-combining-codet5-and-coditt5 - -Requires the dataset at `data/${model}/${dataset}/`, the trained models at `models/${model}/${dataset}/model/`. - -### Rerank Models' outputs - -``` -cd python/ -# Rerank CodeT5's outputs with CoditT5 -./run.sh CodeT5_rerank ${dataset} -# Rerank CoditT5's outputs with CodeT5 -./run.sh CodeT5_rerank ${dataset} - -# Example: ./run.sh CoditT5_rerank comment-update -``` - -Main results are generated to `results/reranks/`: - -- `test-${dataset}-${model}-top-20-rerank-${reranker}-results.json`: `${model}`'s top 20 beam outputs and `${reranker}`'s likelihood score for each beam output. - -### Compute automatic metrics - -Requires the model's reranking results file -`results/reranks/test-${dataset}-${model}-top-20-rerank-${reranker}-results.json`. - -``` -./run.sh eval_rerank_${model}_${reranker} ${dataset} - -# Example: compute metrics for top 1 CoditT5 prediction reranked by CodeT5 -./run.sh eval_rerank_CoditT5_CodeT5 comment-update -``` - -Results are generated to `results/`: - -- `results-${dataset}-${model}-rerank-${reranker}.json`: the average of automatic metrics. - -- `scores-${dataset}-${model}-rerank-${reranker}.json`: the list of automatic metrics per sample. \ No newline at end of file