Skip to content

Commit

Permalink
Update README
Browse files Browse the repository at this point in the history
  • Loading branch information
JiyangZhang committed Dec 4, 2023
1 parent 2092a68 commit b2d34a6
Showing 1 changed file with 12 additions and 158 deletions.
170 changes: 12 additions & 158 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,140 +15,51 @@ Authors: [Jiyang Zhang](https://jiyangzhang.github.io/), [Pengyu Nie](https://pe
}
```

<!-- ## News
**Aug 2023**
**Pretrained CoditT5** model is released on 🤗 ! 🔥
[link](https://huggingface.co/JiyangZhang/CoditT5)
\
Note: It is recommended fine-tuning it before applying to downstream tasks. -->
## Introduction

This repo contains the code and artifacts for reproducing the experiments in [CoditT5: Pretraining for Source Code and Natural Language Editing](https://arxiv.org/abs/2208.05446).
In this work, we introduce CoditT5 for software **edit** tasks. CoditT5 is a large Language Model pretrained with a novel objective to explicitly model edits. CoditT5 sets the state-of-the-art for downstream tasks including comment updating, bug fixing and automated code review.
This repo contains the code and artifacts for reproducing the experiments in [Multilingual Code Co-Evolution Using Large Language Models](https://arxiv.org/abs/2307.14991).
In this work, we introduce Codeditor for co-evolving software implemented in multiple programming languages.

The code includes:

- scripts for collecting and processing dataset
- scripts for processing dataset
- scripts for training and evaluating codeditor models

The artifacts include:

- Java to C# translation dataset
- checkpoints for the Codeditor models fine-tuned for Java to C# translation and C# to Java translation

## Table of Contents


1. [How to Use][sec-howto]
2. [Dependency][sec-dependency]
3. [Data Downloads][sec-downloads]
4. [Code for Pretraining][sec-pretrain]
5. [Code for Processing Fine-tuning Data][sec-process]
6. [Code for Training and Evaluating Models][sec-traineval]
7. [Code for Combining CodeT5 and CoditT5][sec-rerank]


## Dependency

[sec-dependency]: #dependency

Our code require the following hardware and software environments.

- Operating system: Linux (tested on Ubuntu 20.04)
- Minimum disk space: 4 GB
- Python: 3.8
- Anaconda/Miniconda: appropriate versions for Python 3.8 or higher

Additional requirements for training and evaluating ML models:

- GPU: NVIDIA GTX 1080 Ti or better (with >= 11GB memory)
- CUDA: 10.2 or 11.3
- Disk space: 2 GB per trained model

[Anaconda](https://www.anaconda.com/products/individual#Downloads) or [Miniconda](https://docs.conda.io/en/latest/miniconda.html) is required for installing the other Python library dependencies. Once Anaconda/Miniconda is installed, you can use the following command to setup a virtual environment, named `deltr`, with the Python library dependencies installed:

```
cd python/
./prepare_conda_env.sh
```

And then use `conda activate cdt` to activate the created virtual environment.
- Java to C# raw paired changes
- Java to C# translation dataset processed for codeditor models

## Data Downloads

[sec-downloads]: #data-downloads

All our data is hosted on UTBox via [a zip file](https://utexas.box.com/s/9rkqnlp6wjhwyfmxce97pgb4ersfc1f9).
All our data is hosted on UTBox via [a shared folder](https://utexas.box.com/s/iwcvwgx23g9xvowu9joa661rz74k9eea).

Data should be downloaded to this directory with the same directory structure (e.g., `data/` from the shared folder should be downloaded as `data/` under current directory).

## Code for Pretraining

[sec-pretrain]: #code-for-pretraining

### Synthesize Pretraining Data

We provide sample scripts to synthesize the pretraining dataset (by corrupting programming language code snippets and natural language comments) for CoditT5.

First, prepare the programming language and natural language data for pretraining; Then specify the following variables in the function `corrupt_pretrain_data()` in `python/run.sh`:

- `source_pl_file`: the path of data file where each line is a programming language function;
- `tokenized_pl_file`: the path of tokenized version of `source_pl_file`;
- `corrupt_pl_file`: corrupted version of `tokenized_pl_file` which is the input of pretrained model.
- `source_nl_file`: the path of data file where each line is a natural language sequence;
- `tokenized_nl_file`: the path of tokenized version of `source_nl_file`;
- `corrupt_nl_file`: corrupted version of `tokenized_nl_file` which is the input of pretrained model.

```
cd python/
./run.sh corrupt_pretrain_data
```

### Pretrain CoditT5

Requires the pretrain dataset at `data/CoditT5/pretrain/`

```
cd python/
./run.sh pretrain_CoditT5
```

## Code for Processing Fine-tuning Data

[sec-process]: #code-for-processing-fine-tuning-data

We provide the sample script to process the downstream datasets for CoditT5. Requires the raw data files at `raw_data/`.
We provide the sample script to process the datasets for edit-translation. Requires the raw data files at `raw_data/`.

```
cd python/
./run.sh process_coditT5_dataset --dataset ${dataset}
python -m deltr.collector.DataProcessor edit_translation_data_process --exp cs2java --src_lang cs --tgt_lang java
# Example: ./run.sh process_coditT5_dataset --dataset comment-update
```

Where `${dataset}` is the name of the dataset (comment-update, code-review, bf-small, bf-medium). The data files are generated to `data/CoditT5/${dataset}/`.

Notes:

- CoditT5's input data file name ends with `.buggy`; CoditT5's target output (edit plan + generation) file name ends with `.fixed`; target generation file name ends with `.seq`.
- CoditT5's input is in the form of `source_sequence </s> context_sequence`; and CoditT5's output is in the form of `edit_plan <s> target_sequence`
- Raw data files are stored in `raw_data/` (we provide some examples for demo), processed data files are generated to `data/CoditT5/${dataset}`
- Note that for the comment-update dataset, the processed `edit_plan` is the edits applied to the comment w/o parameter (@return, @param)

## Code for Training and Evaluating Models

[sec-traineval]: #code-for-training-and-evaluating-models

### Train ML models

Requires the dataset at `data/${model}/${dataset}/`, where `${model}` is the name of the model (CodeT5, CoditT5); `${dataset}` is the name of the dataset.

```
cd python/
./run.sh ${model}_train ${dataset}
python -m deltr.coditT5.CodeT5 fit --exp_dir {MODELS_DIR}/${model_name}/${dataset} --data.dataset {dataset} --data.model ${model_name} --config configs/coditT5.yaml
# Example: ./run.sh CoditT5_train comment-update
# Example: python -m deltr.coditT5.CodeT5 fit --exp_dir models/edit-translation/java2cs --data.dataset java2cs --data.model edit-translation --config configs/coditT5.yaml
```

Results are generated to `models/${model}/${dataset}/`, where:
Expand All @@ -157,73 +68,16 @@ Results are generated to `models/${model}/${dataset}/`, where:

- `logs/`: stores logs during training.

### Evaluate ML models
### Run ML models to do inference

Requires the dataset at `data/${model}/${dataset}/`, the trained model at `models/${model}/${dataset}/model/`.

```
cd python/
./run.sh ${model}_generate ${dataset}
python -m deltr.coditT5.CodeT5 predict --exp_dir {MODELS_DIR}/${model_name}/${dataset} --data.dataset {dataset} --data.model ${model_name} --config configs/coditT5.yaml
# Example: ./run.sh CoditT5_generate comment-update
```

Results are generated to `models/${model}/${dataset}/`, where:

- `output.hyp`: the predictions.

### Compute automatic metrics

Requires the model's predictions at `models/${model}/${dataset}/`. Note that the provided script assumes the names for the data files conform the what described in [Code for Processing Fine-tuning Data][sec-process]

```
./run.sh ${model}_eval ${dataset}
# Example: ./run.sh CoditT5_eval comment-update
```

Results are generated to `results/`:

- `results-${dataset}-${model}.json`: the average of automatic metrics.

- `scores-${dataset}-${model}.json`: the list of automatic metrics per sample.

## Code for Combining CodeT5 and CoditT5

[sec-rerank]: #code-for-combining-codet5-and-coditt5

Requires the dataset at `data/${model}/${dataset}/`, the trained models at `models/${model}/${dataset}/model/`.

### Rerank Models' outputs

```
cd python/
# Rerank CodeT5's outputs with CoditT5
./run.sh CodeT5_rerank ${dataset}
# Rerank CoditT5's outputs with CodeT5
./run.sh CodeT5_rerank ${dataset}
# Example: ./run.sh CoditT5_rerank comment-update
```

Main results are generated to `results/reranks/`:

- `test-${dataset}-${model}-top-20-rerank-${reranker}-results.json`: `${model}`'s top 20 beam outputs and `${reranker}`'s likelihood score for each beam output.

### Compute automatic metrics

Requires the model's reranking results file
`results/reranks/test-${dataset}-${model}-top-20-rerank-${reranker}-results.json`.

```
./run.sh eval_rerank_${model}_${reranker} ${dataset}
# Example: compute metrics for top 1 CoditT5 prediction reranked by CodeT5
./run.sh eval_rerank_CoditT5_CodeT5 comment-update
```

Results are generated to `results/`:

- `results-${dataset}-${model}-rerank-${reranker}.json`: the average of automatic metrics.

- `scores-${dataset}-${model}-rerank-${reranker}.json`: the list of automatic metrics per sample.

0 comments on commit b2d34a6

Please sign in to comment.