Code for DeMuX: Active-learning based data-selection strategies for efficient annotation of multilingual data. This project is licensed under the terms of the MIT license.
- Environment
- Reproducing Paper Results
- Baselines
- Running for multiple budgets
- Notes on experimentation based on our observations
- Running with custom models and datasets
- Additional Support/Issues?
- Citation
We use Miniconda to manage the environment. Our Python version is 3.10.11. To create the environment, run the following command:
conda env create -f environment.yml -n demux-env
To activate the environment, run the following command:
conda activate demux-env
We currently support the following datasets:
- Token-level:
udpos
: Universal Dependencies v2.5PAN-X
: WikiAnn
- Sequence-level:
xnli
: XNLI
- Question Answering:
tydiqa
: TyDiQA-GoldP
- Machine Translation (skip this step for MT):
opus
: OPUS-100nllb
: NLLBflores200
: Flores-200
The models supported in the scripts are xlm-roberta-large
, infoxlm-large
and rembert
. First, we fine-tune multilingual models on labelled data in English for the task. To do this, run the following command:
bash scripts/train/run_ft_en.sh [MODEL] [DATASET]
For example, to fine-tune xlm-roberta-large
on udpos
, run the following command:
bash scripts/train/run_ft_en.sh xlm-roberta-large udpos
One can also fine-tune the model on other languages if desired, by specifying the language code in the run_ft_en.sh
script.
Note, for TyDiQA we first fine-tune the model on SQuAD v2, given the small size of TyDiQA. Clone the HF transformers
repo in your home directory. To ensure that you're working with the same version of the codebase, you can checkout to this commit:
git checkout dadc9fb4275f4b7c2984d16d6d9a7880ec76d872
You will also need to change the environment since the transformers version required to run this is different. The minimum transformers version required is 4.31.0.dev0
. The rest of the requirements can be found here. Once the dependencies are installed, you can fine-tune a model as follows:
bash scripts/train/run_ft_squad_v2.sh [MODEL]
We describe five source and target language configurations in our paper: target_hp
, target_mp
, target_lp
, target_geo
and target_lp-pool
. Additionally, we have six supported strategies: random
, egalitarian
, average_dist
, knn_uncertainty
, uncertainty
and gold
. Please refer to the paper on details of the configurations and strategies. To fine-tune the above models for a configuration and strategy, run the following command:
bash scripts/train/run_ft_al.sh [FT-MODEL_PATH] [DATASET] [CONFIG] [STRATEGY]
Note that the configs for all datasets are given in scripts/train/dataset-configs.yaml
. This file can be updated with custom dataset details.
To fine-tune a model on all strategies, we have wandb sweep configs defined in scripts/train/wandb/configs
. These configs can be created by running the following command:
bash scripts/train/wandb/create_configs.sh
To create a sweep job for each config run the following command:
bash scripts/train/wandb/create_config_jobs.sh
This will create a sweep job for each config in scripts/train/wandb/jobs
.
We also support multi-GPU training, using HF accelerate
. A sample script can be found here: scripts/train/sample_multi-gpu.sh
.
Support for MT was added post other supported tasks. Please find sample run scripts for MT under scripts/train/wandb/jobs/mt
.
To collect results across strategies run so far, run the following command. This will aggregate results over all strategies for the specified configuration. This script also collects the deltas in performance between the strategies for all configs.
bash scripts/evaluation/run_collect_results_all.sh
To collect results for a specific model and target configuration, run the following command. This will aggregate results over all strategies for the specified configuration.
bash scripts/evaluation/run_collect_results_single_model.sh [MODEL_PATH]
To visualize the embeddings of the unlabelled source data, target data and the selected data, for all the models and cofigs, run the following command. By default, this will generate a plot for the data selected from the last round, which includes data selected in all previous rounds. However, this can be controlled by setting the al_round_to_show
variable in the script.
bash scripts/evaluation/run_visualize_embeddings_all.sh
To visualize the embeddings of the unlabelled source data, target data and the selected data, for a specific model and config, run the following command.
bash scripts/evaluation/run_visualize_embeddings_single_model.sh [EMBEDDING_PATH]
- Find uncertainty and loss correlations between target point and neighborhood:
bash scripts/train/run_find_correlation.sh
- Get lang2vec distances between all language pairs for a given dataset
bash scripts/helper/l2v/run_l2v_distances.sh
RANDOM
andEGALITARIAN
are already included in our fine-tuning scripts above.LITMUS
: Run the following bash scripts to get augmentation suggestions for the configs in our paperudpos
:bash scripts/baselines/litmus/run_litmus_udpos.sh
PAN-X
:bash scripts/baselines/litmus/run_litmus_PAN-X.sh
xnli
:bash scripts/baselines/litmus/run_litmus_xnli.sh
tydiqa
:bash scripts/baselines/litmus/run_litmus_tydiqa.sh
We evaluate the applicability of demux for data selection under very limited budgets, and without multiple rounds of fine-tuning. We find our strategies to be most effective (over random selection), under low budget scenarios (<100 samples), with diminishing returns as the budget increases. To run experiments for multiple budgets and across all strategies, modify the following:
bash scripts/train/run_mul-bud_all.sh
-
The larger the source dataset, the better the performance (in general). This is because the strategies have a lot of data to choose from, making a random selection sub-optimal.
-
With increasing budgets come diminishing returns. Since we are dealing with fine-tuned models, we observe we don't need multiple rounds of fine-tuning (1-2 rounds suffice).
-
For distance based strategies, the data selected is a reflection of the quantity and quality of your unlabelled target dataset. With better unlabelled target data, the efficacy of the strategies improves.
-
Token-level, sequence-level and MT tasks benefit from distance-based strategies. For QA, our uncertainty strategy works best, but there is potential to improve our KNN-UNCERTAINTY strategy such that we can give more importance to uncertainty.
-
With competent model representations come better estimates of distance and uncertainty, and hence more effective data selection.
-
If the source data contains multiple languages that are similar to your target in different ways (syntactic, phonological, geograhic distance, language family etc.), the strategies will be more effective across a variety of tasks and datasets.
With regards to models, the scripts should work with any model supported by HuggingFace. Simply specify the model identifier in the scripts. The variables one would need to specify/change according to the model they fine-tune are as follows: LR
(learning rate), epochs
, max_seq_length
. This can be easily done in run_ft_en.sh
.
With regards to datasets, the scripts should work with any dataset supported by HuggingFace OR custom dataset files (saved in arrow format). If using an HF dataset, one must specify the following: source_languages
, target_languages
and dataset
. If using custom dataset files, one must specify: source_dataset_path
and target_dataset_path
. The source_dataset_path
and target_dataset_path
must be directories contining the following files:
source_dataset_path:
- train
- validation
target_dataset_path:
- target
- test
In the above, train
and validation
contain the train and test splits for the source data, target
is the unlabelled target data, and test
is the test data we want to eventually improve performance for. Each directory can have the following files: data-*of*.arrow
, dataset_info.json
and state.json
. A sample script to convert a CSV file to this format is given in convert_csv_to_arrow.py
. Please include a language column. If the language is unkown, add "unknown" as the language column value. This is needed to calculate selected data statistics.
One can even specify a custom dataset for source/target and use an HF dataset for the other. Note that if a custom dataset is provided, it will always override HF for the source or target.
All dataset configs can be found in scripts/train/dataset-configs.yaml
. Make sure to add your custom dataset config to this file. If remove_columns
is not specified, the script removes all columns, except those obtained after tokenization and pre-processing (example, input_ids
, attention_mask
and labels
).
To fine-tune a model on a custom dataset, run the following command:
bash scripts/train/run_ft_en.sh [MODEL] [DATASET_NAME] [SOURCE_LANGUAGES] [TARGET_LANGUAGES] [MODEL_PATH][SOURCE_DATASET_PATH] [TARGET_DATASET_PATH]
Note that many of the above variables are optional. For example, if you want to fine-tune a model on a custom dataset, with default source and target languages, and no custom model path, you can run the following command:
bash scripts/train/run_ft_en.sh xlm-roberta-large custom "" "" "" <path_to_source_data> <path_to_target_data>
The scripts under scripts/train/wandb/jobs/mt
for example, use a custom dataset as the target and a HF dataset as the source.