Skip to content

Latest commit



185 lines (147 loc) · 9.03 KB

File metadata and controls

185 lines (147 loc) · 9.03 KB


This is the repo for the ECIR 2024 Reproducibility track Paper A Reproducibility Study of Goldilocks: Just-Right Tuning of BERT for TAR [arXiv]

Xinyu Mao, Bevan Koopman, and Guido Zuccon. A Reproducibility Study of Goldilocks: Just-Right Tuning of BERT for TAR. In Advances in Information Retrieval: 46th European Conference on Information Retrieval (ECIR'24). Glasgow, UK. March 2024. 10.1007/978-3-031-56066-8_13

Datasets & Data Processing

How to get the datasets?

Large files under ./data-processing/rcv1-v2 and ./data-processing/jeb-bush can be downloaded from rcv1_path.txt, id.txt, and athome1.md5.

We also provide the category information table of these three datasets used in this reproducibility paper.


The optimizer and loss function are the same as in the original paper.

  • For BERT, we use ADAM as the optimizer with no weight decay and no warm-up period, and a learning rate of 5 * 10^-5 for further pre-training with mask-language modelling. The language model pre-training ranges from not pre-training at all on the target collection, to performing ten iterations over the collection. For classification fine-tuning, we use the ADAM optimizer with a linear weight decay of 0.01 with 50 warm-up steps and an initial learning rate of 0.001.
  • For logistic regression, we use L2 regularization on the losses and fit to convergence with default settings in the active learning pipeline.
  • In order to better fit our GPUs, we increased the training and evaluation batch sizes for BERT to 100 and 1000 respectively -- examining the original author's code, we could understand they instead used batch sizes of 26 and 600. Furthermore, we found that using mixed precision training (fp16) could largely reduce the training time.


We adapted the codes from the original authors to our own experiment environment, please check the comments inside each file if necessary.

  • Environment

    For our setting: 3 * A100, we use python=3.8, cuda=11.7, run

    conda env create -f env.yml

    and install the dev version of libact for active learning as below.

    To install the env provided by the original authors, run

    conda create -n huggingface python=3.7
    conda install pytorch torchvision torchaudio cudatoolkit=10.2 -c pytorch
    pip install transformers
    conda install numpy scipy pandas scikit-learn nltk tqdm Cython joblib
    LIBACT_BUILD_HINTSVM=0 LIBACT_BUILD_VARIANCE_REDUCTION=0 pip install -e ~/repositories/libact

    N.B. The active learning library libact used is the dev version from the original authors

  • Tokenization

    • For reproducibility on RCV1-v2 and Jeb-Bush, run

      python3 rcv1

    and change the option for both rcv1 and jb.

    • For generalizability on CLEF collections, run

      python3 2017/test
      • With BioLinkBert-base, run
      python3 2017/test

      and the option for CLEF collections ranges from 2017/test to 2019/intervention/train

  • Further pre-training with mlm-finetuning

    • For reproducibility on RCV1-v2 and Jeb-Bush, run


    and check the comments inside for both rcv1 and jb.

    • For generalizability on CLEF collections, run

      python3 2017/test
      • With BioLinkBert-base, check the comments inside

      and the option for CLEF collections ranges from 2017/test to 2019/intervention/train.

  • Reproduce goldilocks-tar

    • For reproducibility on RCV1-v2 and Jeb-Bush, run

      python3 --category 434 \
          --cached_dataset ./cache_new/jb50sub_org_bert-base.512.pkl.gz \
          --dataset_path  ./jb_info/ \
          --model_path ./mlm-finetune/bert-base_jb50sub_ep10 \
          --output_path  ./results/jb/ep10/ \

      and change the options for rcv1 and jb with corresponding categories, ep refers to the further pre-training epochs from the previous stage.

    • For generalizability on CLEF collections, run

      python3 --topic CD012019 \
          --cached_dataset ./cache_new/clef/clef2017_test_CD012019_org_bert-base.512.pkl.gz \
          --dataset_path  ./clef_info/2017/test/CD012019 \
          --output_path  ./results/clef/ep2/clef17_test/ \
          --batch_size 25 \
          --model_path ./mlm-finetune/clef/2017/bert-base_clef_2017_test_CD012019_ep2 \

      and the options according to CLEF collections range from to with corresponding topics.

      • With BioLinkBert-base, run corresponding biolink version such as
      python3 --topic CD011984 \
        --cached_dataset ./cache_new/clef_biolink/clef2017_train_CD011984_biolink_bert-base.512.pkl.gz \
        --dataset_path  ./clef_info/2017/train/CD011984 \
        --output_path  ./results/biolink/ep0/clef17_train/ \
        --batch_size 25 \
        --model_path  michiyasunaga/BioLinkBERT-base \

      and the options according to CLEF collections range from to with corresponding topics.


  • Feature engineering

    • For reproducibility on RCV1-v2 and Jeb-Bush, run

      python3 rcv1

      and change the option for both rcv1 and jb.

    • For generalizability on CLEF collections, run

      python3 2019/intervention/train

      and the option for CLEF collections ranges from 2017/test to 2019/intervention/train.

  • Reproduce goldilocks-tar baseline with logistic regression

    • For reproducibility on RCV1-v2 and Jeb-Bush, run

      python3 --category 434 \
          --cached_dataset ./jb_info/jb_sampled_features.pkl \
          --dataset_path  ./jb_info \
          --output_path  ./results/baseline/jb/

      and change the options for rcv1 and jb with corresponding categories.

    • For generalizability on CLEF collections, run

      python3 --dataset clef2017_train \
          --topic CD011984 \
          --batch_size 25 \
          --cached_dataset ./clef_info/lr_features/clef2017_train_CD011984_features.pkl \
          --dataset_path  ./clef_info/2017/train/CD011984 \
          --output_path  ./results/baseline/clef/clef17_train/

      and change the options for CLEF collection features ranges from clef2017_train to clef2019_intervention_train with corresponding topics.


  • R-Precision

    Please refer to under ./utils.

  • Review Cost

    Please refer to and under ./utils.

  • Statistical significance test

    Please refer to and under ./utils.


If you find this repo useful for your research, please kindly cite the following paper:

  author       = {Xinyu Mao and Bevan Koopman and Guido Zuccon},
  title        = {A Reproducibility Study of Goldilocks: Just-Right Tuning of BERT for TAR},
  booktitle    = {European Conference on Information Retreival},
  series       = {ECIR '24},
  pages        = {132--146},
  publisher    = {Springer},
  year         = {2024},
  doi          = {10.1007/978-3-031-56066-8\_13},


If you have any questions, feel free to contact xinyu.mao [AT] (with [AT] replaced by @).