Skip to content

Commit

Permalink
README updates
Browse files Browse the repository at this point in the history
  • Loading branch information
smsharma committed Mar 14, 2024
1 parent a3f46df commit a237563
Showing 1 changed file with 11 additions and 2 deletions.
13 changes: 11 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,9 +13,9 @@
- [Paper draft](#paper-draft)
- [Requirements](#requirements)
- [Code overview](#code-overview)
- [Fine-tuned CLIP model and _Hubble_ data](#fine-tuned-clip-model-and-hubble-data)
- [Citation](#citation)


## Abstract

We present PAPERCLIP (Proposal Abstracts Provide an Effective Representation for Contrastive Language-Image Pre-training), a method which associates astronomical observations imaged by telescopes with natural language using a neural network model. The model is fine-tuned from a pre-trained Contrastive Language-Image Pre-training (CLIP) model using successful observing proposal abstracts and corresponding downstream observations, with the abstracts optionally summarized via guided generation using large language models (LLMs). Using observations from the Hubble Space Telescope (HST) as an example, we show that the fine-tuned model embodies a meaningful joint representation between observations and natural language through tests targeting image retrieval (i.e., finding the most relevant observations using natural language queries) and description retrieval (i.e., querying for astrophysical object classes and use cases most relevant to a given observation). Our study demonstrates the potential for using generalist foundation models rather than task-specific models for interacting with astronomical data by leveraging text as an interface.
Expand All @@ -26,14 +26,23 @@ We present PAPERCLIP (Proposal Abstracts Provide an Effective Representation for

## Requirements

The Python environment is defined in `environment.yml`. To create the environment run e.g.,
Since PyTorch and Jax can be [tricky to have under the same roof](https://github.com/google/jax/issues/18032), the Python environment for downloading data and guided LLM summarization using `Outlines` is defined in `environment_outlines.yml`, and the one for training and evaluating the CLIP model in `environment.py`. To create the environment run e.g.,
``` sh
mamba env create --file environment.yaml
```

## Code overview


- The script for downloading the data is [download_data.py](download_data.py), the summarization script is [summarize.py](summarize.py), and training script is [train.py](train.py).
- [notebooks/01_create_dataset.ipynb](notebooks/01_create_dataset.ipynb) is used to create the `tfrecords` data used for training.
- [notebooks/03_eval.ipynb](notebooks/03_eval.ipynb) creates the qualitative and quantitative evaluation plots.
- [notebooks/09_dot_product_eval.ipynb](notebooks/09_dot_product_eval.ipynb) generates additional quantitative evaluation.

## Fine-tuned CLIP model and _Hubble_ data

_Coming soon._

## Citation

If you use this code, please cite our paper:
Expand Down

0 comments on commit a237563

Please sign in to comment.