Skip to content

Scripts

Tiago Jesu edited this page Oct 25, 2018 · 3 revisions

Brief description of the scripts

pATLAS repo contains several python scripts that are used to compute the matrix using mash and fetch the taxonomic annotation for each sequence. There is also another script that allows to dump abricate outputs to the psql database that is generated while running pATLAS-db-creation workflow (see detailed instructions here). Other scripts are legacy and are no longer used for deploying the website but are still maintained if for some reason they are required for example for debug purposes.

MASHix.py

This is the main script, used to generate the pATLAS matrix of pairwise distances. It collects the fastas given as input, merge them into a concatenated fasta and them feeds a function that splits every entry as a single fasta, which will then be given to mash dist in order to execute several pairwise distances in parallel. Then all these distances will be outputted to a JSON file called import_to_vivagraph.json and this is then used to render the vivagraph nodes and links.

Other outputs:

  • mash sketch for the entire plasmid database available in pATLAS, that should then be updated in the docker image for mash based analysis that can be imported to pATLAS. For now, in FlowCraft docker image and pATLASflow.
  • the lengths of all plasmids in the database is exported to a json file and should also be added to mapping docker images to that the scripts that are used for the import can calculate the coverage of a given plasmid in the sequencing results. For now, it is available in FlowCraft docker image and pATLASflow.
  • The bowtie2 and samtools indexes used in mapping approaches, which will be available in the same docker images as the ones described above ( FlowCraft docker image and pATLASflow).

These other outputs are made available in each pATLASflow release.

Other other output:

  • an sql file is also made available in each pATLAS release with the current database so that the service can be easily launched elsewhere using patlas-compose.
  • a text file with all the removed entries from the database in relation to the NCBI refseq database with the description of the reason by which they were removed.

utils/taxa_fetch.py

This script can work in standalone using argparse or as a module as it is by default executed inside MASHix.py. It crawls the NCBI taxonomy given a list of species as input. In pATLAS it is used to dump the list of species to the database and to generate a file called taxa_tree.json, that is then served in the following view: /taxa. Therefore, everytime a new database is created this view needs to be updated as well. To do so, just copy the new taxa_tree.json into this directory. This is then used to populate the dropdowns available in pATLAS for taxonomic classifications.

weirdos (-w parameter)

This parameter is used to remove entries from the queries that may refer to some conflicting species, in which a bacteria has the same exact scientific name as other living being may have from a different kindgom.

utils/crowd_curation.py

This script is imported by MASHix.py and is used as a blacklist of accession numbers that should not be added to the database, because they aren't plasmids. It is a simple dictionary which stores as keys the accession numbers of the blacklisted entries and as values the reason by which they should be excluded.

abricate2db.py

This script is basically responsible for outputting the results of a given .tsv file from abricate to the desired database.

  • database correspondence with the -db parameter:
    • resistance - used for "card" and "resfinder" and will output to the model Card (check models here).
    • plasmidfinder - used for the "plasmidfinder" database and will output to the model Database.
    • virulence - used for the "vfdb" database and will output to the model Positive.

Other outputs:

  • resistance.json, plasmidfinder.json, virulence.json: This files will be used to populate dropdowns for resfinder, card, plasmidfinder and vfdb. They are served in different views as documented here.

diamond2db.py

Similarly to abricate2db.py, this script is responsible for outputting the results from diamond tabular output format into the pATLAS metal resistance database table.

  • database correspondence with the -db parameter:
    • metal - used for the "bacmet" database and will output to the model MetalDatabase. Note that this is the only database that uses a protein query and therefore abricate is expected to last longer.

Other outputs:

  • metal.json: This file will be used to populate dropdowns for bacmet. They are served in different views as documented here.