Make sure that you compile MetaCache with the right data type sizes so that you
- don't waste memory
- have support for enough reference sequences per database
- and/or support for long enough reference sequences.
See the installation instructions for more details.
MetaCache's build mode is used for creating databases.
You need
- taxonomic hierarchy files
- reference genome/scaffold/contig files
- target-to-taxon mappings that specify each reference genome's taxon
If your machine doesn't have enough RAM to fit an entire database you can use database partitioning to split up the reference genomes into several partitions.
You can download the NCBI's taxonomy with an included helper script:
download-ncbi-taxonomy ncbi_taxonomy
This downloads the taxonomy and puts it in a folder called ncbi_taxonomy
Must be in (uncompressed) FASTA or FASTQ format.
You can either specify all input genome files separately:
metacache build mydatabase genome1.fna genome2.fnq genome3.fa -taxonomy ncbi_taxonomy
or put them all in one folder and then use that folder as input source:
metacache build mydatabase genomes_folder -taxonomy ncbi_taxonomy
There are three ways for MetaCache to obtain the taxon of a reference genome.
MetaCache looks for a file named assembly_summary.txt
(as used by the NCBI) in each refenrece genome input folder.
Such files map reference genome filenames to taxon ids, so that all sequences in a reference genome file get the same taxid.
A proper assembly_summary.txt
file must contain at least:
- first line that starts with
#
- second line that starts with
#
and all column names separated by tabs - the first column must contain the file names
- some other column must be named
taxid
(and contain the taxon ids)
Example:
# first comment line needs to be there
# assembly_accession taxid
filename1 438753
filename2 198804
...
>NC_017430.1 | taxid|813
GCGGCCGCCCGGGAAATTGCTAAAAGATGGGAGCAAAGAGTTAGAGATCTACAAGATAAAGGTGCTGCACGAAAATTATT
...
MetaCache looks for such annotations by default, so this does just work.
The NCBI's accession to taxon mapping files can be downloaded with the included helper script download-ncbi-taxmaps
.
You should put them into the same folder as the taxonomy:
download-ncbi-taxmaps taxonomy_folder
MetaCache looks in the taxonomy folder (determined by build option -taxonomy
) for them. They are used at the end of the database build phase for all reference sequences that don't have a taxid assigned yet (by any of the other methods).
accession2taxid
files contain 4 tab-separated columns:
accession accession_version taxid gid
A00002 A00002.1 9913 0
A00003 A00003.1 9913 0
X52700 X52700.1 9771 0
...
The mappings file name(s) go after the parameter -taxpostmap
:
metacache build mydb mygenomes_folder -taxonomy ncbi_taxonomy -taxpostmap mymap.accession2taxid
In order for this to work the accession
or accession.version
must be contained as sequence id in the
reference genome FASTA or FASTQ files:
>NZ_AAGY02000218.1 Streptococcus pneumoniae TIGR4 ctg00649, whole genome shotgun sequence
TTTGAGCCACTTCGTCTTTAACGGCTTTATTCATAAGCTCTTGTAATTTTTCTTTACTATCAATTACTTCTGATTTTCCG
...
If your genomes are already annotated with taxon ids (see section 2) and/or you also have assembly_summary.txt
with taxon ids in the genomes folder, than these taxon ids are used with higher priority.
In case you want the taxon ids from accession2taxid files to supersede those from other sources you can supply the option -reset-taxa
.
metacache build mydb mygenomes_folder -taxonomy ncbi_taxonomy -reset-taxa -taxpostmap mymap.accession2taxid