All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
medaka smolecule
was broken by change frommedaka consensus
tomedaka inference
.
- Improved error message when model is not found.
Switched from tensorflow to pytorch.
Existing models for recent basecallers have been converted to the new format.
Pytorch format models contain a _pt
suffix in the filename.
- Inference is now performed using PyTorch instead of TensorFlow.
- The
medaka consensus
command has been renamed tomedaka inference
to reflect its function in running an arbitrary model and avoid confusion withmedaka_consensus
. - The
medaka stitch
command has been renamed tomedaka sequence
to reflect its function in creating a consensus sequence. - The
medaka variant
command has been renamed tomedaka vcf
to reflect its function in consolidating variants and avoid confusion withmedaka_variant
. - Order of arguments to
medaka vcf
has been changed to be more consistent withmedaka sequence
. - The helper script
medaka_haploid_variant
has been renamedmedaka_variant
to save typing. - Make
--ignore_read_groups
option available to more medaka subcommands includinginference
.
- The
medaka snp
command has been removed. This was long defunct as diploid SNP calling had been deprecated, andmedaka variant
is used to create VCFs for current models. - Loading models in hdf format has been deprecated.
- Deleted minimap2 and racon wrappers in
medaka/wrapper.py
.
- Release conda packages for Linux (x86 and aarch64) and macOS (arm64).
- Option
--lr_schedule
allows using cosine learning rate schedule in training. - Option
--max_valid_samples
to set number of samples in a training validation batch.
- Training models with DiploidLabelScheme uses categorical cross-entropy loss instead of binary cross-entropy.
- Minor edits to README around model selection and package installation.
- Release conda packages for Linux (x86 and aarch64) and macOS (arm64).
Switched from tensorflow to pytorch.
Existing models for recent basecallers have been converted to the new format.
Pytorch format models contain a _pt
suffix in the filename.
- Inference is now performed using PyTorch instead of TensorFlow.
- The
medaka consensus
command has been renamed tomedaka inference
to reflect its function in running an arbitrary model and avoid confusion withmedaka_consensus
. - The
medaka stitch
command has been renamed tomedaka sequence
to reflect its function in creating a consensus sequence. - The
medaka variant
command has been renamed tomedaka vcf
to reflect its function in consolidating variants and avoid confusion withmedaka_variant
. - Order of arguments to
medaka vcf
has been changed to be more consistent withmedaka sequence
. - The helper script
medaka_haploid_variant
has been renamedmedaka_variant
to save typing.
- The
medaka snp
command has been removed. This was long defunct as diploid SNP calling had been deprecated, andmedaka variant
is used to create VCFs for current models. - Loading models in hdf format has been deprecated.
- Deleted minimap2 and racon wrappers in
medaka/wrapper.py
.
- Option
--lr_schedule
allows using cosine learning rate schedule in training. - Option
--max_valid_samples
to set number of samples in a training validation batch.
- Training models with DiploidLabelScheme uses categorical cross-entropy loss instead of binary cross-entropy.
(Probably) final version of medaka using tensorflow. Future versions will use pytorch instead.
- medaka_consensus: only keep bam tags if input file matches joint polishing pipeline.
- Pin numpy to <2.0.0.
- Consensus and variant models lookup for v3.5.1 Dorado models.
- tandem: Use haplotag 0 in unphased mode.
- tandem: Don't run consensus if regions set is empty.
- Models for version 5 basecaller models.
- Expose
sym_indels
option for training. - Expose
--min_mapq
minimum mapping quality alignment fitering option for medaka consensus. - tandem: Option
--ignore_read_groups
to ignore read groups present in input file. - Wrapper script
medaka_consensus_joint
and convenience tools (prepare_tagged_bam
,get_model_dtypes
) to facilitate joint polishing with multiple datatypes.
- Consensus and variant models for v4.3.0 dorado models.
- Parsing model information from fastq headers output by Guppy and MinKNOW.
- Additional explanatory information in VCF INFO fields concerning depth calculations.
- Do not exit if model cannot be interpreted, use the default instead.
- An issue with co-ordinate handling in computing variants from alignments.
- Ability to use basecaller model name as --model argument.
- Better handling or errors when running abpoa.
- Correct suffix of consensus file when
medaka_consensus
outputs a fastq.
- Choice of model file can be introspected from input files. For BAM files the read group (RG) headers are searched according to the dorado specification, whilst for .fastq files the comment section of a number of reads are checked for corresponding read group information. In the latter case see README for information on correctly converting basecaller output to .fastq whilst maintaining the relevant meta information.
medaka tools resolve_model
can display the model that would automatically be used for a given input file.
- If no model is provided on command-line interface (medaka consensus, medaka_consensus, and medaka_haploid_variant) automatic attempts will be made to choose the appropriate model.
- Tensorflow logging level no longer set from Python.
- spoa and parasail are now strict requirements.
- Sort VCF before annotating in
medaka_haploid_variant
. - Ignore errors when deleting temporary files.
- The output of the first POA run not being used in the second iteration in smolecule command.
- Support for Python 3.11.
--spoa_min_coverage
option to smolecule command.
- Support for Python 3.7.
- A long-standing bug in pileup_counts that manifests for single-position pileups on ARM64.
- Added
medaka tandem
targeted tandem repeat variant calling.
- Updated features related to fetching of trimmed reads.
- Refactored smolecule module.
- Faster inference and stitching of many short contigs.
- Tensorflow version 2.10 (allows for aarch64 wheels).
- Expose qualities parameter in medaka_consensus script with
-q
parameter.
- Consensus and variant models for v4.1 and v4.2 basecallers.
- Changed default models to be r1041_e8.2_400bps_v4.2 models
- Clip probabilities in
_phred()
rather than adding smallest float.
- Consensus polishing models for Version 4 basecallers.
- Wheel builds for newer Python versions.
- Deprecated numpy.unicode use.
- Set minimum Python version to 3.7.
- Updated tensorflow requirement to 2.8.
- Put lower bound on numpy requirement.
- Dropped support for Python 3.6. Security support for Python 3.6 was ended on 23 Dec 2021; as such we have removed support for Python 3.6 and suggest users update their Python version.
- New models for R10.4.1 E8.2 260bps based sequencing chemistries.
- Updated Hac and Fast models for R10.4.1 E8.2 400bps based sequencing chemistries.
- Removed models for Fast basecallers from pypi package
- medaka variant IndexError on long insertion
- capability to fill gaps in consensus sequence with a designated character (e.g. 'N') instead of content from a reference sequence.
- option
-r
inmedaka_consensus
to set the designated fill character. - option
--fill_char
inmedaka stitch
to set the designated fill character.
- CUDA initialization errors during
medaka smolecule
s stitch phase.
- New models for R10.4.1 E8.2 400bps based sequencing chemistries.
- DiploidZygosityLabelScheme renaming.
- Updated to tensorflow~=2.7.0.
- Do not always force recreation of minimap2 index in helper scripts.
- PyPI wheel releases now built with libdeflate for faster BAM reading.
- Inclusion of inserted bases immediately after deletion in pileup counts.
- Makefile can now build environment for macOS M1.
- Publish ARMv8 wheels compatible with NVIDIA's Jetpack 4.6.1 binary.
--qualities
option forsmolecule
andstitch
to output consensus fastq.
- Updated tensorflow requirement to ~=2.5.2.
- Spruced-up documentation.
- Light testing of Docker build.
- Remove
medaka_variant
in deference of clair3.
- Updated tensorflow requirement to ~=2.4.4.
- Light testing of Docker build.
- tensorflow requirement to ~=2.2.2
- R10.4 E8.1 consensus models for Guppy version 5.0.15.
medaka tools
now displays its help rather than an error.
- `medaka tools download_models can download specific models.
- Missing sites in gVCF output.
- Rewrittern algorithm for determining VCF records from RNN outputs for clarity and speed.
- Inclusion of select models in distributions.
- Models for Guppy version 5.0.7.
- Issue whereby tensorflow would spawn many threads that do not exit.
- Added missing default option to arparse instance in smolecule command.
- Copy across contigs with no aligned reads during
medaka stitch
. - Quote strings in bash scripts to allow filenames with spaces.
- Typo in
medaka_consensus
causing a syntax error.
- Option to output VCF record for all reference positions from
medaka variant
.
- Haploid variant calling reverted to old-style methodology.
- Early exits on error in
medaka_consensus
andmedaka_variant
. INFO
field of VCFs is now correctly.
when empty.
- Rewrote inference data loading code for clarity.
- Removed pinned BioPython pin.
- Formally update htslib program requirements to 1.11.
- Support for Python 3.5.
- Corner case in consensus stitching.
- Variant annotation when more than one CHROM record in VCF.
- Variant annotation when counts matrix does not span variants.
- Updated Tensorflow requirement to 2.2.2
- Off-by-one error during stitching of consensus chunks.
Minor release
- Fixed incorrect read depth annotations in VCFs.
- Fixed missing files in PyPI source distribution.
- Fix
StopIteration
issues in newer Pythons.
- Added
-n
option tomedaka_variant
to add a sample field to outputs. - Set
HDF5_USE_FILE_LOCKING=FALSE
, which some users report as useful. - Set
OMP_NUM_THREADS=1
required to make Tensorflow anaconda use CPU resource sensibly.
Minor release
- Fix issue whereby variant ALTs were created equal to REF.
- Build a medaka-cpu package depending on tensorflow-cpu.
Performance release
- Fix long-standing issue where genome regions could be unprocessed.
- Improve inference performance by 30%.
- Add efficient multiprocessing to
medaka stitch
.
Bug fix and feature release
- Fix iteration error in retrieving trimmed reads.
- Work around tensorflow threading issue.
- Add ability to
haploid2diploid
tool on VCFs generated bymedaka_haploid_variant
Bug fix and feature release
- Fix issues in command-line argument parsing.
- Add true ploidy-1 variant caller.
- Do not break contigs at unpolished regions (fill with input instead).
- Add multi-nucleotide variant decomposition to be compatible with DeepVariant.
Bux fix release.
- Remove python version check preventing Python >3.6 builds from running.
Update with new models and features.
- Fix a few bugs in variant annotation program.
- Add ARM builds to PyPI release.
- Add Python 3.7 and 3.8 builds for x86-64.
- Add PromethION model for Guppy 4.0.11.
- Upgrade to Tensorflow 2.2.
- Option to split MNPs to independent SNPs (for compatibility with DeepVariant).
- Single molecule consensus program now uses
pyspoa
.
- Remove methylation aggregation functionality.
Minor fixes release.
- Fix occasional mangled sam output in guppy2sam.
- Update htslib ecosystem to 1.10 to fix conda installation issue.
Minor fixes and models release.
- VCF GQ is now an integer in line with VCF spec.
- Fixed issue requiring a previous model for training.
- Fixed issue causing -p option of medaka_variant to crash.
- Fixed issue preventing installation in a virtualenv with python <3.6.
- R9.4.1 variant calling models for Guppy 3.6.0 and updated benchmarks.
- Made r941_min_high_g360 the default consensus model.
Minor fixes release, resolving issues introduced in v1.0.0.
- Fix default model for SNP calling.
- Fix issue causing medaka_consensus to crash.
Models, features and fixes release
- Fix to methylation aggregation.
- Consensus models for Guppy 3.6.0.
- Add functionality for auto-download of older models.
- VCF annotation tool.
Minor release.
- Harmonised versions of htslib/samtools dependencies.
Models, features and fixes release
- Minor speed improvement.
- Fix bug where force overwrite of output was always enabled.
- Fix bug where variant calling of a region crashed if the region began with a deletion.
- Variant calling models for R10.3 and R9.4.1 and updated benchmarks.
- Consensus models for Guppy 3.5.1.
- Add read group (RG) tag filtering.
- Add option to create consensus sequence via intermediate .vcf file.
- Update to methylation calling documentation.
- Addition of all-context modified-base aggregation.
R10.3 model and small fixes
- Fix index/compression issue with RLE workflow
- Fix a rare memory error during feature generation caused by very long indels.
- Add model for R10.3 on MinION.
- Write and empty vcf when no variants are found in medaka_variant.
Bugfix release
- Fix invalid specification of variant calling model.
Model release
- Models for guppy 3.4.4.
Minor fix release
- Fix a memory error in pileup calculation.
- Update variant calling models and benchmarks.
Minor fix release
- Detect NaNs during training and halt early.
- Workaround pysam interface changes (for conda package).
- Preliminary hard-RLE model for R9.4.1
- --regions argument can now be a .bed file.
- Support soft-RLE network training.
This release includes an experimental consensus mode using run-length encoded alignments. Use of this algorithm can be specified using the new "rle" model:
medaka_consensus -m r941_min_high_g340_rle -i basecalls.fasta -d draft.fa
Feature release
- Consensus models for guppy 3.3 and 3.4.
- Aggregation of Guppy modified base probability tables.
- Multi-thread stitching of inference chunks in
medaka_consensus
. - Optionally run whatshap phase at the end of
medaka_variant
.
Minor fix release
- Fix bug where feature matrix was misaligned with coordinate system.
- Fixed issue with
medaka_variant
failing on zero-coverage regions. - Rename incorrectly named diploid SNP calling model.
- Made variant calling faster by resolving trivial bottleneck in variant classification.
- Add missing arguments from
smolecule
command. - Output contig names are no longer written as samtools-style regions.
Feature release
- Corrected parsing of region strings with multiple
:
charaters - Fixed bug causing larger than requested overlap in inference chunks.
- Fixed rare consensus stitching error.
- Added a
-f
force overwrite option tomedaka_consenses
. - Added C. elegans assembly benchmarks to documentation.
- Switched variant calling to an explicitely diploid calling model.
- Refreshed E. coli benchmark to include effect of
racon
. - Refreshed variant calling benchmarks.
Minor fix release.
- Additional fix to handling lowercase reference sequences.
- Fix bug in creation of RLE alignments.
- Update
update_model.py
script.
- Unify how LabelSchemes store training data.
- Remove option to select labelling scheme during training.
Minor fix release
- Fix regression in medaka stitch and medaka snp speed.
- Remove dill and yaml requirements.
- Handle lowercase letters in reference sequences.
Bugfix and training refactor release
- Fix readlink issue on MacOS
- Fix bug where medaka_variant did not call indels by default
- Fix bug in determining when to split contigs
- Make network feature generation 2x faster
- Add smolecule command
- Log use of GPU and cuDNN, noting workaround for RTX cards
- Store models in git-lfs
- Simplify medaka_variant workflow for speed
- Refactor labelling of training data and storing of models
- Reimplement RLE feature generation
- Drop support for older basecaller models (guppy<3.0.3)
Documentation release
- Clarify suggested workflows in documentation.
Patch release
- Patch import of loading of older models
Model release and development release
- Add support for R10 basecaller
- Add diploid multi-labelling
- Upgrade to tensorflow 1.14.0
Bug fix release
- Fix regression in consensus stitching when chunks do not overlap.
Feature release
- Indel calling for
medaka_variant
. - New models for MinION/GridION and PromethION paired to high accuracy an fast guppy basecallers.
- Overhaul of chunk handling and overlapping.
Bug fix release
- Fix Makefile for parallel build.
- Ensure medaka consensus is given absolute path to model.
- Tidy up some parsing and sorting of regions from strings.
- Disable by default validation of output HDF during consensus.
- Refactor variant handling code.
Bug fix release
- Fix for models not specifiying data types.
Bug fix release
- Split pileup when reads do not span rather than silently deleting region.
- Fix error in stitching occuring with a single region.
- Refactor handling of short and remainder regions.
- Drop 3.4 support.
Bug fix release
- Enhanced verification of training feature samples
- Pin pyyaml version
Bug fix release
- Fixed bug in
medaka_consensus
incorrectly calling python
SNP calling, model, and bugfix release release
- Workaround short-contig/no-coverage corner case during pileup.
- Prototype SNP calling and phasing, benchmarks
- Add model for improved Flip-flop model in Guppy 2.3.5
- Rename models to be more logical
- Update to htslib version 1.9 for long cigars
Bug fix release
- Fix bug leading to dropping of pileup chunks during loading
Development and performance release
- Refactor batch queuing in preparation to using keras Sequence
- Asynchrounous feature loading during inference
- Pin version of h5py to work around intermittent errors in saving models
Training and bug-fix release.
- Resolve issue with contained chunks during stitching
- Resolve hanging at the end of training
- Improved storage and retrieval of features for better IO
- Training speed improved >10X
- Switch to CuDNN for GRU layers
- Check presence of
minimap2
andsamtools
- Provide more feedback on error
Model release.
- Add support for R9.4.1 flip-flop basecaller
Development release.
- Adds build infrastructure for source distributions and manylinux wheels.
Performance and bugfix release.
- Large refactoring of feature and sample generation. Fixes many small bugs and edge cases
- Resize models for small contigs
- Faster Generation of inference features
- Model updates
- Ability to handle multiple read types
- Remove redundant samtools tview code
- Limit CPU usage when running without a GPU
Model and userbility release.
- Many small bug fixes
- New non-RLE model
- Updated documentation and benchmarks
- Dockerfile to build a medaka Docker image
medaka_consensus
no longer needs a pomoxis installation to run