Merge pull request #1741 from milaboratory/4.7.0-RC

4.7.0 last checks
milaboratory · Aug 7, 2024 · 976ba14 · 976ba14
2 parents c9a1bb8 + 56acb42
commit 976ba14
Show file tree

Hide file tree

Showing 2 changed files with 43 additions and 47 deletions.
diff --git a/build.gradle.kts b/build.gradle.kts
@@ -134,7 +134,7 @@ val toObfuscate: Configuration by configurations.creating {
 val obfuscationLibs: Configuration by configurations.creating
 
 
-val mixcrAlgoVersion = "4.6.0-206-4-7-0-RC-plus-mitool-integration"
+val mixcrAlgoVersion = "4.7.0"
 // may be blank (will be inherited from mixcr-algo)
 val milibVersion = ""
 // may be blank (will be inherited from mixcr-algo or milib)
@@ -146,7 +146,7 @@ val repseqioVersion = ""
 
 val picocliVersion = "4.6.3"
 val jacksonBomVersion = "2.16.0"
-val milmVersion = "4.1.0"
+val milmVersion = "4.2.0"
 
 val cliktVersion = "3.5.0"
 val jcommanderVersion = "1.72"

diff --git a/changelogs/v4.7.0.md b/changelogs/v4.7.0.md
@@ -1,77 +1,73 @@
 ## ❗ Breaking changes
 
-- Starting from version 4.7.0 of MiXCR, users are required to specify the assembling feature for all presets in cases
-  where it's not defined by the protocol. This can be achieved using either the option `--assemble-clones-by Feature`
-  or `--assemble-contigs-by Feature` for fragmented data (such as RNA-seq or 10x VDJ data). This ensures consistency in
-  assembling features when integrating various samples or types of samples, such as 10x single-cell VDJ and AIRR
-  sequencing data, for downstream analyses like inferring alleles or building SHM trees. The previous behavior for
-  fragmented data, which aimed to assemble as long sequences as possible, can still be achieved with either the
-  option `--assemble-contigs-by-cell` for single-cell data or `--assemble-longest-contigs` for RNA-seq/Exom-seq data.
+- Starting from version 4.7.0 of MiXCR, users are required to specify the assembling feature for all presets in cases where it's not defined by the protocol. This can be achieved using either the option `--assemble-clones-by [feature]`or `--assemble-contigs-by [feature]` for fragmented data (such as RNA-seq or 10x VDJ data). This ensures consistency in assembling features when integrating various samples or types of samples, such as 10x single-cell VDJ and AIRR sequencing data, for downstream analyses like inferring alleles or building SHM trees. The previous behavior for fragmented data, which aimed to assemble as long sequences as possible, can still be achieved with either the option `--assemble-contigs-by-cell` for single-cell data or `--assemble-longest-contigs` for RNA-seq/Exom-seq data.
 
 ## 🚀 Major fixes and upgrades
 
-- Ability to trigger realignments of left or right reads boundaries with global alignment algorythm using
-  parameters `rightForceRealignmentTrigger Feature` or `leftForceRealignmentTrigger Feature` in case the reads do
-  not span the CDR3 regions (rescue alignments in case of fragmented single cell data).
-- Fixed `assemble` behavior in presets for single-cell data (in some cases consensuses were assembled from reads coming
-  from different cells)
-- Ability to override the `relativeMeanScore` and `maxHits` parameters in `assemble` and `assembleContigs` steps
-  (improve the V genes assignments)
-- Consensus assembly in `assemble` now is performed separately for each chain. This allows to prevent effects from
-  different expression levels on the consensus assembly algorithm. This change is specifically important for single-cell
-  presets with cell-level assembly (most of the MiXCR presets for single-cell data).
-- Options `--dont-correct-tag-with-name <tag_name>` or `--dont-correct-tag-type (Molecule|Cell|Sample)` could be
-  specified to skip tag correction. It will degrade the overall quality of analysis, but will decrease memory
-  consumption
-- MiTool pipeline integrated into `10x-sc-xcr-vdj` preset which improved overall quality of `analyze`
+- Fixed `assemble` behavior for single-cell data, before the fix, in rare cases consensuses were assembled from reads coming from different cells. Now reads from different cells are strictly isolated.
+- Significant improvement of V genes assignment precision. To facilitate this improvement `assemble` and `assembleContigs` steps now have individual `relativeMeanScore` and `maxHits` parameters.
+- Improved robustness against expression level differences between TCR/IG chains. Consensus assembly in `assemble` now is performed separately for each chain. This change is specifically important for single-cell presets with cell-level assembly (most of the MiXCR presets for single-cell data).
+- Now options `--dont-correct-tag-with-name <tag_name>` or `--dont-correct-tag-type (Molecule|Cell|Sample)` can be specified to skip tag correction. It will trade off some analysis quality and error correction performance, for significantly lower memory and analysis time requirements, in deeply sequenced datasets with many Cell and Molecular barcodes.
+- Ability to trigger realignments of left or right reads boundaries with global alignment algorithm using parameters `rightForceRealignmentTrigger` or `leftForceRealignmentTrigger` in cases where reads do not cover the CDR3 regions (rescue alignments in case of fragmented data, like single-cell).
+- MiTool-based contig pre-assembly step integrated into `10x-sc-xcr-vdj` preset, significantly improving overall analysis performance. 
 
-## 🛠️ Minor improvements & fixes
+## 🛠️ Other improvements & fixes
 
-- Default input quality filter in `assemble`  (`badQualityThreshold`) stage was decreased to 10.
+- Default input quality filter in `assemble` (`badQualityThreshold`) stage was decreased to 10, improving total analysis yield
 - Added validation for `assembleCells` that input files should be assembled by fixed feature
 - Export of trees and tree nodes now support imputed features
-- Fixed parsing of optional arguments
-  for `exportShmTreesWithNodes`: `-nMutationsRelative`, `-aaMutations`, `-nMutations`, `-aaMutationsRelative`, `-allNMutations`, `-allAAMutations`, `-allNMutationsCount`, `-allAAMutationsCount`.
-- Fixed parsing of optional arguments for `exportClones`
-  and `exportAlignments`: `-allNMutations`, `-allAAMutations`, `-allNMutationsCount`, `-allAAMutationsCount`.
+- Fixed parsing of optional arguments for `exportShmTreesWithNodes`: `-nMutationsRelative`, `-aaMutations`, `-nMutations`, `-aaMutationsRelative`, `-allNMutations`, `-allAAMutations`, `-allNMutationsCount`, `-allAAMutationsCount`.
+- Fixed parsing of optional arguments for `exportClones` and `exportAlignments`: `-allNMutations`, `-allAAMutations`, `-allNMutationsCount`, `-allAAMutationsCount`.
 - Fixed possible errors on exporting amino acid mutations in `exportShmTreesWithNodes`
 - Fixed list of required options in `listPresets` command
-- Fixed error on building trees in case of JBeginTrimmed started before CDR3Begin
+- Fixed error on building trees in case of `JBeginTrimmed` started before `CDR3Begin`
 - Fixed usage `--remove-step qc`
 - Added `--remove-qc-check` option
 - Remove `-topChains` field from `exportShmTreesWithNodes` command. Use `-chains` instead
 - Removed default splitting clones by V and J for presets where clones are assembled by full-length.
-- Fixed NullPointerException in some cases of building trees by SC+bulk data
+- Fixed `NullPointerException` in some cases of building trees by SC+bulk data
 - Fixed `java.lang.IllegalArgumentException: While adding VEndTrimmed` in `exportClones`
 - Fixed combination trees step in `findShmTrees`: in some cases trees weren't combined even if it could be
-- Fixed `java.util.NoSuchElementException` in some cases of SC combining of trees
+- Fixed `NoSuchElementException` in some cases of SC combining of trees
 - Fixed export of `-jBestIdentityPercent` in `exportShmTreesWithNodes`
 - Added validation on export `-aaFeature` for features containing UTR
 - Fixed usage of command `exportPlots shmTrees`
-- Fixed topology of trees: before common V and J mutations were included in the root node, now in root included only
-  reconstructed NDN. Previous behavior lead to smaller numbers of distance from germline, sequence for germline exported
-  with common mutations. For fix this you should recalculate `findShmTrees`
+- Fixed topology of trees: before common V and J mutations were included in the root node, now root includes only reconstructed NDN. Previous behavior lead to underestimated distance from the germline. Now sequence for the germline exports with common mutations. To fully apply the fix to previously analyzed data, rerun the pipeline starting from `findShmTrees`
 - Fixed `IllegalStateException` on removing unnecessary genes on `findAlleles`
 - Added `--dont-remove-unused-genes` option to `findAlleles` command
 - Adjustment consensus assembly (in `assemble`) parameters for single cell presets
-- Command `groupClones` was renamed to `assembleCells`. Old name is working, but it's hidden from help. Also report and
-  output file names in `analyze` step were renamed accordingly.
+- Command `groupClones` was renamed to `assembleCells`. Old name is working, but it's hidden from help. Also report and output file names in `analyze` step were renamed accordingly.
 - Fixed calculation of germline for `VCDR3Part` and `JCDR3Part` in case of indels inside CDR3
 - Fixed export of trees if data assembled by a feature with reference point having offset
 - Export of `VJJunction gemline` in `shmTrees` exports now export `mrca` as most plausible content
-- Fixed parsing and alignment of reads with length more than 30Kbase
+- Fixed parsing and alignment of reads longer than 30 Kbase
 - `downsample` now supports `molecule` variant in `--downsampling` option
 - Fixed naming of output files of `downsample` command
-- `--output-not-used-reads` of `analyze` command now works with bam input files too, alongside `--not-aligned-(R1|R2)`
-  and `--not-parsed-(R1|R2)` of `align` command
-- Fix `replaceWildcards` behaviour on parsing BAM. It led before to discarding of quality on `align`
-- `v_call`, `d_call`, `j_call` and `c_call` columns in airr now output only bets hit, not the whole list
-- Stable behavior of `replaceWildcards`. Before it depended on the position of read in a file, now it depends on read
-  content
-- If sample sheet supplied by `--sample-sheet[-strict]` option has `*` symbol after tag name, then it will be preserved
+- `--output-not-used-reads` of `analyze` command now works with bam input files too, alongside `--not-aligned-(R1|R2)` and `--not-parsed-(R1|R2)` of `align` command
+- Fix `replaceWildcards` behaviour on parsing BAM. Previous behaviour resulted in discarding of the quality scores on `align`
+- `v_call`, `d_call`, `j_call` and `c_call` columns in AIRR now output only best hit, not the whole list
+- Stable behavior of `replaceWildcards`. Before it depended on the position of read in a file, now it depends on read content only
+- If sample sheet supplied by `--sample-sheet[-strict]` option has `*` symbol after tag name, it will be preserved
 
-## New Presets
+## 🧬 Reference gene library changes
+
+- IG reference for new species:
+  - Rabbit (IGH, IGK, IGL)
+  - Sheep (IGH, IGK, IGL)
+- Human reference corrections:
+  - Duplicated entries removed: `IGHV1-69*00`, `IGHV1-69*01`, `IGHV3-23*00`, `IGHV3-23*01`
+  - Fix for `CDR3Begin` position in `IGHV4-30-4`
+  - Fix for `FR1Begin` position in `TRBV21-1`
+  - Names of the following human `TRAV` genes were changed:
+    - `TRAV14DV4` -> `TRAV14/DV4`
+    - `TRAV23DV6` -> `TRAV23/DV6`
+    - `TRAV29DV5` -> `TRAV29/DV5`
+    - `TRAV36DV7` -> `TRAV36/DV7`
+    - `TRAV38-1DV8` -> `TRAV38-1/DV8`
+- Correct mapping of V-gene UTRs in Alpaca reference
+
+## 📚 New Presets
 
 - Added preset `takara-mouse-rna-bcr-umi-smarseq` for new Takara SMART-Seq Mouse BCR (with UMIs) kit
 - Added preset `idt-human-rna-bcr-umi-archer` and `idt-human-rna-tcr-umi-archer` for IDT Archer kits
-- Presets for Cellecta kits that include TCR/BCR Spike-in mix QC metrics: `cellecta-human-rna-xcr-umi-drivermap-air-bcr-spikein-1-1-1`, `cellecta-human-rna-xcr-umi-drivermap-air-bcr-spikein-16-4-1`,`cellecta-human-rna-xcr-umi-drivermap-air-tcr-spikein-1-1-1`,`cellecta-human-rna-xcr-umi-drivermap-air-tcr-spikein-16-4-1`
+- Presets for Cellecta kits that include TCR/BCR Spike-in mix QC metrics: `cellecta-human-rna-xcr-umi-drivermap-air-bcr-spikein-1-1-1`, `cellecta-human-rna-xcr-umi-drivermap-air-bcr-spikein-16-4-1`, `cellecta-human-rna-xcr-umi-drivermap-air-tcr-spikein-1-1-1`,`cellecta-human-rna-xcr-umi-drivermap-air-tcr-spikein-16-4-1`