Merge pull request #436 from jeromekelleher/release-0.2-updates

Update changelog with 0.2 updates.
tskit-dev · Dec 18, 2020 · df74494 · df74494
2 parents 70e2430 + 6f5e7d1
commit df74494
Show file tree

Hide file tree

Showing 3 changed files with 56 additions and 31 deletions.
diff --git a/CHANGELOG.rst b/CHANGELOG.rst
@@ -1,67 +1,73 @@
 ********************
-[0.2.0] - XXXX-XX-XX
+[0.2.0] - 2020-12-18
 ********************
 
+Major feature release, including some incompatible file format and API updates.
+
 **New features**:
 
+- Mismatch and recombination parameters can now be specified via the
+  recombination_rate and mismatch_ratio arguments in the Python API.
+
+- Missing data can be accomodated in SampleData using the tskit.MISSING_DATA
+  value in input genotypes. Missing data will be imputed in the output
+  tree sequence.
+
+- Metadata schemas for population, individual, site and tree sequence metadata
+  can now we be specified in the SampleData format. These will be included
+  in the final tree sequence and allow for automatic decoding of JSON metadata.
+
+- Map non-inference sites onto the tree by using the tskit ``map_mutations``
+  parsimony method. This allows us to support sites with > 2 alleles.
+
+- Historical (non-contemporaneous) samples can now be accommodated in inference,
+  assuming that the true dates of ancestors have been set, by using the concept
+  of "proxy samples". This is done via the new function
+  ``AncestorData.insert_proxy_samples()``, then setting the new
+  parameter ``force_sample_times=True`` when matching samples.
+
+- The default tree sequence returned after inference when ``simplify=True`` retains
+  unary nodes (i.e. simplify is done with ``keep_unary=True``.
+
+
 **Breaking changes**:
 
 - The ancestors tree sequence now contains the real alleles and not
   0/1 values as before.
+
 - Times for undated sites now use frequencies (0..1), not as counts (1..num_samples),
   and are now stored as -inf, then calculated on the fly in the variants() iterator.
+
 - The SampleData file no longer accepts the ``inference`` argument to add_site.
   This functionality has been replaced by the ``exclude_positions`` argument
   to the ``infer`` and ``generate_ancestors`` functions.
+
 - The SampleData format is now at version 5, and older versions cannot be read.
   Users should rerun their data ingest pipelines.
 
-**Bugfixes**:
-
-- Individuals and populations in the SampleData file are kept in the returned tree
-  sequence, even if they are not referenced by any sample. The individual and population
-  ids are therefore guaranteed to stay the same between the sample data file and the
-  inferred tree sequence. (:pr:`348`)
-
-********************
-[0.1.5] - 2019-09-25
-********************
-
-**Breaking changes**:
-
-- Bumped SampleData file format version to 2.0, then to 3.0 as a result of the additions
-  below. Older SampleData files will not be readable and must be regenerated.
-
 - Users can specify variant ages, via ``sample_data.add_sites(... , time=user_time)``.
   If not ``None``, this overrides the default time position of an ancestor, otherwise
   ancestors are ordered in time by using the frequency of the derived variant (#143).
-  This addition bumped the file format to 2.0
 
 - Change "age" to "time" to match tskit/msprime notation, and to avoid confusion
   with the age since birth of an individual (#149). Together with the 2 changes below,
   this addition bumped the file format to 3.0.
 
 - Add the ability to record user-specified times for individuals, and therefore
   the samples contained in them (currently ignored during inference). Times are
-  added using ``sample_data.add_individual(... , time=user_time)`` (#190). Together
-  with the changes above and below, this addition bumped the file format to 3.0.
+  added using ``sample_data.add_individual(... , time=user_time)`` (#190).
 
 - Change ``tsinfer.UNKNOWN_ALLELE`` to ``tskit.MISSING_DATA`` for marking unknown regions
   of ancestral haplotypes (#188) . This also involves changing the allele storage to a
   signed int from ``np.uint8`` which matches the tskit v0.2 format for allele storage
-  (see https://github.com/tskit-dev/tskit/issues/144). Together with the 2 changes above,
-  this addition bumped the file format to 3.0.
+  (see https://github.com/tskit-dev/tskit/issues/144).
 
-**New features**:
-
-- Map non-inference sites onto the tree by using the built-in tskit
-  ``map_mutatations`` method. With further work, this should allow triallelic sites
-  to be mapped (#185)
-
-- The default tree sequence returned after inference when ``simplify=True`` retains
-  unary nodes (i.e. simplify is done with ``keep_unary=True``. This tends to result
-  in better compression.
+**Bugfixes**:
 
+- Individuals and populations in the SampleData file are kept in the returned tree
+  sequence, even if they are not referenced by any sample. The individual and population
+  ids are therefore guaranteed to stay the same between the sample data file and the
+  inferred tree sequence. (:pr:`348`)
 
 ********************
 [0.1.4] - 2018-12-12

diff --git a/tests/test_inference.py b/tests/test_inference.py
@@ -492,6 +492,23 @@ def verify_data_round_trip(
                 ts, genotypes, positions, alleles, sample_data.sequence_length
             )
 
+    # Skipping these tests as the HMM is currently not working properly
+    # for > 2 alleles, and we we have a guard on this just to make
+    # sure that no user-data uses the faulty engine. Renable these
+    # when the HMM is fixed.
+
+    @pytest.mark.skip("Not currently working for > 2 alleles; #415")
+    def test_triallelic(self):
+        pass
+
+    @pytest.mark.skip("Not currently working for > 2 alleles; #415")
+    def test_n_allelic(self):
+        pass
+
+    @pytest.mark.skip("Not currently working for > 2 alleles; #415")
+    def test_not_all_alleles_in_genotypes(self):
+        pass
+
 
 class TestMissingDataRoundTrip(TestRoundTrip):
     """

diff --git a/tsinfer/inference.py b/tsinfer/inference.py
@@ -1113,6 +1113,8 @@ def __init__(
         # quickly be big enough even for very large instances.
         max_edges = 64 * 1024
         max_nodes = 64 * 1024
+        if np.any(num_alleles > 2):
+            raise ValueError("Cannot currently match with > 2 alleles.")
         self.tree_sequence_builder = self.tree_sequence_builder_class(
             num_alleles=num_alleles, max_nodes=max_nodes, max_edges=max_edges
         )