Merge pull request #52 from gwm17/dev

Interpolation solver fixes, addresses #44 and #21
ATTPC · Feb 28, 2024 · 17dd1b7 · 17dd1b7
2 parents a2fcd64 + 092b66a
commit 17dd1b7
Show file tree

Hide file tree

Showing 28 changed files with 1,365 additions and 729 deletions.
diff --git a/config.json b/config.json
@@ -47,16 +47,16 @@
     "Cluster": {
         "min_cloud_size": 50,
         "smoothing_neighbor_distance(mm)": 15.0,
-        "minimum_points": 3,
-        "minimum_size_scale_factor": 0.06,
+        "minimum_points": 5,
+        "minimum_size_scale_factor": 0.05,
         "minimum_size_lower_cutoff": 10,
+        "cluster_selection_epsilon": 0.3,
         "circle_overlap_ratio": 0.5,
         "fractional_charge_threshold": 0.75,
-        "n_neighbors_outlier_test": 5
+        "outlier_scale_factor": 5
     },
     "Estimate": {
-        "mininum_total_trajectory_points": 50,
-        "maximum_distance_from_beam_axis": 30.0
+        "mininum_total_trajectory_points": 30
     },
     "Solver": {
         "gas_data_path": "/path/to/some/gas.json",

diff --git a/docs/api/core/index.md b/docs/api/core/index.md
@@ -13,4 +13,5 @@ The `core` module contains a lot of code that is used throughout Spyral (as well
 - [point_cloud](point_cloud.md)
 - [spy_log](spy_log.md)
 - [track_generator](track_generator.md)
-- [workspace](workspace.md)
+- [workspace](workspace.md)
+- [legacy_beam_pads](legacy_beam_pads.md)
diff --git a/docs/api/core/legacy_beam_pads.md b/docs/api/core/legacy_beam_pads.md
@@ -0,0 +1,5 @@
+# legacy_beam_pads Module
+
+The legacy_beam_pads module contains an array of the pad numbers in the beam region for legacy data.
+
+::: spyral.core.legacy_beam_pads
diff --git a/docs/user_guide/config/cluster.md b/docs/user_guide/config/cluster.md
@@ -6,13 +6,14 @@ The Cluster parameters which control the clustering, joining, and outlier detect
 "Cluster":
 {
     "min_cloud_size": 50,
-    "smoothing_neighbor_distance(mm)": 10.0,
-    "minimum_points": 3,
-    "minimum_size_scale_factor": 0.06,
+    "smoothing_neighbor_distance(mm)": 15.0,
+    "minimum_points": 5,
+    "minimum_size_scale_factor": 0.05,
     "minimum_size_lower_cutoff": 10,
-    "circle_overlap_ratio": 0.75,
-    "fractional_charge_threshold": 0.75,
-    "n_neighbors_outlier_test": 5
+    "cluster_selection_epsilon": 0.3,
+    "circle_overlap_ratio": 0.50,
+    "fractional_charge_threshold": 0.7,
+    "outlier_scale_factor": 0.05
 },
 ```
 
@@ -28,7 +29,7 @@ The neighborhood search radius in millimeters for the smoothing algorithm. Befor
 
 ## minimum_points
 
-The minimum number of samples (points) in a neighborhood for a point to be a core point. This is a re-exposure of the `min_samples` parameter of [scikit-learn's HDBSCAN](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.HDBSCAN.html#sklearn.cluster.HDBSCAN). See their documentation for more details. This parameter needs more testing with more datasets! Please report any interesting behavior!
+The minimum number of samples (points) in a neighborhood for a point to be a core point. This is a re-exposure of the `min_samples` parameter of [scikit-learn's HDBSCAN](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.HDBSCAN.html#sklearn.cluster.HDBSCAN). See their documentation for more details. Larger values will make the algorithm more likely to identify points as noise. See the original [HDBSCAN docs](https://hdbscan.readthedocs.io/en/latest/parameter_selection.html#) for details on why this parameter is important and how it can impact the data.
 
 ## minimum_size_scale_factor
 
@@ -38,6 +39,10 @@ HDBSCAN requires a minimum size (the hyper parameter `min_cluster_size` in [scik
 
 As discussed in the above `minimum_size_scale_factor`, we need to scale the `min_cluster_size` parameter to the size of the point cloud. However, there must be a lower limit (i.e. you can't have a minimum cluster size of 0). This parameter sets the lower limit; that is any `min_cluster_size` calculated using the scale factor that is smaller than this cutoff is replaced with the cutoff value. As an example, if the cutoff is set to 10 and the calculated value is 50, the calculated value would be used. However, if the calculated value is 5, the cutoff would be used instead.
 
+## cluster_selection_epsilon
+
+A re-exposure of the `cluster_selection_epsilon` paramter of [scikit-learn's HDBSCAN](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.HDBSCAN.html#sklearn.cluster.HDBSCAN). This parameter will merge clusters that are less than epsilon apart. Note that this epsilon must be on the scale of the scaled data (i.e. it is not in normal units). The impact of this parameter is large, and small changes to this value can produce dramatically different results. Larger values will bias the clustering to assume the point cloud is onesingle cluster (or all noise), while smaller values will cause the algorithm to revert to the default result of HDBSCAN. See the original [HDBSCAN docs](https://hdbscan.readthedocs.io/en/latest/parameter_selection.html#) for details on why this parameter is important and how it can impact the data.
+
 ## circle_overlap_ratio
 
 The minimum amount of overlap between circles fit to two clusters for the clusters to be joined together into a single cluster. HDBSCAN often fractures trajectories into multiple clusters as the point density changes due to the pad size, gaps, etc. These fragments are grouped together based on how much circles fit on their 2-D projections (X-Y) overlap.
@@ -46,6 +51,6 @@ The minimum amount of overlap between circles fit to two clusters for the cluste
 
 The maximum allowed difference between the mean charge of two clusters (relative to the larger of the two means) for the clusters to be joined together into a single cluster. HDBSCAN often fractures trajectories into multiple clusters as the point density changes due to the pad size, gaps, etc. These fragments are grouped together based on how similar their mean charges are.
 
-## n_neighbors_outlier_test
+## outlier_scale_factor
 
-Number of neighbors to sample for the Local Outlier Test, used to remove outliers from clusters. This is a re-exposure of the `n_neighbors` parameter of [scikit-learn's LocalOutlierFactor](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.LocalOutlierFactor.html). See their documentation for more details. In general smaller numbers results in more aggressive data cleaning.
+We use [scikit-learn's LocalOutlierFactor](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.LocalOutlierFactor.html) as a last round of noise elimination on a cluster-by-cluster basis. This algorithim requires a number of neighbors to search over (the `n_neighbors` parameter). As with the `min_cluster_size` in HDBSCAN, we need to scale this value off the size of the cluster. This factor multiplied by the size of the cluster gives the number of neighbors to search over (`n_neighbors = outlier_scale_factor * cluster_size`). This value tends to have a "sweet spot" where it is most effective. If it is too large, every point has basically the same outlier factor as you're including the entire cluster for every point. If it is too small the variance between neighbors can be too large and the results will be unpredictable. Note that if the value of `outlier_scale_factor * cluster_size` is less than 2, `n_neighbors` will be set to 2 as this is the minimum allowed value.
diff --git a/docs/user_guide/config/estimate.md b/docs/user_guide/config/estimate.md
@@ -6,7 +6,6 @@ The Estimate parameters control the estimation phase. The default estimate param
 "Estimate":
 {
     "mininum_total_trajectory_points": 30,
-    "maximum_distance_from_beam_axis": 30.0
 },
 ```
 
@@ -15,7 +14,3 @@ A break down of each parameter:
 ## minimum_total_trajectory_points
 
 Minimum number of points in a cluster to be considered a valid trajectory. Any clusters smaller will be ignored.
-
-## maximum_distance_from_beam_axis
-
-Maximum distance in millimeters from the z-axis for an estimated reaction vertex. Any trajectory which has a vertex further than this from the z-axis is ignored.
diff --git a/docs/user_guide/phases/cluster.md b/docs/user_guide/phases/cluster.md
@@ -32,7 +32,7 @@ The code for this is found in `spyral/core/clusterize.py` in the functions `join
 
 ## Cleanup of the Final Clusters
 
-As a last stage, a final noise removal is applied to each cluster. The [Local Outlier Factor](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.LocalOutlierFactor.html#sklearn.neighbors.LocalOutlierFactor) is an outlier test which compares the distance to nearest neighbors (neighbor density) to determine outlier points. The number of neighbors to use is an exposed [configuration](../config/cluster.md) parameter.
+As a last stage, a final noise removal is applied to each cluster. The [Local Outlier Factor](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.LocalOutlierFactor.html#sklearn.neighbors.LocalOutlierFactor) is an outlier test which compares the distance to nearest neighbors (neighbor density) to determine outlier points. The number of neighbors is controlled using the [`outlier_scale_factor`](../config/cluster.md) parameter.
 
 The code for this is found in `spyral/core/cluster.py`.
 

diff --git a/docs/user_guide/phases/estimate.md b/docs/user_guide/phases/estimate.md
@@ -35,10 +35,39 @@ $$
     B\rho = \frac{B_{det} r_{circle}}{\sin(\theta_{polar})}
 $$
 
-where we divide by the sin of the polar angle as only the field perpendicular to the motion contributes to the cyclotron motion.
+where we divide by the sin of the polar angle as only the field perpendicular to the motion contributes to the cyclotron motion. Dividing by the polar angle gives the "total" B&rho;.
 
 Finally, we extract the stopping power (dE/dx) by adding the integrated charge along the trajectory and dividing by the total path length. In this way we have extracted all of the relevant physics parameters.
 
+## Output
+
+The output of the estimation phase is a set of dataframes saved to parquet files on a run-by-run basis in the `estimates` directory of the workspace. The following fields are available in the dataframes:
+
+- event: the event number associated with this data
+- cluster_index: the cluster index asscociated with this data
+- cluster_label: the cluster label associated with this data
+- ic_amplitude: the ion chamber charge amplitude for this event
+- ic_centroid: the ion chamber centroid (time) for this event
+- ic_integral: the ion chamber integrated charge for this event
+- ic_multiplicity: the ion chamber peak multiplicity for this event (only relevant for legacy data)
+- vertex_x: the estimated vertex x-coordinate (mm)
+- vertex_y: the estimated vertex y-coordinate (mm)
+- vertex_z: the estimated vertex z-coordinate (mm)
+- center_x: the estimated best-fit circle center x-coordinate (mm)
+- center_y: the estimated best-fit circle center y-coordinate (mm)
+- center_z: the estimated best-fit circle center z-coordinate (mm)
+- polar: the estimated trajectory polar angle (radians)
+- azimuthal: the estimated trajectory azimuthal angle (radians)
+- brho: the estimated trajectory total B&rho; (Tm)
+- dEdx: the estimated trajectory stopping power (average energy loss) (arb.)
+- dE: the total integrated charge sum over the length of the first arc (arb.)
+- arclength: the length of the first arc (mm)
+- direction: indicates the detected direction of the trajectory (forward or backward)
+- eloss: Summed integrated charge up to some cutoff (arb.) (Experimental, not used)
+- cutoff_index: Index in the cluster up to which the eloss was summed. (Experimental, not used)
+
+Units are given where relevant. Some values are experimental and not meant for use in final analysis.
+
 ## Plotting and Particle ID
 
 Once estimation is done, you can check the progress of the work and see how well Spyral is doing from a physics perspective. This involves plotting the B&rho; vs. dE/dx relationship to examine particle groups as well as kinematics in the relationship B&rho; vs. &theta;. To help with this Spyral ships with a notebook `particle_id.ipynb` which will allow you to plot and gate on particle groups. To launch the notebook use the following command from the Spyral repo:

diff --git a/docs/user_guide/phases/solve.md b/docs/user_guide/phases/solve.md
@@ -11,34 +11,57 @@ Below we'll break down each of these steps
 
 Before Spyral does anything, it will check to see if you're planning to run the solve phase. If you are, it will check to see if the interpolation mesh you requested (see the [configuration](../config/solver.md)); if the mesh does not exist, it will generate one.
 
-In the [configuration](../config/solver.md) you can specify the coarseness of the mesh; the particle species is taken from the given particle ID gate. Spyral walks through the kinetic energy, polar angle range at the steps needed for the number of bins requested, making a complete trajectory for each energy, angle. The time window range for the ODE solver is 1 &mu;s. The number of time steps is a configuration parameter and can be used to optimize resolution. Each step saves 3 double precision floats (8 bytes) for 24 bytes per point. Multiply this by the total number of bins (energy bins x angle bins x timesteps) to get the total size of the mesh in bytes. Even trajectories which stop before the 1.0 &mu;s time window have the same total data size. This will give you an idea of what the size of your mesh will be.
+In the [configuration](../config/solver.md) you can specify the coarseness of the mesh; the particle species is taken from the given particle ID gate. Spyral walks through the kinetic energy, polar angle range at the steps needed for the number of bins requested, making a complete trajectory for each energy, angle. Care is taken to handle the time range of the solving. First, Spyral will walk through the phase space of the configuration and find the time range of the longest trajectory (in time) using an inital coarse time window of 1 &mu;s. Then Spyral generates the acutal mesh using this refined time window. The number of time steps is a configuration parameter and can be used to optimize resolution. Each step saves 3 double precision floats (8 bytes) for 24 bytes per point. Multiply this by the total number of bins (energy bins x angle bins x timesteps) to get the total size of the mesh in bytes. Even trajectories which stop before the 1.0 &mu;s time window have the same total data size. This will give you an idea of what the size of your mesh will be.
 
 The mesh size is important for several reasons. First is, obviously, a finer mesh (more bins) gives better results, as it reduces the degree of interpolation. But since a finer mesh is bigger, the computational strain is larger. The interpolation is a bilinear interpolation; it requires the *entire* mesh be loaded into memory. This limits the size of the mesh to the amount of memory in your machine divided by the number of parallel processes.
 
-Finally, when a trajectory is requested, Spyral interpolates on energy and angle to get a trajectory of 500 points or fewer (points where the particle stopped are trimmed). Spyral then interpolates the trajectory in z, returning an interpolated (x,y) for a given z value.
+Finally, when a trajectory is requested, Spyral interpolates on energy and angle to get a trajectory points or fewer (points where the particle stopped are trimmed).
 
 When a trajectory is requested, the full energy, polar angle, azimuthal angle, and reaction vertex must be given. The azimuthal angle and vertex are used to rotate and translate the trajectory to the proper orientation (this reduces the phase space of the interpolation mesh).
 
 In general, it has been found that 0.5 degree steps in polar angle are best, while 500 keV steps in energy are acceptable. However, this may need tuning from experiment to experiment. The interpolation code can be found in `spyral/core/track_generator.py` as well as `spyral/interpolate/`
 
 ## Fitting to Data
 
-Spyral performs a least squares fit to data using the [lmfit](https://lmfit.github.io/lmfit-py/) library. Based on the estimated parameters, Spyral requests a trajectory from the interpolator. For each data point, the interpolator calculates (x,y) from the data z. The difference between data (x,y) and interpolator (x,y) is minimized to find the best parameters.
-
-Special care is taken to handle cases where a bad B&rho; guess leads to an interpolated trajectory which is too short to accomidate the data. At the start of solving the interpolated trajectory is too checked to see if it is too short; if it is B&rho; is slowly "wiggled" until the trajectory is long enough. Once fitting begins, if the minimizer walks to energies which result in too small trajectories, the points for which there is no interpolation value are given a large error (1.0e6). This biases the minimizer to pick trajectories which are long enough to cover all of the data.
+Spyral performs a least squares fit to data using the [lmfit](https://lmfit.github.io/lmfit-py/) library. Spyral uses the L-BFGS-B quasi-Newton minimization scheme to minimize the average error between the data and the trajectory. The average error is computed by looping over the data and the trajectory and calculating the smallest distance between the given data point and a point on the trajectory. These minimum distances are then averaged.
 
 The best fit parameters and their uncertainties are written to disk in an Apache parquet file. The &chi;<sup>2</sup> and reduced &chi;<sup>2</sup> are also written.
 
 Solver code can be found in `spyral/solvers/solver_interp.py`.
 
+## Output
+
+The output of the solver phase is a dataframe written to a parquet file in the `physics` directory of the workspace. Files are named by run and the particle species from the particle ID gate. The available values in the data frame are as follows:
+
+- event: The event number associated with this data
+- cluster_index: the cluster index associated with this data
+- cluster_label: the cluster label associated with this data
+- vertex_x: the fitted vertex x-coordinate
+- sigma_vx: the error on the fitted vertex x-coordinate (unused) in meters
+- vertex_y: the fitted vertex y-coordinate in meters
+- sigma_vy: the error on the fitted vertex y-coordinate (unused) in meters
+- vertex_z: the fitted vertex z-coordinate in meters
+- sigma_vz: the error on the fitted vertex z-coordinate (unused) in meters
+- brho: the fitted (total) B&rho; in Tm
+- sigma_brho: the error on the fitted total B&rho; in Tm (unused)
+- polar: the fitted polar angle of the trajectory in radians
+- sigma_polar: the error on the fitted polar angle of the trajectory in radians (unused)
+- azimuthal: the resulting azimuthal angle (not fitted) in radians. This calculated from the best fit vertex position.
+- sigma_azimuthal: error on the resulting azimuthal angle (not fitted) in radians (unused)
+- redchisq: the reduced &chi;<sup>2</sup> of the fit (error)
+
+The error values are unused because the L-BFGS-B optimization scheme does not generally report parameter uncertainties.
+
 ## Final Thoughts
 
 Generating a good interpolation mesh is key. If the mesh is too coarse the results will become equally coarse.
 
 Through testing, it was found that the reaction vertex is *highly* correlated to the azimuthal and polar angles. As such, only the vertex z position is acutally fit; x and y are fixed. This greatly reduces the correlation. More testing should be done to look for alternatives.
 
-To make performance reasonable [Numba](../numba.md) is used to just-in-time compile the interpolation scheme. Checkout that section of the docs for more details.
+To make performance reasonable [Numba](../numba.md) is used to just-in-time compile the interpolation scheme and objective function. Checkout that section of the docs for more details.
 
 Other solvers have been used/tried. Of particular interest is using the Unscented Kalman Filter, which feels like a really great solution to our problem, but has some critical issues (uneven step size, gaps in trajectories, etc.). See `spyral/solvers/solver_kalman.py` if you're curious. Note that that code is out-of-date with the rest of Spyral. In the future it will most likely be relegated to an isolated branch of the repository.
 
-Solving is very fast, typically the fastest phase in Spyral. Thanks Numba!
+An additional method of interest is least-squares fitting using a secondary interpolation on z from the trajectory. However, in practice, this was tested to give generally worse results than this method. But more concrete testing is needed to verify this behavior.
+
+Solving is typically fast relative to  the longer phases like the point cloud phase, however, if the previous phases are not optimized well, the solver can take significantly longer due to noisy data.