Skip to content

Commit

Permalink
Merge pull request #56 from LewisResearchGroup/develop
Browse files Browse the repository at this point in the history
Minor bug fixes.
  • Loading branch information
sorenwacker authored Jul 30, 2024
2 parents 65f5758 + 822b8c7 commit 8765133
Show file tree
Hide file tree
Showing 14 changed files with 126 additions and 41 deletions.
13 changes: 10 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,18 +18,25 @@ More information on how to install and run the program can be found in the [Docu

![Screenshot](docs/gallery/MINT-interface-1.png)

## News
Starting with version 1.0.0, we have updated the installation setup to use pyproject.toml. Additionally, the main script to start Mint has been changed from Mint.py to Mint. Furthermore, each release of the repository will now be assigned a DOI to facilitate citation of the software.

## Publications that used Mint
1. Brown K, Thomson CA, Wacker S, Drikic M, Groves R, Fan V, et al. [Microbiota alters the metabolome in an age- and sex- dependent manner in mice.](https://pubmed.ncbi.nlm.nih.gov/36906623/) Nat Commun. 2023;14: 1348.

2. Ponce LF, Bishop SL, Wacker S, Groves RA, Lewis IA. [SCALiR: A Web Application for Automating Absolute Quantification of Mass Spectrometry-Based Metabolomics Data. Anal Chem.](https://pubs.acs.org/doi/10.1021/acs.analchem.3c04988) 2024;96: 6566–6574.

## Installation
You can find installation instructions [here](https://lewisresearchgroup.github.io/ms-mint-app/install/)

## Contributions
All contributions, bug reports, code reviews, bug fixes, documentation improvements, enhancements, and ideas are welcome.
Before you modify the code please reach out to us using the [issues](https://github.com/LewisResearchGroup/ms-mint/issues) page.
All contributions, bug reports, code reviews, bug fixes, documentation improvements, enhancements, and ideas are welcome. This includes recommendations for software architecture, code design, and efficiency improvements.

## Code standards
The project follows PEP8 standard and uses Black and Flake8 to ensure a consistent code format throughout the project.

## Get in touch
Open an [issue](https://github.com/LewisResearchGroup/ms-mint-app/issues) or join the [slack](https://ms-mint.slack.com/) channel.
To get in touch, please open a GitHub [issue](https://github.com/LewisResearchGroup/ms-mint-app/issues).

## Acknowledgements
This project would not be possible without the help of the open-source community.
Expand Down
Binary file added docs/gallery/analysis-pca.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
120 changes: 97 additions & 23 deletions docs/gui.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,6 +63,8 @@ with the optimization tools.

## Add Metabolites

> Since version 1.0.0 this functionality has been removed and will be provided as an optional plugin.
- Search for metabolites from ChEBI three stars database
- Add selected metabolites to peaklist (without RT estimation)

Expand Down Expand Up @@ -94,7 +96,6 @@ identifes the closest peak with respect to the expected RT which is displayed as
- Remove peaks from peaklist
- Set expected retention time


![Manual peak optimization](image/manual-peak-optimization.png "Manual peak optimization")

When a peak is selected in the drop down box the chromatograms for the particular mass windows
Expand All @@ -110,46 +111,107 @@ to the current view and updated the peaklist accordingly.

![Processing](image/processing.png "Processing")


When all peaks look good the data can be processed using `RUN MINT`. This will apply
the current peaklist to the MS-files in the workspace and extract additional properties.
When the results tables are present the results can be explored with the following tabs.
The generated results can be downloaded with the `DOWNLOAD` button.

- `RUN MINT`: Will process all files in the workspace using the current target list. The progress is displayed in the progress bar on the top.
- `DOWNLOAD ALL RESULTS`: The generated results can be downloaded in tidy format.
- `DOWNLOAD DENSE MATRIX`: This will download a dense data table with targets as rows and files as columns. The observable used for the cells can be selected in the drop down menu. Optionllay, you can transpose the table, by checking the `Transposed` checkbox.
- `DELETE RESULTS`: Delete results file if present, and start from scratch.

## Analysis
## Quality Control
Analytical visualizations to display a few quality metrics and comparisons. The `m/z drift` compares the observed m/z values with the ones set in the target list. This value will always be lower than the `mz_width` set in the target list for each target. It is one way of evaluating how well the machine is calibrated. Generally speaking, values between [-5, 5] are acceptible, but it depends on the specific assay and experiment.

The graphs are categorized by `sample_type` set in the Metadata tab. You should have some quality control, or calibration samples with known metabolite composition, to be able to make judgements about the quality.

The second plot breaks down the `m/z drift` by target, to see how the calibration varies between targets.

The PCA (Principal Components Analysis) plot shows a PCA using `peak_area_top3`. You can compare different groups of samples as set in the `sample_types` column in the Metadata tab.

The final plot displays peak shapes of a random sample of files for all targets. To change the sample, you can refresh this page.

## Analysis
After running MINT the results can be downloaed or analysed using the provided tools.
For quality control purposes histograms and boxplots can be generated in the
quality control tab. The interactive heatmap tool can be used to explore the results data after `RUN MINT`
has been exectuted. The tool allows to explore the generated data in from of heatmaps.


## General selection elements

## Selections and transformations
- Include/exclude file types (based on `Type` column in metadata)
- Include/exclude peak labels for analysis
- Set file sorting (e.g. by name, by batch etc.)
- Select group-by column for coloring and statistics

![Selections](image/general-selection-elements.png "Selections")

- `Types of files to include`: Uses the `sample_types` column in the Metadata tab to select files. If nothing is selected, all files are included.
- `Include peak_labels`: Targets to include. If nothing is selected all targets are included.
- `Exclude peak_labels`: Targets to exclude. If nothing is selected no target is excluded.
- `Variable to plot`: This determines which column to run the analysis on. For example, you can set this to `peak_mass_diff_50pc` to analyse the instrument calibration. The default is `peak_area_top3`.
- `MS-file sorting`: Before plotting sets the order of the MS-files in the underlying dataframe. This will change the order of files in some plots.
- `Color by`: PCA and `Plotting` tool can use a categoric or numeric column to color code samples. Some plots (e.g. Hierarchical clustering tool are unaffected).
- `Transformation`: The values can be log transformed before subjected to normalization. If nothing is selected, the raw values are used.
- `Scaling group(s)`: Column or selection of columns to group the data and apply the normalization function in the dropdown menu for each group. If you want to z-scores for each target, you need to select `peak_label` here, and in the dropdown menu 'Standard scaling`.
- `Scaling technique`: You can choose between standard scaling, min-max scaling, or robust scaling, or no scaling (if nothing is selected).

## Heatmap
### Scaling Techniques

#### 1. Standard Scaling

**Standard scaling** (also known as z-score normalization) transforms the data such that the mean of each feature becomes 0 and the standard deviation becomes 1. This is useful when the features have different units or magnitudes, as it ensures they are on the same scale.

The formula for standard scaling is:

z = (x - mean) / standard_deviation

Where:
- `x` is the original value.
- `mean` is the mean of the feature.
- `standard_deviation` is the standard deviation of the feature.

#### 2. Robust Scaling

**Robust scaling** is used to scale features using statistics that are robust to outliers. This scaling technique uses the median and the interquartile range (IQR) instead of the mean and standard deviation, making it more suitable for datasets with outliers.

The formula for robust scaling is:

x_scaled = (x - median) / IQR

Where:
- `x` is the original value.
- `median` is the median of the feature.
- `IQR` is the interquartile range of the feature (IQR = Q3 - Q1).

#### 3. Min-Max Scaling

**Min-max scaling** (also known as normalization) transforms the data to fit within a the range [0, 1]. This scaling techique is useful when you want to preserve the relationships within the data, but want to adjust the scale.

The formula for min-max scaling is:

x_scaled = (x - x_min) / (x_max - x_min)

Where:
- `x` is the original value.
- `x_min` is the minimum value of the feature.
- `x_max` is the maximum value of the feature.


## Heatmap
![Heatmap](image/heatmap.png "Heatmap")

The first dropdown menu allows to include certain file types e.g. biological samples rather than quality control samples.
The second dropdown menu distinguishes the how the heatmap is generated.

- Normalized by biomarer: devide values by column maxium.
- Cluster: Cluster rows with hierachical clustering.
- Dendrogram: Plots a dendrogram instead of row labels.
- Transpose: Switch columns and rows.
- Correlation: Calculate pearson correlation between columns.
- Show in new tab: The figure will be generated in a new independent tab. That way multiple heatmaps can be generated at the same time.
- `Cluster`: Cluster rows with hierachical clustering.
- `Dendrogram`: Plots a dendrogram instead of row labels (only in combination with `Cluster`).
- `Transpose`: Switch columns and rows.
- `Correlation`: Calculate pearson correlation between columns.
- `Show in new tab`: The figure will be generated in a new independent tab. That way multiple heatmaps can be generated at the same time. This may only work when you serve MINT locally, since the plot is served on a different port. If the app becomes unresponsive to changes, reload the tab.

### Correlation of (scaled) peak_max
### Example: Plot correlation between metabolites using scaled peak_area_top3 values

![Heatmap](image/heatmap-correlation.png "Correlation")

Expand All @@ -160,6 +222,8 @@ The second dropdown menu distinguishes the how the heatmap is generated.
- Density distributions
- Boxplots

### Example: Box-plot of scaled peak_area_top3 values by metabolite

![Quality Control](image/distributions.png "Quality Control")

The MS-files can be grouped based on the values in the metadata table. If nothing
Expand All @@ -169,9 +233,9 @@ to generate. The third dropdown menu allows to include certain file types.
For example, the analysis can be limited to only the biological samples if such a
type has been defined in the type column of the metadata table.

The checkbox can be used to create a dense view. If the box is unchecked the output will be
visually grouped into an individual section for each metabolite.
The checkbox can be used to create a dense view. If the box is unchecked the output will be visually grouped into an individual section for each metabolite.

The plots are interactive. You can switch off labels, zoom in on particular areas of interest, or hover the mouse cursor over a datapoint to get more information about underlying sample and/or target.

## PCA

Expand All @@ -183,20 +247,30 @@ visually grouped into an individual section for each metabolite.


## Hierarchical clustering
Hierarchical clustering is a technique for cluster analysis that seeks to build a hierarchy of clusters. It can be divided into two main types: **agglomerative** and **divisive**. MINT uses agglomerative hierarchical clustering, also known as bottom-up clustering, starts with each data point as a separate cluster and iteratively merges the closest clusters until all points are in a single cluster or a stopping criterion is met.

![Hierarchical clustering](image/hierarchical_clustering.png "Hierarchical clustering")
### Steps for Agglomerative Clustering
1. **Initialization**: Start with each data point as its own cluster.
2. **Distance Calculation**: Compute the pairwise distance between all clusters.
3. **Merge Closest Clusters**: Find the two closest clusters and merge them into a single cluster.
4. **Update Distances**: Recalculate the distances between the new cluster and all other clusters.
5. **Repeat**: Repeat steps 3 and 4 until all data points are in a single cluster or the desired number of clusters is achieved.

### Dendrogram

## Plotting
The output of hierarchical clustering is often visualized using a dendrogram, which is a tree-like diagram that shows the arrangement of clusters and their hierarchical relationships. Each branch of the dendrogram represents a merge or split, and the height of the branches indicates the distance or dissimilarity between clusters.

MINT comes with a flexible and powerful plotting interface that is based on the powerful [Seaborn](http://seaborn.pydata.org/) library.
### Example: Hirarchical clustering with different metrics using z-scores (for each metabolite)
![Hierarchical clustering](image/hierarchical-clustering.png "Hierarchical clustering")

- Bar plots
- Violin plots
- Boxen plot
- Scatter plots
- and more...
## Plotting
With great power comes great responsibility. The plotting tool can generate impressive, and very complex plots, but it can be a bit overwhelming in the beginning. It uses the [Seaborn](http://seaborn.pydata.org/) library under the hood. Familiarity, with this library can help understanding what the different settings are doing. We recommend starting with a basic plot and then increase its complexity stepwisely.

- Bar plots
- Violin plots
- Boxen plot
- Scatter plots
- and more...

![Plot setting](image/plotting_settings.png "Plot settings")

Expand Down
Binary file modified docs/image/distributions.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified docs/image/general-selection-elements.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified docs/image/heatmap-correlation.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified docs/image/heatmap.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/image/hierarchical-clustering.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file removed docs/image/hierarchical_clustering.png
Binary file not shown.
Binary file modified docs/image/pca.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file modified docs/image/processing.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
14 changes: 9 additions & 5 deletions docs/targets.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,12 +10,13 @@ The target list is the determining protocol for the data processing step. You ca
The input files contains a number of columns headers in the target list should contain:

- **peak_label** : A __unique__ identifier such as the biomarker name or ID. Even if multiple peaklist files are used, the label have to be unique across all the files.
- **peak_label** : A __unique__ identifier such as the biomarker name or ID.
- **mz_mean** : The target mass (m/z-value) in [Da].
- **mz_width** : The width of the peak in the m/z-dimension in units of ppm. The window will be *mz_mean* +/- (mz_width * mz_mean * 1e-6). Usually, a values between 5 and 10 are used.
- **rt** : Estimated retention time in [min] (optional, see above).
- **rt_min** : The start of the retention time for each peak in [min].
- **rt_max** : The end of the retention time for each peak in [min].
- **rt** : Estimated retention time (optional, see above), for reference and used in automated peak optimization.
- **rt_min** : The start of the retention time for each peak.
- **rt_max** : The end of the retention time for each peak.
- **rt_unit** : Time unit can be `min` (minutes) or `s` (seconds), Mint will always convert the values to seconds.
- **intensity_threshold** : A threshold that is applied to filter noise for each window individually. Can be set to 0 or any positive value.

#### Example file
Expand All @@ -27,4 +28,7 @@ Biomarker-B,151.02585,10,4.18,4.53,0
```


A template can be created using the [GUI](gui.md).
A template can be created using the [GUI](gui.md):

1. Go to the targets tab.
2. Click on `EXPORT` to download a `target.csv` file with all necessary columns.
14 changes: 7 additions & 7 deletions src/ms_mint_app/plugins/analysis.py
Original file line number Diff line number Diff line change
Expand Up @@ -49,9 +49,9 @@ def outputs(self):
var_name_options = T.list_to_options(RESULTS_COLUMNS)

scaler_options = [
{"value": "standard", "label": "Standard Scaling (z-scores)"},
#{"value": "minmax", "label": "MinMax Scaling"},
{"value": "robust", "label": "Robust Scaling"}
{"value": "standard", "label": "Standard scaling (z-scores)"},
{"value": "minmax", "label": "Min-Max scaling"},
{"value": "robust", "label": "Robust scaling"}
]


Expand Down Expand Up @@ -92,7 +92,7 @@ def outputs(self):
dcc.Dropdown(
id="ana-var-name",
options=var_name_options,
value='peak_max',
value='peak_area_top3',
placeholder="Variable to plot",
)
]),
Expand All @@ -117,14 +117,14 @@ def outputs(self):
id="ana-groupby",
options=[],
value=None,
placeholder="Normalize by",
placeholder="Scaling group(s)",
multi=True,
),
dcc.Dropdown(
id="ana-scaler",
options=scaler_options,
value=[],
placeholder="Scaler",
value=None,
placeholder="Scaling method",
multi=False,
),
]),
Expand Down
6 changes: 3 additions & 3 deletions src/ms_mint_app/plugins/analysis_tools/plotting.py
Original file line number Diff line number Diff line change
Expand Up @@ -469,7 +469,7 @@ def create_figure(
sharex="share-x" in options,
sharey="share-y" in options,
dodge="no-dodge" not in options,
facet_kws=dict(legend_out=True),
#facet_kws=dict(legend_out=True),
)

try:
Expand All @@ -488,8 +488,8 @@ def create_figure(
**kwargs
)
except Exception as e:
logging.error(e)
return dbc.Alert(str(e), color="danger")
logging.error(f"Failed to generate plot: {e}\nwith arguments:\n{kwargs}")
return "" #dbc.Alert(str(e), color="danger")

g.fig.subplots_adjust(top=0.9)
g.set_titles(col_template="{col_name}", row_template="{row_name}", y=1.05)
Expand Down

0 comments on commit 8765133

Please sign in to comment.