diff --git a/scripts/test_levenshtein.ipynb b/scripts/test_levenshtein.ipynb
index fc8f9bf6..4718c386 100644
--- a/scripts/test_levenshtein.ipynb
+++ b/scripts/test_levenshtein.ipynb
@@ -1,29 +1,52 @@
 {
  "cells": [
   {
-   "cell_type": "code",
-   "execution_count": 25,
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Exploring the Impact of Evaluation Order on Edit Distance Algorithms\n",
+    "\n",
+    "Removing data-dependencies in the Wagner-Fisher, Needleman-Wunsch, Smith-Waterman, and Gotoh Dynamic Programming algorithms to explain the hardware-accelerated variants in StringZilla."
+   ]
+  },
+  {
+   "cell_type": "markdown",
    "metadata": {},
-   "outputs": [],
    "source": [
-    "import numpy as np\n",
-    "import random"
+    "## Levenshtein Distance\n",
+    "\n",
+    "Levenshtein edit distance is one of the most broadly studied string similarity metrics.\n",
+    "It is defined as the minimum number of single-character insertions, deletions, and substitutions required to change one string into another.\n",
+    "The Levenshtein distance between two strings is calculated using dynamic programming algorithms, such as the Wagner-Fisher algorithm, and its variations for Bioinformatics: \n",
+    "\n",
+    "- Needleman-Wunsch for global alignment with substitution matrices, \n",
+    "- Smith-Waterman for local alignment with substitution matrices, \n",
+    "- Gotoh for different penalties for gap opening and extensions.\n",
+    "\n",
+    "Given the shared nature of these algorithms, the same tricks can be applied to all of them to improve their performance."
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Exploring the Impact of Evaluation Order on the Wagner Fisher Algorithm for Levenshtein Edit Distance"
+    "## Warner-Fisher Algorithm\n",
+    "\n",
+    "Wagner-Fisher algorithm, in its most naive form, has a time and space complexity of $O(NM)$, where $N$ and $M$ are the lengths of the two strings being compared.\n",
+    "A rectangular matrix of size $(N+1) \\times (M+1)$ is created to store the edit distances between all prefixes of the two strings.\n",
+    "The first row and column are, naturally, initialized with ${0, 1, 2, ..., N}$ and ${0, 1, 2, ..., M}$ respectively."
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 26,
+   "execution_count": 1,
    "metadata": {},
    "outputs": [],
    "source": [
-    "def algo_v0(s1, s2) -> int:\n",
+    "from typing import Tuple\n",
+    "import numpy as np # NumPy for matrices\n",
+    "\n",
+    "def wagner_fisher(s1: str, s2: str) -> Tuple[int, np.ndarray]:\n",
     "    # Create a matrix of size (len(s1)+1) x (len(s2)+1)\n",
     "    matrix = np.zeros((len(s1) + 1, len(s2) + 1), dtype=int)\n",
     "\n",
@@ -38,12 +61,12 @@
     "        for j in range(1, len(s2) + 1):\n",
     "            substitution_cost = s1[i - 1] != s2[j - 1]\n",
     "            matrix[i, j] = min(\n",
-    "                matrix[i - 1, j] + 1,  # Deletion\n",
-    "                matrix[i, j - 1] + 1,  # Insertion\n",
-    "                matrix[i - 1, j - 1] + substitution_cost,  # Substitution\n",
+    "                matrix[i - 1, j] + 1,                      #? Deletion cost\n",
+    "                matrix[i, j - 1] + 1,                      #? Insertion cost\n",
+    "                matrix[i - 1, j - 1] + substitution_cost,  #? Substitution cost\n",
     "            )\n",
     "\n",
-    "    # Return the Levenshtein distance\n",
+    "    # The distance will be placed in the bottom right corner of the matrix\n",
     "    return matrix[len(s1), len(s2)], matrix"
    ]
   },
@@ -51,25 +74,32 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Accelerating this exact algorithm isn't trivial, is the `matrix[i, j]` value has a dependency on the `matrix[i, j-1]` value.\n",
-    "So we can't brute-force accelerate the inner loop.\n",
-    "Instead, we can show that we can evaluate the matrix in a different order, and still get the same result."
+    "This algorithm is almost never recommended for practical use, as it has a quadratic space complexity.\n",
+    "It's trivial to see that the space complexity can be reduced to $O(min(N, M))$ by only storing the last two rows of the matrix, but we want to keep the entire matrix as a reference to allow debugging and visualization."
    ]
   },
   {
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "![](https://mathworld.wolfram.com/images/eps-svg/SkewDiagonal_1000.svg)"
+    "## Diagonal Evaluation Order\n",
+    "\n",
+    "Accelerating this exact algorithm with SIMD instructions isn't trivial, is the `matrix[i, j]` value has a dependency on the `matrix[i, j - 1]` value.\n",
+    "So we can't brute-force accelerate the inner loop.\n",
+    "Instead, we can show that we can evaluate the matrix in a different order, and still get the same result.\n",
+    "\n",
+    "![Skewed Diagonals Evaluation Order](https://mathworld.wolfram.com/images/eps-svg/SkewDiagonal_1000.svg)\n",
+    "\n",
+    "But before complicating things too much, let's start with a simple case - when both strings have identical lengths and the DP matrix has a square shape."
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 27,
+   "execution_count": 2,
    "metadata": {},
    "outputs": [],
    "source": [
-    "def algo_v1(s1, s2, verbose: bool = False) -> int:\n",
+    "def square_skewed_diagonals(s1: str, s2: str, verbose: bool = False) -> Tuple[int, np.ndarray]:\n",
     "    assert len(s1) == len(s2), \"First define an algo for square matrices!\"\n",
     "    # Create a matrix of size (len(s1)+1) x (len(s2)+1)\n",
     "    matrix = np.zeros((len(s1) + 1, len(s2) + 1), dtype=int)\n",
@@ -83,35 +113,45 @@
     "\n",
     "    # Number of rows and columns in the square matrix.\n",
     "    n = len(s1) + 1\n",
+    "    \n",
+    "    # Number of diagonals and skewed diagonals in the square matrix of size (n x n).\n",
     "    skew_diagonals_count = 2 * n - 1\n",
-    "    # Compute Levenshtein distance\n",
-    "    for skew_diagonal_idx in range(2, skew_diagonals_count):\n",
-    "        skew_diagonal_length = (skew_diagonal_idx + 1) if skew_diagonal_idx < n else (2*n - skew_diagonal_idx - 1)\n",
+    "    \n",
+    "    # Populate the matrix in 2 separate loops: for the top left triangle and for the bottom right triangle.\n",
+    "    for skew_diagonal_idx in range(2, n):\n",
+    "        skew_diagonal_length = skew_diagonal_idx + 1\n",
+    "        for offset_within_skew_diagonal in range(1, skew_diagonal_length - 1):\n",
+    "            # If we haven't passed the main skew diagonal yet, \n",
+    "            # then we have to skip the first and the last operation,\n",
+    "            # as those are already pre-populated and form the first column \n",
+    "            # and the first row of the Levenshtein matrix respectively.\n",
+    "            i = skew_diagonal_idx - offset_within_skew_diagonal\n",
+    "            j = offset_within_skew_diagonal\n",
+    "            if verbose:\n",
+    "                print(f\"top left triangle: {skew_diagonal_idx=}, {skew_diagonal_length=}, {i=}, {j=}\")\n",
+    "            substitution_cost = s1[i - 1] != s2[j - 1]\n",
+    "            matrix[i, j] = min(\n",
+    "                matrix[i - 1, j] + 1,                      #? Deletion cost\n",
+    "                matrix[i, j - 1] + 1,                      #? Insertion cost\n",
+    "                matrix[i - 1, j - 1] + substitution_cost,  #? Substitution cost\n",
+    "            )\n",
+    "            \n",
+    "    # Now the bottom right triangle of the matrix.\n",
+    "    for skew_diagonal_idx in range(n, skew_diagonals_count):\n",
+    "        skew_diagonal_length = 2*n - skew_diagonal_idx - 1\n",
     "        for offset_within_skew_diagonal in range(skew_diagonal_length):\n",
-    "            if skew_diagonal_idx < n:\n",
-    "                # If we passed the main skew diagonal yet, \n",
-    "                # Then we have to skip the first and the last operation,\n",
-    "                # as those are already pre-populated and form the first column \n",
-    "                # and the first row of the Levenshtein matrix respectively.\n",
-    "                if offset_within_skew_diagonal == 0 or offset_within_skew_diagonal + 1 == skew_diagonal_length:\n",
-    "                    continue      \n",
-    "                i = skew_diagonal_idx - offset_within_skew_diagonal\n",
-    "                j = offset_within_skew_diagonal\n",
-    "                if verbose:\n",
-    "                    print(f\"top left triangle: {skew_diagonal_idx=}, {skew_diagonal_length=}, {i=}, {j=}\")\n",
-    "            else:\n",
-    "                i = n - offset_within_skew_diagonal - 1\n",
-    "                j = skew_diagonal_idx - n + offset_within_skew_diagonal + 1\n",
-    "                if verbose:\n",
-    "                    print(f\"bottom right triangle: {skew_diagonal_idx=}, {skew_diagonal_length=}, {i=}, {j=}\")\n",
+    "            i = n - offset_within_skew_diagonal - 1\n",
+    "            j = skew_diagonal_idx - n + offset_within_skew_diagonal + 1\n",
+    "            if verbose:\n",
+    "                print(f\"bottom right triangle: {skew_diagonal_idx=}, {skew_diagonal_length=}, {i=}, {j=}\")\n",
     "            substitution_cost = s1[i - 1] != s2[j - 1]\n",
     "            matrix[i, j] = min(\n",
-    "                matrix[i - 1, j] + 1,  # Deletion\n",
-    "                matrix[i, j - 1] + 1,  # Insertion\n",
-    "                matrix[i - 1, j - 1] + substitution_cost,  # Substitution\n",
+    "                matrix[i - 1, j] + 1,                      #? Deletion cost\n",
+    "                matrix[i, j - 1] + 1,                      #? Insertion cost\n",
+    "                matrix[i - 1, j - 1] + substitution_cost,  #? Substitution cost\n",
     "            )\n",
     "\n",
-    "    # Return the Levenshtein distance\n",
+    "    # Similarly, the distance will be placed in the bottom right corner of the matrix\n",
     "    return matrix[len(s1), len(s2)], matrix"
    ]
   },
@@ -124,16 +164,17 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 28,
+   "execution_count": 3,
    "metadata": {},
    "outputs": [],
    "source": [
+    "import random\n",
     "for _ in range(10):\n",
     "    s1 = ''.join(random.choices(\"ab\", k=50))\n",
     "    s2 = ''.join(random.choices(\"ab\", k=50))\n",
-    "    d0, _ = algo_v0(s1, s2)\n",
-    "    d1, _ = algo_v1(s1, s2)\n",
-    "    assert d0 == d1 "
+    "    d0, _ = wagner_fisher(s1, s2)\n",
+    "    d1, _ = square_skewed_diagonals(s1, s2)\n",
+    "    assert d0 == d1"
    ]
   },
   {
@@ -146,7 +187,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 29,
+   "execution_count": 4,
    "metadata": {},
    "outputs": [
     {
@@ -154,7 +195,7 @@
       "text/plain": [
        "('listen',\n",
        " 'silent',\n",
-       " 'distance = 4',\n",
+       " 'distance = np.int64(4)',\n",
        " array([[0, 1, 2, 3, 4, 5, 6],\n",
        "        [1, 1, 2, 2, 3, 4, 5],\n",
        "        [2, 2, 1, 2, 3, 4, 5],\n",
@@ -164,7 +205,7 @@
        "        [6, 5, 5, 5, 4, 3, 4]]))"
       ]
      },
-     "execution_count": 29,
+     "execution_count": 4,
      "metadata": {},
      "output_type": "execute_result"
     }
@@ -174,13 +215,13 @@
     "s2 = \"silent\"\n",
     "# s1 = ''.join(random.choices(\"abcd\", k=100))\n",
     "# s2 = ''.join(random.choices(\"abcd\", k=100))\n",
-    "distance, baseline = algo_v0(s1, s2)\n",
+    "distance, baseline = wagner_fisher(s1, s2)\n",
     "s1, s2, f\"{distance = }\", baseline"
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 30,
+   "execution_count": 5,
    "metadata": {},
    "outputs": [
     {
@@ -191,7 +232,7 @@
        " array([0, 0, 0, 0, 0, 0, 0], dtype=uint64))"
       ]
      },
-     "execution_count": 30,
+     "execution_count": 5,
      "metadata": {},
      "output_type": "execute_result"
     }
@@ -233,7 +274,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 31,
+   "execution_count": 6,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -244,7 +285,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 32,
+   "execution_count": 7,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -258,7 +299,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 33,
+   "execution_count": 8,
    "metadata": {},
    "outputs": [
     {
@@ -269,7 +310,7 @@
        " array([6, 4, 3, 2, 3, 4, 6], dtype=uint64))"
       ]
      },
-     "execution_count": 33,
+     "execution_count": 8,
      "metadata": {},
      "output_type": "execute_result"
     }
@@ -306,7 +347,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 34,
+   "execution_count": 9,
    "metadata": {},
    "outputs": [
     {
@@ -317,7 +358,7 @@
        " array([4, 5, 4, 5, 5, 5, 6], dtype=uint64))"
       ]
      },
-     "execution_count": 34,
+     "execution_count": 9,
      "metadata": {},
      "output_type": "execute_result"
     }
@@ -342,12 +383,195 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 35,
+   "execution_count": 10,
    "metadata": {},
    "outputs": [],
    "source": [
     "assert distance == following[0], f\"{distance = } != {following[0] = }\""
    ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Generalizing to Non-Square Matrices"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def skewed_diagonals(s1, s2, verbose: bool = False) -> int:\n",
+    "    shorter, longer = (s1, s2) if len(s1) < len(s2) else (s2, s1)    \n",
+    "    shorter_dim = len(shorter) + 1\n",
+    "    longer_dim = len(longer) + 1\n",
+    "    # Create a matrix of size (len(s1)+1) x (len(s2)+1)\n",
+    "    matrix = np.zeros((len(shorter) + 1, len(longer) + 1), dtype=int)\n",
+    "    matrix[:, :] = 99\n",
+    "\n",
+    "    # Initialize the first column and first row of the matrix\n",
+    "    for i in range(shorter_dim):\n",
+    "        matrix[i, 0] = i\n",
+    "    for j in range(longer_dim):\n",
+    "        matrix[0, j] = j\n",
+    "\n",
+    "    # Let's say we are dealing with 6 and 9 letter words.\n",
+    "    # The matrix will have size 7 x 10, parameterized as (shorter_dim x longer_dim).\n",
+    "    # It will have:\n",
+    "    # - 8 diagonals of increasing length, at positions: 0, 1, 2, 3, 4, 5, 6, 7.\n",
+    "    # - 2 diagonals of fixed length, at positions: 8, 9.\n",
+    "    # - 8 diagonals of decreasing length, at positions: 10, 11, 12, 13, 14, 15, 16, 17.\n",
+    "    skew_diagonals_count = 2 * longer_dim - 1\n",
+    "\n",
+    "    # Same as with square matrices, the 0th diagonal contains - just one element - zero - skipping it.\n",
+    "    # Same as with square matrices, the 1st diagonal contains the values 1 and 1 - skipping it.\n",
+    "    # Now let's handle the rest of the upper triangle.\n",
+    "    for skew_diagonal_idx in range(2, shorter_dim + 1):\n",
+    "        skew_diagonal_length = (skew_diagonal_idx + 1)\n",
+    "        for offset_within_skew_diagonal in range(1, skew_diagonal_length-1): #! Skip the first column & row\n",
+    "            # If we haven't passed the main skew diagonal yet, \n",
+    "            # then we have to skip the first and the last operation,\n",
+    "            # as those are already pre-populated and form the first column \n",
+    "            # and the first row of the Levenshtein matrix respectively.\n",
+    "            i = skew_diagonal_idx - offset_within_skew_diagonal\n",
+    "            j = offset_within_skew_diagonal\n",
+    "            if verbose:\n",
+    "                print(f\"top left triangle: {skew_diagonal_idx=}, {skew_diagonal_length=}, {i=}, {j=}\")\n",
+    "            shorter_char = shorter[i - 1]\n",
+    "            longer_char = longer[j - 1]\n",
+    "            substitution_cost = shorter_char != longer_char\n",
+    "            matrix[i, j] = min(\n",
+    "                matrix[i - 1, j] + 1,  # Deletion\n",
+    "                matrix[i, j - 1] + 1,  # Insertion\n",
+    "                matrix[i - 1, j - 1] + substitution_cost,  # Substitution\n",
+    "            )\n",
+    "            \n",
+    "    # Now let's handle the anti-diagonal band of the matrix, between the top and bottom triangles.        \n",
+    "    for skew_diagonal_idx in range(shorter_dim + 1, longer_dim + 1):\n",
+    "        skew_diagonal_length = shorter_dim\n",
+    "        for offset_within_skew_diagonal in range(skew_diagonal_length):\n",
+    "            i = shorter_dim - offset_within_skew_diagonal - 1\n",
+    "            j = offset_within_skew_diagonal + 1\n",
+    "            if verbose:\n",
+    "                print(f\"anti-band: {skew_diagonal_idx=}, {skew_diagonal_length=}, {i=}, {j=}\")\n",
+    "            shorter_char = shorter[i - 1]\n",
+    "            longer_char = longer[j - 1]\n",
+    "            substitution_cost = shorter_char != longer_char\n",
+    "            matrix[i, j] = min(\n",
+    "                matrix[i - 1, j] + 1,  # Deletion\n",
+    "                matrix[i, j - 1] + 1,  # Insertion\n",
+    "                matrix[i - 1, j - 1] + substitution_cost,  # Substitution\n",
+    "            )\n",
+    "    \n",
+    "    # Now let's handle the bottom right triangle.\n",
+    "    for skew_diagonal_idx in range(longer_dim + 1, skew_diagonals_count):\n",
+    "        skew_diagonal_length = 2 * longer_dim - skew_diagonal_idx - 1\n",
+    "        for offset_within_skew_diagonal in range(skew_diagonal_length):\n",
+    "            i = shorter_dim - offset_within_skew_diagonal - 1\n",
+    "            j = skew_diagonal_idx - longer_dim + offset_within_skew_diagonal + 1\n",
+    "            if verbose:\n",
+    "                print(f\"bottom right triangle: {skew_diagonal_idx=}, {skew_diagonal_length=}, {i=}, {j=}\")\n",
+    "            shorter_char = shorter[i - 1]\n",
+    "            longer_char = longer[j - 1]\n",
+    "            substitution_cost = shorter_char != longer_char\n",
+    "            matrix[i, j] = min(\n",
+    "                matrix[i - 1, j] + 1,  # Deletion\n",
+    "                matrix[i, j - 1] + 1,  # Insertion\n",
+    "                matrix[i - 1, j - 1] + substitution_cost,  # Substitution\n",
+    "            )\n",
+    "\n",
+    "    # Return the Levenshtein distance\n",
+    "    return matrix[len(shorter), len(longer)], matrix"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "('listeners',\n",
+       " 'silents',\n",
+       " 'distance = np.int64(5)',\n",
+       " array([[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],\n",
+       "        [1, 1, 2, 2, 3, 4, 5, 6, 7, 8],\n",
+       "        [2, 2, 1, 2, 3, 4, 5, 6, 7, 8],\n",
+       "        [3, 2, 2, 2, 3, 4, 5, 6, 7, 8],\n",
+       "        [4, 3, 3, 3, 3, 3, 4, 5, 6, 7],\n",
+       "        [5, 4, 4, 4, 4, 4, 3, 4, 5, 6],\n",
+       "        [6, 5, 5, 5, 4, 5, 4, 4, 5, 6],\n",
+       "        [7, 6, 6, 5, 5, 5, 5, 5, 5, 5]]))"
+      ]
+     },
+     "execution_count": 12,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "s1 = \"listeners\"\n",
+    "s2 = \"silents\"\n",
+    "distance, baseline = skewed_diagonals(s1, s2)\n",
+    "s1, s2, f\"{distance = }\", baseline"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 13,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "('listeners',\n",
+       " 'silents',\n",
+       " 'distance = np.int64(5)',\n",
+       " array([[0, 1, 2, 3, 4, 5, 6, 7],\n",
+       "        [1, 1, 2, 2, 3, 4, 5, 6],\n",
+       "        [2, 2, 1, 2, 3, 4, 5, 6],\n",
+       "        [3, 2, 2, 2, 3, 4, 5, 5],\n",
+       "        [4, 3, 3, 3, 3, 4, 4, 5],\n",
+       "        [5, 4, 4, 4, 3, 4, 5, 5],\n",
+       "        [6, 5, 5, 5, 4, 3, 4, 5],\n",
+       "        [7, 6, 6, 6, 5, 4, 4, 5],\n",
+       "        [8, 7, 7, 7, 6, 5, 5, 5],\n",
+       "        [9, 8, 8, 8, 7, 6, 6, 5]]))"
+      ]
+     },
+     "execution_count": 13,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "distance, baseline = wagner_fisher(s1, s2)\n",
+    "s1, s2, f\"{distance = }\", baseline"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 14,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "s1 = ''.join(random.choices(\"abcd\", k=5))\n",
+    "s2 = ''.join(random.choices(\"abcd\", k=6))\n",
+    "distance_v0, baseline_v0 = wagner_fisher(s1, s2)\n",
+    "distance_v2, baseline_v2 = skewed_diagonals(s1, s2, verbose=False)\n",
+    "assert distance_v0 == distance_v2, f\"{distance_v0 = } != {distance_v2 = }\"\n",
+    "assert np.all(baseline_v0 == baseline_v2), f\"{baseline_v0 = }\\n{baseline_v2 = }\""
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
   }
  ],
  "metadata": {
@@ -366,7 +590,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.11.5"
+   "version": "3.12.2"
   }
  },
  "nbformat": 4,