diff --git a/scripts/test_levenshtein.ipynb b/scripts/test_levenshtein.ipynb index fc8f9bf6..4718c386 100644 --- a/scripts/test_levenshtein.ipynb +++ b/scripts/test_levenshtein.ipynb @@ -1,29 +1,52 @@ { "cells": [ { - "cell_type": "code", - "execution_count": 25, + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Exploring the Impact of Evaluation Order on Edit Distance Algorithms\n", + "\n", + "Removing data-dependencies in the Wagner-Fisher, Needleman-Wunsch, Smith-Waterman, and Gotoh Dynamic Programming algorithms to explain the hardware-accelerated variants in StringZilla." + ] + }, + { + "cell_type": "markdown", "metadata": {}, - "outputs": [], "source": [ - "import numpy as np\n", - "import random" + "## Levenshtein Distance\n", + "\n", + "Levenshtein edit distance is one of the most broadly studied string similarity metrics.\n", + "It is defined as the minimum number of single-character insertions, deletions, and substitutions required to change one string into another.\n", + "The Levenshtein distance between two strings is calculated using dynamic programming algorithms, such as the Wagner-Fisher algorithm, and its variations for Bioinformatics: \n", + "\n", + "- Needleman-Wunsch for global alignment with substitution matrices, \n", + "- Smith-Waterman for local alignment with substitution matrices, \n", + "- Gotoh for different penalties for gap opening and extensions.\n", + "\n", + "Given the shared nature of these algorithms, the same tricks can be applied to all of them to improve their performance." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "# Exploring the Impact of Evaluation Order on the Wagner Fisher Algorithm for Levenshtein Edit Distance" + "## Warner-Fisher Algorithm\n", + "\n", + "Wagner-Fisher algorithm, in its most naive form, has a time and space complexity of $O(NM)$, where $N$ and $M$ are the lengths of the two strings being compared.\n", + "A rectangular matrix of size $(N+1) \\times (M+1)$ is created to store the edit distances between all prefixes of the two strings.\n", + "The first row and column are, naturally, initialized with ${0, 1, 2, ..., N}$ and ${0, 1, 2, ..., M}$ respectively." ] }, { "cell_type": "code", - "execution_count": 26, + "execution_count": 1, "metadata": {}, "outputs": [], "source": [ - "def algo_v0(s1, s2) -> int:\n", + "from typing import Tuple\n", + "import numpy as np # NumPy for matrices\n", + "\n", + "def wagner_fisher(s1: str, s2: str) -> Tuple[int, np.ndarray]:\n", " # Create a matrix of size (len(s1)+1) x (len(s2)+1)\n", " matrix = np.zeros((len(s1) + 1, len(s2) + 1), dtype=int)\n", "\n", @@ -38,12 +61,12 @@ " for j in range(1, len(s2) + 1):\n", " substitution_cost = s1[i - 1] != s2[j - 1]\n", " matrix[i, j] = min(\n", - " matrix[i - 1, j] + 1, # Deletion\n", - " matrix[i, j - 1] + 1, # Insertion\n", - " matrix[i - 1, j - 1] + substitution_cost, # Substitution\n", + " matrix[i - 1, j] + 1, #? Deletion cost\n", + " matrix[i, j - 1] + 1, #? Insertion cost\n", + " matrix[i - 1, j - 1] + substitution_cost, #? Substitution cost\n", " )\n", "\n", - " # Return the Levenshtein distance\n", + " # The distance will be placed in the bottom right corner of the matrix\n", " return matrix[len(s1), len(s2)], matrix" ] }, @@ -51,25 +74,32 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Accelerating this exact algorithm isn't trivial, is the `matrix[i, j]` value has a dependency on the `matrix[i, j-1]` value.\n", - "So we can't brute-force accelerate the inner loop.\n", - "Instead, we can show that we can evaluate the matrix in a different order, and still get the same result." + "This algorithm is almost never recommended for practical use, as it has a quadratic space complexity.\n", + "It's trivial to see that the space complexity can be reduced to $O(min(N, M))$ by only storing the last two rows of the matrix, but we want to keep the entire matrix as a reference to allow debugging and visualization." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "![](https://mathworld.wolfram.com/images/eps-svg/SkewDiagonal_1000.svg)" + "## Diagonal Evaluation Order\n", + "\n", + "Accelerating this exact algorithm with SIMD instructions isn't trivial, is the `matrix[i, j]` value has a dependency on the `matrix[i, j - 1]` value.\n", + "So we can't brute-force accelerate the inner loop.\n", + "Instead, we can show that we can evaluate the matrix in a different order, and still get the same result.\n", + "\n", + "![Skewed Diagonals Evaluation Order](https://mathworld.wolfram.com/images/eps-svg/SkewDiagonal_1000.svg)\n", + "\n", + "But before complicating things too much, let's start with a simple case - when both strings have identical lengths and the DP matrix has a square shape." ] }, { "cell_type": "code", - "execution_count": 27, + "execution_count": 2, "metadata": {}, "outputs": [], "source": [ - "def algo_v1(s1, s2, verbose: bool = False) -> int:\n", + "def square_skewed_diagonals(s1: str, s2: str, verbose: bool = False) -> Tuple[int, np.ndarray]:\n", " assert len(s1) == len(s2), \"First define an algo for square matrices!\"\n", " # Create a matrix of size (len(s1)+1) x (len(s2)+1)\n", " matrix = np.zeros((len(s1) + 1, len(s2) + 1), dtype=int)\n", @@ -83,35 +113,45 @@ "\n", " # Number of rows and columns in the square matrix.\n", " n = len(s1) + 1\n", + " \n", + " # Number of diagonals and skewed diagonals in the square matrix of size (n x n).\n", " skew_diagonals_count = 2 * n - 1\n", - " # Compute Levenshtein distance\n", - " for skew_diagonal_idx in range(2, skew_diagonals_count):\n", - " skew_diagonal_length = (skew_diagonal_idx + 1) if skew_diagonal_idx < n else (2*n - skew_diagonal_idx - 1)\n", + " \n", + " # Populate the matrix in 2 separate loops: for the top left triangle and for the bottom right triangle.\n", + " for skew_diagonal_idx in range(2, n):\n", + " skew_diagonal_length = skew_diagonal_idx + 1\n", + " for offset_within_skew_diagonal in range(1, skew_diagonal_length - 1):\n", + " # If we haven't passed the main skew diagonal yet, \n", + " # then we have to skip the first and the last operation,\n", + " # as those are already pre-populated and form the first column \n", + " # and the first row of the Levenshtein matrix respectively.\n", + " i = skew_diagonal_idx - offset_within_skew_diagonal\n", + " j = offset_within_skew_diagonal\n", + " if verbose:\n", + " print(f\"top left triangle: {skew_diagonal_idx=}, {skew_diagonal_length=}, {i=}, {j=}\")\n", + " substitution_cost = s1[i - 1] != s2[j - 1]\n", + " matrix[i, j] = min(\n", + " matrix[i - 1, j] + 1, #? Deletion cost\n", + " matrix[i, j - 1] + 1, #? Insertion cost\n", + " matrix[i - 1, j - 1] + substitution_cost, #? Substitution cost\n", + " )\n", + " \n", + " # Now the bottom right triangle of the matrix.\n", + " for skew_diagonal_idx in range(n, skew_diagonals_count):\n", + " skew_diagonal_length = 2*n - skew_diagonal_idx - 1\n", " for offset_within_skew_diagonal in range(skew_diagonal_length):\n", - " if skew_diagonal_idx < n:\n", - " # If we passed the main skew diagonal yet, \n", - " # Then we have to skip the first and the last operation,\n", - " # as those are already pre-populated and form the first column \n", - " # and the first row of the Levenshtein matrix respectively.\n", - " if offset_within_skew_diagonal == 0 or offset_within_skew_diagonal + 1 == skew_diagonal_length:\n", - " continue \n", - " i = skew_diagonal_idx - offset_within_skew_diagonal\n", - " j = offset_within_skew_diagonal\n", - " if verbose:\n", - " print(f\"top left triangle: {skew_diagonal_idx=}, {skew_diagonal_length=}, {i=}, {j=}\")\n", - " else:\n", - " i = n - offset_within_skew_diagonal - 1\n", - " j = skew_diagonal_idx - n + offset_within_skew_diagonal + 1\n", - " if verbose:\n", - " print(f\"bottom right triangle: {skew_diagonal_idx=}, {skew_diagonal_length=}, {i=}, {j=}\")\n", + " i = n - offset_within_skew_diagonal - 1\n", + " j = skew_diagonal_idx - n + offset_within_skew_diagonal + 1\n", + " if verbose:\n", + " print(f\"bottom right triangle: {skew_diagonal_idx=}, {skew_diagonal_length=}, {i=}, {j=}\")\n", " substitution_cost = s1[i - 1] != s2[j - 1]\n", " matrix[i, j] = min(\n", - " matrix[i - 1, j] + 1, # Deletion\n", - " matrix[i, j - 1] + 1, # Insertion\n", - " matrix[i - 1, j - 1] + substitution_cost, # Substitution\n", + " matrix[i - 1, j] + 1, #? Deletion cost\n", + " matrix[i, j - 1] + 1, #? Insertion cost\n", + " matrix[i - 1, j - 1] + substitution_cost, #? Substitution cost\n", " )\n", "\n", - " # Return the Levenshtein distance\n", + " # Similarly, the distance will be placed in the bottom right corner of the matrix\n", " return matrix[len(s1), len(s2)], matrix" ] }, @@ -124,16 +164,17 @@ }, { "cell_type": "code", - "execution_count": 28, + "execution_count": 3, "metadata": {}, "outputs": [], "source": [ + "import random\n", "for _ in range(10):\n", " s1 = ''.join(random.choices(\"ab\", k=50))\n", " s2 = ''.join(random.choices(\"ab\", k=50))\n", - " d0, _ = algo_v0(s1, s2)\n", - " d1, _ = algo_v1(s1, s2)\n", - " assert d0 == d1 " + " d0, _ = wagner_fisher(s1, s2)\n", + " d1, _ = square_skewed_diagonals(s1, s2)\n", + " assert d0 == d1" ] }, { @@ -146,7 +187,7 @@ }, { "cell_type": "code", - "execution_count": 29, + "execution_count": 4, "metadata": {}, "outputs": [ { @@ -154,7 +195,7 @@ "text/plain": [ "('listen',\n", " 'silent',\n", - " 'distance = 4',\n", + " 'distance = np.int64(4)',\n", " array([[0, 1, 2, 3, 4, 5, 6],\n", " [1, 1, 2, 2, 3, 4, 5],\n", " [2, 2, 1, 2, 3, 4, 5],\n", @@ -164,7 +205,7 @@ " [6, 5, 5, 5, 4, 3, 4]]))" ] }, - "execution_count": 29, + "execution_count": 4, "metadata": {}, "output_type": "execute_result" } @@ -174,13 +215,13 @@ "s2 = \"silent\"\n", "# s1 = ''.join(random.choices(\"abcd\", k=100))\n", "# s2 = ''.join(random.choices(\"abcd\", k=100))\n", - "distance, baseline = algo_v0(s1, s2)\n", + "distance, baseline = wagner_fisher(s1, s2)\n", "s1, s2, f\"{distance = }\", baseline" ] }, { "cell_type": "code", - "execution_count": 30, + "execution_count": 5, "metadata": {}, "outputs": [ { @@ -191,7 +232,7 @@ " array([0, 0, 0, 0, 0, 0, 0], dtype=uint64))" ] }, - "execution_count": 30, + "execution_count": 5, "metadata": {}, "output_type": "execute_result" } @@ -233,7 +274,7 @@ }, { "cell_type": "code", - "execution_count": 31, + "execution_count": 6, "metadata": {}, "outputs": [], "source": [ @@ -244,7 +285,7 @@ }, { "cell_type": "code", - "execution_count": 32, + "execution_count": 7, "metadata": {}, "outputs": [], "source": [ @@ -258,7 +299,7 @@ }, { "cell_type": "code", - "execution_count": 33, + "execution_count": 8, "metadata": {}, "outputs": [ { @@ -269,7 +310,7 @@ " array([6, 4, 3, 2, 3, 4, 6], dtype=uint64))" ] }, - "execution_count": 33, + "execution_count": 8, "metadata": {}, "output_type": "execute_result" } @@ -306,7 +347,7 @@ }, { "cell_type": "code", - "execution_count": 34, + "execution_count": 9, "metadata": {}, "outputs": [ { @@ -317,7 +358,7 @@ " array([4, 5, 4, 5, 5, 5, 6], dtype=uint64))" ] }, - "execution_count": 34, + "execution_count": 9, "metadata": {}, "output_type": "execute_result" } @@ -342,12 +383,195 @@ }, { "cell_type": "code", - "execution_count": 35, + "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "assert distance == following[0], f\"{distance = } != {following[0] = }\"" ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Generalizing to Non-Square Matrices" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [], + "source": [ + "def skewed_diagonals(s1, s2, verbose: bool = False) -> int:\n", + " shorter, longer = (s1, s2) if len(s1) < len(s2) else (s2, s1) \n", + " shorter_dim = len(shorter) + 1\n", + " longer_dim = len(longer) + 1\n", + " # Create a matrix of size (len(s1)+1) x (len(s2)+1)\n", + " matrix = np.zeros((len(shorter) + 1, len(longer) + 1), dtype=int)\n", + " matrix[:, :] = 99\n", + "\n", + " # Initialize the first column and first row of the matrix\n", + " for i in range(shorter_dim):\n", + " matrix[i, 0] = i\n", + " for j in range(longer_dim):\n", + " matrix[0, j] = j\n", + "\n", + " # Let's say we are dealing with 6 and 9 letter words.\n", + " # The matrix will have size 7 x 10, parameterized as (shorter_dim x longer_dim).\n", + " # It will have:\n", + " # - 8 diagonals of increasing length, at positions: 0, 1, 2, 3, 4, 5, 6, 7.\n", + " # - 2 diagonals of fixed length, at positions: 8, 9.\n", + " # - 8 diagonals of decreasing length, at positions: 10, 11, 12, 13, 14, 15, 16, 17.\n", + " skew_diagonals_count = 2 * longer_dim - 1\n", + "\n", + " # Same as with square matrices, the 0th diagonal contains - just one element - zero - skipping it.\n", + " # Same as with square matrices, the 1st diagonal contains the values 1 and 1 - skipping it.\n", + " # Now let's handle the rest of the upper triangle.\n", + " for skew_diagonal_idx in range(2, shorter_dim + 1):\n", + " skew_diagonal_length = (skew_diagonal_idx + 1)\n", + " for offset_within_skew_diagonal in range(1, skew_diagonal_length-1): #! Skip the first column & row\n", + " # If we haven't passed the main skew diagonal yet, \n", + " # then we have to skip the first and the last operation,\n", + " # as those are already pre-populated and form the first column \n", + " # and the first row of the Levenshtein matrix respectively.\n", + " i = skew_diagonal_idx - offset_within_skew_diagonal\n", + " j = offset_within_skew_diagonal\n", + " if verbose:\n", + " print(f\"top left triangle: {skew_diagonal_idx=}, {skew_diagonal_length=}, {i=}, {j=}\")\n", + " shorter_char = shorter[i - 1]\n", + " longer_char = longer[j - 1]\n", + " substitution_cost = shorter_char != longer_char\n", + " matrix[i, j] = min(\n", + " matrix[i - 1, j] + 1, # Deletion\n", + " matrix[i, j - 1] + 1, # Insertion\n", + " matrix[i - 1, j - 1] + substitution_cost, # Substitution\n", + " )\n", + " \n", + " # Now let's handle the anti-diagonal band of the matrix, between the top and bottom triangles. \n", + " for skew_diagonal_idx in range(shorter_dim + 1, longer_dim + 1):\n", + " skew_diagonal_length = shorter_dim\n", + " for offset_within_skew_diagonal in range(skew_diagonal_length):\n", + " i = shorter_dim - offset_within_skew_diagonal - 1\n", + " j = offset_within_skew_diagonal + 1\n", + " if verbose:\n", + " print(f\"anti-band: {skew_diagonal_idx=}, {skew_diagonal_length=}, {i=}, {j=}\")\n", + " shorter_char = shorter[i - 1]\n", + " longer_char = longer[j - 1]\n", + " substitution_cost = shorter_char != longer_char\n", + " matrix[i, j] = min(\n", + " matrix[i - 1, j] + 1, # Deletion\n", + " matrix[i, j - 1] + 1, # Insertion\n", + " matrix[i - 1, j - 1] + substitution_cost, # Substitution\n", + " )\n", + " \n", + " # Now let's handle the bottom right triangle.\n", + " for skew_diagonal_idx in range(longer_dim + 1, skew_diagonals_count):\n", + " skew_diagonal_length = 2 * longer_dim - skew_diagonal_idx - 1\n", + " for offset_within_skew_diagonal in range(skew_diagonal_length):\n", + " i = shorter_dim - offset_within_skew_diagonal - 1\n", + " j = skew_diagonal_idx - longer_dim + offset_within_skew_diagonal + 1\n", + " if verbose:\n", + " print(f\"bottom right triangle: {skew_diagonal_idx=}, {skew_diagonal_length=}, {i=}, {j=}\")\n", + " shorter_char = shorter[i - 1]\n", + " longer_char = longer[j - 1]\n", + " substitution_cost = shorter_char != longer_char\n", + " matrix[i, j] = min(\n", + " matrix[i - 1, j] + 1, # Deletion\n", + " matrix[i, j - 1] + 1, # Insertion\n", + " matrix[i - 1, j - 1] + substitution_cost, # Substitution\n", + " )\n", + "\n", + " # Return the Levenshtein distance\n", + " return matrix[len(shorter), len(longer)], matrix" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "('listeners',\n", + " 'silents',\n", + " 'distance = np.int64(5)',\n", + " array([[0, 1, 2, 3, 4, 5, 6, 7, 8, 9],\n", + " [1, 1, 2, 2, 3, 4, 5, 6, 7, 8],\n", + " [2, 2, 1, 2, 3, 4, 5, 6, 7, 8],\n", + " [3, 2, 2, 2, 3, 4, 5, 6, 7, 8],\n", + " [4, 3, 3, 3, 3, 3, 4, 5, 6, 7],\n", + " [5, 4, 4, 4, 4, 4, 3, 4, 5, 6],\n", + " [6, 5, 5, 5, 4, 5, 4, 4, 5, 6],\n", + " [7, 6, 6, 5, 5, 5, 5, 5, 5, 5]]))" + ] + }, + "execution_count": 12, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "s1 = \"listeners\"\n", + "s2 = \"silents\"\n", + "distance, baseline = skewed_diagonals(s1, s2)\n", + "s1, s2, f\"{distance = }\", baseline" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "('listeners',\n", + " 'silents',\n", + " 'distance = np.int64(5)',\n", + " array([[0, 1, 2, 3, 4, 5, 6, 7],\n", + " [1, 1, 2, 2, 3, 4, 5, 6],\n", + " [2, 2, 1, 2, 3, 4, 5, 6],\n", + " [3, 2, 2, 2, 3, 4, 5, 5],\n", + " [4, 3, 3, 3, 3, 4, 4, 5],\n", + " [5, 4, 4, 4, 3, 4, 5, 5],\n", + " [6, 5, 5, 5, 4, 3, 4, 5],\n", + " [7, 6, 6, 6, 5, 4, 4, 5],\n", + " [8, 7, 7, 7, 6, 5, 5, 5],\n", + " [9, 8, 8, 8, 7, 6, 6, 5]]))" + ] + }, + "execution_count": 13, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "distance, baseline = wagner_fisher(s1, s2)\n", + "s1, s2, f\"{distance = }\", baseline" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [], + "source": [ + "s1 = ''.join(random.choices(\"abcd\", k=5))\n", + "s2 = ''.join(random.choices(\"abcd\", k=6))\n", + "distance_v0, baseline_v0 = wagner_fisher(s1, s2)\n", + "distance_v2, baseline_v2 = skewed_diagonals(s1, s2, verbose=False)\n", + "assert distance_v0 == distance_v2, f\"{distance_v0 = } != {distance_v2 = }\"\n", + "assert np.all(baseline_v0 == baseline_v2), f\"{baseline_v0 = }\\n{baseline_v2 = }\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] } ], "metadata": { @@ -366,7 +590,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.11.5" + "version": "3.12.2" } }, "nbformat": 4,