Example

This is a full-scale example on how to use vitrivr-engine to index a collection of images and videos and serves as a starting point for advanced users of vitrivr-engine. Previous knowledge about multimedia retrieval and vitrivr-engine are beneficial, however the aim of this example is such that even novices can use vitrivr-engine only with this tutorial.

Goals

This is a tutorial / example on how to use vitrivr-engine for users, e.g. people with a multimedia collection aiming on indexing it. There are three goals of this tutorial:

A quick reference for vitrivr-engine ingestion and retrieval
Thoughts and design choices for schema, ingestion and retrieval
Real-world example, in contrast to other documentation in this wiki, which is more abstract

Why vitrivr-engine

Having a multimedia collection, (videos and images for the sake of this tutorial) is great, however the means to explore / search within (large) collections are still rather limited. With vitrivr-engine, a general purpose content-based multimedia retrieval engine, ingestion (i.e. analysing the content and storing this information for efficient use) and retrieval (i.e. using the previously gathered information to find items of the collection) may improve the understanding / usability of the collection.

Prerequisites

Not a requirement, however reading and following the Getting Started guide is beneficial. Additionally, reading the introduction of the Documentation wiki page is also helpful.

Technical requirements are as follows:

JDK 21 or higher, e.g. OpenJDK
CottontailDB at least v0.16.5
The example collection consisting of CC-0 videos and images. This is arguably a small collection and a real-world multimedia collection would be significantly larger.

Setup

In case no release exists, then building vitrivr-engine is required.

Start CottontailDB on the default port 1865
Build vitrivr-engine (from the root of the repository): Unix:

./gradlew distZip

Windows:

.\gradlew.bat distZip

Unzip the distribution, e.g. unzip -d ../instance/ vitrivr-engine-module-server/build/distribution/vitrivr-engine-server-0.0.1-SNAPSHOT.zip
Prepare the media data into a folder called example/media

By now, you should have the following folder structure:

+ vitrivr-engine/
|
+ instance/
  |
  + vitrivr-engine-server-0.0.1-SNAPSHOT/
    |
    + bin/
    |
    + lib/
+ example/
  |
  + media/
    |
    + images/
    |
    + videos/
    |
    - README.md
|
+ cottontaildb/

The cottontaildb folder is optional and might contain either the DBMS or the repository. We will not delve deeper into the cottontail setup. In the

The Schema

Since we have images and videos with a rather diverse set of styles, we aim on extracting as much content-based information as possible. Therefore, we set the schema accordingly:

The schema fields in detail:

Field	Type	Description	Module
`averagecolor`	Vector, length: 3	The most basic feature for completeness sake	core
`clip`	Vector, length: 512	CLIP based dense embedding, enables textual, concept search	fes
`file`	Structural	Metadata for the file	core
`whisper`	Textual	ASR: OpenAI whisper deep learning based subtitle analysis	fes
`ocr`	Textual	OCR: Text recogntion both for images and videos, however for videos only on key frames	fes
`dino`	Vector length 384	DINO based dense embedding, predominantly for query-by-example	fes
`time`	Structural	Temporal metadata for time-based media (e.g. video, audio)	core
`video`	Structural	Metadata for videos, e.g. resolution, FPS, ...	core

The fes module depends on the feature extraction server, a micro service for extraction and queries using pre-trained deep learning models. There is a list of available tasks and the README explains the setup.

For the sake of this tutorial, we assume that there is a FES instance running on the same machine, available under the host http://127.0.0.1:8888 (which should be the default port, following the instructions of FES).

Schema Configuration

This is the schema we use:

{
  "schemas": {
    "example": {
      "connection": {
        "database": "CottontailConnectionProvider",
        "parameters": {
          "Host": "127.0.0.1",
          "port": "1865"
        }
      },
      "fields": {
        "averagecolor": {
          "factory": "AverageColor"
        },
        "file": {
          "factory": "FileSourceMetadata"
        },
        "clip": {
          "factory": "DenseEmbedding",
          "parameters": {
            "host": "http://127.0.0.1:8888",
            "model": "open-clip-vit-b32",
            "length":"512"
          }
        },
        "dino": {
          "factory": "DenseEmbedding",
          "parameters": {
            "host": "http://127.0.0.1:8888/",
            "model": "dino-v2-vits14",
            "length":"384"
          }
        },
        "whisper": {
          "factory": "ASR",
          "parameters": {
            "host": "http://127.0.0.1:8888/",
            "model": "whisper"
          }
        },
        "ocr": {
          "factory": "OCR",
          "parameters": {
            "host": "http://127.0.0.1:8888/",
            "model": "tesseract"
          }
        },
        "time": {
          "factory": "TemporalMetadata"
        },
        "video": {
          "factory": "VideoSourceMetadata"
        },
      },
      "resolvers": {
        "disk": {
          "factory": "DiskResolver",
          "parameters": {
            "location": "./example/thumbs"
          }
        }
      },
      "exporters": {
        "thumbnail": {
          "factory": "ThumbnailExporter",
          "resolverName": "disk",
          "parameters": {
            "maxSideResolution": "300",
            "mimeType": "JPG"
          }
        }
      },
      "extractionPipelines": {}
    }
  ]
}

The Ingestion

To simplify the pipelines, it is beneficial to separate them based on media type. In this tutorial's collection, we do have images and videos, therefore we have two separate ones. Even with a shared schema, not all media types can be analysed for all the fields we have defined. For instance, there is no audio attached to images and therefore, we won't extract ASR from them.

Image Pipeline

The basic idea behind the image pipeline is the assumption, that the microservice, FES (feature-extraction-server), for CLIP, OCR, and DINO can handle multiple requests, which may take some time. In the meantime, vitrivr-engine can extract metadata information.

%%{
  init: {
    'theme': 'base',
    'themeVariables': {
      'primaryColor': '#2D373C',
      'primaryTextColor': '#D2EBE9',
      'primaryBorderColor': '#A5D7D2',
      'lineColor': '#D20537',
      'secondaryColor': '#2D373C',
      'edgeLabelBackground': '#000'
    }
  }
}%%
flowchart LR

direction LR
 s[ ] --> e[enumerator] --> d[decoder]
 d --> a[averagecolor]
 d --> c[clip]
 d --> i[dino]
 d --> o[ocr]
 d --> t[thumbail]
 t --> f[filter]
 a --> f[filter]
 f -->|combine| m[file]
 m --> p[persistance]
 c --> p
 i --> p
 o --> p
 p -->|combine| q[ ]

 style q fill:#0000,stroke:#0000,stroke-width:0px
 style s fill:#0000,stroke:#0000,stroke-width:0px

Image Pipeline Configuration

Store the pipeline as configuration JSON file under example/image-pipeline.json.

{
  "schema": "example",
  "context": {
    "contentFactory": "InMemoryContentFactory",
    "resolverName":"disk",
    "local": {
      "enumerator": {
        "path": "./example/media/",
        "depth": "3"
      },
      "filter": {
        "type": "SOURCE:IMAGE"
      }
    }
  },
  "operators": {
    "enumerator": {
      "type": "ENUMERATOR",
      "factory": "FileSystemEnumerator",
      "mediaTypes": ["IMAGE"]
    },
    "decoder": {
      "type": "DECODER",
      "factory": "ImageDecoder"
    },
    "averagecolor": {
      "type": "EXTRACTOR",
      "fieldName": "averagecolor"
    },
    "clip": {
      "type": "EXTRACTOR",
      "fieldName": "clip"
    },
    "dino": {
      "type": "EXTRACTOR",
      "fieldName": "dino"
    },
    "ocr": {
      "type": "EXTRACTOR",
      "fieldName": "ocr"
    },
    "meta-file": {
      "type": "EXTRACTOR",
      "fieldName": "file"
    },
    "meta-video": {
      "type": "EXTRACTOR",
      "fieldName": "video"
    },
    "meta-time": {
      "type": "EXTRACTOR",
      "fieldName": "time"
    },
    "thumbnail": {
      "type": "EXPORTER",
      "exporterName": "thumbnail"
    },
    "filter": {
      "type": "TRANSFORMER",
      "factory": "TypeFilterTransformer"
    }
  },
  "operations": {
    "stage-0-0": {"operator": "enumerator"},
    "stage-1-0": {"operator": "decoder","inputs": ["stage-0-0"]},
    "stage-2-0": {"operator": "clip","inputs": ["stage-1-0"]},
    "stage-2-1": {"operator": "dino","inputs": ["stage-1-0"]},
    "stage-2-2": {"operator": "ocr","inputs": ["stage-1-0"]},
    "stage-2-3": {"operator": "averagecolor","inputs": ["stage-1-0"]},
    "stage-2-4": {"operator": "thumbnail","inputs": ["stage-1-0"]},
    "stage-3-0": {"operator": "filter","inputs": ["stage-2-3","stage-2-4"], "merge": "COMBINE"},
    "stage-4-0": {"operator": "meta-file", "inputs": ["stage-3-0"]}
  },
  "output": [
    "stage-2-0",
    "stage-2-1",
    "stage-2-2",
    "stage-2-3",
    "stage-4-0"
  ],
  "mergeType": "COMBINE"
}

Video Pipeline

Similar to the image pipeline, again we leverage that the microservice (Feature Extraction Server) can extract on multiple analysers simulatneously. The video pipeline is slightly longe, as we want more metadata and ASR.

%%{
  init: {
    'theme': 'base',
    'themeVariables': {
      'primaryColor': '#2D373C',
      'primaryTextColor': '#D2EBE9',
      'primaryBorderColor': '#A5D7D2',
      'lineColor': '#D20537',
      'secondaryColor': '#2D373C',
      'edgeLabelBackground': '#0000'
    }
  }
}%%
flowchart LR

direction LR
 s[ ] --> e[enumerator] --> d[decoder] --> l[selector]
 l --> a[averagecolor]
 l --> c[clip]
 l --> i[dino]
 l --> o[ocr]
 l --> t[thumbail]
 d --> w[whisper]
 t --> f[filter]
 a --> f[filter]
 f -->|combine| m[file] --> v[video] --> y[time]
 y --> p[persistance]
 c --> p
 i --> p
 o --> p
 w --> p
 p -->|combine| q[ ]

 style q fill:#0000,stroke:#0000,stroke-width:0px
 style s fill:#0000,stroke:#0000,stroke-width:0px

Video Pipeline Configuration

Store the pipeline as configuration JSON file under example/video-pipeline.json.

{
  "schema": "example",
  "context": {
    "contentFactory": "InMemoryContentFactory",
    "resolverName":"disk",
    "local": {
      "enumerator": {
        "path": "./example/media/",
        "depth": "3"
      },
      "decoder": {
        "timeWindowMs": "30_000"
      },
      "filter": {
        "type": "SOURCE:VIDEO"
      }
    }
  },
  "operators": {
    "enumerator": {
      "type": "ENUMERATOR",
      "factory": "FileSystemEnumerator",
      "mediaTypes": ["VIDEO"]
    },
    "decoder": {
      "type": "DECODER",
      "factory": "VideoDecoder"
    },
    "selector": {
      "type": "TRANSFORMER",
      "factory": "LastContentAggregator"
    },
    "averagecolor": {
      "type": "EXTRACTOR",
      "fieldName": "averagecolor"
    },
    "clip": {
      "type": "EXTRACTOR",
      "fieldName": "clip"
    },
    "dino": {
      "type": "EXTRACTOR",
      "fieldName": "dino"
    },
    "whisper": {
      "type": "EXTRACTOR",
      "fieldName": "whisper"
    },
    "ocr": {
      "type": "EXTRACTOR",
      "fieldName": "ocr"
    },
    "meta-file": {
      "type": "EXTRACTOR",
      "fieldName": "file"
    },
    "meta-video": {
      "type": "EXTRACTOR",
      "fieldName": "video"
    },
    "meta-time": {
      "type": "EXTRACTOR",
      "fieldName": "time"
    },
    "thumbnail": {
      "type": "EXPORTER",
      "exporterName": "thumbnail"
    },
    "filter": {
      "type": "TRANSFORMER",
      "factory": "TypeFilterTransformer"
    }
  },
  "operations": {
    "stage-0-0": {"operator": "enumerator"},
    "stage-1-0": {"operator": "decoder","inputs": ["stage-0-0"]},
    "stage-2-0": {"operator": "selector","inputs": ["stage-1-0"]},
    "stage-3-0": {"operator": "clip","inputs": ["stage-2-0"]},
    "stage-3-1": {"operator": "dino","inputs": ["stage-2-0"]},
    "stage-3-2": {"operator": "whisper","inputs": ["stage-1-0"]},
    "stage-3-3": {"operator": "ocr","inputs": ["stage-2-0"]},
    "stage-3-4": {"operator": "averagecolor","inputs": ["stage-2-0"]},
    "stage-3-5": {"operator": "thumbnail","inputs": ["stage-2-0"]},
    "stage-4-0": {"operator": "filter","inputs": ["stage-3-5","stage-3-4"], "merge": "COMBINE"},
    "stage-5-0": {"operator": "meta-file", "inputs": ["stage-4-0"]},
    "stage-6-0": {"operator": "meta-video", "inputs": ["stage-5-0"]},
    "stage-7-0": {"operator": "meta-time", "inputs": ["stage-6-0"]}
  },
  "output": [
    "stage-3-0",
    "stage-3-1",
    "stage-3-2",
    "stage-3-3",
    "stage-7-0"
  ],
  "mergeType": "COMBINE"
}

Note the timeWindowMs decoder context propety, which is set to 30s - half a minute - for this particular setup.

Dense Retrevial

{
  "schema": "dense",
  "context": {
    "contentFactory": "CachedContentFactory",
    "resolverName": "disk",
    "local": {
      "content": {
        "path": "../cache"
      },
      "enumerator": {
        "path": "../objects_dense",,
        "depth": "5"
      },
      "selector":{
        "contentSources": "videoDecoder"
      },
      "imageSourceFilter": {
        "type": "SOURCE:IMAGE"
      },
      "videoSourceFilter": {
        "type": "SOURCE:VIDEO"
      },
      "videoFilePathContent": {
        "field": "file"
      },
      "imageFilePathContent": {
        "field": "file"
      },
      "ocrContent": {
        "field": "ocrSparse"
      },
      "asrContent": {
        "field": "asrSparse"
      },
      "captionContent": {
        "field": "captionSparse"
      },
      "documentFilter": {
        "label": "text document",
        "value": "true"
      },
      "photographFilter": {
        "label": "photograph",
        "value": "true"
      },
      "videoDecoder": {
        "timeWindowMs": "10000"
      },
      "clip": {
        "contentSources": "selector,imageDecoder"
      },
      "ocrSparse": {
        "contentSources": "imageDecoder,selector"
      },
      "captionSparse": {
        "contentSources": "imageDecoder,selector,videoPrompt,documentPrompt,photographPrompt"
      },
      "asrSparse": {
        "contentSources": "videoDecoder"
      },
      "ocrDense": {
        "contentSources": "ocrContent"
      },
      "captionDense": {
        "contentSources": "captionContent"
      },
      "asrDense": {
        "contentSources": "asrContent"
      },
      "documentType": {
        "contentSources": "imageDecoder"
      },
      "videoPrompt": {
        "template": "Describe the contents of this shot from a video segment (file path: ${videoFilePathContent}) to aid archivists in documenting and searching for the video segment. The automatically extracted speech transcript for the video segment is '${asrContent}' (may contain errors). Use information from the internet to enhance the description, for instance by searching for proper nouns. If web sources turn out to be irrelevant, do not include them. The video segment is part of the PTT Archive which preserves the history (1848-1997) of Swiss Post, Telegraphy and Telephony (PTT). The description should include all of the speech transcript in the video segment, if it is relevant. Instead of including the speech transcript verbatim, correct the errors first. If it is impossible to understand what the speech transcript means, simply ignore it. Never include any transcripts that contain errors and do not mention correcting errors. Do not include general information about the PTT. Do not structure the description, put everything in one paragraph. Do not mention words such as 'archive', 'documentation', 'archivist', 'search' or 'internet'. Include sources at the end of the description if applicable and otherwise do not mention any sources.",
        "defaultValue": "no content provided"
      },
      "documentPrompt": {
        "template": "Describe the contents of this document (file path: ${imageFilePathContent}) to aid archivists in documenting and searching for the document. Use information from the internet to enhance the description, for instance by searching for proper nouns. If web sources turn out to be irrelevant, do not include them. The document is part of the PTT Archive which preserves the history (1848-1997) of Swiss Post, Telegraphy and Telephony (PTT). The description should include all of the text in the document. Do not include general information about the PTT. Do not structure the description, put everything in one paragraph. Do not mention words such as 'archive', 'documentation', 'archivist', 'search' or 'internet'. Include sources at the end of the description if applicable and otherwise do not mention any sources.",
        "defaultValue": "no content provided"
      },
      "photographPrompt": {
        "template": "Describe the contents of this photograph (file path: ${imageFilePathContent}) to aid archivists in documenting and searching for the image. Use information from the internet to enhance the description, for instance by searching for proper nouns. If web sources turn out to be irrelevant, do not include them. The image is part of the PTT Archive which preserves the history (1848-1997) of Swiss Post, Telegraphy and Telephony (PTT). Do not include general information about the PTT. Do not structure the description, put everything in one paragraph. Do not mention words such as 'archive', 'documentation', 'archivist', 'search' or 'internet'. Include sources at the end of the description if applicable and otherwise do not mention any sources.",
        "defaultValue": "no content provided"
      }
    }
  },
  "operators": {
    "enumerator": {
      "type": "ENUMERATOR",
      "factory": "FileSystemEnumerator",
      "mediaTypes": ["IMAGE", "VIDEO"]
    },
    "imageDecoder": {
      "type": "DECODER",
      "factory": "ImageDecoder"
    },
    "videoDecoder": {
      "type": "DECODER",
      "factory": "VideoDecoder"
    },
    "fileMetadata":{
      "type": "EXTRACTOR",
      "fieldName": "file"
    },
    "videoFilePathContent": {
      "type": "TRANSFORMER",
      "factory":"DescriptorAsContentTransformer"
    },
    "imageFilePathContent": {
      "type": "TRANSFORMER",
      "factory":"DescriptorAsContentTransformer"
    },
    "clip": {
      "type": "EXTRACTOR",
      "fieldName": "clip"
    },
    "ocrSparse": {
      "type": "EXTRACTOR",
      "fieldName": "ocrSparse"
    },
    "captionSparse": {
      "type": "EXTRACTOR",
      "fieldName": "captionSparse"
    },
    "asrSparse": {
      "type": "EXTRACTOR",
      "fieldName": "asrSparse"
    },
    "ocrDense": {
      "type": "EXTRACTOR",
      "fieldName": "ocrDense"
    },
    "captionDense": {
      "type": "EXTRACTOR",
      "fieldName": "captionDense"
    },
    "asrDense": {
      "type": "EXTRACTOR",
      "fieldName": "asrDense"
    },
    "documentType": {
      "type": "EXTRACTOR",
      "fieldName": "documentType"
    },
    "imageSourceFilter": {
      "type": "TRANSFORMER",
      "factory": "TypeFilterTransformer"
    },
    "videoSourceFilter": {
      "type": "TRANSFORMER",
      "factory": "TypeFilterTransformer"
    },
    "ocrContent": {
      "type": "TRANSFORMER",
      "factory": "DescriptorAsContentTransformer"
    },
    "asrContent": {
      "type": "TRANSFORMER",
      "factory": "DescriptorAsContentTransformer"
    },
    "captionContent": {
      "type": "TRANSFORMER",
      "factory": "DescriptorAsContentTransformer"
    },
    "documentFilter": {
      "type": "TRANSFORMER",
      "factory": "LabelFilterTransformer"
    },
    "photographFilter": {
      "type": "TRANSFORMER",
      "factory": "LabelFilterTransformer"
    },
    "selector": {
      "type": "TRANSFORMER",
      "factory": "LastContentAggregator"
    },
    "time":{
      "type": "EXTRACTOR",
      "fieldName": "time"
    },
    "videoPrompt": {
      "type": "TRANSFORMER",
      "factory": "TemplateTextTransformer"
    },
    "documentPrompt": {
      "type": "TRANSFORMER",
      "factory": "TemplateTextTransformer"
    },
    "photographPrompt": {
      "type": "TRANSFORMER",
      "factory": "TemplateTextTransformer"
    },
    "thumbnail": {
      "type": "EXPORTER",
      "exporterName": "thumbnail"
    }
  },
  "operations": {
    "enumerator-stage": {"operator": "enumerator"},
    "video-decoder-stage": {"operator": "videoDecoder", "inputs": ["enumerator-stage"]},
    "video-file-metadata-stage": {"operator": "fileMetadata", "inputs": ["video-decoder-stage"], "merge": "COMBINE"},
    "video-file-path-content-stage": {"operator": "videoFilePathContent", "inputs": ["video-file-metadata-stage"]},
    "time-stage": {"operator": "time","inputs": ["video-file-path-content-stage"]},
    "image-decoder-stage": {"operator": "imageDecoder", "inputs": ["enumerator-stage"]},
    "image-file-metadata-stage": {"operator": "fileMetadata", "inputs": ["image-decoder-stage"]},
    "image-file-path-content-stage": {"operator": "imageFilePathContent", "inputs": ["image-file-metadata-stage"]},
    "selector-stage": {"operator": "selector", "inputs": ["time-stage"]},

    "video-clip-stage": {"operator": "clip", "inputs": ["selector-stage"]},
    "video-ocr-sparse-stage": {"operator": "ocrSparse", "inputs": ["selector-stage"]},
    "video-ocr-content-stage": {"operator": "ocrContent", "inputs": ["video-ocr-sparse-stage"]},
    "video-ocr-stage": {"operator": "ocrDense", "inputs": ["video-ocr-content-stage"]},
    "asr-sparse-stage": {"operator": "asrSparse", "inputs": ["time-stage"]},
    "asr-content-stage": {"operator": "asrContent", "inputs": ["asr-sparse-stage"]},
    "asr-stage": {"operator": "asrDense", "inputs": ["asr-content-stage"]},

    "image-classification-stage": {"operator": "documentType", "inputs": ["image-file-path-content-stage"]},
    "photograph-stage": {"operator": "photographFilter", "inputs": ["image-classification-stage"]},
    "document-stage": {"operator": "documentFilter", "inputs": ["image-classification-stage"]},
    "photograph-clip-stage": {"operator": "clip", "inputs": ["photograph-stage"]},
    "photograph-ocr-sparse-stage": {"operator": "ocrSparse", "inputs": ["photograph-stage"]},
    "photograph-ocr-content-stage": {"operator": "ocrContent", "inputs": ["photograph-ocr-sparse-stage"]},
    "photograph-ocr-stage": {"operator": "ocrDense", "inputs": ["photograph-ocr-content-stage"]},
    "document-ocr-sparse-stage": {"operator": "ocrSparse", "inputs": ["document-stage"]},
    "document-ocr-content-stage": {"operator": "ocrContent", "inputs": ["document-ocr-sparse-stage"]},
    "document-ocr-stage": {"operator": "ocrDense", "inputs": ["document-ocr-content-stage"]},

    "video-prompt-stage": {"operator": "videoPrompt", "inputs": ["asr-stage"]},
    "video-caption-sparse-stage": {"operator": "captionSparse", "inputs": ["video-prompt-stage"]},
    "video-caption-content-stage": {"operator": "captionContent", "inputs": ["video-caption-sparse-stage"]},
    "video-caption-stage": {"operator": "captionDense", "inputs": ["video-caption-content-stage"]},
    "document-prompt-stage": {"operator": "documentPrompt", "inputs": ["document-stage"]},
    "document-caption-sparse-stage": {"operator": "captionSparse", "inputs": ["document-prompt-stage"]},
    "document-caption-content-stage": {"operator": "captionContent", "inputs": ["document-caption-sparse-stage"]},
    "document-caption-stage": {"operator": "captionDense", "inputs": ["document-caption-content-stage"]},
    "photograph-prompt-stage": {"operator": "photographPrompt", "inputs": ["photograph-stage"]},
    "photograph-caption-sparse-stage": {"operator": "captionSparse", "inputs": ["photograph-prompt-stage"]},
    "photograph-caption-content-stage": {"operator": "captionContent", "inputs": ["photograph-caption-sparse-stage"]},
    "photograph-caption-stage": {"operator": "captionDense", "inputs": ["photograph-caption-content-stage"]},

    "photograph-final-stage": {"operator": "thumbnail", "inputs": ["photograph-clip-stage", "photograph-caption-stage"], "merge": "COMBINE"},
    "document-final-stage": {"operator": "thumbnail", "inputs": ["document-caption-stage"]},
    "video-final-stage": {"operator": "thumbnail", "inputs": ["video-clip-stage", "video-caption-stage"], "merge": "COMBINE"},

    "video-filter-stage": {"operator": "videoSourceFilter", "inputs": ["video-final-stage"]},
    "image-filter-stage": {"operator": "imageSourceFilter", "inputs": ["document-final-stage", "photograph-final-stage"], "merge": "MERGE"}
  },
  "output": [
    "image-filter-stage"
  ],
  "mergeType": "MERGE"
}

%%{

  init: {

    'theme': 'base',

    'themeVariables': {

      'primaryColor': '#2D373C',

      'primaryTextColor': '#D2EBE9',

      'primaryBorderColor': '#A5D7D2',

      'lineColor': '#D20537',

      'secondaryColor': '#2D373C',

      'edgeLabelBackground': '#0000'

    }

  }

}%%


flowchart LR

  

direction LR

s[ ] --> e[enumerator]

e --> vd[videoDecoder]

vd --> vfm[fileMetadata]

vfm --> vfp[videoFilePathContent]

vfp --> t[time]

t --> sel[selector]

e --> id[imageDecoder]

id --> ifm[fileMetadata]

ifm --> ifp[imageFilePathContent]

  
  
  

sel --> ocrs[ocrSparse]

ocrs --> ocrc[ocrContent]

ocrc --> ocrd[ocrDense]

  

sel --> vclip[clip]

  

t --> asrs[asrSparse]

asrs --> asrc[asrContent]

asrc --> asrd[asrDense]

  

ifp --> dt[documentType]

dt --> pf[photographFilter]

dt --> df[documentFilter]

  

pf --> iclip[clip]

  

pf--> iocrs[ocrSparse]

iocrs --> iocrc[ocrContent]

iocrc --> iocrd[ocrDense]

  

df--> docrs[ocrSparse]

docrs --> docrc[ocrContent]

docrc --> docrd[ocrDense]

  

asrd --> vp[videoPrompt]

vp --> cs[captionSparse]

cs --> cc[captionContent]

cc --> cd[captionDense]

  

df --> dp[documentPrompt]

dp --> dcs[captionSparse]

dcs --> dcc[captionContent]

dcc --> dcd[captionDense]

  

pf --> pp[photographPrompt]

pp --> pcs[captionSparse]

pcs --> pcc[captionContent]

pcc --> pcd[captionDense]

  

iclip --> ptn[thumbnail]

pcd --> ptn[thumbnail]

iocrd --> ptn[thumbnail]
  

dcd --> dtn[thumbnail]

docrd --> dtn[thumbnail]

  

vclip --> vtn[thumbnail]

cd --> vtn[thumbnail]

ocrd --> vtn[thumbnail]

  

vtn -->|COMBINE| vsf[videoSourceFilter]

dtn -->|COMBINE| isf[imageSourceFilter]

ptn -->|COMBINE| isf[imageSourceFilter]

  

vsf --> p[persistance]

isf -->|COMBINE| p[persistance]

  

 p -->|MERGE| q[ ]

  

style q fill:#0000,stroke:#0000,stroke-width:0px

style s fill:#0000,stroke:#0000,stroke-width:0px

3D Model Pipeline

When ingesting 3D models into the vitrivr-engine, you can create visual previews to better understand and manage the models. To achieve this, you need to include the ModelPreviewExporter in your schema configuration.

Preview as JPG

To generate a static preview of the 3D model in JPG format, use the following configuration:

"exporters": [
  {
    "name": "preview",
    "factory": "ModelPreviewExporter",
    "resolverName": "disk",
    "parameters": {
      "maxSideResolution": "400",
      "mimeType": "GLTF",
      "distance": "1",
      "format": "jpg",
      "views": "4"
    }
  }
]

In this example, the preview is a JPG image with four views of the model. The original 3D model format is GLTF. The resulting preview for a sample model, such as a bunny, will look like this:

bunny

Preview as GIF

Alternatively, you can create a GIF to showcase the 3D model. To configure this, specify the number of views and set the format to GIF:

"exporters": [
  {
    "name": "preview",
    "factory": "ModelPreviewExporter",
    "resolverName": "disk",
    "parameters": {
      "maxSideResolution": "400",
      "mimeType": "GLTF",
      "distance": "1",
      "format": "gif",
      "views": "30"
    }
  }
]

In this configuration, the preview will be a GIF featuring 30 views of the model. The preview for the same bunny model will look like this:

bunny

Running the Ingestion

We run the ingestion using the shipped CLI

Start vitrivr-engine

Let's start the CLI using the previously built executable, this also works from an IDE (be careful to select the Main from the vitrivr-engine-server module!) or directly with the JAR.

./instance/vitrivr-engine-server-0.0.1-SNAPSHOT/bin/vitrivr-engine-server ./example/schema.json

Initialise Storage Layer

Before the start of the ingestion, it is essential that we prepare the database, essentially implementing the schema.

Using the CLI, we call the schema's init command.

v> example init

Since our schema is named example the command is as above. In case you renamed the schema, please use the template <schema> init.

Start the Ingestion

Ingestion jobs are schema-dependent and therefore, the command is similar to the init as before:

v> example extract -c ./example/image-pipeline.json

It is good practice to wait until a job has finished. With the default settings, there are a lot of log statements printed continiously to the console. As a rule of thumb, if the logs have stopped popping up every now and so often, the ingestion has finished (wether successfully or not, should be written by the log).

v> example extract -c ./example/video-pipeline.json

We provide with the -c option the path to the pipeline we created earlier. It is important to note, that these relative paths work due to the setup. In addition, please be aware that any path in any configuration file is relative to the the working directory by default. If you followed this tutorial, this shouldn't be a problem. Another solution to not run into issues is to always use absolute paths.

Retrieval

{
      "inputs": {
            "mytext1": {"type": "TEXT", "data": "orange starfish on the seafloor"},
            "mytext2": {"type": "TEXT", "data": "a seasnake on the seafloor"}
      },
      "operations": {
         "clip1" : {"type": "RETRIEVER", "field": "clip", "input": "mytext1"},
         "relations1" : {"type": "TRANSFORMER", "transformerName": "RelationExpander", "input": "clip1"},
         "lookup1" : {"type": "TRANSFORMER", "transformerName": "FieldLookup", "input": "relations1"},
         "clip2" : {"type": "RETRIEVER", "field": "clip", "input": "mytext2"},
         "relations2" : {"type": "TRANSFORMER", "transformerName": "RelationExpander", "input": "clip2"},
         "lookup2" : {"type": "TRANSFORMER", "transformerName": "FieldLookup", "input": "relations2"},
         "temporal" : {"type": "AGGREGATOR", "aggregatorName": "TemporalSequenceAggregator", "inputs": ["lookup1", "lookup2"]},
      
         "aggregator" : {"type": "TRANSFORMER", "transformerName": "ScoreAggregator",  "input": "temporal"},
         
         "filelookup" : {"type": "TRANSFORMER", "transformerName": "FieldLookup", "input": "aggregator"}
      },
      "context": {
         "global": {
            "limit": "1000"
         },
         "local" : {
            "lookup1":{"field": "time", "keys": "start, end"},
            "relations1" :{"outgoing": "partOf"},            
            "lookup2":{"field": "time", "keys": "start, end"},
            "relations2" :{"outgoing": "partOf"},            
            "filelookup": {"field": "file", "keys": "path"}
         }
      },
      "output": "filelookup"
}

%%{
  init: {
    'theme': 'base',
    'themeVariables': {
      'primaryColor': '#2D373C',
      'primaryTextColor': '#D2EBE9',
      'primaryBorderColor': '#A5D7D2',
      'lineColor': '#D20537',
      'secondaryColor': '#2D373C',
      'edgeLabelBackground': '#0000'
    }
  }
}%%
flowchart LR

direction LR
 ti1[Textinput t=1] 
 ti2[Textinput t=...] 
 tin[Textinput t=n] 
 ti1--> c1[clip]
 ti2--> c2[clip]
 tin--> cn[clip]
 c1 --> r1[RelationExpander]
 c2 --> r2[RelationExpander]
 cn --> rn[RelationExpander]
 r1 --> f1[TimeLookup]
 r2 --> f2[TimeLookup]
 rn --> fn[TimeLookup]
 f1 --> tsa[TemporalAggregator]
 f2 --> tsa
 fn --> tsa
 tsa --> fl[FileLookup]
 fl --> p[output]
 p -->|combine| q[ ]

 style q fill:#0000,stroke:#0000,stroke-width:0px
 style s fill:#0000,stroke:#0000,stroke-width:0px

Migration

In order to maintain interoperability between the database systems offered, vitrivr-engine provides a migration mechanism.

<schema-source> migrate-to -n <schema-targed>

Preconditions

<schema-source> and <schema-targed> must be initialized.

All needed schemas must be in the configuration file (See below).

<schema-source> data must be consistant.

As an example (with configuration below) from an existing cottontail source vitrivr-ct to jsonl vitrivr-json and then to postgres vitrivr-pg:

vitrivr-json init
vitrivr-ct migrate-to -n vitrivr-json

Check the json files in ./vitrivr-json folder.

vitrivr-pg init
vitrivr-json migrate-to -n vitrivr-pg

{
  "schemas": {
    "vitrivr-ct": {
      "connection": {
        "database": "CottontailConnectionProvider",
        "parameters": {
          "Host": "127.0.0.1",
          "port": "1865"
        }
      },
      "fields": {
        "averagecolor": {
          "factory": "AverageColor"
        },
        "file": {
          "factory": "FileSourceMetadata"
        },
        "clip": {
          "factory": "DenseEmbedding",
          "parameters": {
            "host": "http://10.34.64.84:8888/",
            "model": "open-clip-vit-b32",
            "length": "512",
            "timeoutSeconds": "100",
            "retries": "1000"
          }
        },
        "time": {
          "factory": "TemporalMetadata"
        },
        "video": {
          "factory": "VideoSourceMetadata"
        }
      },
      "resolvers": {
        "disk": {
          "factory": "DiskResolver",
          "parameters": {
            "location": "./example/thumbs"
          }
        }
      },
      "exporters": {
        "thumbnail": {
          "factory": "ThumbnailExporter",
          "resolverName": "disk",
          "parameters": {
            "maxSideResolution": "300",
            "mimeType": "JPG"
          }
        }
      },
      "extractionPipelines": {
        "video": {
          "path": "./example-configs/ingestion/migration/video-ct.json"
        }
      }
    },

    "vitrivr-pg": {
      "connection": {
        "database": "PgVectorConnectionProvider",
        "parameters": {
          "Host": "127.0.0.1",
          "port": "5432",
          "username": "postgres",
          "password": "vitrivr"
        }
      },
      "fields": {
        "averagecolor": {
          "factory": "AverageColor"
        },
        "file": {
          "factory": "FileSourceMetadata"
        },
        "clip": {
          "factory": "DenseEmbedding",
          "parameters": {
            "host": "http://10.34.64.84:8888/",
            "model": "open-clip-vit-b32",
            "length": "512",
            "timeoutSeconds": "100",
            "retries": "1000"
          }
        },
        "time": {
          "factory": "TemporalMetadata"
        },
        "video": {
          "factory": "VideoSourceMetadata"
        }
      },
      "resolvers": {
        "disk": {
          "factory": "DiskResolver",
          "parameters": {
            "location": "./example/thumbs"
          }
        }
      },
      "exporters": {
        "thumbnail": {
          "factory": "ThumbnailExporter",
          "resolverName": "disk",
          "parameters": {
            "maxSideResolution": "300",
            "mimeType": "JPG"
          }
        }
      },
      "extractionPipelines": {
        "video": {
          "path": "./example-configs/ingestion/migration/video-pg.json"
        }
      }
    },


    "vitrivr-json": {
      "connection": {
        "database": "JsonlConnectionProvider",
        "parameters": {
          "root": "."
        }
      },
      "fields": {
        "averagecolor": {
          "factory": "AverageColor"
        },
        "file": {
          "factory": "FileSourceMetadata"
        },
        "clip": {
          "factory": "DenseEmbedding",
          "parameters": {
            "host": "http://10.34.64.84:8888/",
            "model": "open-clip-vit-b32",
            "length": "512",
            "timeoutSeconds": "100",
            "retries": "1000"
          }
        },
        "time": {
          "factory": "TemporalMetadata"
        },
        "video": {
          "factory": "VideoSourceMetadata"
        }
      },
      "resolvers": {
        "disk": {
          "factory": "DiskResolver",
          "parameters": {
            "location": "./example/thumbs"
          }
        }
      },
      "exporters": {
        "thumbnail": {
          "factory": "ThumbnailExporter",
          "resolverName": "disk",
          "parameters": {
            "maxSideResolution": "300",
            "mimeType": "JPG"
          }
        }
      },
      "extractionPipelines": {
        "video": {
          "path": "./example-configs/ingestion/migration/video-json.json"
        }
      }
    }
  }
}

⚠️ This wiki is work-in-progress and targets the dev branch / Release Candiate 1 to be released by the end of August 2024 ⚠️

Found an issue in the wiki? Post it!

Have a question? Ask it

Disclaimer: Please keep in mind, vitrivr and vitrivr-engine are predominantly research prototypes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly