Skip to content

Latest commit

 

History

History
39 lines (26 loc) · 2.99 KB

README.md

File metadata and controls

39 lines (26 loc) · 2.99 KB

watsonx Discovery Document Ingestion Utilities

This repo contains code and links to utilities to help with file ingestion into watsonx Discovery (Note that watsonx Discovery is synonymous with Elasticsearch)

RAG-LLM utility ingestDocs API

RAG-LLM is an application that can be started locally, in Code Engine, or on an OpenShift cluster. There is terraform automation code to help deploy this into Code Engine on IBM Cloud. This application includes 3 different apis:

  • ingestDocs: ingest documents of different types from a Cloud Object Storage bucket.
  • queryLLM: retrieve documents from an Elasticsearch index and send them into LLM for natural language response
  • queryWDLLM: retrieve documents from a Watson Discovery Collection and send them into LLM for natural language response

LlamaIndex document ingestion script

watsonx Discovery setup and ingestion is a python application that uses the LlamaIndex framework to ingest and chunk documents (located locally or from Cloud Object Storage) into an Elasticsearch instance. This utility can ingest multiple documents of type .pdf, .txt, .docx, and .pptx at one time.

Elasticsearch helper function scripts

Elasticsearch provides helper functions that use their _bulk API to ingest documents that have been saved in JSON format.

Python

Read the python README for details on running a python script to ingest a JSON file

Javascript

Read the javascript README for details on running a javascript to ingest a JSON file

Use Elasticsearch apis

See steps here Working with PDF and Office Documents in Elasticsearch

Webcrawling

Additional utilities

A couple of other methods and utilities:

Elastic documentation on how to ingest data

Set up a web crawler with Neuralseek

File ingestion from Cloud Object Storage, using python notebooks in watsonx.ai environment