watsonx Discovery Document Ingestion Utilities

This repo contains code and links to utilities to help with file ingestion into watsonx Discovery (Note that watsonx Discovery is synonymous with Elasticsearch)

RAG-LLM utility ingestDocs API

RAG-LLM is an application that can be started locally, in Code Engine, or on an OpenShift cluster. There is terraform automation code to help deploy this into Code Engine on IBM Cloud. This application includes 3 different apis:

ingestDocs: ingest documents of different types from a Cloud Object Storage bucket.
queryLLM: retrieve documents from an Elasticsearch index and send them into LLM for natural language response
queryWDLLM: retrieve documents from a Watson Discovery Collection and send them into LLM for natural language response

LlamaIndex document ingestion script

watsonx Discovery setup and ingestion is a python application that uses the LlamaIndex framework to ingest and chunk documents (located locally or from Cloud Object Storage) into an Elasticsearch instance. This utility can ingest multiple documents of type .pdf, .txt, .docx, and .pptx at one time.

Elasticsearch helper function scripts

Elasticsearch provides helper functions that use their _bulk API to ingest documents that have been saved in JSON format.

Python

Read the python README for details on running a python script to ingest a JSON file

Javascript

Read the javascript README for details on running a javascript to ingest a JSON file

Use Elasticsearch apis

See steps here Working with PDF and Office Documents in Elasticsearch

Webcrawling

Using simple web crawler in Elasticsearch
Using web crawler with chunking preprocessors in Elasticsearch
Using external tools to scrape websites (turn websites into documents in JSON format, then use python or javascript above to ingest)

Additional utilities

A couple of other methods and utilities:

Elastic documentation on how to ingest data

Set up a web crawler with Neuralseek

File ingestion from Cloud Object Storage, using python notebooks in watsonx.ai environment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

watsonx Discovery Document Ingestion Utilities

RAG-LLM utility ingestDocs API

LlamaIndex document ingestion script

Elasticsearch helper function scripts

Python

Javascript

Use Elasticsearch apis

Webcrawling

Additional utilities

Files

README.md

Latest commit

History

README.md

File metadata and controls

watsonx Discovery Document Ingestion Utilities

RAG-LLM utility ingestDocs API

LlamaIndex document ingestion script

Elasticsearch helper function scripts

Python

Javascript

Use Elasticsearch apis

Webcrawling

Additional utilities