This repo contains code and links to utilities to help with file ingestion into watsonx Discovery (Note that watsonx Discovery is synonymous with Elasticsearch)
RAG-LLM is an application that can be started locally, in Code Engine, or on an OpenShift cluster. There is terraform automation code to help deploy this into Code Engine on IBM Cloud. This application includes 3 different apis:
ingestDocs
: ingest documents of different types from a Cloud Object Storage bucket.queryLLM
: retrieve documents from an Elasticsearch index and send them into LLM for natural language responsequeryWDLLM
: retrieve documents from a Watson Discovery Collection and send them into LLM for natural language response
watsonx Discovery setup and ingestion is a python application that uses the LlamaIndex framework to ingest and chunk documents (located locally or from Cloud Object Storage) into an Elasticsearch instance. This utility can ingest multiple documents of type .pdf
, .txt
, .docx
, and .pptx
at one time.
Elasticsearch provides helper functions that use their _bulk API to ingest documents that have been saved in JSON format.
Read the python README for details on running a python script to ingest a JSON file
Read the javascript README for details on running a javascript to ingest a JSON file
See steps here Working with PDF and Office Documents in Elasticsearch
- Using simple web crawler in Elasticsearch
- Using web crawler with chunking preprocessors in Elasticsearch
- Using external tools to scrape websites (turn websites into documents in JSON format, then use python or javascript above to ingest)
A couple of other methods and utilities:
Elastic documentation on how to ingest data
Set up a web crawler with Neuralseek
File ingestion from Cloud Object Storage, using python notebooks in watsonx.ai environment