watsonx Discovery Document Ingestion Utilities

This repo contains code and links to utilities to help with file ingestion into watsonx Discovery (Note that watsonx Discovery is synonymous with Elasticsearch)

RAG-LLM utility ingestDocs API

RAG-LLM is an application that can be started locally, in Code Engine, or on an OpenShift cluster. There is terraform automation code to help deploy this into Code Engine on IBM Cloud. This application includes 3 different apis:

ingestDocs: ingest documents of different types from a Cloud Object Storage bucket.
queryLLM: retrieve documents from an Elasticsearch index and send them into LLM for natural language response
queryWDLLM: retrieve documents from a Watson Discovery Collection and send them into LLM for natural language response

LlamaIndex document ingestion script

watsonx Discovery setup and ingestion is a python application that uses the LlamaIndex framework to ingest and chunk documents (located locally or from Cloud Object Storage) into an Elasticsearch instance. This utility can ingest multiple documents of type .pdf, .txt, .docx, and .pptx at one time.

Elasticsearch helper function scripts

Elasticsearch provides helper functions that use their _bulk API to ingest documents that have been saved in JSON format.

Python

Read the python README for details on running a python script to ingest a JSON file

Javascript

Read the javascript README for details on running a javascript to ingest a JSON file

Use Elasticsearch apis

See steps here Working with PDF and Office Documents in Elasticsearch

Webcrawling

Additional utilities

A couple of other methods and utilities:

Elastic documentation on how to ingest data

Set up a web crawler with Neuralseek

File ingestion from Cloud Object Storage, using python notebooks in watsonx.ai environment

Name		Name	Last commit message	Last commit date
Latest commit History 47 Commits
assets		assets
javascript		javascript
python		python
README.md		README.md
README_webcrawl.md		README_webcrawl.md
README_webcrawl_simple.md		README_webcrawl_simple.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

watsonx Discovery Document Ingestion Utilities

RAG-LLM utility ingestDocs API

LlamaIndex document ingestion script

Elasticsearch helper function scripts

Python

Javascript

Use Elasticsearch apis

Webcrawling

Additional utilities

About

Releases

Packages

Languages

ibm-build-lab/wxd-file-ingestion-utilities

Folders and files

Latest commit

History

Repository files navigation

watsonx Discovery Document Ingestion Utilities

RAG-LLM utility ingestDocs API

LlamaIndex document ingestion script

Elasticsearch helper function scripts

Python

Javascript

Use Elasticsearch apis

Webcrawling

Additional utilities

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages