Skip to content

This repo contains code and links to utilities to help with file ingestion into watsonx Discovery (Elasticsearch)

Notifications You must be signed in to change notification settings

ibm-build-lab/wxd-file-ingestion-utilities

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

47 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

watsonx Discovery Document Ingestion Utilities

This repo contains code and links to utilities to help with file ingestion into watsonx Discovery (Note that watsonx Discovery is synonymous with Elasticsearch)

RAG-LLM utility ingestDocs API

RAG-LLM is an application that can be started locally, in Code Engine, or on an OpenShift cluster. There is terraform automation code to help deploy this into Code Engine on IBM Cloud. This application includes 3 different apis:

  • ingestDocs: ingest documents of different types from a Cloud Object Storage bucket.
  • queryLLM: retrieve documents from an Elasticsearch index and send them into LLM for natural language response
  • queryWDLLM: retrieve documents from a Watson Discovery Collection and send them into LLM for natural language response

LlamaIndex document ingestion script

watsonx Discovery setup and ingestion is a python application that uses the LlamaIndex framework to ingest and chunk documents (located locally or from Cloud Object Storage) into an Elasticsearch instance. This utility can ingest multiple documents of type .pdf, .txt, .docx, and .pptx at one time.

Elasticsearch helper function scripts

Elasticsearch provides helper functions that use their _bulk API to ingest documents that have been saved in JSON format.

Python

Read the python README for details on running a python script to ingest a JSON file

Javascript

Read the javascript README for details on running a javascript to ingest a JSON file

Use Elasticsearch apis

See steps here Working with PDF and Office Documents in Elasticsearch

Webcrawling

Additional utilities

A couple of other methods and utilities:

Elastic documentation on how to ingest data

Set up a web crawler with Neuralseek

File ingestion from Cloud Object Storage, using python notebooks in watsonx.ai environment

About

This repo contains code and links to utilities to help with file ingestion into watsonx Discovery (Elasticsearch)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published