This guide explain how to build a Python RAG chatbot using GPT-4o and Bright Data’s SERP API for more accurate, context-rich AI responses.
- Introduction
- What Is RAG?
- Why Feed AI Models With SERP Data
- RAG With SERP Data With GPT Models Using Python: Step-By-Step Tutorial
- Step #1: Initialize a Python Project
- Step #2: Install the Required Libraries
- Step #3: Prepare Your Project
- Step #4: Configure SERP API
- Step #5: Implement the SERP Scraping Logic
- Step #6: Extract Text from the SERP URLs
- Step #7: Generate the RAG Prompt
- Step #8: Perform the GPT Request
- Step #9: Create the Application UI
- Step #10: Put It All Together
- Step #11: Test the Application
- Conclusion
RAG, short for Retrieval-Augmented Generation, is an AI approach that combines information retrieval with text generation. In a RAG workflow, the application first retrieves relevant data from external sources—such as documents, web pages, or databases. Then, it passes data to the AI models so that it can generate more contextually relevant responses.
RAG enhances large language models (LLMs) like GPT by enabling them to access and reference up-to-date information beyond their original training data. This approach is key in scenarios where precise and context-specific information is needed, as it improves both the quality and accuracy of AI-generated responses.
The knowledge cutoff date for GPT-4o is October 2023, meaning it lacks access to events or information that came out after that time. However, GPT-4o models can pull in data from the Internet in real-time using Bing search integration. That helps them offer more up-to-date information and responses that are detailed, precise, and contextually rich.
This tutorial guides through building a RAG chatbot using OpenAI’s GPT models. The idea is to gather text from the top-performing pages on Google for a specific search query and use it as the context for a GPT request.
The biggest challenge is scraping SERP data. Most search engines come with advanced anti-bot solutions to prevent automated access to their pages. For detailed guidance, refer to our guide on how to scrape Google in Python.
To simplify the scraping process, we will use Bright Data’s SERP API.
This SERP scraper allows you to easily retrieve SERPs from Google, DuckDuckGo, Bing, Yandex, Baidu, and other search engines using simple HTTP requests.
We will then extract text data from the returned URLs using a headless browser. Then, we will use that information as the context for the GPT model in a RAG workflow. If you instead want to retrieve online data directly using AI, read our article on web scraping with ChatGPT.
All the code in this guide is also available in a GitHub repository:
git clone https://github.com/Tonel/rag_gpt_serp_scraping
Follow the instructions in the README.md file to install the project’s dependencies and launch the project.
Keep in mind that the approach presented in this blog post can easily be adapted to any other search engine or LLM.
Note:
This guide refers to Unix and macOS. If you are a Windows user, you can still follow the tutorial by using the Windows Subsystem for Linux (WSL).
Make sure you have Python 3 installed on your machine. Otherwise, download and install it.
Create a folder for your project and switch to it in the terminal:
mkdir rag_gpt_serp_scraping
cd rag_gpt_serp_scraping
The rag_gpt_serp_scraping
folder will contain your Python RAG project.
Then, load the project directory in your favorite Python IDE. PyCharm Community Edition or Visual Studio Code with the Python extension will do.
Inside rag_gpt_serp_scraping, add an empty app.py file. This will contain your scraping and RAG logic.
Next, initialize a Python virtual environment in the project directory:
python3 -m venv env
Activate the virtual environment with the command below:
source ./env/bin/activate
This Python RAG project will be using the following dependencies:
python-dotenv
: It will be used to securely manage sensitive credentials, such as Bright Data credentials and OpenAI API keys.requests
: To perform HTTP requests to Bright Data’s SERP API.langchain-community
: It will be used for retrieving text from the Google SERP pages and cleaning it to generate relevant content for RAG.openai
: It will be employed to interface with GPT models to generate natural language responses based on the given inputs and RAG context.streamlit
: It will come in handy for creating a UI where users can input their Google search queries and AI prompt, and view the results dynamically.
Install all the dependencies:
pip install python-dotenv requests langchain-community openai streamlit
We will use AsyncChromiumLoader from langchain-community, which requires the following dependencies:
pip install --upgrade --quiet playwright beautifulsoup4 html2text
To function properly, Playwright also requires you to install the browsers:
playwright install
In app.py
, add the following imports:
from dotenv import load_dotenv
import os
import requests
from langchain_community.document_loaders import AsyncChromiumLoader
from langchain_community.document_transformers import BeautifulSoupTransformer
from openai import OpenAI
import streamlit as st
Then, create a .env
file in your project folder to store all your credentials. Your project structure will now look like as below:
Use the function below in app.py
to instruct python-dotenv
to load the environment variables from .env
:
load_dotenv()
You can now import environment variables from .env
or the system with:
os.environ.get("<ENV_NAME>")
We will use Bright Data’s SERP API to retrieve content from search engine results pages and use that in our Python RAG workflow. Specifically, we will extract text from the URLs of the web pages returned by the SERP API.
To set up SERP API, refer to the official documentation. Alternatively, follow the instructions below.
If you have not already created an account, sign up for Bright Data. Once logged in, navigate to your account dashboard:
There, click the “Get proxy products” button.
That will bring you to the page below, where you have to click on the “SERP API” row:
On the SERP API product page, toggle “Activate zone” to enable the product:
Now, copy the SERP API host, port, username, and password in the “Access parameters” section and add them to your .env
file:
BRIGHT_DATA_SERP_API_HOST="<YOUR_HOST>"
BRIGHT_DATA_SERP_API_PORT=<YOUR_PORT>
BRIGHT_DATA_SERP_API_USERNAME="<YOUR_USERNAME>"
BRIGHT_DATA_SERP_API_PASSWORD="<YOUR_PASSWORD>"
Replace the <YOUR_XXXX>
placeholders with the values provided by Bright Data on the SERP API page.
Note that the host in “Access parameters” has a format like this:
brd.superproxy.io:33335
Split it as below:
BRIGHT_DATA_SERP_API_HOST="brd.superproxy.io"
BRIGHT_DATA_SERP_API_PORT=33335
In app.py
, add the following function to retrieve the first number_of_urls
URLs from a Google SERP page:
def get_google_serp_urls(query, number_of_urls=5):
# perform a Bright Data's SERP API request
# with JSON autoparsing
host = os.environ.get("BRIGHT_DATA_SERP_API_HOST")
port = os.environ.get("BRIGHT_DATA_SERP_API_PORT")
username = os.environ.get("BRIGHT_DATA_SERP_API_USERNAME")
password = os.environ.get("BRIGHT_DATA_SERP_API_PASSWORD")
proxy_url = f"http://{username}:{password}@{host}:{port}"
proxies = {"http": proxy_url, "https": proxy_url}
url = f"https://www.google.com/search?q={query}&brd_json=1"
response = requests.get(url, proxies=proxies, verify=False)
# retrieve the parsed JSON response
response_data = response.json()
# extract a "number_of_urls" number of
# Google SERP URLs from the response
google_serp_urls = []
if "organic" in response_data:
for item in response_data["organic"]:
if "link" in item:
google_serp_urls.append(item["link"])
return google_serp_urls[:number_of_urls]
This makes an HTTP GET request to SERP API with the search query specified in the query argument. The brd_json=1
query parameter ensures that SERP API parses the results into JSON for you, in the format below:
{
"general": {
"search_engine": "google",
"results_cnt": 1980000000,
"search_time": 0.57,
"language": "en",
"mobile": false,
"basic_view": false,
"search_type": "text",
"page_title": "pizza - Google Search",
"code_version": "1.90",
"timestamp": "2023-06-30T08:58:41.786Z"
},
"input": {
"original_url": "https://www.google.com/search?q=pizza&brd_json=1",
"user_agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12) AppleWebKit/608.2.11 (KHTML, like Gecko) Version/13.0.3 Safari/608.2.11",
"request_id": "hl_1a1be908_i00lwqqxt1"
},
"organic": [
{
"link": "https://www.pizzahut.com/",
"display_link": "https://www.pizzahut.com",
"title": "Pizza Hut | Delivery & Carryout - No One OutPizzas The Hut!",
"image": "omitted for brevity...",
"image_alt": "pizza from www.pizzahut.com",
"image_base64": "omitted for brevity...",
"rank": 1,
"global_rank": 1
},
{
"link": "https://www.dominos.com/en/",
"display_link": "https://www.dominos.com › ...",
"title": "Domino's: Pizza Delivery & Carryout, Pasta, Chicken & More",
"description": "Order pizza, pasta, sandwiches & more online for carryout or delivery from Domino's. View menu, find locations, track orders. Sign up for Domino's email ...",
"image": "omitted for brevity...",
"image_alt": "pizza from www.dominos.com",
"image_base64": "omitted for brevity...",
"rank": 2,
"global_rank": 3
},
// omitted for brevity...
],
// omitted for brevity...
}
The last few lines of the function retrieve each SERP URL from the resulting JSON data, select only the first number_of_urls
URLs, and return them in a list.
Define a function that extracts text from each of the SERP URLs:
def extract_text_from_urls(urls, number_of_words=600):
# instruct a headless Chrome instance to visit the provided URLs
# with the specified user-agent
loader = AsyncChromiumLoader(
urls,
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/129.0.0.0 Safari/537.36",
)
html_documents = loader.load()
# process the extracted HTML documents to extract text from them
bs_transformer = BeautifulSoupTransformer()
docs_transformed = bs_transformer.transform_documents(
html_documents,
tags_to_extract=["p", "em", "li", "strong", "h1", "h2"],
unwanted_tags=["a"],
remove_comments=True,
)
# make sure each HTML text document contains only a number
# number_of_words words
extracted_text_list = []
for doc_transformed in docs_transformed:
# split the text into words and join the first number_of_words
words = doc_transformed.page_content.split()[:number_of_words]
extracted_text = " ".join(words)
# ignore empty text documents
if len(extracted_text) != 0:
extracted_text_list.append(extracted_text)
return extracted_text_list
This function:
- Loads web pages from the URLs passed as an argument using a headless Chrome browser instance.
- Utilizes BeautifulSoupTransformer to process the HTML of each page and extract text from specific tags (like
<p>
,<h1>
,<strong>
, etc.), omitting unwanted tags (like<a>
) and comments. - Limits the extracted text for each webpage to a number of words specified by the
number_of_words
argument. - Returns a list of the extracted text from each URL.
While the ["p", "em", "li", "strong", "h1", "h2"]
tags are enough to extract text from most web pages, in some specific scenarios, you may need to customize this list of HTML tags. Also, you might have to increase or decrease the target number of words for each text item.
For example, consider the web page below:
Applying that function to that page will result in this text array:
["Lisa Johnson Mandell’s Transformers One review reveals the heretofore inconceivable: It’s one of the best animated films of the year! I never thought I’d see myself write this about a Transformers movie, but Transformers One is actually an exceptional film! ..."]
The list of text items returned by extract_text_from_urls()
represents the RAG context to feed to the OpenAI model.
Define a function that transforms the AI prompt request and text context into the final RAG prompt:
def get_openai_prompt(request, text_context=[]):
# default prompt
prompt = request
# add the context to the prompt, if present
if len(text_context) != 0:
context_string = "\n\n--------\n\n".join(text_context)
prompt = f"Answer the request using only the context below.\n\nContext:\n{context_string}\n\nRequest: {request}"
return prompt
Prompts returned by the previous function when a RAG context is specified have this format:
Answer the request using only the context below.
Context:
Bla bla bla...
--------
Bla bla bla...
--------
Bla bla bla...
Request: <YOUR_REQUEST>
First, initialize the OpenAI client at the top of the app.py
file:
openai_client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
This relies on the OPENAI_API_KEY
environment variable, which you can define directly in your system’s environments or in the .env
file:
OPENAI_API_KEY="<YOUR_API_KEY>"
Replace <YOUR_API_KEY>
with the value of your OpenAI API key. If you do not know how to get one, follow the official guide.
Next, write a function that uses the OpenAI official client to perform a request to the GPT-4o mini AI model:
def interrogate_openai(prompt, max_tokens=800):
# interrogate the OpenAI model with the given prompt
response = openai_client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
max_tokens=max_tokens,
)
return response.choices[0].message.content
Note:
You can configure any other GPT model supported by the OpenAI API.
If called with a prompt returned by get_openai_prompt()
that includes a specified text context, interrogate_openai()
will successfully perform retrieval-augmented generation as intended.
Use Streamlit to define a simple form UI where users can specify:
- The Google search query to pass to the SERP API
- The AI prompt to send to GPT-4o mini
To do that, use this code:
with st.form("prompt_form"):
# initialize the output results
result = ""
final_prompt = ""
# textarea for user to input their Google search query
google_search_query = st.text_area("Google Search:", None)
# textarea for user to input their AI prompt
request = st.text_area("AI Prompt:", None)
# button to submit the form
submitted = st.form_submit_button("Send")
# if the form is submitted
if submitted:
# retrieve the Google SERP URLs from the given search query
google_serp_urls = get_google_serp_urls(google_search_query)
# extract the text from the respective HTML pages
extracted_text_list = extract_text_from_urls(google_serp_urls)
# generate the AI prompt using the extracted text as context
final_prompt = get_openai_prompt(request, extracted_text_list)
# interrogate an OpenAI model with the generated prompt
result = interrogate_openai(final_prompt)
# dropdown containing the generated prompt
final_prompt_expander = st.expander("AI Final Prompt:")
final_prompt_expander.write(final_prompt)
# write the result from the OpenAI model
st.write(result)
The Python RAG script is ready.
Your app.py
file should contain the following code:
from dotenv import load_dotenv
import os
import requests
from langchain_community.document_loaders import AsyncChromiumLoader
from langchain_community.document_transformers import BeautifulSoupTransformer
from openai import OpenAI
import streamlit as st
# load the environment variables from the .env file
load_dotenv()
# initialize the OpenAI API client with your API key
openai_client = OpenAI(api_key=os.environ.get("OPENAI_API_KEY"))
def get_google_serp_urls(query, number_of_urls=5):
# perform a Bright Data's SERP API request
# with JSON autoparsing
host = os.environ.get("BRIGHT_DATA_SERP_API_HOST")
port = os.environ.get("BRIGHT_DATA_SERP_API_PORT")
username = os.environ.get("BRIGHT_DATA_SERP_API_USERNAME")
password = os.environ.get("BRIGHT_DATA_SERP_API_PASSWORD")
proxy_url = f"http://{username}:{password}@{host}:{port}"
proxies = {"http": proxy_url, "https": proxy_url}
url = f"https://www.google.com/search?q={query}&brd_json=1"
response = requests.get(url, proxies=proxies, verify=False)
# retrieve the parsed JSON response
response_data = response.json()
# extract a "number_of_urls" number of
# Google SERP URLs from the response
google_serp_urls = []
if "organic" in response_data:
for item in response_data["organic"]:
if "link" in item:
google_serp_urls.append(item["link"])
return google_serp_urls[:number_of_urls]
def extract_text_from_urls(urls, number_of_words=600):
# instruct a headless Chrome instance to visit the provided URLs
# with the specified user-agent
loader = AsyncChromiumLoader(
urls,
user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/129.0.0.0 Safari/537.36",
)
html_documents = loader.load()
# process the extracted HTML documents to extract text from them
bs_transformer = BeautifulSoupTransformer()
docs_transformed = bs_transformer.transform_documents(
html_documents,
tags_to_extract=["p", "em", "li", "strong", "h1", "h2"],
unwanted_tags=["a"],
remove_comments=True,
)
# make sure each HTML text document contains only a number
# number_of_words words
extracted_text_list = []
for doc_transformed in docs_transformed:
# split the text into words and join the first number_of_words
words = doc_transformed.page_content.split()[:number_of_words]
extracted_text = " ".join(words)
# ignore empty text documents
if len(extracted_text) != 0:
extracted_text_list.append(extracted_text)
return extracted_text_list
def get_openai_prompt(request, text_context=[]):
# default prompt
prompt = request
# add the context to the prompt, if present
if len(text_context) != 0:
context_string = "\n\n--------\n\n".join(text_context)
prompt = f"Answer the request using only the context below.\n\nContext:\n{context_string}\n\nRequest: {request}"
return prompt
def interrogate_openai(prompt, max_tokens=800):
# interrogate the OpenAI model with the given prompt
response = openai_client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role": "user", "content": prompt}],
max_tokens=max_tokens,
)
return response.choices[0].message.content
# create a form in the Streamlit app for user input
with st.form("prompt_form"):
# initialize the output results
result = ""
final_prompt = ""
# textarea for user to input their Google search query
google_search_query = st.text_area("Google Search:", None)
# textarea for user to input their AI prompt
request = st.text_area("AI Prompt:", None)
# button to submit the form
submitted = st.form_submit_button("Send")
# if the form is submitted
if submitted:
# retrieve the Google SERP URLs from the given search query
google_serp_urls = get_google_serp_urls(google_search_query)
# extract the text from the respective HTML pages
extracted_text_list = extract_text_from_urls(google_serp_urls)
# generate the AI prompt using the extracted text as context
final_prompt = get_openai_prompt(request, extracted_text_list)
# interrogate an OpenAI model with the generated prompt
result = interrogate_openai(final_prompt)
# dropdown containing the generated prompt
final_prompt_expander = st.expander("AI Final Prompt")
final_prompt_expander.write(final_prompt)
# write the result from the OpenAI model
st.write(result)
Launch your Python RAG application with:
streamlit run app.py
In the terminal, you should see the following output:
You can now view your Streamlit app in your browser.
Local URL: http://localhost:8501
Network URL: http://172.27.134.248:8501
Follow the instructions, and visit http://localhost:8501
in the browser. Below is what you should be seeing:
Test the application by using a Google search query as below:
Transformers One review
And an AI prompt as follows:
Write a review for the movie Transformers One
Click “Send” and wait while your application processes the request. After a few seconds, you should get a result like this:
If you expand the “AI Final Prompt” dropdown, you will see the complete prompt used by the application for RAG.
The major challenge with using a Python RAG chatbot is scraping search engines like Google:
- They frequently alter the structure of their SERP pages.
- They are protected by some of the most sophisticated anti-bot measures available.
- Retrieving large volumes of SERP data concurrently is complex and can be expensive.
Bright Data’s SERP API helps you retrieve real-time SERP data from all major search engines with no effort. It also supports RAG and many other applications. Get your free trial now!