Building an Intelligent Response System on Custom Data with GPT-3.5 and Haystack

14 min readApr 11, 2023

1. Introduction

Open AI’s Generative Pre-trained Transformer 3.5 (GPT-3.5) has demonstrated its potential to revolutionize the way we interact with and access information. This is due to the model’s capacity to understand natural language and generate responses that are coherent and relevant.

A shortfall of this model is its tendency to hallucinate or generate incorrect responses if it lacks contextual information regarding a query. Querying information that is unknown, too niche or generated after the model was trained (2021) may lead to unreliable responses. This problem may be mitigated if contextual information is provided to the model before queries are processed. However, this exposes another limitation.

The GPT-3.5 text models have a maximum token limit of 4,096 or approximately 3000 words. Even GPT 4, Open AI’s most recent model (released March 14th, 2023), is capped at 32,000 tokens or approximately 25,000 words. These token limitations may make the models unable to process documents which have massive amounts of text data. However, if chunks of contextual information, relating solely to a user’s query could be extracted from a large corpus of text and fed into the GPT model, then-current token caps may be sufficient to generate a non-hallucinogenic/intelligent response. A notable and efficient way to achieve this is through the integration of GPT 3.5 with Haystack.

1.1 Haystack

Haystack is an open-source framework for building search systems on large-scale text data, leveraging state-of-the-art models and algorithms to provide accurate and efficient search results to user queries. It offers several customizable pipelines to build an end-to-end Natural Language Processing (NLP) system. A user may start off with an indexing pipeline where files are converted (from .txt, .pdf, .doc, etc.) into a Haystack data type called Document. These Documents are then preprocessed and indexed into a DocumentStore (SQL, FAISS, Elasticsearch, etc.) for fast retrieval. A search pipeline consisting of a retriever may then be used to index relevant Documents about a user’s query and fed into the GPT-3.5 model for response generation. Other applications of Haystack include document summarization, document search and question generation.

This article will demonstrate how the integration of Haystack and GPT 3.5 may be used to construct an intelligent generative response system based on custom text data. The article will cover two use cases:

Market Research
Enquiry System

2. Use Cases

2.1 Market Research

This image is only for aesthetic purposes. Source: unsplash.com

For the first use case, a specific product review from Amazon’s fine food reviews will be examined. The product in question is a brand of oatmeal raisin cookies and the goal of this section is to analyse customer reviews and investigate how the product may be improved.

2.1.1 Data Importation and Description

The dataset contains six variables; Id, ProductId, UserId, Score, Summary and Text with 913 observations.

# Mount your gdrive for colab
from google.colab import drive
drive.mount('/content/drive')

# import the pandas library
import pandas as pd 
# read the Excel file 'Reviews_oatmeal_cookies.xlsx' from Google Drive
dat = pd.read_excel('drive/MyDrive/oat_review/Reviews_oatmeal_cookies.xlsx')

# display the first three rows of the data
dat.head(3)

# display information about the dataset
dat.info()

2.1.2 Exploratory Data Analysis

Of particular interest to most businesses are the ratings customers give their product as it is a strong indication of its marketability. A mean rating of 4.58 isn’t quite bad, however, it may be beneficial to understand why some customers gave the product a low rating. A word cloud was used to unearth the most frequently occurring words in the Summary for reviews with score ratings less than or equal to 2.

# find the mean rating
dat.Score.mean()

# Import the WordCloud and Matplotlib libraries
from wordcloud import WordCloud
import matplotlib.pyplot as plt

# Join the 'Summary' column of 'dat' where the 'Score' is less than or equal to 2
text = ' '.join(dat[dat.Score <=2].Summary)

# Generate a WordCloud object with specified parameters
wordcloud = WordCloud(width=800, height=800, background_color='white', min_font_size=10).generate(text)

# Create a figure to display the WordCloud object
plt.figure(figsize=(8, 8), facecolor=None)

# Display the WordCloud object
plt.imshow(wordcloud)

# Remove the axis ticks and labels
plt.axis('off')

# Adjust the layout to remove any extra whitespace
plt.tight_layout(pad=0)

# Display the figure
plt.show()

Reasons, why some customers dislike the product, are evident in the visualization below. Words like “Bad”, “dry”, “Bland” and “Blech” express customers’ dissatisfaction with the product and may be the reason for their low rating. However, these words are without context and do not provide sufficient insight for a business to take concert actions. This article will re-approach the problem with the generative response system.

2.1.3 Generative Response System for Market Research

First, all reviews will be joined into a single text file and written to the drive. As illustrated below, the number of words in the text file is 47,496. This far exceeds the current word caps of GPT-3.5 and 4 (3000 and 25,000).

# Join all reviews with line break seperator
text1 = '/n'.join(dat.Text)

# Open a text file in write mode and save it
with open('/content/drive/MyDrive/text_review.txt', 'w') as file:
    file.write(text1)

# open the file in read mode
with open('/content/drive/MyDrive/text_review.txt', 'r') as file:

    # read the contents of the file
    contents = file.read()

    # split the contents into words
    words = contents.split()

    # count the number of words
    num_words = len(words)

# print the number of words
print("The file contains", num_words, "words.")

To convert the text file to Haystack Documents a converter object is initialized.

# Upgrade pip
! pip install --upgrade pip

# install farm-haystack along with with its dependencies for google colab environment
! pip install farm-haystack[colab]

# Import the TextConverter to convert text to documents
from haystack.nodes import TextConverter

# Set the directory path for the input text file
DOC_DIR = '/content/drive/MyDrive/text_review.txt'

# Initialize a TextConverter object with the specified parameters
converter = TextConverter(
    remove_numeric_tables=True,  # Remove numeric tables from the text
    valid_languages=["en"]  # Specify that the text is in English
)

# Convert the text file to a list of Haystack Document objects and extract metadata
# The convert() method returns a tuple with the list of documents and the metadata dictionary
docs = converter.convert(file_path=DOC_DIR, meta=None)[0]

Word embedding models are generally limited in the number of words they can process at a time and may require the text in the file to be grouped into smaller word chunks. To achieve this, the Preprocessor instance can be used to define the split length, which is 100 words in this case. An overlap of three tokens was used to ensure there is no information lost and some standard cleaning processes including the removal of empty lines and whitespace were undertaken.

# Importing PreProcessor class from the haystack.nodes module
from haystack.nodes import PreProcessor

# Creating an instance of the PreProcessor class with various options
preprocessor = PreProcessor(
    clean_empty_lines=True,
    clean_whitespace=True,
    clean_header_footer=True,
    split_by="word",
    split_length=100,
    split_overlap=3,
    split_respect_sentence_boundary=False,
)

# Using the process method of the preprocessor instance to process the docs
processed_docs = preprocessor.process(docs)

The document store can now be initialised. For this use case, the FAISS database was considered. The text-embedding-ada-002 model from OpenAI was used to represent text data into high dimensional vectors of 1536. This will be defined in the embedding retriever later on. The faiss_index_factory_str argument specifies the type of FAISS index to use (in this case, a Flat index). It must be noted that the current runtime may need to be restarted after the faiss dependency is installed.

Once the document_store is finalised, a faiss_document_stor.db file will be created in the current working directory.

# install farm-haystack along with with its dependencies for faiss
!pip install farm-haystack[faiss]

# Importing the FAISSDocumentStore class from the haystack.document_stores module
from haystack.document_stores import FAISSDocumentStore

# Creating an instance of the FAISSDocumentStore class with specified options
document_store = FAISSDocumentStore(
    faiss_index_factory_str="Flat",
    embedding_dim=1536
)

Next any existing documents in the database were deleted and the current preprocessed document was added.

# delete documents in database
document_store.delete_documents()

# add preprocessed document
document_store.write_documents(processed_docs)

The retriever module is responsible for matching any query to the documents stored in the database. In our use case, the retriever will be based on a semantic similarity search based on embeddings from the “text-embedding-ada-002” model. To use this model, an API key was required from Open AI. These embeddings will also be used to index the documents in the database.

Once the retriever is set, it can be connected to the document_store to convert the raw text into high-dimensional vectors.

# Import the EmbeddingRetriever class from the haystack.nodes module
from haystack.nodes import EmbeddingRetriever

# Set the OpenAI API key
MY_API_KEY = "enter api key from open ai"

# Initialize an EmbeddingRetriever object with the specified parameters
retriever = EmbeddingRetriever(
    document_store=document_store,  # Document database
    embedding_model="text-embedding-ada-002",  # Pre-trained text embedding model
    batch_size=32,  # Batch size for processing data
    api_key=MY_API_KEY,  # OpenAI API key for authentication
    max_seq_len=1024  # Maximum length of input sequences
)

# Update the embeddings of documents in the document store using the retriever object
document_store.update_embeddings(retriever)

The generator can now be initialised using the OpenAIAnswerGenerator node. In this case, the most powerful GPT-3.5 model, “text-davinci-003”, was utilised. The temperature (0.5 in the use case) can be adjusted to control how deterministic the model is or how varied the responses may be given a specific query. A max_token of 100 was used to give the model ample room to generate longer responses.

# Import the OpenAIAnswerGenerator class from the haystack.nodes module
from haystack.nodes import OpenAIAnswerGenerator

# Initialize an OpenAIAnswerGenerator object with the specified parameters
generator = OpenAIAnswerGenerator(
    api_key=MY_API_KEY,  # OpenAI API key for authentication
    model="text-davinci-003",  # Open AI text model
    temperature=.5,  # Controls the randomness of the generated responses. Ranges from 0 to 1, with higher values leading the more determinism
    max_tokens=100  # Maximum number of tokens (words) in the generated responses
)

Finally, the generative pipeline was defined with the GenerativeQAPipeline function, containing the generator and retriever.

# Import the GenerativeQAPipeline class from the haystack.pipelines module
from haystack.pipelines import GenerativeQAPipeline

# Initialize a GenerativeQAPipeline object with the specified parameters
gpt_search_engine = GenerativeQAPipeline(
    generator=generator,  # Answer generator object
    retriever=retriever  # Retriever object
)

A simple market analysis can now be conducted. For brevity, three queries will be examined. It must be noted that due to GPT’s determinism (temperature of 0.5), responses may be different from the ones below when the same queries are processed again.

Query 1: What are some complaints about the cookies?

To print answers generated by the model, the print_anwers function was used. To keep the output simple, the details argument was set to “minimum”. The generative pipeline (gpt_serach_engine) accepts the query and a parameter detailing the number of relevant documents to retrieve as well as the number of likely answers to generate.

The response below was not quoted from the reviews but was generated based on retrieved documents pertaining to the query. This response seems to be in line with the word cloud generated earlier.

The next question will narrow its focus on the cookie’s dryness.

# Import the print_answers function from the haystack.utils module
from haystack.utils import print_answers

# Set the input query string
query_input = "What are some complaints about the cookies?"
query = query_input

# Define the search parameters as a dictionary with retriever and generator parameters
params = {
    "Retriever": {"top_k": 15},  # Retrieve the top 15 most relevant documents
    "Generator": {"top_k": 1}  # Generate the top 1 most likely answer
}

# Use the GenerativeQAPipeline object to answer the input query with the specified search parameters
answer = gpt_search_engine.run(query=query, params=params)

# Print the predicted answers with minimal details
print_answers(answer, details="minimum")

Query 2: What else are people saying about the cookie's dryness?

Interesting. This query has allowed greater insight into what customers believe is the reason for the dryness, including packaging and weather conditions.

Query 3: What are some customer recommendations to improve the product?

The system is also capable of generating recommendations based on customer feedback in the reviews.

This section has briefly demonstrated the power of the generative response system in market research. The next section entails a more exhaustive demonstration of the system’s capacity.

2.2 Enquiry System

In this section, an enquiry system will be constructed based on some policy papers released by the UK government on reaching net zero emissions. Four pdf documents; the energy white paper (December 2020), the industrial decarbonisation strategy (March 2021), the UK hydrogen strategy (August 2021) and the heat and building strategy (October 2021) will be utilized. The pages of these documents number 170, 170, 121 and 244 respectively.

The procedure for the enquiry system will mostly remain the same as in the previous section, however, for this use case, multiple pdf files will be converted into Haystack Documents instead of a single text file. To accommodate these changes, the pdf node was installed and imported. Irrespective of the file format, the convert_files_to_docs function may be used to read multiple files in a specified directory and convert them all into Haystack documents. This function will be used to convert the net zero-emission pdf into documents.

# Install the 'farm-haystack' package with PDF support
! pip install 'farm-haystack[pdf]'

# Import the PDFToTextConverter class from haystack.nodes module
from haystack.nodes import PDFToTextConverter

# Import the convert_files_to_docs function from haystack.utils module
from haystack.utils import convert_files_to_docs

# Set the directory path where the documents to be converted are located
DOC_DIR = '/content/drive/MyDrive/net_zero'

# Use the convert_files_to_docs function to convert the PDF documents in the directory to text
# The split_paragraphs parameter specifies whether to split the text into paragraphs or not
docs = convert_files_to_docs(dir_path=DOC_DIR, split_paragraphs=True)

The rest of the code will follow the same sequence in the previous section.

# Importing PreProcessor class from the haystack.nodes module
from haystack.nodes import PreProcessor

# Creating an instance of the PreProcessor class with various options
preprocessor = PreProcessor(
    clean_empty_lines=True,
    clean_whitespace=True,
    clean_header_footer=True,
    split_by="word",
    split_length=100,
    split_overlap=3,
    split_respect_sentence_boundary=False,
)

# Using the process method of the preprocessor instance to process the docs
processed_docs = preprocessor.process(docs)

# install farm-haystack along with with its dependencies for faiss
# !pip install farm-haystack[faiss]

# Importing the FAISSDocumentStore class from the haystack.document_stores module
from haystack.document_stores import FAISSDocumentStore

# Creating an instance of the FAISSDocumentStore class with specified options
document_store = FAISSDocumentStore(
    faiss_index_factory_str="Flat",
    embedding_dim=1536
)

# delete documents in database
document_store.delete_documents()

# add preprocessed document
document_store.write_documents(processed_docs)

# Import the EmbeddingRetriever class from the haystack.nodes module
from haystack.nodes import EmbeddingRetriever

# Set the OpenAI API key
MY_API_KEY = "enter api key from open ai"

# Initialize an EmbeddingRetriever object with the specified parameters
retriever = EmbeddingRetriever(
    document_store=document_store,  # Document database
    embedding_model="text-embedding-ada-002",  # Pre-trained text embedding model
    batch_size=32,  # Batch size for processing data
    api_key=MY_API_KEY,  # OpenAI API key for authentication
    max_seq_len=1024  # Maximum length of input sequences
)

# Update the embeddings of documents in the document store using the retriever object
document_store.update_embeddings(retriever)

# Import the OpenAIAnswerGenerator class from the haystack.nodes module
from haystack.nodes import OpenAIAnswerGenerator

# Initialize an OpenAIAnswerGenerator object with the specified parameters
generator = OpenAIAnswerGenerator(
    api_key=MY_API_KEY,  # OpenAI API key for authentication
    model="text-davinci-003",  # Open AI text model
    temperature=.5,  # Controls the randomness of the generated responses
    max_tokens=100  # Maximum number of tokens (words) in the generated responses
)

# Import the GenerativeQAPipeline class from the haystack.pipelines module
from haystack.pipelines import GenerativeQAPipeline

# Initialize a GenerativeQAPipeline object with the specified parameters
gpt_search_engine = GenerativeQAPipeline(
    generator=generator,  # Answer generator object
    retriever=retriever  # Retriever object
)

Now queries based on the pdf documents can be inputted.

Query 1: What does net zero even mean?

Within two seconds all remaining steps in the pipeline were executed and a valid response was produced.

# Import the print_answers function from the haystack.utils module
from haystack.utils import print_answers

# Set the input query string
query_input = "What does net zero even mean?"
query = query_input

# Define the search parameters as a dictionary with retriever and generator parameters
params = {
    "Retriever": {"top_k": 15},  # Retrieve the top 15 most relevant documents
    "Generator": {"top_k": 1}  # Generate the top 1 most likely answer
}

# Use the GenerativeQAPipeline object to answer the input query with the specified search parameters
answer = gpt_search_engine.run(query=query, params=params)

# Print the predicted answers with minimal details
print_answers(answer, details="minimum")

Query 2: When does the UK intend to reach net zero?

Again a valid response was provided quickly. The next question will be a bit lengthier and more complex.

Query 3: I am saving to buy a new petrol car in 2038, is there any information about how feasible this may be in that year?

The system was able to realise that the sale of new petrol cars would be banned in 2030, making purchasing a new petrol car in 2038 infeasible. The subsequent queries will test how well the system handles hallucinations.

Query 4: The government will take away my diesel car in 2030. This has been my only car for 10 years and I will hate it if the government takes it away

In this query, I deliberately premised the sentence with a mistruth to try and induce hallucination. However, the system was able to understand my query and its inconsistency with the information it held and corrected me with relevant details in mere seconds.

Query 5: What information is there about the large fans that would cool the atmosphere and stop climate change?

Finally, I asked a question that was both based on untruth and impossible to answer because the context for it did not exist in the document store. The system was able to identify that there was no information about the query accurately.

3. Conclusion

The intelligent response system discussed in the article may have the capacity to reduce hallucinations since it was directly connected to relevant contextual information. However, it must be noted that hallucinations are still possible and as such, it may be useful to fact-check when uncertain about a response. Notwithstanding, the response system demonstrated astonishing performance. With Haystack and GPT 3.5, millions of documents can be processed quickly and relevant responses may be provided with a lowered risk of hallucinations. Similar systems can be built to address several use cases and business needs. The code for this system was written in Google Colab and can be found on my GitHub. Thank you.

Reference: Build a Search Engine with GPT-3 (deepset.ai)