GPT-3 vs Other Text Embeddings Techniques for Text Classification: A Performance Evaluation

Derrick Ofori
9 min readFeb 10, 2023

--

With recent advancements in NLP (Natural Language Processing), GPT-3 (Generative Pre-trained Transformer 3) from OpenAI has emerged as one of the most powerful language models on the market. On January 25, 2022, OpenAI unveiled an embedding endpoint (Neelakantan et al., 2022). This endpoint uses neural network models to convert text and code into vector representations, embedding them in a high-dimensional space. Of particular interest to this article are the text similarity embedding models created with these endpoints. These models can capture the semantic similarity of text and have seemingly achieved state-of-the-art performance in certain use cases (Conneau et al., 2018).

This article seeks to evaluate the performance of one of these text similarity models, ‘text-embedding-ada-002’. This model was selected due to its affordability and simplicity of use. The performance of the embeddings generated with this model will be compared against those generated from three conventional text embedding techniques; GloVe (Pennington, Socher and Manning, 2014), Word2vec (Mikolov et al., 2013), and MPNet (Song et al., 2020). The embeddings will be used in training several machine learning models to classify food review scores from the Amazon fine-food reviews dataset. The performance of each embedding technique will be evaluated by comparing their accuracy metric.

Data Importation and Preparation

The dataset utilised in this article is a subset of 1000 from the Amazon fine-food reviews dataset. This subset contains already generated embeddings using the ‘text-embedding-ada-002’ model from GPT-3. The embeddings were generated from a combination of the review title (summary) and the review text. As seen in figure 1 each review also has a ProductId, UserId, Score and the number of tokens generated from the combined text.

# Libraries
from sentence_transformers import SentenceTransformer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report
from sklearn.tree import DecisionTreeClassifier
from sklearn.preprocessing import RobustScaler
from sklearn.pipeline import Pipeline
import gensim.downloader as api
from sklearn.svm import SVC
import pandas as pd
import numpy as np
import openai
import re

# import data
df1 = pd.read_csv('https://raw.githubusercontent.com/openai/openai-cookbook/main/examples/data/fine_food_reviews_with_embeddings_1k.csv',
index_col=0)

# view first three rows
df1.head(3)
Figure 1 — First three rows of the dataset

When importing data containing embeddings from a csv file, it is common to find undesirable characters like line breaks and white spaces. These characters may prevent the embeddings from being represented as arrays. To remedy this, a function will be created to remove unnecessary characters and convert the embeddings into a proper array format. The name of the GPT-3 embedding variable will also be changed to ‘gpt_3’ to distinguish between other embeddings generated later in the article.

# clean openai embeddings
def clean_emb(text):

# remove line break
text = re.sub(r'\n', '', text)

# remove square brackets
text = re.sub(r'\[|\]', "", text)

# remove leading and trailing white spaces
text = text.strip()

# convert string into array
text = np.fromstring(text, dtype=float, sep=',')

return text


# Rename column to gpt_3
df1.rename(columns={'embedding': 'gpt_3'}, inplace=True)

# Apply clean_emb function
df1['gpt_3'] = df1['gpt_3'].apply(lambda x: clean_emb(x))

GPT-3 Embeddings

The dataset being utilized contains pre-generated GPT-3-based embeddings. However, in order to generate embeddings, an API key will be required to access the model. This key can be obtained by signing up for the OpenAI API. A function can then be created specifying the model to be used (text-embedding-ada-002 in this case) and applied to any text input.

api_key = 'Enter api key here'

# set api key as default api key for openai
openai.api_key = api_key

def get_embedding(text, model="text-embedding-ada-002"):

# replace new lines with spaces
text = text.replace("\n", " ")

# openai.Embedding.create to convert text into embedding array
return openai.Embedding.create(input = [text], model=model)['data'][0]['embedding']

GloVe Embeddings

GloVe (Global Vectors for Word Representation) is a text embedding technique that builds vector representations of words based on the statistics of their co-occurrence in a sizable corpus of text. The idea behind GloVe is that words that occur in comparable situations are semantically related, and statistics on their co-occurrence through a co-occurrence matrix can be used to infer the links between these words.

GloVe-based embeddings can be generated with the spaCy library. This library provides access to piplines and models designed using the GloVe algorithm. For the purposes of this article, the ‘en_core_web_lg’ English pipeline will be considered. This pipeline performs a series of steps on a given text input such as tokenization, tagging and lemmatization to convert it into a suitable format. This pipeline encompasses 514,000 unique vectors, which may be expansive enough for the current use case.

# Run this in terminal first: python -m spacy download en_core_web_lg
# ! pip install spacy
import spacy

# load pipeline
nlp = spacy.load("en_core_web_lg")

Before applying the pipeline to the input, text cleaning may be required. As seen in figure 2, some full stops occur consecutively within the first text input. This pattern is consistent within the data and must be corrected.

# first text input
df1.combined[0]
Figure 2 — Sample of text input

A function will be created to replace consecutive full stops with a single one as well as remove white spaces from the ends of the sentence.

def replace_multiple_fullstops(text):

# replace 2 or more consecutive fullstops with 1
text = re.sub(r'\.{2,}', '.', text)

# strip white spaces from ends of sentence
text= text.strip()

return text

# Apply function
df1['clean_text'] = df1['combined'].apply(lambda x: replace_multiple_fullstops(x))

The glove-based embeddings can be generated after the cleaning process.

# Generate embedding vectors in a variable called glove
df1['glove'] = df1['clean_text'].apply(lambda text: nlp(text).vector)

Word2vec Embeddings

The word2vec technique is based on a neural network model trained on large amounts of text to predict a target word from its surrounding context words. Word2vec works by representing each word in a vocabulary by a continuous vector that captures the meaning and context in which that word was used. These vectors are generated through an unsupervised learning process where neural network models try to predict a word given its contextual terms.

The Gensim library can be used to load a model trained on the word2vec technique. The ‘word2vec-google-news-300’ model in the Gensim library was trained on the Google News dataset with about 100 billion words and may be able to represent most if not all of the words within the dataset.

import gensim.downloader as api

# Load word2vec-google-news-300 model
wv = api.load("word2vec-google-news-300")

Since the Gensim library provides a model and not a pipeline, the spaCy library may be employed to tokenize, clean and lemmatize the text inputs before vector representations are generated with the word2vec model.


def wv_preprocess_and_vectorize(text):
# Process the input text using a natural language processing library
doc = nlp(text)

# Initialize a list to store the filtered tokens
filtered_tokens = []

# Loop through each token in the doc
for token in doc:
# If the token is a stop word or punctuation, skip it
if token.is_stop or token.is_punct:
continue
# Otherwise, add the lemma of the token to the filtered_tokens list
filtered_tokens.append(token.lemma_)

# If there are no filtered tokens, return np.nan
if not filtered_tokens:
return np.nan
else:
# Otherwise, return the mean vector representation of the filtered tokens
return wv.get_mean_vector(filtered_tokens)

# Apply function
df1['word2vec'] = df1['clean_text'].apply(lambda text: wv_preprocess_and_vectorize(text))

MPNet Embeddings

MPNet (Masked and Permuted Language Model Pre-training), is a technique for pre-training transformer-based language models for NLP. MPNet offers a variation of the BERT (Bidirectional Encoder Representations from Transformers) model (Devlin et al., 2019) and is designed to deal with the computational challenges of pre-training transfer models on large amounts of text data. The technique entails masking a portion of the input token during pre-training and training the model to predict the masked tokens given the context of the unhidden tokens. This process is known as masked language modelling and it is effective for capturing the meaning and context of words in a text corpus. In addition to masked language modelling, MPNet also employs a permutation mechanism that randomly permutes the order of the input tokens. This permutation helps the model learn the global context and relationships between the words in the input sequence.

To obtain MPNet-based embeddings, the sentence-transformer model ‘all-mpnet-base-v2’ from Hugging Face will be utilised. This model builds on the base MPNet model by fine-tuning it on a 1 billion sentence pairs dataset.

# Load all-mpnet-base-v2 model
model_sent = SentenceTransformer('all-mpnet-base-v2')

# Apply model
df1['mpnet'] = df1['clean_text'].apply(lambda text: model_sent.encode(text))

Dimensionality Comparison

Figure 3 highlights the different dimensions of each kind of embedding. GPT-3 has the largest dimension of 1536. This was followed by MPNet, Word2vec and GloVe with 768, 300 and 300 dimensions respectively. It may be interesting to see how dimensionality relates to model performance in the machine learning chapter of this article.

# assign data of lists.  
data = {'Name': ['gpt_3', 'mpnet', 'word2vec', 'glove'],
'Dimension': [len(df1.gpt_3[0]), len(df1.mpnet[0]),
len(df1.word2vec[0]), len(df1.glove[0])]}

# Create DataFrame
df_emb_len = pd.DataFrame(data)

# Set background style
df_emb_len.style.background_gradient()
Figure 3 — Dimension of embeddings

Machine Learning

To evaluate the performance of the text embeddings, four classifiers; random forest, support vector machine, logistic regression and decision tree would be used to predict the Score variable. The dataset will be split into a 75:25 training-to-testing ratio to evaluate accuracy. Since the embeddings are two-dimensional, the numpy stack function will be used to convert them into a single three-dimensional array before training.

# Define a list of embedding methods to evaluate
embedding_var= ['gpt_3', 'mpnet', 'word2vec', 'glove']

# Define a list of classifier models to use
classifiers = [('rf', RandomForestClassifier(random_state=76)),
('svm', SVC(random_state=76)),
('lr', LogisticRegression(random_state=76, max_iter=400)),
('dt', DecisionTreeClassifier(random_state=76))]

# Define a dictionary to store accuracy results for each classifier
accuracy_lists = {
'rf': [],
'svm': [],
'lr': [],
'dt': []
}

# Loop through each embedding method
for emb in embedding_var:

# Split the data into training and testing sets using the 'train_test_split' function
X_train, X_test, y_train, y_test = train_test_split(
df1[emb].values,
df1.Score,
test_size=0.25,
random_state=76
)

# Stack the training and testing sets into 3D arrays
X_train_stacked = np.stack(X_train)
X_test_stacked = np.stack(X_test)

# Loop through each classifier model
for classifier_name, classifier in classifiers:

# Create a pipeline that scales the data and fits the classifier
pipe = Pipeline([('scaler', RobustScaler()), (classifier_name, classifier)])
pipe.fit(X_train_stacked, y_train)

# Use the pipeline to make predictions on the test data
y_pred = pipe.predict(X_test_stacked)

# Evaluate the accuracy of the predictions
report = classification_report(y_test, y_pred ,output_dict=True)
acc = report['accuracy']

# Store the accuracy results for each classifier
accuracy_lists[classifier_name].append(acc)

Results and Conclusion

As seen in figure 4, the machine learning models presented some interesting results. The GPT-3 embedding attained the highest accuracy across all machine learning models. MPNet embedding resulted in the next best performance utilizing logistic regression and support vector machine but was surpassed by the word2vec embedding in the random forest algorithm and was the worse performer in the decision tree algorithm. A clear conclusion can not be arrived at with regard to the effect of dimensionality on model performance. However, it is evident from the results that the GPT-3 embeddings consistently outperformed all other embeddings, demonstrating its superiority in text classification.

# Add a new key 'embeddings' to the dictionary 'accuracy_lists' and assign the list 'embedding_var' to it
accuracy_lists['embeddings'] = embedding_var

# Create a list of tuples using the values from the dictionaries
df_zip = list(zip(accuracy_lists['embeddings'], accuracy_lists['lr'], accuracy_lists['svm'], accuracy_lists['rf'], accuracy_lists['dt']))

# Create a DataFrame 'df_accuracy' from the list 'df_zip' and specify the column names
df_accuracy = pd.DataFrame(df_zip, columns = ['Embedding','Logistic_Regression','Support_Vector_Machine', 'Random_Forest','Decision_Tree'])

# Add a background gradient to the DataFrame for visual representation
df_accuracy.style.background_gradient()
Figure 4 — Comparison of embedding accuracy

A link to the Jupyter notebook file can be found on my GitHub.

--

--

Responses (1)