Empowering Scalability in LLM App Development with a High-Performance Vector Database

Thu Jun 08 2023

# Introduction

Powerful language models like GPT3.5-turbo and GPT4 have revolutionized app development, resulting in a surge of domain-specific applications. Standout apps like PandaGPT and GiftWarp exemplify the capabilities of these models. What sets these apps apart is their exceptional data handling. PandaGPT, for example, seamlessly retrieves information from hundreds of PDF documents. This positions PandaGPT for success in the competitive app market.

To ensure longevity, entrepreneurs must prioritize scaling up data processing. As apps grow in popularity, efficient data handling becomes crucial. Robust infrastructure and scalable systems are essential to manage increased data loads. By addressing bottlenecks and planning for smooth expansion, entrepreneurs position their apps for growth and user satisfaction.

Embracing a data-driven approach offers unprecedented opportunities. With language models like GPT3.5-turbo and GPT4, developers unlock the potential for groundbreaking innovations and exceptional user experiences. By harnessing these models' power, app development reaches new heights. The future lies in data-driven solutions and leveraging advanced language models for transformative experiences.

# Planning for Scalability from the Start

With the help of OpenAI's API, we can effortlessly create a customer service chatbot by utilizing GPT and a small amount of product data. By using GPT to analyze prompts, we can efficiently search for items in a given list and achieve impressive results. Here's an example:

import os
import openai
from dotenv import load_dotenv, find_dotenv
_ = load_dotenv(find_dotenv()) # read local .env file

openai.api_key  = os.environ['OPENAI_API_KEY']

delimiter = "####"
system_message = f"""
Follow these steps to answer the customer queries.
The customer query will be delimited with four hashtags,
i.e. {delimiter}. 

Step 1:{delimiter} First decide whether the user is 
asking a question about a specific product or products. 
Product cateogry doesn't count. 

Step 2:{delimiter} If the user is asking about 
specific products, identify whether 
the products are in the following list.
All available products: 
1. Product: TechPro Ultrabook
   Category: Computers and Laptops
   Brand: TechPro
   Model Number: TP-UB100
   Warranty: 1 year
   Rating: 4.5
   Features: 13.3-inch display, 8GB RAM, 256GB SSD, Intel Core i5 processor
   Description: A sleek and lightweight ultrabook for everyday use.
   Price: $799.99

2. Product: BlueWave Gaming Laptop
   Category: Computers and Laptops
   Brand: BlueWave
   Model Number: BW-GL200
   Warranty: 2 years
   Rating: 4.7
   Features: 15.6-inch display, 16GB RAM, 512GB SSD, NVIDIA GeForce RTX 3060
   Description: A high-performance gaming laptop for an immersive experience.
   Price: $1199.99

3. Product: PowerLite Convertible
   Category: Computers and Laptops
   Brand: PowerLite
   Model Number: PL-CV300
   Warranty: 1 year
   Rating: 4.3
   Features: 14-inch touchscreen, 8GB RAM, 256GB SSD, 360-degree hinge
   Description: A versatile convertible laptop with a responsive touchscreen.
   Price: $699.99
 ......
 
Step 3:{delimiter} If the message contains products 
in the list above, list any assumptions that the 
user is making in their 
message e.g. that Laptop X is bigger than 
Laptop Y, or that Laptop Z has a 2 year warranty.

Step 4:{delimiter}: If the user made any assumptions, 
figure out whether the assumption is true based on your 
product information. 

Step 5:{delimiter}: First, politely correct the 
customer's incorrect assumptions if applicable. 
Only mention or reference products in the list of 
5 available products, as these are the only 5 
products that the store sells. 
Answer the customer in a friendly tone.

Use the following format:
Step 1:{delimiter} <step 1 reasoning>
Step 2:{delimiter} <step 2 reasoning>
Step 3:{delimiter} <step 3 reasoning>
Step 4:{delimiter} <step 4 reasoning>
Response to user:{delimiter} <response to customer>

Make sure to include {delimiter} to separate every step.
"""

user_message = f"""
how much is the BlueWave Chromebook more expensive 
than the TechPro Desktop"""

messages =  [  
{'role':'system', 
 'content': system_message},    
{'role':'user', 
 'content': f"{delimiter}{user_message}{delimiter}"},  
] 

response = get_completion_from_messages(messages)
print(response)

If a startup aims to develop a customer service chatbot that incorporates multimodality and utilizes a larger dataset, a vector database is necessary to meet their needs. Implementing a vector database at this stage is crucial for effective data storage and retrieval.

To achieve this goal, we can leverage a vector database specifically designed to handle high-dimensional data, including multimodal information. With a vector database, we can store and index vectors representing different aspects of customer service data, such as text, images, or even audio.

For instance, when a customer submits a query to the chatbot, the system can use natural language processing techniques to convert the text into a vector representation. This vector representation can then be used to search the vector database for relevant responses. Additionally, if the chatbot is capable of handling image or audio inputs, those inputs can also be converted into vector representations and stored in the database.

The vector database efficiently indexes the vectors, enabling fast retrieval of relevant information. By utilizing advanced search algorithms like nearest neighbor search, the chatbot can identify the most appropriate responses based on similarity metrics between the user's query and the stored vectors in the database.

As the dataset expands, the vector database ensures scalability and efficient storage of multimodal data. It simplifies the process of updating and adding to the dataset, allowing the chatbot to continuously improve its performance and provide accurate and relevant responses to customer queries.

# Balancing Scalability with Cost Efficiency

Startups also need to consider cost efficiency while achieving scalability. Structuring and preprocessing data to extract relevant features and attributes can help reduce storage and processing requirements, minimizing costs. Leveraging existing tools and frameworks that offer multimodal capabilities can also save valuable time and resources. These resources often come with optimized data structures and algorithms, eliminating the need to build everything from scratch.

When it comes to choosing a database, startups should consider MyScale as a cost-effective solution. MyScale is a dense and cost-efficient vector database that provides high performance at a lower cost compared to alternative options. By implementing strategies such as structurizing and preprocessing data, leveraging existing tools and frameworks, and considering cost-effective solutions like MyScale, startups can strike a balance between scalability and cost efficiency. These approaches optimize performance while making the most of available resources, enabling startups to grow and succeed in a cost-effective manner.

# Case Studies and Best Practices

Here, we will provide a brief introduction on how to use MyScale to quickly scale up a multimodal customer service chatbot. For this purpose, we have employed a simplified dataset derived from Taobao Live (opens new window).

# Installing Prerequisites

transformers: Running CLIP model
tqdm: Beautiful progress bar for humans
clickhouse-connect: MyScale database client

python3 -m pip install transformers tqdm clickhouse-connect streamlit pandas lmdb torch

# Getting Into the Data

First, let's look into the structure of the datasets, we have split the data into two tables.

id	product_url	label
102946	url_to_store_the image	Men's Long Sleeve Shirt

The dataset consists of three columns: product ID, product image, and label. The first column serves as a unique identifier for each product image. The second column contains the URL leading to the image. The third column is the label of the product. Here is an example of the product_url column mentioned above:

id	product_text	label
102946	POOF C(1's)I MOCK NECK POCKET TEE	Men's Long Sleeve Shirt

# Creating a MyScale Database Table

# Working with the Database

You need a connection to a database backend to create a table in MyScale. You can check out the detailed guide for the python client on this page (opens new window).

If you are familiar with SQL (Structured Query Language), it would be much easier for you to work with MyScale. MyScale combines the structured query with vector search, which means creating a vector database is almost the same as creating conventional databases. And here is how we create two vector tables in SQL:

CREATE TABLE IF NOT EXISTS TaoBaoData_image(
        id String,
        vector Array(Float32),
        CONSTRAINT vec_len CHECK length(vector) = 512
        ) ENGINE = MergeTree ORDER BY id; 
        
CREATE TABLE IF NOT EXISTS TaoBaoData_text(
        id String,
        vector Array(Float32),
        CONSTRAINT vec_len CHECK length(vector) = 512
        ) ENGINE = MergeTree ORDER BY id;

# Extracting Features and Fill the Database

CLIP (opens new window) is a popular method that matches data from different forms (we adopt the academic term "modal") into a unified space, enabling high-performance cross-modal retrieval. This model can encode both image and text. Here is an example:

import torch
import clip
from PIL import Image

# Load the CLIP model
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)

# Load and preprocess the image
image_path = "path_to_your_image.jpg"
image = Image.open(image_path).convert("RGB")
image_input = preprocess(image).unsqueeze(0).to(device)

# Encode the image
with torch.no_grad():
    image_features = model.encode_image(image_input)

# Encode the text
text = "Your text here"
text_input = clip.tokenize([text]).to(device)
with torch.no_grad():
    text_features = model.encode_text(text_input)

# Print the image and text features
print("Image features shape:", image_features.shape)
print("Text features shape:", text_features.shape)

# Upload Data to MyScale

Once the data has been processed into embeddings, we proceed to upload the data to MyScale.

# upload data from datasets
client.insert("TaoBaoData_image", 
              data_image.to_records(index=False).tolist(), 
              column_names=data_image.columns.tolist())
client.insert("TaoBaoData_text", 
              data_text.to_records(index=False).tolist(), 
              column_names=data_text.columns.tolist())

# create vector index with cosine
client.command("""
ALTER TABLE TaoBaoData_image
ADD VECTOR INDEX image_feature_index feature
TYPE MSTG
('metric_type=Cosine')
""")

client.command("""
ALTER TABLE TaoBaoData_text
ADD VECTOR INDEX text_feature_index feature
TYPE MSTG
('metric_type=Cosine')
""")

# Search with MyScale

When a user inputs a question, we convert their question into a vector and perform a retrieval from the database. This retrieval process helps us obtain the top K product images and their corresponding product descriptions. We then pass these product descriptions to the GPT model, which further refines the recommendations and provides more detailed product introductions. Additionally, in the final conversation result, we also include the display of product images to the user.

Encode the question:

question = 'Do you have any black dress for women?'
emb_query = retriever.encode(question).tolist()

Search the TaoBaoData_text dataset, return top2 product information：

top_k = 2
results = client.query(f"""
SELECT id, product_text, distance(vector, {emb_query}) as dist
FROM TaoBaoData_text
ORDER BY dist LIMIT {top_k}
""")

summaries = {'id': [], 'product_text': []}
for res in results.named_results():
    summaries['product_text'].append(res["product_text"])
    summaries['id'].append(res["id"])

We now have a dist as following:

{'id':['065906','104588'], 
'product_text': ['Plus Size Womens Autumn New Arrival Elegant Temperament 2019 Concealing Belly Fashionable Mid-length Lace Dress.'
'2019 Summer New Arrival High-end Asymmetrical Shoulder Strap Chic Slimming Daily V-neck Dress for Women, Trendy.']}

After that we can prompt this list back to GPT4 through OpenAI's API as mentioned in the beginning, here is an example:

system_message = f"""
    Based on the user's question, we have retrieved two items with the 
    following information. provide recommendations for these two items based on 
    the product text.
    {summaries}
    If the user requests to see the style, please return the corresponding 
    product IDs.
"""

When we have the product ID, we can search the TaoBaoData_image dataset to get the images as follows:

results = client.query(f"""
SELECT id, product_url, 
FROM TaoBaoData_image
WHERE id IN {summaries['id']}
""")

65906	104588

Now we can return this result to the user to assist them in making further choices and interactions.

A similar pipeline can also be used for image retrieval, for example, if a user wants to find clothing similar to what is shown in an image, we can use image embeddings for retrieval.

# Conclusion

MyScale can efficiently handles multimodal data, providing startups with a cost-effective solution. By integrating different modalities and optimizing resource usage, MyScale enhances customer service capabilities without significant costs. This approach allows startups to allocate resources efficiently, focusing on critical aspects of their business. Scalability and cost efficiency are vital for startup success, ensuring sustainable growth and maximizing ROI. MyScale's strengths in multimodal data processing enable startups to scale up while maintaining cost-effectiveness. Embracing MyScale empowers startups to manage resources wisely and thrive in the competitive market.

Introduction

Planning for Scalability from the Start

Balancing Scalability with Cost Efficiency

Case Studies and Best Practices

Installing Prerequisites

Getting Into the Data

Creating a MyScale Database Table

Extracting Features and Fill the Database

Upload Data to MyScale

Search with MyScale

Conclusion