Large Language Model (LLM) has brought so many amazing ideas into reality, like bots on anything, knowledge experts, research assistant and many others. Most of those great apps combine LLM with a specific domain of knowledge, where vector databases get involved. For example, assuming you have a question on a domain, the best practice is to retrieve possible domains from DB and dynamically construct prompts.
Selecting the right vector database for your app can dramatically impact its efficiency and effectiveness. Currently, there are lots of vector database products available, and generally sorted into two category -- specialized vector database and integrated vector database. While specialized vector databases like Pinecone have gained popularity due to their ease of use, they often fall short when it comes to scaling or supporting various data types. That's why we need ingrated vector database.
# What is Integrated Vector Database
An integrated vector database is a type of database that combines vector search capabilities with traditional structured databases. Unlike specialized vector databases designed solely for vector indexes, integrated vector databases store both vectors and structured data in the same database and combine vector search algorithms with structured databases. This integration offers several advantages, including the ability to perform efficient communication, flexible metadata filtering, execute SQL and vector joint queries, and leverage mature tools and integrations commonly associated with general-purpose databases.
MyScale (opens new window) is a cloud-based database optimized for AI applications and solutions, built on the open-source OLAP database, ClickHouse. It has managed to boost vector search performance in an integrated vector database. It has all benefits other integrated vector databases can give you and offers some extra perks, like good performance with proprietary vector index algorithm MSTG.
In this article, we'll spotlight MyScale, a one of the top-tier integrated vector databases, and discuss how these integrated vector databases can enhance your LLM apps.
# Communication Matters
Communication really matters to your app’s performance. DBaaS (database as a service) and SaaS (software as a service) are being widely adopted due to their lower cost and improved scalability and communications matters when working with these services.
A specialized vector index may not be able to hold all data you have, which means you may need to store it somewhere else. In this setup, you will have to make 2 requests sequentially with 4 data transmissions happening in a single query. But with an integrated database solution, you only need 1 request within 2 transmissions.
Less transmission means less latency and the latency does affect the user experience. If you are thinking seriously about the communication latency, or you want to make a massive query to the database, please consider MyScale to be one of the options among those integrated solutions.
# Filter on Anything without Constraints
LLM apps are augmented by tools. And vector database is the most important one. Narrowing down the results with keywords is usually a better choice when you have large numbers of vectors to search, and those vectors can represent articles, web pages or prompts. So here comes the concept of metadata-filtered search.
Filtered search is quite common in LLM apps. You may use them to prune out some un-useful data to improve accuracy. Most vector index services provide you metadata filters to implement pruning on those unnecessary data. Some implementations do have limitations on the data you will be filtering, either in size or the filter functions themselves. For example, Pinecone's implementation has a 40 KB metadata limit, which constrains the functionality of metadata filters. That will be a huge barrier if you want to match a regex pattern among a very large paragraph or filter out some data that is geographically remote to the queried location.
Database solutions, for example, MyScale, are capable of performing metadata filtering on data of almost any size and type. You can bring anything as a metadata filter, for example, geo locations (like H3 and S2), regular expression matching, math expression thresholding and even metadata filtering with a SQL subquery.
If your prompt needs to calculate some geographical distances, here's an example of how you can use it:
WHERE h3Distance(<data column in h3 index type>, <h3 index>) > 10
Suppose you are searching on a group of articles that matches some keywords, you can use pattern match for strings to constrain your vector search:
WHERE column_1 LIKE '%value%'
You can also use regular expressions to narrow down your search
WHERE match(column_1, '(?i)(value\s)')
You can also filter with math expression, for example, compute prediction for a few shot learner and use a threshold to filter the results:
WHERE 1/(1+exp(column_1)) > 0.9
And you can also perform metadata filtering with a SQL subquery:
WHERE column_1 IN (SELECT ... FROM another_table WHERE ...)
Furthermore, data, where you can perform metadata filtering, is actually stored as columns. There is no extra constraint on their sizes or data types. You can also JOIN
some external columns from other tables, which allows you to design complex query pipelines with good performance.
# Multiple Vector Indices Single Instance
Some LLM apps may have multiple columns of vectors. If your app needs to search on more than one vector, for example, searching articles before searching among paragraphs or deciding on prompts before retrieving relevant information, you may need multiple vector indices for your app.
Most specialized vector databases only support one vector index per instance, which means you need a new instance for every vector column you have. That could be bad if your apps have multiple vector database instances to work with. The inconsistency in latency and computation may have a performance impact as well as maintenance issues in the long run.
However, integrated vector databases, particularly MyScale, treat vector indices as a type of data index. It allows you to create a vector index for every table. If you have multiple apps and they all need a vector database, you can create tables and vector indices for them and squash them into one instance! You now only need one instance running for an LLM app that uses multiple vector indices.
# Conclusion
Despite the steeper learning curve due to SQL interfaces, integrated vector databases have undeniable advantages. By offering flexible metadata filtering, multiple indices support, and improved communication efficiency, they can take your LLM apps to new heights. At MyScale, we firmly believe that integrated vector databases are the future. MyScale, a high-performance integrated vector database solution, backed by an advanced vector index algorithm, provides both high data density and cost efficiency (opens new window).