MyScale (opens new window) is a fully managed SQL vector database built on ClickHouse, offering advanced vector search capabilities. In version 1.5.0 of MyScale, we introduced an upgraded full-text search feature powered by Tantivy (opens new window).
Implementing the BM25 algorithm (opens new window) for calculating relevance scores of search results significantly improves full-text search in MyScale. The BM25 algorithm is a critical feature in text search as it provides a way to rank search results based on their relevance to the original query. It considers factors such as term frequency, inverse document frequency, and document length. The algorithm assigns a score to each document, allowing the most relevant results to be displayed first.
However, MyScale and ClickHouse's full-text search (opens new window) currently faces two main challenges:
- Low Performance: Searching and ranking large tables is slow, especially as tables increase to millions of rows.
- Lack of Functionality: ClickHouse lacks support for fuzzy search, relevance tuning, and BM25 relevance scoring commonly found in modern search engines.
Relevance tuning is another essential feature in text search. It allows you to fine-tune the search algorithm to prioritize specific aspects of the search, such as giving higher weight to matches in the title versus the body of a document, significantly improving the accuracy and usefulness of search results.
Fuzzy search is also valuable, especially for handling typos or misspellings in search queries. With it, the search engine can find documents similar to the search query, even if they aren’t an exact match. It significantly enhances the user experience and ensures that relevant results are not missed due to small spelling mistakes.
MyScale's full-text search index aims to bridge this gap between MyScale and specialized engines like Elasticsearch, eliminating the need for additional services.
Key features of MyScale's full-text search index include:
- Fully native to MyScale with no external dependencies;
- Built on Tantivy, a fast and resource-efficient alternative to Apache Lucene;
- Query times over 5M rows are 300x faster than ClickHouse's built-in inverted index;
- Supports fuzzy and wildcard searches along with rich tokenizers;
- Utilizes BM25 for relevance scoring similar to Elasticsearch; and
- Real-time searching without manual reindexing.
For example:Boost Your AI App Efficiency nowSign up for free to benefit from 150+ QPS with 5,000,000 vectorsFree TrialExplore our product
First, we create a table to store Wikipedia data to test the full-text search functionality.
CREATE TABLE default.en_wiki_abstract(
`id` UInt64,
`body` String,
`title` String,
`url` String,
)
ENGINE = MergeTree
ORDER BY id;
When creating a full-text search (FTS) index on the body
column, it is critical to note that the tokenizer can be configured within index's arguments. In this scenario, we have selected a tokenizer with English stemming and stop words.
ALTER TABLE default.en_wiki_abstract
ADD INDEX body_idx (body)
TYPE fts('{"body":{"tokenizer":{"type":"stem", "stop_word_filters":["english"]}}}');
Next, we upload the data using S3:
INSERT INTO default.en_wiki_abstract
SELECT * FROM s3('https://myscale-datasets.s3.ap-southeast-1.amazonaws.com/wiki_abstract_5m.parquet','Parquet');
We can now search the body
column using the TextSearch()
function, which will return a score in bm25.
SELECT
id,
title,
body,
TextSearch(body, 'non-profit institute in Washington') AS score
FROM default.en_wiki_abstract
ORDER BY score DESC
LIMIT 5;
Output:
id | title | body | score |
---|---|---|---|
3400768 | Drug Strategies | Drug Strategies is a non-profit research institute located in Washington D.C. | 24.457561 |
872513 | Earth Policy Institute | Earth Policy Institute was an independent non-profit environmental organization based in Washington, D.C. | 22.730673 |
895248 | Arab American Institute | Founded in 1985, the Arab American Institute is a non-profit membership organization based in Washington D.C. | 21.955559 |
1950599 | Environmental Law Institute | The Environmental Law Institute (ELI) is a non-profit, non-partisan organization, headquartered in Washington, D.C. | 21.231567 |
2351478 | Public Knowledge | Public Knowledge is a non-profit Washington, D.C. | 20.742344 |
Moreover, the full-text search feature within MyScale can be integrated with vector searches to conduct hybrid searches in RAG pipelines. Users usually perform separate queries using vector and full-text searches, then reorganize the results from both searches by applying fusion algorithms through Python libraries such as ranx (opens new window).
For detailed guides on these topics, please visit:
Lastly, the full-text search index with Tantivy will soon be available in our open-source project MyScaleDB (opens new window), so stay tuned!