Skip to contents

BM25 refers to Okapi Best Matching 25. See doi:10.1561/1500000019 for more information.

Usage

ragnar_retrieve_bm25(store, text, top_k = 3L)

Arguments

store

A RagnarStore object or a dplyr::tbl() derived from it. When you pass a tbl, you may use usual dplyr verbs (e.g. filter(), slice()) to restrict the rows examined before similarity scoring. Avoid dropping essential columns such as text, embedding, origin, and hash.

text

A string to find the nearest match too

top_k

Integer, maximum amount of document chunks to retrieve

Value

A dataframe of retrieved chunks. Each row corresponds to an individual chunk in the store. It always contains a column named text that contains the chunks.

Details

The supported methods are:

  • cosine_distance: Measures the dissimilarity between two vectors based on the cosine of the angle between them. Defined as \(1 - cos(\theta)\), where \(cos(\theta)\) is the cosine similarity.

  • cosine_similarity: Measures the similarity between two vectors based on the cosine of the angle between them. Ranges from -1 (opposite) to 1 (identical), with 0 indicating orthogonality.

  • euclidean_distance: Computes the straight-line (L2) distance between two points in a multidimensional space. Defined as \(\sqrt{\sum(x_i - y_i)^2}\).

  • dot_product: Computes the sum of the element-wise products of two vectors.

  • negative_dot_product: The negation of the dot product.

Pre-filtering with dplyr

The store behaves like a lazy table backed by DuckDB, so row‑wise filtering is executed directly in the database. This lets you narrow the search space efficiently without pulling data into R.

See also

Examples

if (FALSE) { # (rlang::is_installed("dbplyr") && nzchar(Sys.getenv("OPENAI_API_KEY")))
# Basic usage
store <- ragnar_store_create(
  embed = \(x) ragnar::embed_openai(x, model = "text-embedding-3-small")
)
ragnar_store_insert(store, data.frame(text = c("foo", "bar")))
ragnar_store_build_index(store)
ragnar_retrieve(store, "foo")

# More Advanced: store metadata, retrieve with pre-filtering
store <- ragnar_store_create(
  embed = \(x) ragnar::embed_openai(x, model = "text-embedding-3-small"),
  extra_cols = data.frame(category = character())
)

ragnar_store_insert(
  store,
  data.frame(
    category = "desert",
    text = c("ice cream", "cake", "cookies")
  )
)

ragnar_store_insert(
  store,
  data.frame(
    category = "meal",
    text = c("steak", "potatoes", "salad")
  )
)

ragnar_store_build_index(store)

# simple retrieve
ragnar_retrieve(store, "carbs")

# retrieve with pre-filtering
dplyr::tbl(store) |>
  dplyr::filter(category == "meal") |>
  ragnar_retrieve("carbs")
}