Semantic search
Context
The Pixano's app support semantic search which allows a user to search for a view based on a text content.
To do that, Pixano requires that the embeddings of the views are pre-computed with the model used for semantic search, using the embeddings functions from LanceDB.
In this tutorial, we will go through the process of semantic search using the OpenCLIP model.
Pre-compute the embeddings
First, we need to pre-compute the embeddings using LanceDB and Pixano. We will use the Health Images dataset defined in the Build and query a dataset tutorial.
- Install OpenCLIP
For this tutorial, we will use OpenCLIP embedding function and therefore need it to be installed.
- Create the Image View Embedding table:
from pathlib import Path
from pixano.datasets import Dataset
from pixano.datasets.dataset_schema import SchemaRelation
from pixano.features import ViewEmbedding
lancedb_embedding_fn = "open-clip"
table_name = "emb_open-clip"
dataset = Dataset( # Load the dataset
Path("./pixano_library/health_dataset"), media_dir=Path("./assets/")
)
embedding_schema = ViewEmbedding.create_schema( # Create the ViewEmbeddingSchema
embedding_fn=lancedb_embedding_fn, # For LanceDB
table_name=table_name,
dataset=dataset
)
dataset.create_table( # Create the table
name=table_name,
schema=embedding_schema,
relation_item=SchemaRelation.ONE_TO_ONE, # Only one view per item
mode="create"
)
>>> LanceTable(connection=LanceDBConnection(.../pixano/docs/tutorials/pixano_library/health_dataset/db), name="emb_open-clip")
The pixano.features.ViewEmbedding
's method create_schema
create a schema that contains a LanceDB embedding function compatible with Pixano.
- Compute the embeddings
To compute the embeddings, Pixano needs to access the references to the views. Then, based on these information, it can use the LanceDB embedding function on the views.
import shortuuid
views = dataset.get_data(table_name="image") # Get all views from the dataset's table "image".
data = [] # List of dictionnary of ViewEmbedding's model dump without the vector field.
for view in views:
data.append(
{
"id": shortuuid.uuid(),
"item_ref": {
"id": view.item_ref.id,
"name": view.item_ref.name,
},
"view_ref": { # Reference to the table "image"
"id": view.id,
"name": "image",
},
}
)
dataset.compute_view_embeddings(table_name=table_name, data=data)
Use the semantic search
With the app
Now you are all set to use the semantic search, follow the using the app guide!
The semantic search bar will appear on the dataset page.
With the API
Python API
To perform semantic search, simply call the dataset's semantic_search
method with the text query. It will return the items with the closest view semantically to the query and the distance.
query = "microscope"
dataset.semantic_search(query, table_name, limit=5, skip=0)
>>>(
[
Item(id='bFPEPGYaSmJGPakiaPctfF', ..., split='all', image_type='microscope'),
Item(id='VkBcn4bhgt2RRJNWWjvDN5', ..., split='all', image_type='microscope'),
Item(id='5XFPk5qwFL5xWhLHzGXLeV', ..., split='all', image_type='microscope'),
Item(id='A7BXLwmXWRAypbqxd5KqzK', ..., split='all', image_type='peau'),
Item(id='iyA4tHmGeHPP4N6diSuUXi', ..., split='all', image_type='microcope')
],
[1.5589371919631958, 1.5611850023269653, 1.5707159042358398, 1.5987659692764282, 1.605470895767212]
)
REST API
You first need to launch the Pixano's app to interact with the REST API.
Then you can curl the REST API to navigate through your items:
curl -X 'GET' \
'http://localhost:8000/browser/health_images?limit=5&skip=0&query=microscope&embedding_table=emb_open-clip' \
-H 'accept: application/json'
>>> {
"id": "health_images",
"name": "Health Images",
"table_data": {
"columns": [
{
"name": "image",
"type": "image"
},
{
"name": "id",
"type": "str"
},
{
"name": "created_at",
"type": "datetime"
},
{
"name": "updated_at",
"type": "datetime"
},
{
"name": "split",
"type": "str"
},
{
"name": "image_type",
"type": "str"
},
{
"name": "distance",
"type": "float"
}
],
"rows": [
{
"image": "",
"id": "bFPEPGYaSmJGPakiaPctfF",
"created_at": "2024-11-07T10:14:05.396010",
"updated_at": "2024-11-07T10:14:05.396010",
"split": "all",
"image_type": "microscope",
"distance": 1.5589371919631958
},
{
"image": "",
"id": "VkBcn4bhgt2RRJNWWjvDN5",
"created_at": "2024-11-07T10:14:05.367159",
"updated_at": "2024-11-07T10:14:05.367159",
"split": "all",
"image_type": "microscope",
"distance": 1.5611850023269653
},
{
"image": "",
"id": "5XFPk5qwFL5xWhLHzGXLeV",
"created_at": "2024-11-07T10:14:05.402816",
"updated_at": "2024-11-07T10:14:05.402816",
"split": "all",
"image_type": "microscope",
"distance": 1.5707159042358398
},
{
"image": "",
"id": "A7BXLwmXWRAypbqxd5KqzK",
"created_at": "2024-11-07T10:14:05.409434",
"updated_at": "2024-11-07T10:14:05.409434",
"split": "all",
"image_type": "peau",
"distance": 1.5987659692764282
},
{
"image": "",
"id": "iyA4tHmGeHPP4N6diSuUXi",
"created_at": "2024-11-07T10:14:05.363831",
"updated_at": "2024-11-07T10:14:05.363831",
"split": "all",
"image_type": "microcope",
"distance": 1.605470895767212
}
]
},
"pagination": {
"current_page": 0,
"page_size": 5,
"total_size": 11
},
"semantic_search": [
"emb_open-clip"
]
}