pixano.datasets.dataset
Dataset(path, media_dir=None)
The Pixano Dataset.
It is a collection of tables that can be queried and manipulated with LanceDB.
The tables are defined by the DatasetSchema which allows the dataset to return the data in the form of LanceModel instances.
Attributes:
Name | Type | Description |
---|---|---|
path |
Path
|
Path to the dataset. |
info |
DatasetInfo
|
Dataset info. |
schema |
DatasetSchema
|
Dataset schema. |
features_values |
DatasetFeaturesValues
|
Dataset features values. |
stats |
list[DatasetStatistic]
|
Dataset statistics. |
thumbnail |
Path
|
Dataset thumbnail base 64 URL. |
media_dir |
Path
|
Path to the media directory. |
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
Path
|
Path to the dataset. |
required |
media_dir
|
Path | None
|
Path to the media directory. |
None
|
Source code in pixano/datasets/dataset.py
id
property
Return the dataset ID.
num_rows
property
add_data(table_name, data, ignore_integrity_checks=None, raise_or_warn='raise')
Add data to a table.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
table_name
|
str
|
Table name. |
required |
data
|
list[BaseSchema]
|
Data to add. |
required |
ignore_integrity_checks
|
list[IntegrityCheck] | None
|
List of integrity checks to ignore. |
None
|
raise_or_warn
|
Literal['raise', 'warn', 'none']
|
Whether to raise or warn on integrity errors. Can be 'raise', 'warn' or 'none'. |
'raise'
|
Source code in pixano/datasets/dataset.py
add_dataset_items(dataset_items)
Add dataset items to the dataset.
Warn
Does not test for integrity of the data.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dataset_items
|
list[DatasetItem] | DatasetItem
|
Dataset items to add. |
required |
Source code in pixano/datasets/dataset.py
compute_view_embeddings(table_name, data)
Compute the view embeddings via the Embedding Function stored in the table metadata.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
table_name
|
str
|
Table name containing the view embeddings. |
required |
data
|
list[dict]
|
Data to compute. Dictionary representing a view embedding without the vector field. |
required |
Source code in pixano/datasets/dataset.py
create_table(name, schema, relation_item, data=None, mode='create', exist_ok=False, on_bad_vectors='error', fill_value=0.0)
Add a table to the dataset.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
name
|
str
|
Table name. |
required |
schema
|
type[BaseSchema]
|
Table schema. |
required |
relation_item
|
SchemaRelation
|
Relation with the |
required |
data
|
DATA | None
|
Table data. |
None
|
mode
|
str
|
Table mode ('create', 'overwrite'd). |
'create'
|
exist_ok
|
bool
|
If True, do not raise an error if the table already exists. |
False
|
on_bad_vectors
|
str
|
Raise an error, drop or fill bad vectors ("error", "drop", "fill"). |
'error'
|
fill_value
|
float
|
Value to fill bad vectors. |
0.0
|
Returns:
Type | Description |
---|---|
LanceTable
|
The table created. |
Source code in pixano/datasets/dataset.py
delete_data(table_name, ids)
Delete data from a table.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
table_name
|
str
|
Table name. |
required |
ids
|
list[str]
|
Ids to delete. |
required |
Returns:
Type | Description |
---|---|
list[str]
|
The list of ids not found. |
Source code in pixano/datasets/dataset.py
delete_dataset_items(ids)
Delete dataset items.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
ids
|
list[str]
|
Ids to delete. |
required |
Returns:
Type | Description |
---|---|
list[str]
|
The list of ids not found. |
Source code in pixano/datasets/dataset.py
find(id, directory, media_dir=None)
staticmethod
Find a Dataset in a directory.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
id
|
str
|
Dataset ID to find. |
required |
directory
|
Path
|
Directory to search in. |
required |
media_dir
|
Path | None
|
Media directory. |
None
|
Returns:
Type | Description |
---|---|
'Dataset'
|
The found dataset. |
Source code in pixano/datasets/dataset.py
find_ids_in_table(table_name, ids)
Search ids in a table.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
table_name
|
str
|
Table name. |
required |
ids
|
set[str]
|
Ids to find. |
required |
Returns:
Type | Description |
---|---|
dict[str, bool]
|
Dictionary of ids found. Keys are the ids and values are |
Source code in pixano/datasets/dataset.py
get_all_ids(table_name=SchemaGroup.ITEM.value)
get_data(table_name, ids=None, limit=None, skip=0, where=None, item_ids=None)
Read data from a table.
Data can be filtered by ids, item ids, where clause, or limit and skip.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
table_name
|
str
|
Table name. |
required |
where
|
str | None
|
Where clause. |
None
|
ids
|
list[str] | str | None
|
ids to read. |
None
|
limit
|
int | None
|
Amount of items to read. |
None
|
skip
|
int
|
The number of data to skip. |
0
|
item_ids
|
list[str] | None
|
Item ids to read. |
None
|
Returns:
Type | Description |
---|---|
list[BaseSchema] | BaseSchema | None
|
List of values. |
Source code in pixano/datasets/dataset.py
get_dataset_items(ids=None, limit=None, skip=0)
Read dataset items.
Filter dataset items by ids, or limit and skip.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
ids
|
list[str] | str | None
|
Item ids to read. |
None
|
limit
|
int | None
|
Amount of items to read. |
None
|
skip
|
int
|
The number of data to skip.. |
0
|
Returns:
Type | Description |
---|---|
list[DatasetItem] | DatasetItem | None
|
List of dataset items. |
Source code in pixano/datasets/dataset.py
list(directory)
staticmethod
List the datasets information in directory.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
directory
|
Path
|
Directory to search in. |
required |
Returns:
Type | Description |
---|---|
list[DatasetInfo]
|
List of dataset infos. |
Source code in pixano/datasets/dataset.py
open_table(name)
Open a dataset table with LanceDB.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
name
|
str
|
Name of the table to open. |
required |
Returns:
Type | Description |
---|---|
LanceTable
|
Dataset table. |
Source code in pixano/datasets/dataset.py
open_tables(names=None, exclude_embeddings=True)
Open the dataset tables with LanceDB.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
names
|
list[str] | None
|
Table names to open. If None, open all tables. |
None
|
exclude_embeddings
|
bool
|
Whether to exclude embedding tables from the list. |
True
|
Returns:
Type | Description |
---|---|
dict[str, LanceTable]
|
Dataset tables. |
Source code in pixano/datasets/dataset.py
resolve_ref(ref)
Resolve a SchemaRef.
It fetches the data from the table referenced.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
ref
|
SchemaRef | ItemRef | ViewRef | EmbeddingRef | EntityRef | AnnotationRef | SourceRef
|
Reference to resolve. |
required |
Returns:
Type | Description |
---|---|
BaseSchema | Item | View | Embedding | Entity | Annotation | Source
|
The resolved reference. |
Source code in pixano/datasets/dataset.py
semantic_search(query, table_name, limit, skip=0)
Perform a semantic search.
It searches for the closest items to the query in the table embeddings.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
query
|
str
|
Text query for semantic search. |
required |
table_name
|
str
|
Table name for embeddings. |
required |
limit
|
int
|
Limit number of items. |
required |
skip
|
int
|
Skip number of items |
0
|
Returns:
Type | Description |
---|---|
tuple[list[BaseSchema], list[float]]
|
Tuple of items and distances. |
Source code in pixano/datasets/dataset.py
update_data(table_name, data, return_separately=False, ignore_integrity_checks=None, raise_or_warn='raise')
Update data in a table.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
table_name
|
str
|
Table name. |
required |
data
|
list[BaseSchema]
|
Data to update. |
required |
return_separately
|
bool
|
Whether to return separately added and updated data. |
False
|
ignore_integrity_checks
|
list[IntegrityCheck] | None
|
List of integrity checks to ignore. |
None
|
raise_or_warn
|
Literal['raise', 'warn', 'none']
|
Whether to raise or warn on integrity errors. Can be 'raise', 'warn' or 'none'. |
'raise'
|
Returns:
Type | Description |
---|---|
list[BaseSchema] | tuple[list[BaseSchema], list[BaseSchema]]
|
If |
list[BaseSchema] | tuple[list[BaseSchema], list[BaseSchema]]
|
data. |
Source code in pixano/datasets/dataset.py
update_dataset_items(dataset_items, return_separately=False)
Update dataset items.
Warn
Does not test for integrity of the data.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dataset_items
|
list[DatasetItem]
|
Dataset items to update. |
required |
return_separately
|
bool
|
Whether to return separately added and updated dataset items. |
False
|
Returns:
Type | Description |
---|---|
list[DatasetItem] | tuple[list[DatasetItem], list[DatasetItem]]
|
If |
list[DatasetItem] | tuple[list[DatasetItem], list[DatasetItem]]
|
the updated dataset items. |