pixano.datasets.dataset
Dataset(path, media_dir=None)
The Pixano Dataset.
It is a collection of tables that can be queried and manipulated with LanceDB.
The tables are defined by the DatasetSchema which allows the dataset to return the data in the form of LanceModel instances.
Attributes:
Name | Type | Description |
---|---|---|
path |
Path
|
Path to the dataset. |
info |
DatasetInfo
|
Dataset info. |
schema |
DatasetSchema
|
Dataset schema. |
features_values |
DatasetFeaturesValues
|
Dataset features values. |
stats |
list[DatasetStatistic]
|
Dataset statistics. |
thumbnail |
Path
|
Dataset thumbnail base 64 URL. |
media_dir |
Path
|
Path to the media directory. |
Parameters:
Name | Type | Description | Default |
---|---|---|---|
path
|
Path
|
Path to the dataset. |
required |
media_dir
|
Path | None
|
Path to the media directory. |
None
|
Source code in pixano/datasets/dataset.py
id
property
Return the dataset ID.
num_rows
property
add_constraint(table, field_name, values, restricted=True)
Add or replace a constraint.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
table
|
TableName
|
Table name (as in DatasetItem schema) |
required |
field_name
|
str
|
Name of the field to constrain. |
required |
values
|
List[Union[int, float, str, bool]]
|
List of allowed values. |
required |
restricted
|
bool
|
True if no other values are allowed. |
True
|
Source code in pixano/datasets/dataset.py
add_data(table_name, data, ignore_integrity_checks=None, raise_or_warn='raise')
Add data to a table.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
table_name
|
str
|
Table name. |
required |
data
|
list[BaseSchema]
|
Data to add. |
required |
ignore_integrity_checks
|
list[IntegrityCheck] | None
|
List of integrity checks to ignore. |
None
|
raise_or_warn
|
Literal['raise', 'warn', 'none']
|
Whether to raise or warn on integrity errors. Can be 'raise', 'warn' or 'none'. |
'raise'
|
Source code in pixano/datasets/dataset.py
add_dataset_items(dataset_items)
Add dataset items to the dataset.
Warn
Does not test for integrity of the data.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dataset_items
|
list[DatasetItem] | DatasetItem
|
Dataset items to add. |
required |
Source code in pixano/datasets/dataset.py
compute_view_embeddings(table_name, data)
Compute the view embeddings via the Embedding Function stored in the table metadata.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
table_name
|
str
|
Table name containing the view embeddings. |
required |
data
|
list[dict]
|
Data to compute. Dictionary representing a view embedding without the vector field. |
required |
Source code in pixano/datasets/dataset.py
create_table(name, schema, relation_item, data=None, mode='create', exist_ok=False, on_bad_vectors='error', fill_value=0.0)
Add a table to the dataset.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
name
|
str
|
Table name. |
required |
schema
|
type[BaseSchema]
|
Table schema. |
required |
relation_item
|
SchemaRelation
|
Relation with the |
required |
data
|
DATA | None
|
Table data. |
None
|
mode
|
str
|
Table mode ('create', 'overwrite'). |
'create'
|
exist_ok
|
bool
|
If True, do not raise an error if the table already exists. |
False
|
on_bad_vectors
|
str
|
Raise an error, drop or fill bad vectors ("error", "drop", "fill"). |
'error'
|
fill_value
|
float
|
Value to fill bad vectors. |
0.0
|
Returns:
Type | Description |
---|---|
LanceTable
|
The table created. |
Source code in pixano/datasets/dataset.py
delete_data(table_name, ids)
Delete data from a table.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
table_name
|
str
|
Table name. |
required |
ids
|
list[str]
|
Ids to delete. |
required |
Returns:
Type | Description |
---|---|
list[str]
|
The list of ids not found. |
Source code in pixano/datasets/dataset.py
delete_dataset_items(ids)
Delete dataset items.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
ids
|
list[str]
|
Ids to delete. |
required |
Returns:
Type | Description |
---|---|
list[str]
|
The list of ids not found. |
Source code in pixano/datasets/dataset.py
find(id, directory, media_dir=None)
staticmethod
Find a Dataset in a directory.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
id
|
str
|
Dataset ID to find. |
required |
directory
|
Path
|
Directory to search in. |
required |
media_dir
|
Path | None
|
Media directory. |
None
|
Returns:
Type | Description |
---|---|
'Dataset'
|
The found dataset. |
Source code in pixano/datasets/dataset.py
find_ids_in_table(table_name, ids)
Search ids in a table.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
table_name
|
str
|
Table name. |
required |
ids
|
set[str]
|
Ids to find. |
required |
Returns:
Type | Description |
---|---|
dict[str, bool]
|
Dictionary of ids found. Keys are the ids and values are |
Source code in pixano/datasets/dataset.py
get_all_ids(table_name=SchemaGroup.ITEM.value, sortcol=None, order=None)
Get all the ids from a table.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
table_name
|
str
|
table to look for ids. |
ITEM.value
|
sortcol
|
str | None
|
column to order by |
None
|
order
|
str | None
|
sort order (asc or desc) |
None
|
Returns:
Type | Description |
---|---|
list[str]
|
list of the ids. |
Source code in pixano/datasets/dataset.py
get_data(table_name, ids=None, limit=None, skip=0, where=None, item_ids=None, sortcol=None, order=None)
Read data from a table.
Data can be filtered by ids, item ids, where clause, or limit and skip.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
table_name
|
str
|
Table name. |
required |
where
|
str | None
|
Where clause. |
None
|
ids
|
list[str] | str | None
|
ids to read. |
None
|
limit
|
int | None
|
Amount of items to read. If not set, will default to table size. |
None
|
skip
|
int
|
The number of data to skip. |
0
|
item_ids
|
list[str] | None
|
Item ids to read. |
None
|
sortcol
|
str | None
|
column to order by |
None
|
order
|
str | None
|
sort order (asc or desc) |
None
|
Returns:
Type | Description |
---|---|
list[BaseSchema] | BaseSchema | None
|
List of values. |
Source code in pixano/datasets/dataset.py
363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 |
|
get_dataset_items(ids=None, limit=None, skip=0)
Read dataset items.
Filter dataset items by ids, or limit and skip.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
ids
|
list[str] | str | None
|
Item ids to read. |
None
|
limit
|
int | None
|
Amount of items to read. |
None
|
skip
|
int
|
The number of data to skip.. |
0
|
Returns:
Type | Description |
---|---|
list[DatasetItem] | DatasetItem | None
|
List of dataset items. |
Source code in pixano/datasets/dataset.py
list(directory)
staticmethod
List the datasets information in directory.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
directory
|
Path
|
Directory to search in. |
required |
Returns:
Type | Description |
---|---|
list[DatasetInfo]
|
List of dataset infos. |
Source code in pixano/datasets/dataset.py
open_table(name)
Open a dataset table with LanceDB.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
name
|
str
|
Name of the table to open. |
required |
Returns:
Type | Description |
---|---|
LanceTable
|
Dataset table. |
Source code in pixano/datasets/dataset.py
open_tables(names=None, exclude_embeddings=True)
Open the dataset tables with LanceDB.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
names
|
list[str] | None
|
Table names to open. If None, open all tables. |
None
|
exclude_embeddings
|
bool
|
Whether to exclude embedding tables from the list. |
True
|
Returns:
Type | Description |
---|---|
dict[str, LanceTable]
|
Dataset tables. |
Source code in pixano/datasets/dataset.py
resolve_ref(ref)
Resolve a SchemaRef.
It fetches the data from the table referenced.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
ref
|
SchemaRef | ItemRef | ViewRef | EmbeddingRef | EntityRef | AnnotationRef | SourceRef
|
Reference to resolve. |
required |
Returns:
Type | Description |
---|---|
BaseSchema | Item | View | Embedding | Entity | Annotation | Source
|
The resolved reference. |
Source code in pixano/datasets/dataset.py
semantic_search(query, table_name, limit, skip=0)
Perform a semantic search.
It searches for the closest items to the query in the table embeddings.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
query
|
str
|
Text query for semantic search. |
required |
table_name
|
str
|
Table name for embeddings. |
required |
limit
|
int
|
Limit number of items. |
required |
skip
|
int
|
Skip number of items |
0
|
Returns:
Type | Description |
---|---|
tuple[list[BaseSchema], list[float], list[str]]
|
Tuple of items, distances, and full sorted list of ids. |
Source code in pixano/datasets/dataset.py
update_data(table_name, data, return_separately=False, ignore_integrity_checks=None, raise_or_warn='raise')
Update data in a table.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
table_name
|
str
|
Table name. |
required |
data
|
list[BaseSchema]
|
Data to update. |
required |
return_separately
|
bool
|
Whether to return separately added and updated data. |
False
|
ignore_integrity_checks
|
list[IntegrityCheck] | None
|
List of integrity checks to ignore. |
None
|
raise_or_warn
|
Literal['raise', 'warn', 'none']
|
Whether to raise or warn on integrity errors. Can be 'raise', 'warn' or 'none'. |
'raise'
|
Returns:
Type | Description |
---|---|
list[BaseSchema] | tuple[list[BaseSchema], list[BaseSchema]]
|
If |
list[BaseSchema] | tuple[list[BaseSchema], list[BaseSchema]]
|
data. |
Source code in pixano/datasets/dataset.py
update_dataset_items(dataset_items, return_separately=False)
Update dataset items.
Warn
Does not test for integrity of the data.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dataset_items
|
list[DatasetItem]
|
Dataset items to update. |
required |
return_separately
|
bool
|
Whether to return separately added and updated dataset items. |
False
|
Returns:
Type | Description |
---|---|
list[DatasetItem] | tuple[list[DatasetItem], list[DatasetItem]]
|
If |
list[DatasetItem] | tuple[list[DatasetItem], list[DatasetItem]]
|
the updated dataset items. |