pixano.datasets.dataset
¶
Dataset(path, media_dir=None)
¶
The Pixano Dataset.
It is a collection of tables that can be queried and manipulated with LanceDB.
The tables are defined by the DatasetSchema which allows the dataset to return the data in the form of LanceModel instances.
Attributes:
| Name | Type | Description |
|---|---|---|
path |
Path
|
Path to the dataset. |
info |
DatasetInfo
|
Dataset info. |
schema |
DatasetSchema
|
Dataset schema. |
features_values |
DatasetFeaturesValues
|
Dataset features values. |
stats |
list[DatasetStatistic]
|
Dataset statistics. |
thumbnail |
Path
|
Dataset thumbnail base 64 URL. |
media_dir |
Path
|
Path to the media directory. |
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
Path
|
Path to the dataset. |
required |
media_dir
|
Path | None
|
Path to the media directory. |
None
|
Source code in pixano/datasets/dataset.py
id
property
¶
Return the dataset ID.
num_rows
property
¶
add_constraint(table, field_name, values, restricted=True)
¶
Add or replace a constraint.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
table
|
TableName
|
Table name (as in DatasetItem schema) |
required |
field_name
|
str
|
Name of the field to constrain. |
required |
values
|
List[Union[int, float, str, bool]]
|
List of allowed values. |
required |
restricted
|
bool
|
True if no other values are allowed. |
True
|
Source code in pixano/datasets/dataset.py
add_data(table_name, data, ignore_integrity_checks=None, raise_or_warn='raise')
¶
Add data to a table.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
table_name
|
str
|
Table name. |
required |
data
|
list[BaseSchema]
|
Data to add. |
required |
ignore_integrity_checks
|
list[IntegrityCheck] | None
|
List of integrity checks to ignore. |
None
|
raise_or_warn
|
Literal['raise', 'warn', 'none']
|
Whether to raise or warn on integrity errors. Can be 'raise', 'warn' or 'none'. |
'raise'
|
Source code in pixano/datasets/dataset.py
add_dataset_items(dataset_items)
¶
Add dataset items to the dataset.
Warn
Does not test for integrity of the data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dataset_items
|
list[DatasetItem] | DatasetItem
|
Dataset items to add. |
required |
Source code in pixano/datasets/dataset.py
compute_view_embeddings(table_name, data)
¶
Compute the view embeddings via the Embedding Function stored in the table metadata.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
table_name
|
str
|
Table name containing the view embeddings. |
required |
data
|
list[dict]
|
Data to compute. Dictionary representing a view embedding without the vector field. |
required |
Source code in pixano/datasets/dataset.py
count_rows_where(table_name=SchemaGroup.ITEM.value, where=None)
¶
Count rows in a table, optionally filtered by a WHERE clause.
Uses LanceDB native count_rows() which avoids full table materialization.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
table_name
|
str
|
Table name. |
ITEM.value
|
where
|
str | None
|
Optional WHERE clause to filter rows. |
None
|
Returns:
| Type | Description |
|---|---|
int
|
Number of matching rows. |
Source code in pixano/datasets/dataset.py
create_table(name, schema, relation_item, data=None, mode='create', exist_ok=False, on_bad_vectors='error', fill_value=0.0)
¶
Add a table to the dataset.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
Table name. |
required |
schema
|
type[BaseSchema]
|
Table schema. |
required |
relation_item
|
SchemaRelation
|
Relation with the |
required |
data
|
DATA | None
|
Table data. |
None
|
mode
|
str
|
Table mode ('create', 'overwrite'). |
'create'
|
exist_ok
|
bool
|
If True, do not raise an error if the table already exists. |
False
|
on_bad_vectors
|
str
|
Raise an error, drop or fill bad vectors ("error", "drop", "fill"). |
'error'
|
fill_value
|
float
|
Value to fill bad vectors. |
0.0
|
Returns:
| Type | Description |
|---|---|
LanceTable
|
The table created. |
Source code in pixano/datasets/dataset.py
delete_data(table_name, ids)
¶
Delete data from a table.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
table_name
|
str
|
Table name. |
required |
ids
|
list[str]
|
Ids to delete. |
required |
Returns:
| Type | Description |
|---|---|
list[str]
|
The list of ids not found. |
Source code in pixano/datasets/dataset.py
delete_dataset_items(ids)
¶
Delete dataset items.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ids
|
list[str]
|
Ids to delete. |
required |
Returns:
| Type | Description |
|---|---|
list[str]
|
The list of ids not found. |
Source code in pixano/datasets/dataset.py
find(id, directory, media_dir=None)
staticmethod
¶
Find a Dataset in a directory.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
id
|
str
|
Dataset ID to find. |
required |
directory
|
Path
|
Directory to search in. |
required |
media_dir
|
Path | None
|
Media directory. |
None
|
Returns:
| Type | Description |
|---|---|
'Dataset'
|
The found dataset. |
Source code in pixano/datasets/dataset.py
find_ids_in_table(table_name, ids)
¶
Search ids in a table.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
table_name
|
str
|
Table name. |
required |
ids
|
set[str]
|
Ids to find. |
required |
Returns:
| Type | Description |
|---|---|
dict[str, bool]
|
Dictionary of ids found. Keys are the ids and values are |
Source code in pixano/datasets/dataset.py
generate_preview()
¶
Generate a preview for the dataset.
It samples images from the dataset, creates a mosaic and saves it to the previews directory.
Returns:
| Type | Description |
|---|---|
str
|
The preview base64 string. |
Source code in pixano/datasets/dataset.py
get_all_ids(table_name=SchemaGroup.ITEM.value, sortcol=None, order=None, where=None)
¶
Get all the ids from a table.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
table_name
|
str
|
table to look for ids. |
ITEM.value
|
sortcol
|
str | None
|
column to order by |
None
|
order
|
str | None
|
sort order (asc or desc) |
None
|
where
|
str | None
|
where clause to filter ids. |
None
|
Returns:
| Type | Description |
|---|---|
list[str]
|
list of the ids. |
Source code in pixano/datasets/dataset.py
get_data(table_name, ids=None, limit=None, skip=0, where=None, item_ids=None, sortcol=None, order=None)
¶
Read data from a table.
Data can be filtered by ids, item ids, where clause, or limit and skip.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
table_name
|
str
|
Table name. |
required |
where
|
str | None
|
Where clause. |
None
|
ids
|
list[str] | str | None
|
ids to read. |
None
|
limit
|
int | None
|
Amount of items to read. If not set, will default to table size. |
None
|
skip
|
int
|
The number of data to skip. |
0
|
item_ids
|
list[str] | None
|
Item ids to read. |
None
|
sortcol
|
str | None
|
column to order by |
None
|
order
|
str | None
|
sort order (asc or desc) |
None
|
Returns:
| Type | Description |
|---|---|
list[BaseSchema] | BaseSchema | None
|
List of values. |
Source code in pixano/datasets/dataset.py
428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 | |
get_dataset_items(ids=None, limit=None, skip=0)
¶
Read dataset items.
Filter dataset items by ids, or limit and skip.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ids
|
list[str] | str | None
|
Item ids to read. |
None
|
limit
|
int | None
|
Amount of items to read. |
None
|
skip
|
int
|
The number of data to skip.. |
0
|
Returns:
| Type | Description |
|---|---|
list[DatasetItem] | DatasetItem | None
|
List of dataset items. |
Source code in pixano/datasets/dataset.py
list(directory)
staticmethod
¶
List the datasets information in directory.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
directory
|
Path
|
Directory to search in. |
required |
Returns:
| Type | Description |
|---|---|
list[DatasetInfo]
|
List of dataset infos. |
Source code in pixano/datasets/dataset.py
open_table(name)
¶
Open a dataset table with LanceDB.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name
|
str
|
Name of the table to open. |
required |
Returns:
| Type | Description |
|---|---|
LanceTable
|
Dataset table. |
Source code in pixano/datasets/dataset.py
open_tables(names=None, exclude_embeddings=True)
¶
Open the dataset tables with LanceDB.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
names
|
list[str] | None
|
Table names to open. If None, open all tables. |
None
|
exclude_embeddings
|
bool
|
Whether to exclude embedding tables from the list. |
True
|
Returns:
| Type | Description |
|---|---|
dict[str, LanceTable]
|
Dataset tables. |
Source code in pixano/datasets/dataset.py
resolve_ref(ref)
¶
Resolve a SchemaRef.
It fetches the data from the table referenced.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ref
|
SchemaRef | ItemRef | ViewRef | EmbeddingRef | EntityRef | AnnotationRef | SourceRef
|
Reference to resolve. |
required |
Returns:
| Type | Description |
|---|---|
BaseSchema | Item | View | Embedding | Entity | Annotation | Source
|
The resolved reference. |
Source code in pixano/datasets/dataset.py
semantic_search(query, table_name, limit, skip=0)
¶
Perform a semantic search.
It searches for the closest items to the query in the table embeddings.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
query
|
str
|
Text query for semantic search. |
required |
table_name
|
str
|
Table name for embeddings. |
required |
limit
|
int
|
Limit number of items. |
required |
skip
|
int
|
Skip number of items |
0
|
Returns:
| Type | Description |
|---|---|
tuple[list[BaseSchema], list[float], list[str]]
|
Tuple of items, distances, and full sorted list of ids. |
Source code in pixano/datasets/dataset.py
update_data(table_name, data, return_separately=False, ignore_integrity_checks=None, raise_or_warn='raise')
¶
Update data in a table.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
table_name
|
str
|
Table name. |
required |
data
|
list[BaseSchema]
|
Data to update. |
required |
return_separately
|
bool
|
Whether to return separately added and updated data. |
False
|
ignore_integrity_checks
|
list[IntegrityCheck] | None
|
List of integrity checks to ignore. |
None
|
raise_or_warn
|
Literal['raise', 'warn', 'none']
|
Whether to raise or warn on integrity errors. Can be 'raise', 'warn' or 'none'. |
'raise'
|
Returns:
| Type | Description |
|---|---|
list[BaseSchema] | tuple[list[BaseSchema], list[BaseSchema]]
|
If |
list[BaseSchema] | tuple[list[BaseSchema], list[BaseSchema]]
|
data. |
Source code in pixano/datasets/dataset.py
update_dataset_items(dataset_items, return_separately=False)
¶
Update dataset items.
Warn
Does not test for integrity of the data.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dataset_items
|
list[DatasetItem]
|
Dataset items to update. |
required |
return_separately
|
bool
|
Whether to return separately added and updated dataset items. |
False
|
Returns:
| Type | Description |
|---|---|
list[DatasetItem] | tuple[list[DatasetItem], list[DatasetItem]]
|
If |
list[DatasetItem] | tuple[list[DatasetItem], list[DatasetItem]]
|
the updated dataset items. |