lamindb.Artifact¶
- class lamindb.Artifact(data: UPathStr, key: str | None = None, description: str | None = None, is_new_version_of: Artifact | None = None, run: Run | None = None)¶
Bases:
Registry
,HasFeatures
,IsVersioned
,TracksRun
,TracksUpdates
Artifacts: datasets & models stored as files, folders, or arrays.
Artifacts manage data in local or remote storage.
An artifact stores a dataset or model as either a file or a folder.
Some artifacts are array-like, e.g., when stored as
.parquet
,.h5ad
,.zarr
, or.tiledb
.For more info, see tutorial: Tutorial: Artifacts.
- Parameters:
data –
UPathStr
A path to a local or remote folder or file.key –
str | None = None
A relative path within default storage, e.g.,"myfolder/myfile.fcs"
.description –
str | None = None
A description.version –
str | None = None
A version string.is_new_version_of –
Artifact | None = None
A previous version of the artifact.run –
Run | None = None
The run that creates the artifact.
Typical storage formats & their API accessors
Arrays:
Table:
.csv
,.tsv
,.parquet
,.ipc
⟷DataFrame
,pyarrow.Table
Annotated matrix:
.h5ad
,.h5mu
,.zrad
⟷AnnData
,MuData
Generic array: HDF5 group, zarr group, TileDB store ⟷ HDF5, zarr, TileDB loaders
Non-arrays:
Image:
.jpg
,.png
⟷np.ndarray
, …Fastq:
.fastq
⟷ /VCF:
.vcf
⟷ /QC:
.html
⟷ /
You’ll find these values in the
suffix
&accessor
fields.LaminDB makes some default choices (e.g., serialize a
DataFrame
as a.parquet
file).See also
Storage
Storage locations for artifacts.
Collection
Collections of artifacts.
from_df()
Create an artifact from a
DataFrame
.from_anndata()
Create an artifact from an
AnnData
.from_dir()
Bulk create file-like artifacts from a directory.
Examples
Create an artifact from a file in the cloud:
>>> artifact = ln.Artifact("s3://my-bucket/my-folder/my-file.csv", description="My file") >>> artifact.save() # only metadata is saved
Create an artifact from a local filepath:
>>> artifact = ln.Artifact("./my_file.jpg", description="My image") >>> artifact.save()
Why does the API look this way?
It’s inspired by APIs building on AWS S3.
Both boto3 and quilt select a bucket (akin to default storage in LaminDB) and define a target path through a
key
argument.In boto3:
# signature: S3.Bucket.upload_file(filepath, key) import boto3 s3 = boto3.resource('s3') bucket = s3.Bucket('mybucket') bucket.upload_file('/tmp/hello.txt', 'hello.txt')
In quilt3:
# signature: quilt3.Bucket.put_file(key, filepath) import quilt3 bucket = quilt3.Bucket('mybucket') bucket.put_file('hello.txt', '/tmp/hello.txt')
Make a new version of an artifact:
>>> # a non-versioned artifact >>> artifact = ln.Artifact(df1, description="My dataframe") >>> artifact.save() >>> # version an artifact >>> new_artifact = ln.Artifact(df2, is_new_version_of=artifact) >>> assert new_artifact.stem_uid == artifact.stem_uid >>> assert artifact.version == "1" >>> assert new_artifact.version == "2"
Attributes¶
- features: FeatureManager¶
Feature manager.
- labels: LabelManager¶
Label manager.
- objects Manager¶
- stem_uid: str¶
Fields¶
- version CharField¶
Version (default
None
).Defines version of a family of records characterized by the same
stem_uid
.Consider using semantic versioning with Python versioning.
- created_at DateTimeField¶
Time of creation of record.
- updated_at DateTimeField¶
Time of last update to record.
- id AutoField¶
Internal id, valid only in one DB instance.
- uid CharField¶
A universal random id (20-char base62 ~ UUID), valid across DB instances.
- description CharField¶
A description.
- key CharField¶
Storage key, the relative path within the storage location.
- suffix CharField¶
Path suffix or empty string if no canonical suffix exists.
This is either a file suffix (
".csv"
,".h5ad"
, etc.) or the empty string “”.
- accessor CharField¶
Default backed or memory accessor, e.g., DataFrame, AnnData.
Soon, also: SOMA, MuData, zarr.Group, tiledb.Array, etc.
- size BigIntegerField¶
Size in bytes.
Examples: 1KB is 1e3 bytes, 1MB is 1e6, 1GB is 1e9, 1TB is 1e12 etc.
- hash CharField¶
Hash or pseudo-hash of artifact content.
Useful to ascertain integrity and avoid duplication.
- hash_type CharField¶
Type of hash.
- n_objects BigIntegerField¶
Number of objects.
Typically, this denotes the number of files in an artifact.
- n_observations BigIntegerField¶
Number of observations.
Typically, this denotes the first array dimension.
- visibility SmallIntegerField¶
Visibility of artifact record in queries & searches (0 default, 1 hidden, 2 trash).
- key_is_virtual BooleanField¶
Indicates whether
key
is virtual or part of an actual file path.
- input_of ManyToManyField¶
Runs that use this artifact as an input.
- previous_runs ManyToManyField¶
Sequence of runs that created or updated the record.
- feature_sets ManyToManyField¶
The feature sets measured in the artifact (
FeatureSet
).
- feature_values ManyToManyField¶
Non-categorical feature values for annotation.
Methods¶
- backed(is_run_input=None)¶
Return a cloud-backed data object.
- Return type:
Notes
For more info, see tutorial: Query arrays.
Examples
Read AnnData in backed mode from cloud:
>>> artifact = ln.Artifact.filter(key="lndb-storage/pbmc68k.h5ad").one() >>> artifact.backed() AnnData object with n_obs × n_vars = 70 × 765 backed at 's3://lamindb-ci/lndb-storage/pbmc68k.h5ad'
- cache(is_run_input=None)¶
Download cloud artifact to local cache.
Follows synching logic: only caches an artifact if it’s outdated in the local cache.
Returns a path to a locally cached on-disk object (say, a
.jpg
file).- Return type:
Path
Examples
Sync file from cloud and return the local path of the cache:
>>> artifact.cache() PosixPath('/home/runner/work/Caches/lamindb/lamindb-ci/lndb-storage/pbmc68k.h5ad')
- delete(permanent=None, storage=None, using_key=None)¶
Delete.
A first call to
.delete()
puts an artifact into the trash (setsvisibility
to-1
).A second call permanently deletes the artifact.
FAQ: Storage FAQ
- Parameters:
permanent (
bool
|None
, default:None
) – Permanently delete the artifact (skip trash).storage (
bool
|None
, default:None
) – Indicate whether you want to delete the artifact in storage.
- Return type:
None
Examples
For an
Artifact
objectartifact
, call:>>> artifact.delete()
- classmethod from_anndata(adata, key=None, description=None, run=None, version=None, is_new_version_of=None, **kwargs)¶
Create from
AnnData
, validate & link features.- Parameters:
adata (
AnnData
|str
|Path
) – AnAnnData
object or a path of AnnData-like.key (
str
|None
, default:None
) – A relative path within default storage, e.g.,"myfolder/myfile.h5ad"
.description (
str
|None
, default:None
) – A description.version (
str
|None
, default:None
) – A version string.is_new_version_of (
Artifact
|None
, default:None
) – An old version of the artifact.run (
Run
|None
, default:None
) – The run that creates the artifact.
- Return type:
See also
Collection()
Track collections.
Feature
Track features.
Examples
>>> import bionty as bt >>> bt.settings.organism = "human" >>> adata = ln.core.datasets.anndata_with_obs() >>> artifact = ln.Artifact.from_anndata(adata, description="mini anndata with obs") >>> artifact.save()
.
- classmethod from_df(df, key=None, description=None, run=None, version=None, is_new_version_of=None, **kwargs)¶
Create from
DataFrame
, validate & link features.For more info, see tutorial: Tutorial: Artifacts.
- Parameters:
df (
DataFrame
) – ADataFrame
object.key (
str
|None
, default:None
) – A relative path within default storage, e.g.,"myfolder/myfile.parquet"
.description (
str
|None
, default:None
) – A description.version (
str
|None
, default:None
) – A version string.is_new_version_of (
Artifact
|None
, default:None
) – An old version of the artifact.run (
Run
|None
, default:None
) – The run that creates the artifact.
- Return type:
See also
Collection()
Track collections.
Feature
Track features.
Examples
>>> df = ln.core.datasets.df_iris_in_meter_batch1() >>> df.head() sepal_length sepal_width petal_length petal_width iris_organism_code 0 0.051 0.035 0.014 0.002 0 1 0.049 0.030 0.014 0.002 0 2 0.047 0.032 0.013 0.002 0 3 0.046 0.031 0.015 0.002 0 4 0.050 0.036 0.014 0.002 0 >>> artifact = ln.Artifact.from_df(df, description="Iris flower collection batch1") >>> artifact.save()
.
- classmethod from_dir(path, key=None, *, run=None)¶
Create a list of artifact objects from a directory.
Hint
If you have a high number of files (several 100k) and don’t want to track them individually, create a single
Artifact
viaArtifact(path)
for them. See, e.g., RxRx: cell imaging.- Parameters:
path (
str
|Path
) – Source path of folder.key (
str
|None
, default:None
) – Key for storage destination. IfNone
and directory is in a registered location, an inferredkey
will reflect the relative position. IfNone
and directory is outside of a registered storage location, the inferred key defaults topath.name
.run (
Run
|None
, default:None
) – ARun
object.
- Return type:
list
[Artifact
]
Examples
>>> dir_path = ln.core.datasets.generate_cell_ranger_files("sample_001", ln.settings.storage) >>> artifacts = ln.Artifact.from_dir(dir_path) >>> ln.save(artifacts)
.
- classmethod from_mudata(mdata, key=None, description=None, run=None, version=None, is_new_version_of=None, **kwargs)¶
Create from
MuData
, validate & link features.- Parameters:
mdata (
MuData
) – AnMuData
object.key (
str
|None
, default:None
) – A relative path within default storage, e.g.,"myfolder/myfile.h5mu"
.description (
str
|None
, default:None
) – A description.version (
str
|None
, default:None
) – A version string.is_new_version_of (
Artifact
|None
, default:None
) – An old version of the artifact.run (
Run
|None
, default:None
) – The run that creates the artifact.
- Return type:
See also
Collection()
Track collections.
Feature
Track features.
Examples
>>> import bionty as bt >>> bt.settings.organism = "human" >>> mdata = ln.core.datasets.mudata_papalexi21_subset() >>> artifact = ln.Artifact.from_mudata(mdata, description="a mudata object") >>> artifact.save()
.
- load(is_run_input=None, stream=False, **kwargs)¶
Stage and load to memory.
Returns in-memory representation if possible, e.g., an
AnnData
object for anh5ad
file.- Return type:
Any
Examples
Load as a
DataFrame
:>>> df = ln.core.datasets.df_iris_in_meter_batch1() >>> ln.Artifact.from_df(df, description="iris").save() >>> artifact = ln.Artifact.filter(description="iris").one() >>> artifact.load().head() sepal_length sepal_width petal_length petal_width iris_organism_code 0 0.051 0.035 0.014 0.002 0 1 0.049 0.030 0.014 0.002 0 2 0.047 0.032 0.013 0.002 0 3 0.046 0.031 0.015 0.002 0 4 0.050 0.036 0.014 0.002 0
Load as an
AnnData
:>>> artifact.load() AnnData object with n_obs × n_vars = 70 × 765
Fall back to
cache()
if no in-memory representation is configured:>>> artifact.load() PosixPath('/home/runner/work/lamindb/lamindb/docs/guide/mydata/.lamindb/jb7BY5UJoQVGMUOKiLcn.jpg')
- replace(data, run=None, format=None)¶
Replace artifact content.
- Parameters:
data (
str
|Path
) – A file path.run (
Run
|None
, default:None
) – The run that created the artifact gets auto-linked ifln.track()
was called.
- Return type:
None
Examples
Say we made a change to the content of an artifact, e.g., edited the image
paradisi05_laminopathic_nuclei.jpg
.This is how we replace the old file in storage with the new file:
>>> artifact.replace("paradisi05_laminopathic_nuclei.jpg") >>> artifact.save()
Note that this neither changes the storage key nor the filename.
However, it will update the suffix if it changes.
- restore()¶
Restore from trash.
- Return type:
None
Examples
For any
Artifact
objectartifact
, call:>>> artifact.restore()
- save(upload=None, **kwargs)¶
Save to database & storage.
- Parameters:
upload (
bool
|None
, default:None
) – Trigger upload to cloud storage in instances with hybrid storage mode.- Return type:
None
Examples
>>> artifact = ln.Artifact("./myfile.csv", description="myfile") >>> artifact.save()