Dataset Repository Architecture¶
The dataset repository provides a unified API for dataset storage operations, enabling metaseed-hub and other integrations to swap in custom storage backends (e.g., database) while the standalone metaseed uses filesystem storage.
Overview¶
graph TB
subgraph Consumers
UI[Web UI]
MCP[MCP Server]
API[REST API]
end
subgraph Manager["Dataset Manager"]
DM[DatasetManager]
end
subgraph Repositories["Repository Implementations"]
FSR[FilesystemDatasetRepository]
DBR[DatabaseDatasetRepository - Hub]
end
subgraph Storage["Storage Backends"]
FILE[JSON Files]
DB[Database]
end
UI --> DM
MCP --> DM
API --> DM
DM --> FSR
DM --> DBR
FSR --> FILE
DBR --> DB
Components¶
| Component | Location | Responsibility |
|---|---|---|
| DatasetRepository | metaseed.repositories.dataset_repository |
Sync abstract interface |
| AsyncDatasetRepository | metaseed.repositories.dataset_repository |
Async abstract interface |
| DatasetInfo | metaseed.repositories.dataset_repository |
Summary info for listing |
| DatasetData | metaseed.repositories.dataset_repository |
Full dataset contents |
| FilesystemDatasetRepository | metaseed.repositories.filesystem_dataset |
JSON file-based storage |
| DatasetManager | metaseed.ui.dataset_manager |
Sync business logic + state |
| AsyncDatasetManager | metaseed.ui.dataset_manager |
Async business logic + state |
DatasetRepository Interfaces¶
Two interfaces are provided: sync for filesystem/simple backends, async for database backends.
Sync Interface¶
from metaseed.repositories import DatasetRepository, DatasetInfo, DatasetData
class DatasetRepository(ABC):
def list(self) -> list[DatasetInfo]: ...
def save(self, name: str, data: DatasetData) -> DatasetInfo: ...
def load(self, name: str) -> DatasetData: ...
def delete(self, name: str) -> bool: ...
def exists(self, name: str) -> bool: ...
@staticmethod
def validate_name(name: str) -> str | None: ...
Async Interface¶
from metaseed.repositories import AsyncDatasetRepository, DatasetInfo, DatasetData
class AsyncDatasetRepository(ABC):
async def list(self) -> list[DatasetInfo]: ...
async def save(self, name: str, data: DatasetData) -> DatasetInfo: ...
async def load(self, name: str) -> DatasetData: ...
async def delete(self, name: str) -> bool: ...
async def exists(self, name: str) -> bool: ...
@staticmethod
def validate_name(name: str) -> str | None: ...
Data Classes¶
DatasetInfo¶
Summary information for listing datasets:
@dataclass
class DatasetInfo:
name: str # Dataset identifier
profile: str # Profile name (e.g., "miappe")
version: str # Profile version (e.g., "1.2")
entity_count: int # Number of entities
modified: str # ISO timestamp
DatasetData¶
Full dataset contents for save/load:
@dataclass
class DatasetData:
name: str
profile: str
version: str
entities: list[dict] # Serialized entity data
modified: str
Usage Patterns¶
Standalone Metaseed (Default)¶
Uses filesystem storage automatically:
from metaseed.ui.dataset_manager import get_manager
from metaseed.ui.state import AppState
state = AppState(profile="miappe")
manager = get_manager(state)
# List datasets
datasets = manager.list_datasets()
# Save current state
info = manager.save_dataset("my-experiment")
# Load dataset
info = manager.load_dataset("my-experiment")
# Delete dataset
manager.delete_dataset("old-experiment")
metaseed-hub Integration (Async)¶
Swap in an async database backend at app startup:
from datetime import datetime
from sqlalchemy import select
from sqlalchemy.ext.asyncio import AsyncSession
from metaseed.repositories import AsyncDatasetRepository, DatasetInfo, DatasetData
from metaseed.ui.dataset_manager import set_async_repository, get_async_manager
class DatabaseDatasetRepository(AsyncDatasetRepository):
"""Async database-backed dataset storage for metaseed-hub."""
def __init__(self, session: AsyncSession):
self._session = session
async def list(self) -> list[DatasetInfo]:
result = await self._session.execute(select(Dataset))
return [
DatasetInfo(
name=row.name,
profile=row.profile,
version=row.version,
entity_count=len(row.entities),
modified=row.modified.isoformat(),
)
for row in result.scalars()
]
async def save(self, name: str, data: DatasetData) -> DatasetInfo:
error = self.validate_name(name)
if error:
raise ValueError(error)
result = await self._session.execute(
select(Dataset).filter_by(name=name)
)
dataset = result.scalar_one_or_none()
if dataset:
dataset.profile = data.profile
dataset.version = data.version
dataset.entities = data.entities
dataset.modified = datetime.now()
else:
dataset = Dataset(
name=name,
profile=data.profile,
version=data.version,
entities=data.entities,
)
self._session.add(dataset)
await self._session.commit()
return DatasetInfo(
name=name,
profile=data.profile,
version=data.version,
entity_count=len(data.entities),
modified=dataset.modified.isoformat(),
)
async def load(self, name: str) -> DatasetData:
result = await self._session.execute(
select(Dataset).filter_by(name=name)
)
dataset = result.scalar_one_or_none()
if not dataset:
raise FileNotFoundError(f"Dataset not found: {name}")
return DatasetData(
name=dataset.name,
profile=dataset.profile,
version=dataset.version,
entities=dataset.entities,
modified=dataset.modified.isoformat(),
)
async def delete(self, name: str) -> bool:
result = await self._session.execute(
select(Dataset).filter_by(name=name)
)
dataset = result.scalar_one_or_none()
if dataset:
await self._session.delete(dataset)
await self._session.commit()
return True
return False
async def exists(self, name: str) -> bool:
result = await self._session.execute(
select(Dataset).filter_by(name=name)
)
return result.scalar_one_or_none() is not None
# In metaseed-hub app startup:
async def create_app(async_session: AsyncSession):
# Configure async repository BEFORE any dataset operations
set_async_repository(DatabaseDatasetRepository(async_session))
# Now routes can use async manager
from metaseed.ui.app import create_app as create_metaseed_app
return create_metaseed_app()
Using AsyncDatasetManager in Routes¶
from metaseed.ui.dataset_manager import get_async_manager
@app.get("/api/datasets")
async def list_datasets_api():
state = get_state()
manager = get_async_manager(state)
if manager:
# Use async operations
datasets = await manager.list_datasets()
return {"datasets": [asdict(d) for d in datasets]}
else:
# Fall back to sync
from metaseed.ui.dataset_manager import get_manager
sync_manager = get_manager(state)
datasets = sync_manager.list_datasets()
return {"datasets": [asdict(d) for d in datasets]}
Direct Repository Access¶
For scripts or external tools:
from metaseed.repositories import FilesystemDatasetRepository, DatasetData
# Custom storage location
repo = FilesystemDatasetRepository(Path("/data/metaseed/datasets"))
# List available datasets
for info in repo.list():
print(f"{info.name}: {info.profile}/{info.version} ({info.entity_count} entities)")
# Load and inspect
data = repo.load("experiment-2024")
for entity in data.entities:
print(f" {entity['_type']}: {entity.get('title', entity.get('unique_id'))}")
Module-Level Functions¶
For dependency injection configuration:
from metaseed.ui.dataset_manager import (
# Sync
set_repository, # Configure sync repository
get_repository, # Get sync repository
get_manager, # Get DatasetManager instance
# Async
set_async_repository, # Configure async repository
get_async_repository, # Get async repository
get_async_manager, # Get AsyncDatasetManager instance
# Utilities
reset_manager, # Clear all cached managers (for testing)
)
Backward Compatibility¶
The metaseed.ui.datasets module provides backward-compatible functions:
from metaseed.ui.datasets import (
list_datasets, # Returns list[dict]
save_dataset, # state, name -> dict
load_dataset, # state, name -> dict
delete_dataset, # name -> bool
get_current_dataset_name,
set_current_dataset_name,
auto_save,
)
These delegate to DatasetManager internally.
File Format¶
FilesystemDatasetRepository stores datasets as JSON:
{
"name": "my-experiment",
"profile": "miappe",
"version": "1.2",
"modified": "2024-01-15T10:30:00",
"entities": [
{
"_type": "Investigation",
"_parent_unique_id": null,
"unique_id": "INV-001",
"title": "My Investigation"
},
{
"_type": "Study",
"_parent_unique_id": "INV-001",
"unique_id": "STU-001",
"title": "Study One",
"investigation_id": "INV-001"
}
]
}
Design Principles¶
- Dependency Injection: Repository set at startup, consumed via DatasetManager
- Interface Segregation: Focused DatasetRepository interface
- Single Responsibility: Manager handles state integration, repository handles storage
- Open/Closed: New backends (database, S3, etc.) without modifying existing code
- Backward Compatibility: Module functions preserve existing API