Metadata Extraction Agent¶
Overview¶
An AI-powered agent that helps users fill in metadata by analyzing source files and mapping them to a selected metadata profile. The agent is exposed via MCP (Model Context Protocol) server, allowing use with Claude Desktop or other MCP-compatible clients.
Goals¶
- Reduce manual data entry - automatically extract metadata from existing files
- Profile-aware extraction - understand what entities and fields the selected profile expects
- Conversational workflow - MCP client (Claude Desktop) handles user interaction
- Reusable architecture - core library usable via MCP, REST API, CLI, or embedded in other tools
User Workflow¶
- User connects Claude Desktop to the metaseed MCP server
- User asks Claude to help extract metadata from their files
- Claude uses MCP tools to:
- List available profiles
- Parse user's source files
- Analyze column mappings
- Extract entity instances
- Validate extracted data
- Export to YAML/JSON
- Claude guides the user through confirming mappings and fixing validation errors
- Final metadata is exported and ready for use
Architecture¶
┌─────────────────────────────────────────┐
│ MCP Server (interface) │
│ Tools: analyze_file, extract_entities │
└─────────────────┬───────────────────────┘
│
┌─────────────────┴───────────────────────┐
│ Core Library (logic) │
│ ExtractionContext, EntityMapper │
└─────────────────┬───────────────────────┘
│
┌─────────────────┴───────────────────────┐
│ Parsers (file handling) │
│ CSV, JSON, Excel │
└─────────────────────────────────────────┘
Key principle: Core library has no MCP dependencies. MCP server wraps the core library. This allows future REST API or Web UI to reuse the same core.
Components¶
1. Extraction Core (src/metaseed/agent/core.py)¶
Central orchestration logic:
class ExtractionContext:
"""Manages an extraction session."""
profile: ProfileSpec
sources: list[ParsedContent]
mappings: dict[str, ColumnMapping]
extracted: dict[str, list[dict]]
@classmethod
def from_profile(cls, profile_name: str, version: str) -> ExtractionContext:
"""Create context from profile name and version."""
def add_source(self, path: Path) -> ParsedContent:
"""Parse and add a source file."""
def suggest_mapping(self, source_index: int, entity_name: str) -> list[FieldMapping]:
"""Suggest column mappings for an entity."""
def extract_entities(self, source_index: int, entity_name: str) -> ExtractionResult:
"""Extract entity instances from a source."""
def validate_instance(self, data: dict, entity_name: str) -> list[ValidationError]:
"""Validate extracted data against entity spec."""
def export_yaml(self, entity_name: str | None = None) -> str:
"""Export extracted entities to YAML."""
def export_json(self, entity_name: str | None = None) -> str:
"""Export extracted entities to JSON."""
2. Column Mapping (src/metaseed/agent/mapping.py)¶
Maps source file columns to entity fields:
class FieldMapping:
"""Mapping from a source column to an entity field."""
field_name: str
source_column: str | None
confidence: float
default_value: str | None
class ColumnMapping:
"""Complete mapping configuration for extracting an entity."""
entity_name: str
fields: list[FieldMapping]
source_table: str | int | None
def suggest_mapping(
source_columns: list[str],
entity_spec: EntitySpec,
threshold: float = 0.5,
) -> list[FieldMapping]:
"""Suggest column-to-field mappings based on name similarity."""
3. File Parser Registry (src/metaseed/agent/parsers/)¶
Pluggable parsers for different file formats:
class ParserRegistry:
"""Registry of file format parsers."""
def register(self, parser: FileParser) -> None:
"""Register a parser for a file type."""
def parse(self, path: Path) -> ParsedContent:
"""Parse file using appropriate parser."""
class FileParser(Protocol):
"""Protocol for file parsers."""
extensions: list[str]
mime_types: list[str]
def parse(self, path: Path) -> ParsedContent:
"""Parse file into structured content."""
Built-in parsers:
- CSVParser - CSV/TSV files with header detection and dialect sniffing
- JSONParser - JSON files (arrays or objects with array fields)
- ExcelParser - Excel workbooks (.xlsx, .xls) with multiple sheet support
4. MCP Server (src/metaseed/agent/mcp/server.py)¶
Exposes agent capabilities via Model Context Protocol:
Resources (read-only data)¶
| Resource URI | Description |
|---|---|
profile://list |
List all available profiles |
profile://{name}/{version} |
Profile schema overview |
profile://{name}/{version}/entity/{entity} |
Single entity definition |
Tools (actions)¶
File Extraction Tools:
| Tool | Description |
|---|---|
list_profiles |
List available profiles with versions |
get_profile_schema |
Get full profile schema with entities |
parse_source_file |
Parse a file and return structure |
analyze_mapping |
Suggest how file columns map to entity fields |
extract_entities |
Extract entity instances from file using mapping |
validate_extracted |
Validate extracted data against profile |
export_metadata |
Export extracted metadata to YAML/JSON |
Dataset Management Tools:
| Tool | Description |
|---|---|
list_datasets |
List all saved datasets |
load_dataset |
Load a dataset into the editor |
save_dataset |
Save current entities to a dataset |
create_dataset |
Create a new empty dataset with a profile |
Entity Management Tools:
| Tool | Description |
|---|---|
list_entities |
List all entities in the current dataset |
get_entity_tree |
Get hierarchical tree showing parent-child relationships |
get_entity |
Get a specific entity by node ID |
create_entity |
Create entity with optional parent_id (auto-fills parent references). On success returns hints (expected_children, typical_next, cross_ref_consumers); on validation errors returns identifier_field, field_types, valid_fields, required_fields, and a hint |
batch_create |
Create multiple entities in one operation with [{entity_type, data, parent_id?}, ...]. Each created result carries the same relational hints as create_entity |
link_entity |
Link existing entity as child of another |
update_entity |
Update an existing entity (auto-saves) |
bulk_update_entities |
Update multiple entities at once |
delete_entity |
Delete an entity (auto-saves) |
Building Hierarchies: Use parent_id when creating entities. The MCP server automatically:
1. Fills the child's reference field (e.g., Study.investigation_id from parent's unique_id)
2. Updates the parent's list field (e.g., Investigation.studies to include the child)
3. Saves the dataset
create_entity("Study", {"unique_id": "study-1", "title": "..."}, parent_id="<investigation_node_id>")
Batch Creation: Create multiple entities in one call:
batch_create([
{"entity_type": "Investigation", "data": {"unique_id": "inv-1", "title": "Test"}},
{"entity_type": "Study", "data": {"unique_id": "st-1", "title": "Trial"}, "parent_id": "<inv_node_id>"}
])
{total, created, failed, results} with detailed error info per entity.
Schema Discovery: Before creating entities, discover field requirements:
get_entity_fields("Investigation", "miappe", "1.2") # Full field info
get_required_fields("Study", "miappe", "1.2") # Just required field names
get_entity_template("Investigation", "miappe", "1.2") # Template with placeholders
Validation Tools:
| Tool | Description |
|---|---|
validate_dataset |
Validate all entities (same logic as UI - single source of truth) |
validate_relationships |
Flag relationship-completeness gaps the schema implies but does not enforce: empty list-references, entities of a referenced type that nothing points at, and containers with no child of a type they can hold. Derived from the profile's own nested_fields/reference_fields (profile-agnostic) |
validate_ontology_terms |
Check every filled ontology_term field against OLS (restricted to the field's declared ontologies) and report valid plus suggested terms. Fails open when OLS is unreachable |
get_field_spec |
Get field definitions, constraints, and ontology terms |
Schema Discovery Tools:
| Tool | Description |
|---|---|
get_profile_relationships |
Entity relationship map: per type, its identifier, child types (children), and cross_references (which entity/field each reference points at). Use before creating to build relational, not flat, datasets |
get_example_dataset |
One instance of every entity type with required fields filled and every reference wired to the matching example - a correct relational pattern to copy. Values are placeholders; built from the profile schema (any spec) |
get_entity_fields |
Get all fields for an entity type with (entity_type, profile, version) |
get_required_fields |
Get list of required field names for an entity type |
get_entity_template |
Get template with placeholder values by field type |
get_entity_fields, get_required_fields, and get_entity_template all return
the entity's identifier_field, plus a note for types that break the usual
unique_id convention (e.g. Person keys on name). On validation failure,
create_entity likewise reports identifier_field, field_types, and a
corrective hint when an id alias such as unique_id is sent to the wrong type.
Each error detail also carries the field's description and constraints
(pattern, range, enum), so a format error explains the expected shape.
The validate_dataset tool uses the same validation as the web UI, checking:
- Required fields
- Type constraints (from Pydantic models)
- Custom validation rules (from profile spec)
- Returns rules_checked array showing which rules were applied
Prompts (reusable templates)¶
| Prompt | Description |
|---|---|
extraction_guide |
Step-by-step instructions for extraction |
field_mapping_help |
Help for adjusting column mappings |
Data Flow¶
1. Parse Source File¶
User provides file path
│
▼
┌──────────────────┐
│ ParserRegistry │
│ selects parser │
└────────┬─────────┘
│
▼
┌──────────────────┐
│ Parser extracts │
│ tables, headers, │
│ and rows │
└────────┬─────────┘
│
▼
┌──────────────────┐
│ Return │
│ ParsedContent │
└──────────────────┘
2. Analyze Mapping¶
Source columns + Entity spec
│
▼
┌──────────────────┐
│ Normalize names │
│ (snake_case, │
│ camelCase, etc) │
└────────┬─────────┘
│
▼
┌──────────────────┐
│ Compute │
│ similarity │
│ scores │
└────────┬─────────┘
│
▼
┌──────────────────┐
│ Suggest mappings │
│ with confidence │
└──────────────────┘
3. Extract Entities¶
Mapping + Source table
│
▼
┌──────────────────┐
│ For each row: │
│ - Apply mapping │
│ - Convert types │
└────────┬─────────┘
│
▼
┌──────────────────┐
│ Return instances │
│ and any errors │
└──────────────────┘
CLI Usage¶
Start the MCP server:
# For Claude Desktop (stdio transport)
metaseed mcp
# For debugging (HTTP transport)
metaseed mcp --transport http --port 8000
Claude Desktop Configuration¶
Add to ~/Library/Application Support/Claude/claude_desktop_config.json:
Example Usage with Claude Desktop¶
User: "I have a CSV file with experiment data. Help me extract MIAPPE metadata."
Claude (using MCP tools):
1. Calls list_profiles → shows MIAPPE available
2. Calls parse_source_file with the CSV → sees columns
3. Calls get_profile_schema for MIAPPE → understands entities
4. Calls analyze_mapping → suggests column mappings
5. Asks user to confirm/adjust mappings
6. Calls extract_entities → extracts instances
7. Calls validate_extracted → checks for errors
8. Calls export_metadata → saves to file
File Structure¶
src/metaseed/agent/
├── __init__.py # Public API exports
├── core.py # ExtractionContext, extraction logic
├── mapping.py # Column-to-field mapping
├── questions.py # Question types (for future interactive use)
├── parsers/
│ ├── __init__.py
│ ├── registry.py # ParserRegistry, ParsedContent, ParsedTable
│ ├── csv.py # CSV/TSV parser
│ ├── json.py # JSON parser
│ └── excel.py # Excel parser
├── llm/ # LLM abstraction (for future direct LLM use)
│ ├── __init__.py
│ ├── base.py
│ └── anthropic.py
└── mcp/
├── __init__.py
└── server.py # MCP server implementation
Dependencies¶
Required:
- mcp>=1.0.0 - MCP server SDK
- pydantic>=2.0 - Data validation
- openpyxl>=3.1 - Excel parsing
- pyyaml>=6.0 - YAML export
Testing¶
Run agent tests:
Tests cover: - File parsing (CSV, JSON, Excel) - Column mapping suggestions - Entity extraction - MCP server tools and resources
Claude Code Integration¶
The MCP server can be used with Claude Code for AI-assisted metadata extraction.
Setup via CLI (recommended)¶
Add metaseed to your user config (available in all projects):
claude mcp add --scope user --transport stdio metaseed -- \
uv --directory /path/to/metaseed run metaseed mcp
Or add to a specific project only:
claude mcp add --scope project --transport stdio metaseed -- \
uv --directory /path/to/metaseed run metaseed mcp
Manual Setup¶
User scope - add to ~/.claude.json:
{
"mcpServers": {
"metaseed": {
"command": "uv",
"args": ["--directory", "/path/to/metaseed", "run", "metaseed", "mcp"]
}
}
}
Project scope - add .mcp.json to project root:
{
"mcpServers": {
"metaseed": {
"command": "uv",
"args": ["--directory", "/path/to/metaseed", "run", "metaseed", "mcp"]
}
}
}
Usage¶
After restarting Claude Code, ask:
- "List available metadata profiles"
- "Parse data.csv and show me the columns"
- "Suggest MIAPPE mappings for this CSV"
- "Extract Investigation entities from my file"
- "Validate the extracted data"
- "Export as YAML"
Web UI Integration¶
The metaseed web app includes an MCP toggle button in the header:
- Start the app:
make dev(runs at http://127.0.0.1:8765) - Click the MCP button to start/stop the server
- Green dot = running at
http://127.0.0.1:8001 - External MCP clients can connect via HTTP transport
State Sharing: The UI and MCP server share the same state. Changes made via MCP are reflected in the UI after refreshing the browser.
Claude Desktop HTTP Config (for use with UI):
Creating Datasets via MCP¶
When creating hierarchical entities, use parent_id to link child entities to parents:
# 1. Create the Investigation first
create_entity("Investigation", {"unique_id": "inv-001", "title": "My Study"})
# 2. Create Study with parent_id - investigation_id is auto-filled
create_entity("Study", {"unique_id": "study-001", "title": "Trial 1"}, parent_id="<inv_node_id>")
The MCP server automatically:
- Fills the child's reference field (e.g., Study.investigation_id) from the parent's identifier
- Updates the parent's list field (e.g., Investigation.studies) to include the child
- Saves the dataset after each entity creation
Ask Claude to create a complete dataset:
Create a new MIAPPE dataset called "wheat-experiment" with:
Investigation:
- Unique ID: "wheat-drought-2024"
- Title: "Wheat Drought Stress Response Study"
Study (as child of Investigation):
- Unique ID: "field-trial-2024"
- Title: "Field Trial 2024"
Then validate the dataset and show me the entity tree.
After MCP creates entities, refresh your browser to see them in the UI graph.
Testing MCP with Claude Code¶
- Start the web UI:
make dev - Click the MCP button to start the MCP server (green dot = running)
- Connect Claude Code to the MCP server (see setup above)
- Test with commands like:
- "List available metadata profiles"
- "Create a MIAPPE dataset called test-dataset"
- "Show me the entity tree"
After code changes:
- Click MCP button off then on to restart the server with new code
- Or run uv pip install -e . and restart the MCP server
Future Enhancements¶
- REST API - HTTP endpoints for web integration
- Direct LLM integration - Use LLM for intelligent field mapping
- Additional parsers - PDF, XML, text pattern extraction
- Batch processing - Process multiple files at once
- Persistent sessions - Save and resume extraction sessions