Profile Design Approaches¶

When creating profiles for scientific databases, several design decisions affect usability, maintainability, and semantic accuracy.

Combining Profiles: Naive vs Elaborate Merge¶

When experiments span multiple domains (e.g., phenotyping + proteomics), users need combined profiles.

Option 1: Naive Merge (Duplication)¶

Simply include all entities from both profiles, accepting duplication:

# Combined profile with duplicated entities
entities:
  # From MIAPPE
  Investigation:
    fields: [...]
  Study:
    fields: [...]

  # From ISA (duplicates Investigation, Study)
  Investigation:  # Conflict!
    fields: [...]

Pros: - Simple to implement - Each profile remains intact - No semantic interpretation needed

Cons: - Name conflicts require resolution - Redundant storage - Users must understand which entity to use when

Option 2: Intelligent Merge¶

Analyze profiles semantically and merge compatible entities:

# Merged Investigation combining both
Investigation:
  fields:
    - name: identifier      # From both
    - name: title           # From both
    - name: description     # From both
    - name: ontology_sources # ISA-specific
    - name: location        # MIAPPE-specific

Pros: - Single coherent model - Reduced redundancy - Cross-domain experiments fit naturally

Cons: - Complex merge logic - Semantic decisions may be wrong - Field conflicts need resolution strategy

Current Implementation¶

Metaseed supports both via the compare and merge CLI commands:

# Compare to see differences
metaseed compare miappe/1.1 isa/1.0

# Merge with strategy
metaseed merge miappe/1.1 isa/1.0 --strategy most-restrictive

Merge strategies: - first-wins: Keep first profile's definition on conflict - last-wins: Keep last profile's definition - most-restrictive: Use stricter validation - least-restrictive: Use looser validation

Repository-Specific Profiles: Copy vs Adapt¶

When creating profiles for repositories like MetaboLights, PRIDE, or ENA, a key question is how closely to follow the original schema.

MetaboLights Case Study¶

MetaboLights uses ISA-Tab format with assay type specializations: - Assay (base) - NMRAssay (NMR-specific fields) - LCMSAssay (LC-MS-specific fields) - GCMSAssay (GC-MS-specific fields)

The original ISA-Tab handles this via naming conventions in assay files, not entity inheritance.

Question: Should metaseed support entity inheritance (extends)?

Arguments Against Inheritance¶

Complexity: Pydantic model generation becomes significantly more complex
Field conflicts: What happens when child overrides parent field with different type?
Validation ambiguity: Do parent validation rules apply to children?
Graph visualization: How to show inheritance vs composition relationships?

Alternative: Composition¶

Instead of inheritance, use composition and shared fields:

# Each assay type is standalone
NMRAssay:
  fields:
    # Shared assay fields (duplicated but explicit)
    - name: file_name
    - name: study_id
      reference: Study.identifier
    - name: measurement_type

    # NMR-specific fields
    - name: pulse_sequence
    - name: magnetic_field_strength

Pros: - Simple model - Each entity is self-contained - Clear what fields exist

Cons: - Field duplication across similar entities - Changes to "base" fields require updating all variants

Recommendation¶

For repository profiles:

Don't blindly copy - Adapt to metaseed's model
Prefer composition over inheritance
Use reference fields to establish relationships
Document deviations from original schema

Open Questions¶

Should metaseed support a fields_from directive to include fields from another entity without inheritance?

NMRAssay:
  fields_from: Assay  # Include all Assay fields
  fields:
    - name: pulse_sequence  # Additional NMR-specific fields

Should field groups be supported for reuse?

field_groups:
  common_assay_fields:
    - name: file_name
    - name: study_id
    - name: measurement_type

NMRAssay:
  include_groups: [common_assay_fields]
  fields:
    - name: pulse_sequence

Is there value in maintaining strict 1:1 mapping with repository schemas, or should metaseed profiles be optimized for the metaseed data model?