ENA v1.0¶
ENA (European Nucleotide Archive) is one of the primary repositories for nucleotide sequence data, operated by EMBL-EBI. The ENA metadata model defines the structure for submitting sequencing projects, samples, experiments, and data files to the archive.
The model is hierarchical: Studies contain Samples and Experiments, Experiments contain Runs with sequence data files, and Analyses contain derived data products. All entities receive ENA accession numbers upon submission.
flowchart TB
subgraph core["Core Objects"]
STU[Study]
SAM[Sample]
EXP[Experiment]
RUN[Run]
ANA[Analysis]
end
subgraph files["Data Files"]
FILE[File]
end
subgraph attributes["Attributes"]
SA[SampleAttribute]
EA[ExperimentAttribute]
RA[RunAttribute]
AA[AnalysisAttribute]
PL[ProjectLink]
end
%% Core relationships
STU --> SAM
STU --> EXP
STU --> ANA
EXP --> RUN
%% References
EXP -.->|study_ref| STU
EXP -.->|sample_ref| SAM
RUN -.->|experiment_ref| EXP
ANA -.->|study_ref| STU
%% File relationships
RUN --> FILE
ANA --> FILE
%% Attribute relationships
STU --> PL
SAM --> SA
EXP --> EA
RUN --> RA
ANA --> AA
classDef core fill:#e8f5e9,stroke:#4caf50,stroke-width:2px
classDef file fill:#fff3e0,stroke:#ff9800,stroke-width:2px
classDef attr fill:#e3f2fd,stroke:#2196f3,stroke-width:2px
class STU,SAM,EXP,RUN,ANA core
class FILE file
class SA,EA,RA,AA,PL attr
Entities¶
| Category | Entities |
|---|---|
| Core Objects | Study, Sample, Experiment, Run, Analysis |
| Data Files | File |
| Attributes | SampleAttribute, ExperimentAttribute, RunAttribute, AnalysisAttribute, ProjectLink |
Key Concepts¶
Study (Project): The top-level container that groups related submissions and controls release dates. Studies receive a BioProject accession (PRJEB) or legacy study accession (ERP). All data under a study is released together.
Sample: Describes the biological source material being sequenced. Samples are linked to NCBI taxonomy and annotated using ENA checklists (standardized attribute sets). Samples receive BioSample (SAMEA) or sample (ERS) accessions.
Attributes such as collection_date and geographic_location_country are checklist-level requirements (e.g. checklist ERC000011), not properties of the base Sample. They are therefore optional on the Sample entity, so valid public ENA/DDBJ records that omit them can be imported. Apply checklist-conditional requirements via the checklist field rather than expecting these attributes on every Sample.
Experiment: Describes the sequencing library and platform. Key fields include:
library_strategy: Sequencing approach (WGS, RNA-Seq, ChIP-Seq, etc.)library_source: Material type (GENOMIC, TRANSCRIPTOMIC, METAGENOMIC)library_selection: Selection method (RANDOM, PCR, PolyA)platform: Sequencing platform (ILLUMINA, OXFORD_NANOPORE, PACBIO_SMRT)
Run: Contains the actual sequence data files (FASTQ, BAM, CRAM). Each run has associated checksums for data integrity verification.
Analysis: Contains secondary analysis results derived from runs, such as assemblies, variant calls, or annotations.
Accession Formats¶
| Object | Accession Format | Example |
|---|---|---|
| Study | PRJEB or ERP | PRJEB12345, ERP123456 |
| Sample | ERS or SAMEA | ERS123456, SAMEA123456 |
| Experiment | ERX | ERX123456 |
| Run | ERR | ERR123456 |
| Analysis | ERZ | ERZ123456 |
Controlled Vocabularies¶
ENA uses controlled vocabularies for key fields:
Library Strategy (37 terms): WGS, WXS, RNA-Seq, ChIP-Seq, ATAC-seq, Hi-C, Bisulfite-Seq, etc.
Library Source (9 terms): GENOMIC, TRANSCRIPTOMIC, METAGENOMIC, SYNTHETIC, etc.
Library Selection (31 terms): RANDOM, PCR, PolyA, ChIP, MNase, Hybrid Selection, etc.
Platform (14 terms): ILLUMINA, OXFORD_NANOPORE, PACBIO_SMRT, ION_TORRENT, etc.
Validation Rules¶
The ENA profile includes validation rules for:
- Accession format patterns (PRJEB, ERS, ERX, ERR, ERZ)
- Taxon ID must be positive integer
- MD5 checksum format (32 hexadecimal characters)
- Library vocabulary constraints
- Coordinate ranges for geographic metadata
- Reference integrity between objects
Use Cases¶
- Genome sequencing: Whole genome sequencing projects
- Transcriptomics: RNA-Seq and single-cell RNA-Seq studies
- Metagenomics: Environmental and microbiome sequencing
- Epigenomics: ChIP-Seq, ATAC-seq, bisulfite sequencing
- Assemblies: Genome and transcriptome assembly submission
- Variant data: SNP and structural variant submissions
Entity-Relationship Diagram¶
erDiagram
Study {
string alias PK
string accession
string title
string description
string study_type
string center_name
}
Sample {
string alias PK
string accession
string title
integer taxon_id
string scientific_name
string collection_date
string geographic_location_country
}
Experiment {
string alias PK
string accession
string study_ref FK
string sample_ref FK
string library_strategy
string library_source
string library_selection
string platform
string instrument_model
}
Run {
string alias PK
string accession
string experiment_ref FK
date run_date
}
Analysis {
string alias PK
string accession
string study_ref FK
string title
string analysis_type
}
File {
string filename PK
string filetype
string checksum_method
string checksum
}
SampleAttribute {
string tag
string value
string units
}
ProjectLink {
string db
string id
string label
}
Study ||--o{ Sample : samples
Study ||--o{ Experiment : experiments
Study ||--o{ Analysis : analyses
Study ||--o{ ProjectLink : project_links
Experiment ||--o{ Run : runs
Experiment }o--|| Study : study_ref
Experiment }o--|| Sample : sample_ref
Run }o--|| Experiment : experiment_ref
Run ||--|{ File : files
Analysis }o--|| Study : study_ref
Analysis ||--|{ File : files
Sample ||--o{ SampleAttribute : sample_attributes
References¶
| Resource | URL |
|---|---|
| ENA Homepage | https://www.ebi.ac.uk/ena/browser/home |
| ENA Submission Guide | https://ena-docs.readthedocs.io/ |
| ENA Metadata Model | https://ena-docs.readthedocs.io/en/latest/submit/general-guide/metadata.html |
| Webin XML Schemas | https://github.com/enasequence/webin-xml |
| ENA Checklists | https://www.ebi.ac.uk/ena/browser/checklists |
Usage¶
from metaseed import ena
e = ena()
# Create Study
study = e.Study(
alias="my-study-001",
title="Whole genome sequencing of Arabidopsis thaliana ecotypes",
description="Genomic variation across 100 A. thaliana ecotypes"
)
# Create Sample
sample = e.Sample(
alias="sample-001",
title="Arabidopsis thaliana Col-0 leaf tissue",
taxon_id=3702,
scientific_name="Arabidopsis thaliana",
collection_date="2024-06-15",
geographic_location_country="Germany"
)
# Create Experiment
experiment = e.Experiment(
alias="experiment-001",
study_ref="my-study-001",
sample_ref="sample-001",
library_strategy="WGS",
library_source="GENOMIC",
library_selection="RANDOM",
library_layout="PAIRED",
platform="ILLUMINA",
instrument_model="Illumina NovaSeq 6000"
)
# Create Run with files
run = e.Run(
alias="run-001",
experiment_ref="experiment-001",
run_date="2024-06-20"
)
# Create File
file = e.File(
filename="sample1_R1.fastq.gz",
filetype="fastq",
checksum_method="MD5",
checksum="d41d8cd98f00b204e9800998ecf8427e"
)