Skip to content

Curation

MATLAB functions in RAVEN/curation of the RAVEN toolbox. Help text is collected from the source of the tracked branch.

Functions

Function Summary
curateModelFromTables Curate or add mets, rxns and genes from tables.
downloadGenomeData Download genome annotation files from NCBI.
getGeneData Build a gene-mapping table from NCBI genome annotation files.
processProteinFastaFile Rename protein FASTA headers using a gene mapping table.
renameModelGenes Replace gene identifiers in a RAVEN model.

Reference

curateModelFromTables

Curate or add mets, rxns and genes from tables.

Curate existing and/or add new metabolites, reactions and genes from tabular data files. Originally extracted from yeast-GEM's curateMetsRxnsGenes; generalised here so any GEM project can drive batch curation from the same set of *.tsv files.

If the .tsv files contain metabolites, reactions and/or genes that are already present in the model, then information in the model will be overwritten. Note that this includes empty annotations in the .tsv files! Metabolites are matched by metaboliteName[comp]; reactions by the stoichiometry of its reactants and products; genes by their gene name. This function can therefore be used to add new entities in the model, or curate those already existing in the model.

Input arguments:

Name Type Description Default
model struct

RAVEN model structure to be curated.

required
metsInfo char

Path to a *.tsv file with metabolite information, or 'none' to skip metabolite curation. Columns: metNames, comps, formula, charge, inchi, metNotes, then any number of MIRIAM-namespace columns.

required

Name-value arguments:

Name Type Description Default
genesInfo char

Path to a *.tsv file with gene information, or 'none'. Columns: genes, geneShortNames, then MIRIAM.

rxnsCoeffs char

Path to a *.tsv file with reaction stoichiometric coefficients, or 'none'. Columns: rxnIdx, rxnNames, metNames, comps, coefficient. One row per (reaction, metabolite) pair.

rxnsInfo char

Path to a *.tsv file with reaction information, or 'none'. Columns: rxnIdx, rxnNames, grRules, lb, ub, rev, subSystems, eccodes, rxnNotes, rxnReferences, rxnConfidenceScores, then MIRIAM.

metPrefix char

Prefix used to mint fresh metabolite ids (e.g. 's_' for yeast-GEM, 'M_' for the cobrapy/BiGG default) (default 'M_').

rxnPrefix char

Prefix used to mint fresh reaction ids (default 'R_').

Output arguments:

Name Type Description
newModel struct

Curated RAVEN model structure.

Examples:

newModel = curateModelFromTables(model, metsInfo, genesInfo, ...
                rxnsCoeffs, rxnsInfo, metPrefix, rxnPrefix);
Notes

The 'everything after the core columns is MIRIAM' convention applies to all three info tables: any column whose header is not one of the listed core fields is treated as a MIRIAM annotation namespace and stored on the matching entity.

downloadGenomeData

Download genome annotation files from NCBI.

Retrieves the GFF3 annotation and protein FASTA (.faa) files for a given NCBI genome assembly accession using the NCBI Datasets v2 API, which returns both files in a single archive. The files are saved locally and can be used directly by getGeneData to build a gene-mapping table.

Input arguments:

Name Type Description Default
accession char

NCBI genome assembly accession, e.g. 'GCF_000002595.2'. Both RefSeq (GCF_) and GenBank (GCA_) prefixes are accepted.

required

Name-value arguments:

Name Type Description Default
outputDir char

Directory where downloaded files are saved (default: current working directory).

verbose logical

Print download progress to the command window (default: true).

Output arguments:

Name Type Description
gffFile char

Full path to the downloaded GFF3 annotation file.

faaFile char

Full path to the downloaded protein FASTA file.

Examples:

% Download to current directory
[gff, faa] = downloadGenomeData('GCF_000002595.2');

% Save to a specific directory
[gff, faa] = downloadGenomeData('GCF_000002595.2', 'data/');

% Suppress progress messages
[gff, faa] = downloadGenomeData('GCF_000002595.2', 'data/', false);
Notes

Requires an active internet connection. Files already present in outputDir are not re-fetched; delete them manually to force a refresh.

getGeneData

Build a gene-mapping table from NCBI genome annotation files.

Parses the GFF3 annotation and protein FASTA (.faa) files for a given NCBI genome assembly and produces a table mapping locus tags to gene symbols, protein IDs, and other identifiers. The resulting table can be passed directly to renameModelGenes to update gene identifiers in a RAVEN model. If no local files are provided, downloadGenomeData is called automatically to fetch them.

Input arguments:

Name Type Description Default
accession char

NCBI genome assembly accession, e.g. 'GCF_000002595.2', or a path to a local GFF3 annotation file.

required

Name-value arguments:

Name Type Description Default
outputFile char

Path to save the resulting table as a tab-delimited .tsv file. If omitted or empty, no file is written and the table is only returned.

downloadDir char

Directory where genome files are downloaded if not already present (default: current working directory).

Output arguments:

Name Type Description
geneTable table

MATLAB table with one row per gene containing: locus_tag — stable locus identifier (e.g. 'Cre01.g000001') old_locus_tag — previous locus identifier when available GeneID — NCBI Gene ID (e.g. '5723799') gene_name — common gene symbol when available (e.g. 'rbcL') GenBank_protein — protein accession (e.g. 'XP_001698190.2'), matching the protein FASTA headers UniProt — UniProt accession from Dbxref when available

Examples:

% Fetch and parse gene data for Chlamydomonas reinhardtii
geneTable = getGeneData('GCF_000002595.2');

% Save the mapping table to a specific file
geneTable = getGeneData('GCF_000002595.2', 'chlamy.tsv');

% Save to a file and download to a specific directory
geneTable = getGeneData('GCF_000002595.2', 'chlamy.tsv', 'data/');
Notes

When outputFile is provided the table is written as UTF-8 encoded tab-delimited text with a header row, compatible with readtable and renameModelGenes. Rows without a locus_tag are silently discarded.

processProteinFastaFile

Rename protein FASTA headers using a gene mapping table.

Reads a protein FASTA file (as downloaded by downloadGenomeData) and replaces each sequence header with the value from the specified geneTable column, matched via the GenBank_protein accession present in the original FASTA header. Sequences whose accession is not found in geneTable are kept with their original header unchanged.

Input arguments:

Name Type Description Default
faaFile char

Path to the protein FASTA file (.faa) to process.

required
geneTable char | table

Either a path to a tab-delimited .tsv file produced by getGeneData, or a MATLAB table variable. Must contain at least the columns 'GenBank_protein' and the column named by headerCol.

required
headerCol char

Name of the geneTable column whose values will replace each FASTA header (e.g. 'locus_tag', 'gene_name', 'GenBank_protein').

required

Name-value arguments:

Name Type Description Default
outputDir char

Directory where the processed FASTA file is saved. The output file name is the original base name with '_processed' appended before the extension (default: current working directory).

Output arguments:

Name Type Description
processedFaaFile char

Full path to the written processed FASTA file.

Examples:

% Rename headers using locus_tag
[~, faa]  = downloadGenomeData('GCF_000002595.2');
geneTable = getGeneData('GCF_000002595.2', 'chlamy.tsv');
outFile   = processProteinFastaFile(faa, geneTable, 'locus_tag');

% Load both inputs from saved files, use gene_name as header
outFile = processProteinFastaFile('GCF_000002595.2_protein.faa', 'chlamy.tsv', 'gene_name');

% Save the processed file to a specific directory
outFile = processProteinFastaFile(faa, 'chlamy.tsv', 'locus_tag', 'results/');
Notes

GenBank_protein must contain accessions matching the first token of each FASTA header (after '>'). For prokaryotes this is the protein_id (e.g. WP_012345678.1); for eukaryotes it is the GenBank accession stored in the GFF3 annotation by getGeneData.

renameModelGenes

Replace gene identifiers in a RAVEN model.

Updates model.genes and model.grRules by substituting the identifiers currently used in the model (fromCol) with new identifiers (toCol) from a mapping table produced by getGeneData or supplied manually. After renaming, model.rxnGeneMat is rebuilt automatically.

Input arguments:

Name Type Description Default
model struct

RAVEN model struct. Must contain model.genes and model.grRules; model.rxnGeneMat is rebuilt automatically.

required
geneTable table | char

Gene-mapping table. Can be supplied as: (a) A MATLAB table variable already loaded in the workspace, or (b) A path to a tab-delimited .tsv file — loaded automatically.

required
fromCol char

Column in geneTable whose values match the identifiers currently in model.genes (e.g. 'locus_tag').

required
toCol char

Column in geneTable whose values will replace the current identifiers (e.g. 'gene_name').

required

Output arguments:

Name Type Description
model struct

Updated model struct with: model.genes — renamed gene list. model.grRules — gene-reaction rules using new identifiers. model.rxnGeneMat — rebuilt sparse matrix matching the new gene list.

Examples:

model     = readYAMLmodel('iCre1355.yml');
geneTable = getGeneData('GCF_000002595.2', 'chlamy.tsv');
model     = renameModelGenes(model, geneTable, 'locus_tag', 'gene_name');

% Supply a TSV file path directly
model = renameModelGenes(model, 'chlamy.tsv', 'locus_tag', 'gene_name');
Notes
  • Genes with no entry in fromCol, or whose toCol value is empty, are left unchanged; a warning lists them for investigation.
  • Word-boundary matching prevents partial substitution (e.g. 'gene1' is not replaced inside 'gene10').
  • grRules are standardized via standardizeGrRules before and after renaming to ensure consistent formatting.