Curation¶

MATLAB functions in RAVEN/curation of the RAVEN toolbox. Help text is collected from the source of the tracked branch.

Functions¶

Function	Summary
`curateModelFromTables`	Curate or add mets, rxns and genes from tables.
`downloadGenomeData`	Download genome annotation files from NCBI.
`getGeneData`	Build a gene-mapping table from NCBI genome annotation files.
`processProteinFastaFile`	Rename protein FASTA headers using a gene mapping table.
`renameModelGenes`	Replace gene identifiers in a RAVEN model.

Reference¶

curateModelFromTables¶

Curate or add mets, rxns and genes from tables.

Curate existing and/or add new metabolites, reactions and genes from tabular data files. Originally extracted from yeast-GEM's curateMetsRxnsGenes; generalised here so any GEM project can drive batch curation from the same set of *.tsv files.

If the .tsv files contain metabolites, reactions and/or genes that are already present in the model, then information in the model will be overwritten. Note that this includes empty annotations in the .tsv files! Metabolites are matched by metaboliteName[comp]; reactions by the stoichiometry of its reactants and products; genes by their gene name. This function can therefore be used to add new entities in the model, or curate those already existing in the model.

Input arguments:

Name	Type	Description	Default
`model`	`struct`	RAVEN model structure to be curated.	required
`metsInfo`	`char`	Path to a *.tsv file with metabolite information, or 'none' to skip metabolite curation. Columns: metNames, comps, formula, charge, inchi, metNotes, then any number of MIRIAM-namespace columns.	required

Name-value arguments:

Name	Type	Description
`genesInfo`	`char`	Path to a *.tsv file with gene information, or 'none'. Columns: genes, geneShortNames, then MIRIAM.
`rxnsCoeffs`	`char`	Path to a *.tsv file with reaction stoichiometric coefficients, or 'none'. Columns: rxnIdx, rxnNames, metNames, comps, coefficient. One row per (reaction, metabolite) pair.
`rxnsInfo`	`char`	Path to a *.tsv file with reaction information, or 'none'. Columns: rxnIdx, rxnNames, grRules, lb, ub, rev, subSystems, eccodes, rxnNotes, rxnReferences, rxnConfidenceScores, then MIRIAM.
`metPrefix`	`char`	Prefix used to mint fresh metabolite ids (e.g. 's_' for yeast-GEM, 'M_' for the cobrapy/BiGG default) (default 'M_').
`rxnPrefix`	`char`	Prefix used to mint fresh reaction ids (default 'R_').

Output arguments:

Name	Type	Description
`newModel`	`struct`	Curated RAVEN model structure.

Examples:

newModel = curateModelFromTables(model, metsInfo, genesInfo, ...
                rxnsCoeffs, rxnsInfo, metPrefix, rxnPrefix);

Notes

The 'everything after the core columns is MIRIAM' convention applies to all three info tables: any column whose header is not one of the listed core fields is treated as a MIRIAM annotation namespace and stored on the matching entity.

downloadGenomeData¶

Download genome annotation files from NCBI.

Retrieves the GFF3 annotation and protein FASTA (.faa) files for a given NCBI genome assembly accession using the NCBI Datasets v2 API, which returns both files in a single archive. The files are saved locally and can be used directly by getGeneData to build a gene-mapping table.

Input arguments:

Name	Type	Description	Default
`accession`	`char`	NCBI genome assembly accession, e.g. 'GCF_000002595.2'. Both RefSeq (GCF_) and GenBank (GCA_) prefixes are accepted.	required

Name-value arguments:

Name	Type	Description	Default
`outputDir`	`char`	Directory where downloaded files are saved (default: current working directory).
`verbose`	`logical`	Print download progress to the command window (default: true).

Output arguments:

Name	Type	Description
`gffFile`	`char`	Full path to the downloaded GFF3 annotation file.
`faaFile`	`char`	Full path to the downloaded protein FASTA file.

Examples:

% Download to current directory
[gff, faa] = downloadGenomeData('GCF_000002595.2');

% Save to a specific directory
[gff, faa] = downloadGenomeData('GCF_000002595.2', 'data/');

% Suppress progress messages
[gff, faa] = downloadGenomeData('GCF_000002595.2', 'data/', false);

Notes

Requires an active internet connection. Files already present in outputDir are not re-fetched; delete them manually to force a refresh.

getGeneData¶

Build a gene-mapping table from NCBI genome annotation files.

Parses the GFF3 annotation and protein FASTA (.faa) files for a given NCBI genome assembly and produces a table mapping locus tags to gene symbols, protein IDs, and other identifiers. The resulting table can be passed directly to renameModelGenes to update gene identifiers in a RAVEN model. If no local files are provided, downloadGenomeData is called automatically to fetch them.

Input arguments:

Name	Type	Description	Default
`accession`	`char`	NCBI genome assembly accession, e.g. 'GCF_000002595.2', or a path to a local GFF3 annotation file.	required

Name-value arguments:

Name	Type	Description	Default
`outputFile`	`char`	Path to save the resulting table as a tab-delimited .tsv file. If omitted or empty, no file is written and the table is only returned.
`downloadDir`	`char`	Directory where genome files are downloaded if not already present (default: current working directory).

Output arguments:

Name	Type	Description
`geneTable`	`table`	MATLAB table with one row per gene containing: locus_tag — stable locus identifier (e.g. 'Cre01.g000001') old_locus_tag — previous locus identifier when available GeneID — NCBI Gene ID (e.g. '5723799') gene_name — common gene symbol when available (e.g. 'rbcL') GenBank_protein — protein accession (e.g. 'XP_001698190.2'), matching the protein FASTA headers UniProt — UniProt accession from Dbxref when available

Examples:

% Fetch and parse gene data for Chlamydomonas reinhardtii
geneTable = getGeneData('GCF_000002595.2');

% Save the mapping table to a specific file
geneTable = getGeneData('GCF_000002595.2', 'chlamy.tsv');

% Save to a file and download to a specific directory
geneTable = getGeneData('GCF_000002595.2', 'chlamy.tsv', 'data/');

Notes

When outputFile is provided the table is written as UTF-8 encoded tab-delimited text with a header row, compatible with readtable and renameModelGenes. Rows without a locus_tag are silently discarded.

processProteinFastaFile¶

Rename protein FASTA headers using a gene mapping table.

Reads a protein FASTA file (as downloaded by downloadGenomeData) and replaces each sequence header with the value from the specified geneTable column, matched via the GenBank_protein accession present in the original FASTA header. Sequences whose accession is not found in geneTable are kept with their original header unchanged.

Input arguments:

Name	Type	Description	Default
`faaFile`	`char`	Path to the protein FASTA file (.faa) to process.	required
`geneTable`	`char \| table`	Either a path to a tab-delimited .tsv file produced by getGeneData, or a MATLAB table variable. Must contain at least the columns 'GenBank_protein' and the column named by headerCol.	required
`headerCol`	`char`	Name of the geneTable column whose values will replace each FASTA header (e.g. 'locus_tag', 'gene_name', 'GenBank_protein').	required

Name-value arguments:

Name	Type	Description	Default
`outputDir`	`char`	Directory where the processed FASTA file is saved. The output file name is the original base name with '_processed' appended before the extension (default: current working directory).

Output arguments:

Name	Type	Description
`processedFaaFile`	`char`	Full path to the written processed FASTA file.

Examples:

% Rename headers using locus_tag
[~, faa]  = downloadGenomeData('GCF_000002595.2');
geneTable = getGeneData('GCF_000002595.2', 'chlamy.tsv');
outFile   = processProteinFastaFile(faa, geneTable, 'locus_tag');

% Load both inputs from saved files, use gene_name as header
outFile = processProteinFastaFile('GCF_000002595.2_protein.faa', 'chlamy.tsv', 'gene_name');

% Save the processed file to a specific directory
outFile = processProteinFastaFile(faa, 'chlamy.tsv', 'locus_tag', 'results/');

Notes

GenBank_protein must contain accessions matching the first token of each FASTA header (after '>'). For prokaryotes this is the protein_id (e.g. WP_012345678.1); for eukaryotes it is the GenBank accession stored in the GFF3 annotation by getGeneData.

renameModelGenes¶

Replace gene identifiers in a RAVEN model.

Updates model.genes and model.grRules by substituting the identifiers currently used in the model (fromCol) with new identifiers (toCol) from a mapping table produced by getGeneData or supplied manually. After renaming, model.rxnGeneMat is rebuilt automatically.

Input arguments:

Name	Type	Description	Default
`model`	`struct`	RAVEN model struct. Must contain model.genes and model.grRules; model.rxnGeneMat is rebuilt automatically.	required
`geneTable`	`table \| char`	Gene-mapping table. Can be supplied as: (a) A MATLAB table variable already loaded in the workspace, or (b) A path to a tab-delimited .tsv file — loaded automatically.	required
`fromCol`	`char`	Column in geneTable whose values match the identifiers currently in model.genes (e.g. 'locus_tag').	required
`toCol`	`char`	Column in geneTable whose values will replace the current identifiers (e.g. 'gene_name').	required

Output arguments:

Name	Type	Description
`model`	`struct`	Updated model struct with: model.genes — renamed gene list. model.grRules — gene-reaction rules using new identifiers. model.rxnGeneMat — rebuilt sparse matrix matching the new gene list.

Examples:

model     = readYAMLmodel('iCre1355.yml');
geneTable = getGeneData('GCF_000002595.2', 'chlamy.tsv');
model     = renameModelGenes(model, geneTable, 'locus_tag', 'gene_name');

% Supply a TSV file path directly
model = renameModelGenes(model, 'chlamy.tsv', 'locus_tag', 'gene_name');

Notes

Genes with no entry in fromCol, or whose toCol value is empty, are left unchanged; a warning lists them for investigation.
Word-boundary matching prevents partial substitution (e.g. 'gene1' is not replaced inside 'gene10').
grRules are standardized via standardizeGrRules before and after renaming to ensure consistent formatting.