Skip to content

Reconstruction

MATLAB functions in RAVEN/reconstruction of the RAVEN toolbox. Help text is collected from the source of the tracked branch.

Functions

Function Summary
getBlast
getDiamond
getKEGGModelForOrganism Reconstruct a model from KEGG protein homology.
getModelFromHomology
getModelFromKEGG Load the pre-built global KEGG model.
getPhylDist Load the pre-built KEGG phylogenetic distance matrix.
getWSLpath Translate a Windows-style path to its Unix WSL equivalent.
guessComposition Guess the composition of metabolites without one.
makeFakeBlastStructure Make a fake blastStructure from an ortholog list.

Reference

getBlast

Bidirectional BLAST between an organism and template organisms.

Input arguments:

Name Type Description Default
organismID char

the id of the organism of interest. This should also match with the id supplied to getModelFromHomology.

required
fastaFile char

a FASTA file with the protein sequences for the organism of interest.

required
modelIDs cell

a cell array of model ids. These must match the "model.id" fields in the "models" structure if the output is to be used with getModelFromHomology.

required
refFastaFiles cell

a cell array with the paths to the corresponding FASTA files.

required

Name-value arguments:

Name Type Description Default
developMode logical

true if blastReport should be generated that is used in the unit testing function for BLAST+ (default false).

hideVerbose logical

true if no status messages should be printed (default false).

Output arguments:

Name Type Description
blastStructure struct

structure containing the bidirectional homology measurements that can be used by getModelFromHomology.

blastReport struct

structure containing MD5 hashes for FASTA database files and non-parsed BLAST output data. Will be blank if developMode is false.

Notes

This function calls BLAST+ to perform a bidirectional homology test between the organism of interest and a set of other organisms using standard settings. The only filtering this function does is the removal of hits with an E-value higher than 10e-5.

Examples:

[blastStructure,blastReport] = getBlast(organismID,fastaFile,...
   modelIDs,refFastaFiles,developMode,hideVerbose);
See also

getModelFromHomology, getDiamond

getDiamond

Bidirectional BLAST with DIAMOND against template organisms.

Input arguments:

Name Type Description Default
organismID char

the id of the organism of interest. This should also match with the id supplied to getModelFromHomology.

required
fastaFile char

a FASTA file with the protein sequences for the organism of interest.

required
modelIDs cell

a cell array of model ids. These must match the "model.id" fields in the "models" structure if the output is to be used with getModelFromHomology.

required
refFastaFiles cell

a cell array with the paths to the corresponding FASTA files.

required

Name-value arguments:

Name Type Description Default
developMode logical

true if blastReport should be generated that is used in the unit testing function for DIAMOND (default false).

hideVerbose logical

true if no status messages should be printed (default false).

Output arguments:

Name Type Description
blastStructure struct

structure containing the bidirectional homology measurements which are used by getModelFromHomology.

diamondReport struct

structure containing MD5 hashes for FASTA database files and non-parsed BLAST output data. Will be blank if developMode is false.

Notes

This function calls DIAMOND to perform a bidirectional homology search between the organism of interest and a set of other organisms using the "--more-sensitive" setting from DIAMOND. For the most sensitive results, the use of getBlast() is adviced, however, getDiamond() is a fast alternative (>15x faster). The blastStructure generated is in the same format as those obtained from getBlast().

Examples:

[blastStructure,diamondReport] = getDiamond(organismID,fastaFile,...
   modelIDs,refFastaFiles,developMode,hideVerbose);
See also

getModelFromHomology, getBlast

getKEGGModelForOrganism

Reconstruct a model from KEGG protein homology.

Reconstructs a genome-scale metabolic model based on protein homology to the orthologies in KEGG. If the target species is not available in KEGG, the user must select a closely related species. It is also possible to circumvent protein homology search (see fastaFile parameter for more details).

Input arguments:

Name Type Description Default
organismID char

three or four letter abbreviation of the organism (as used in KEGG). If not available, use a closely related species. This is used for determing the phylogenetic distance. Use "eukaryotes" or "prokaryotes" to get a model for the whole domain. Only applicable if fastaFile is empty, i.e. no homology search should be performed.

required

Name-value arguments:

Name Type Description Default
fastaFile char

a FASTA file that contains the protein sequences of the organism for which to reconstruct a model. If no FASTA file is supplied then a model is reconstructed based only on the organism abbreviation. This option ignores all settings except for keepSpontaneous, keepUndefinedStoich, keepIncomplete and keepGeneral.

dataDir char

directory for which to retrieve the input data, styled as kegg118_prokaryotes or kegg118_eukaryotes, indicating the KEGG version and whether the HMMs were trained on pro- or eukaryotic sequences. The directory name matches the published HMM library it is paired with. The prebuilt concatenated KO HMM library (dataDir.hmm) is downloaded here from the corresponding RAVEN release if not already present. May also This parameter should ALWAYS be provided.

outDir char

directory to save the results from the quering of the hidden Markov models. The output is specific for the input sequences and the settings used. It is stored in this manner so that the function can continue if interrupted or if it should run in parallel. Be careful not to leave output files from different organisms or runs with different settings in the same folder. They will not be overwritten (default is a temporary dir where all *.out files are deleted before and after doing the reconstruction).

keepSpontaneous logical

include reactions labeled as "spontaneous" (default true).

keepUndefinedStoich logical

include reactions in the form n A <=> n+1 A. These will be dealt with as two separate metabolites (default true).

keepIncomplete logical

include reactions which have been labelled as "incomplete", "erroneous" or "unclear" (default true).

keepGeneral logical

include reactions which have been labelled as "general reaction". These are reactions on the form "an aldehyde <=> an alcohol", and are therefore unsuited for modelling purposes. Note that not all reactions have this type of annotation, and the script will therefore not be able to remove all such reactions (default false).

cutOff double

significance score from HMMer needed to assign genes to a KO (default 10^-30).

minScoreRatioKO double

ignore genes in a KO if their score is <log(score)/log(best score in KO). This is to "prune" KOs which have many genes and where some are clearly a better fit (default 0.3, lower is less strict).

minScoreRatioG double

a gene is only assigned to KOs for which the score is

=log(score)/log(best score) for that gene. This is to prevent that a gene which clearly belongs to one KO is assigned also to KOs with much lower scores (default 0.9, lower is less strict).

maxPhylDist double

-1 to only use sequences from the same domain (Prokaryota, Eukaryota); any other (positive) value to only use sequences for organisms where the phylogenetic distance is at the most this large (as calculated in getPhylDist). Only used when reconstructing from KEGG annotations (no fastaFile) (default Inf, which means that all sequences will be used).

globalModel struct

structure containing both model and KOModel structures as generated by getModelFromKEGG. These will otherwise be loaded by via getModelFromKEGG. Providing globalKEGGmodel can speed up model generation if getKEGGModelForOrganism is run multiple times for different strains. Example: [globalModel.model,globalModel.KOModel] = getModelFromKEGG (default empty, global model is loaded by getModelFromKEGG).

Output arguments:

Name Type Description
model struct

the reconstructed model.

Notes

The reconstruction works in one of two modes:

  1. From KEGG annotations (no fastaFile supplied). The reactions associated with organismID in the local KEGG database are kept; maxPhylDist controls which organism annotations are considered.
  2. From protein homology (fastaFile supplied). The query proteome is searched, in a single hmmsearch, against a prebuilt KEGG-version- and domain-specific concatenated KO HMM library (e.g. kegg118_eukaryotes), downloaded from the corresponding RAVEN release if not already present in dataDir. Hits are filtered by cutOff and the minScoreRatioKO / minScoreRatioG ratios into a KO-gene matrix, from which the model is built (spontaneous reactions without genes are kept if keepSpontaneous is true).

The global KEGG model (reaction-KO relationships) is either supplied via globalModel, built from a local KEGG FTP dump in dataDir\keggdb (which requires a KEGG FTP license), or loaded from the internal RAVEN version.

Examples:

model=getKEGGModelForOrganism(organismID,fastaFile,dataDir,...
   outDir,keepSpontaneous,keepUndefinedStoich,keepIncomplete,...
   keepGeneral,cutOff,minScoreRatioKO,minScoreRatioG,maxPhylDist);

getModelFromHomology

Construct a new model from existing models and homology.

Constructs a new model from a set of existing models and gene homology information.

Input arguments:

Name Type Description Default
models cell

a cell array of model structures to build the model from. These models must be sorted by importance in decreasing order.

required
blastStructure struct

a blastStructure as produced by getBlast or makeFakeBlastStructure.

required
getModelFor char

a three-four letter abbreviation of the organism to build a model for. Must have BLASTP hits in both directions to the organisms in "models".

required

Name-value arguments:

Name Type Description Default
preferredOrder cell

the order in which reactions should be added from the models. If not supplied, reactions will be included from all models, otherwise one gene will only result in reactions from one model (default {}).

bidirectional logical

require acceptable BLASTP hits in both directions (new→template and template→new) for a gene pair to be used. Reciprocal best-hit mapping (bidirectional=true, bestHitsOnly=true) is the most conservative option (default true).

bestHitsOnly logical

before mapping, trim each BLASTP direction to the single best hit per query gene (by scoreBy criterion). This removes many-to-many and many-to-one mappings (default false).

scoreBy char

criterion used by bestHitsOnly to select the best hit per query gene. "bitscore" selects the hit with the highest bitscore (database-size-independent). "evalue" selects the hit with the lowest E-value. (default "bitscore").

strictness double

legacy integer alias for bidirectional/bestHitsOnly (default 1):

  • 1 : bidirectional=true, bestHitsOnly=false.
  • 2 : bidirectional=false, bestHitsOnly=false.
  • 3 : bidirectional=true, bestHitsOnly=true.

Ignored when bidirectional or bestHitsOnly are supplied explicitly.

onlyGenesInModels logical

consider BLASTP results only for genes that exist in the models. This tends to import a larger fraction from the existing models but may give less reliable results. Has effect only if bestHitsOnly=true (default false).

maxE double

only look at genes with E-values <= this value (default 10^-30).

minLen double

only look at genes with alignment length >= this value (default 200).

minIde double

only look at genes with identity >= this value (default 40 (%)).

mapNewGenesToOld logical

determines how to match genes if not looking at only 1-1 orthologs. Either map the new genes to the old or old genes to new. The default is to map the new genes (default true).

Output arguments:

Name Type Description
draftModel struct

a model structure for the new organism.

hitGenes struct

collect the old and new genes.

Examples:

draftModel = getModelFromHomology(models, blastStructure, getModelFor);
Notes

The models in the "models" structure should have named the metabolites in the same manner, have their reversible reactions in the same direction (run sortModel), and use the same compartment names. To avoid keeping unneccesary old genes, the models should not have "or"-relations in their grRules (use expandModel).

The resulting draft model contains only reactions associated with orthologous genes. The old (original) genes involved in "and" relations in grRules without any orthologs are still included in the draft model as OLD_MODELID_geneName.

"to" and "from" means relative to the new organism.

getModelFromKEGG

Load the pre-built global KEGG model.

Loads the pre-built global KEGG reaction/gene model from keggModel.mat. The artefact is generated by the raven-toolbox Python package and distributed as a raven-data release asset.

Name-value arguments:

Name Type Description Default
keepSpontaneous logical

include reactions labeled as "spontaneous" (default true).

keepUndefinedStoich logical

include reactions in the form n A <=> n+1 A (default true).

keepIncomplete logical

include reactions labelled as "incomplete", "erroneous" or "unclear" (default true).

keepGeneral logical

include reactions labelled as "general reaction" (default false).

Output arguments:

Name Type Description
model struct

a model structure with all reactions and the metabolites used in them.

KOModel struct

a model structure representing KEGG Orthology ids and their associated genes. KO ids are stored as reactions.

Notes

The model output can be used as a template for fillGaps. In that case, remove the genes and rxnGeneMat fields before parsing: model = rmfield(model, 'genes'), etc.

See also

getKEGGModelForOrganism, getPhylDist, fillGaps

getPhylDist

Load the pre-built KEGG phylogenetic distance matrix.

Loads the pre-built phylogenetic distance matrix from keggPhylDist.mat. The artefact is generated by the raven-toolbox Python package and distributed as a raven-data release asset.

Name-value arguments:

Name Type Description Default
onlyInKingdom logical

if true, returns a distance matrix with Inf for cross-domain pairs (Prokaryota vs Eukaryota) (default false).

Output arguments:

Name Type Description
phylDistStruct struct

structure with fields:

  • ids : cell array of KEGG organism abbreviations.
  • distMat : pairwise distance matrix (number of tree-nodes apart).
Notes

Distance is based on the number of nodes two organisms are apart in the KEGG taxonomy tree.

getWSLpath

Translate a Windows-style path to its Unix WSL equivalent.

Translate a Windows-style path to its Unix WSL (Windows Subsystem for Linux) equivalent.

Input arguments:

Name Type Description Default
path char

string with directory of file path, in Windows-style (e.g. 'C:\Directory\').

required

Output arguments:

Name Type Description
path char

string with directory of file path, in Unix style (e.g. '/mnt/c/Directory/').

Examples:

path = getWSLpath(path);
Notes

Uses the WSL function 'wslpath' to translate the path.

guessComposition

Guess the composition of metabolites without one.

Attempts to guess the composition of metabolites without information about elemental composition.

Input arguments:

Name Type Description Default
model struct

a model structure.

required

Name-value arguments:

Name Type Description Default
printResults logical

true if the output should be printed (default true).

Output arguments:

Name Type Description
model struct

a model structure with information about elemental composition added.

guessedFor double

indexes for the metabolites for which a composition could be guessed.

couldNotGuess double

indexes for the metabolites for which no composition could be assigned.

Examples:

[model, guessedFor, couldNotGuess] = guessComposition(model, printResults);
Notes

This function works in a rather straight forward manner:

  1. Get the metabolites which lack composition and participates in at least one reaction where all other metabolites have composition information.
  2. Loop through them and calculate their composition based on the rest of the involved metabolites. If there are any inconsistencies, so that a given metabolite should have different composition in different equations, then throw an error.
  3. Go to 1.

This simple approach requires that the rest of the metabolites have correct composition information, and that the involved reactions are correct. The function will exit with an error on any inconsistencies, which means that it could also be used as a way of checking the model for errors. Note that just because this exits sucessfully, the calculated compositions could still be wrong (in case that the existing compositions were wrong).

makeFakeBlastStructure

Make a fake blastStructure from an ortholog list.

This is a structure that would normally be generated by getBlast. It allows to feed a predefined list of orthologs to getModelFromHomology while retaining the further use of that function. For this function to work, it is crucial that the orthologList is a cell array where the first column contains the genes from the source organism, and the second column contains the genes from the target organism.

Input arguments:

Name Type Description Default
orthologList cell

cell array of orthologous genes, where the first column contains the genes from the source organism, while the second column contains the genes from the target organism.

required
sourceModelID char

ID of the model that will be used as template, that contains the genes in the first column of orthologList.

required
getModelFor char

the name of the organism to build a model for, identical to the getModelFor parameter in the getModelFromHomology function.

required

Output arguments:

Name Type Description
blastStructure struct

a fake blastStructure, where the evalue, identity and aligLen are set at extreme values, such that all orthologous pairs will pass the filter when running getModelFromHomology.

Examples:

blastStructure = makeFakeBlastStructure(orthologList,sourceModelID,getModelFor);
See also

getModelFromHomology, getBlast