Reconstruction¶
MATLAB functions in RAVEN/reconstruction of the RAVEN toolbox. Help text is collected from the source of the tracked branch.
Functions¶
| Function | Summary |
|---|---|
getBlast |
|
getDiamond |
|
getKEGGModelForOrganism |
Reconstruct a model from KEGG protein homology. |
getModelFromHomology |
|
getModelFromKEGG |
Load the pre-built global KEGG model. |
getPhylDist |
Load the pre-built KEGG phylogenetic distance matrix. |
getWSLpath |
Translate a Windows-style path to its Unix WSL equivalent. |
guessComposition |
Guess the composition of metabolites without one. |
makeFakeBlastStructure |
Make a fake blastStructure from an ortholog list. |
Reference¶
getBlast¶
Bidirectional BLAST between an organism and template organisms.
Input arguments:
| Name | Type | Description | Default |
|---|---|---|---|
organismID
|
char
|
the id of the organism of interest. This should also match with the id supplied to getModelFromHomology. |
required |
fastaFile
|
char
|
a FASTA file with the protein sequences for the organism of interest. |
required |
modelIDs
|
cell
|
a cell array of model ids. These must match the "model.id" fields in the "models" structure if the output is to be used with getModelFromHomology. |
required |
refFastaFiles
|
cell
|
a cell array with the paths to the corresponding FASTA files. |
required |
Name-value arguments:
| Name | Type | Description | Default |
|---|---|---|---|
developMode
|
logical
|
true if blastReport should be generated that is used in the unit testing function for BLAST+ (default false). |
|
hideVerbose
|
logical
|
true if no status messages should be printed (default false). |
Output arguments:
| Name | Type | Description |
|---|---|---|
blastStructure
|
struct
|
structure containing the bidirectional homology measurements that can be used by getModelFromHomology. |
blastReport
|
struct
|
structure containing MD5 hashes for FASTA database files and non-parsed BLAST output data. Will be blank if developMode is false. |
Notes
This function calls BLAST+ to perform a bidirectional homology test between the organism of interest and a set of other organisms using standard settings. The only filtering this function does is the removal of hits with an E-value higher than 10e-5.
Examples:
[blastStructure,blastReport] = getBlast(organismID,fastaFile,...
modelIDs,refFastaFiles,developMode,hideVerbose);
See also
getModelFromHomology, getDiamond
getDiamond¶
Bidirectional BLAST with DIAMOND against template organisms.
Input arguments:
| Name | Type | Description | Default |
|---|---|---|---|
organismID
|
char
|
the id of the organism of interest. This should also match with the id supplied to getModelFromHomology. |
required |
fastaFile
|
char
|
a FASTA file with the protein sequences for the organism of interest. |
required |
modelIDs
|
cell
|
a cell array of model ids. These must match the "model.id" fields in the "models" structure if the output is to be used with getModelFromHomology. |
required |
refFastaFiles
|
cell
|
a cell array with the paths to the corresponding FASTA files. |
required |
Name-value arguments:
| Name | Type | Description | Default |
|---|---|---|---|
developMode
|
logical
|
true if blastReport should be generated that is used in the unit testing function for DIAMOND (default false). |
|
hideVerbose
|
logical
|
true if no status messages should be printed (default false). |
Output arguments:
| Name | Type | Description |
|---|---|---|
blastStructure
|
struct
|
structure containing the bidirectional homology measurements which are used by getModelFromHomology. |
diamondReport
|
struct
|
structure containing MD5 hashes for FASTA database files and non-parsed BLAST output data. Will be blank if developMode is false. |
Notes
This function calls DIAMOND to perform a bidirectional homology search between the organism of interest and a set of other organisms using the "--more-sensitive" setting from DIAMOND. For the most sensitive results, the use of getBlast() is adviced, however, getDiamond() is a fast alternative (>15x faster). The blastStructure generated is in the same format as those obtained from getBlast().
Examples:
[blastStructure,diamondReport] = getDiamond(organismID,fastaFile,...
modelIDs,refFastaFiles,developMode,hideVerbose);
See also
getModelFromHomology, getBlast
getKEGGModelForOrganism¶
Reconstruct a model from KEGG protein homology.
Reconstructs a genome-scale metabolic model based on protein homology to the orthologies in KEGG. If the target species is not available in KEGG, the user must select a closely related species. It is also possible to circumvent protein homology search (see fastaFile parameter for more details).
Input arguments:
| Name | Type | Description | Default |
|---|---|---|---|
organismID
|
char
|
three or four letter abbreviation of the organism (as used in KEGG). If not available, use a closely related species. This is used for determing the phylogenetic distance. Use "eukaryotes" or "prokaryotes" to get a model for the whole domain. Only applicable if fastaFile is empty, i.e. no homology search should be performed. |
required |
Name-value arguments:
| Name | Type | Description | Default |
|---|---|---|---|
fastaFile
|
char
|
a FASTA file that contains the protein sequences of the organism for which to reconstruct a model. If no FASTA file is supplied then a model is reconstructed based only on the organism abbreviation. This option ignores all settings except for keepSpontaneous, keepUndefinedStoich, keepIncomplete and keepGeneral. |
|
dataDir
|
char
|
directory for which to retrieve the input data, styled as kegg118_prokaryotes or kegg118_eukaryotes, indicating the KEGG version and whether the HMMs were trained on pro- or eukaryotic sequences. The directory name matches the published HMM library it is paired with. The prebuilt concatenated KO HMM library (dataDir.hmm) is downloaded here from the corresponding RAVEN release if not already present. May also This parameter should ALWAYS be provided. |
|
outDir
|
char
|
directory to save the results from the quering of the hidden Markov models. The output is specific for the input sequences and the settings used. It is stored in this manner so that the function can continue if interrupted or if it should run in parallel. Be careful not to leave output files from different organisms or runs with different settings in the same folder. They will not be overwritten (default is a temporary dir where all *.out files are deleted before and after doing the reconstruction). |
|
keepSpontaneous
|
logical
|
include reactions labeled as "spontaneous" (default true). |
|
keepUndefinedStoich
|
logical
|
include reactions in the form n A <=> n+1 A. These will be dealt with as two separate metabolites (default true). |
|
keepIncomplete
|
logical
|
include reactions which have been labelled as "incomplete", "erroneous" or "unclear" (default true). |
|
keepGeneral
|
logical
|
include reactions which have been labelled as "general reaction". These are reactions on the form "an aldehyde <=> an alcohol", and are therefore unsuited for modelling purposes. Note that not all reactions have this type of annotation, and the script will therefore not be able to remove all such reactions (default false). |
|
cutOff
|
double
|
significance score from HMMer needed to assign genes to a KO (default 10^-30). |
|
minScoreRatioKO
|
double
|
ignore genes in a KO if their score is <log(score)/log(best score in KO). This is to "prune" KOs which have many genes and where some are clearly a better fit (default 0.3, lower is less strict). |
|
minScoreRatioG
|
double
|
a gene is only assigned to KOs for which the score is
|
|
maxPhylDist
|
double
|
-1 to only use sequences from the same domain (Prokaryota, Eukaryota); any other (positive) value to only use sequences for organisms where the phylogenetic distance is at the most this large (as calculated in getPhylDist). Only used when reconstructing from KEGG annotations (no fastaFile) (default Inf, which means that all sequences will be used). |
|
globalModel
|
struct
|
structure containing both model and KOModel structures as generated by getModelFromKEGG. These will otherwise be loaded by via getModelFromKEGG. Providing globalKEGGmodel can speed up model generation if getKEGGModelForOrganism is run multiple times for different strains. Example: [globalModel.model,globalModel.KOModel] = getModelFromKEGG (default empty, global model is loaded by getModelFromKEGG). |
Output arguments:
| Name | Type | Description |
|---|---|---|
model
|
struct
|
the reconstructed model. |
Notes
The reconstruction works in one of two modes:
- From KEGG annotations (no fastaFile supplied). The reactions associated with organismID in the local KEGG database are kept; maxPhylDist controls which organism annotations are considered.
- From protein homology (fastaFile supplied). The query proteome is searched, in a single hmmsearch, against a prebuilt KEGG-version- and domain-specific concatenated KO HMM library (e.g. kegg118_eukaryotes), downloaded from the corresponding RAVEN release if not already present in dataDir. Hits are filtered by cutOff and the minScoreRatioKO / minScoreRatioG ratios into a KO-gene matrix, from which the model is built (spontaneous reactions without genes are kept if keepSpontaneous is true).
The global KEGG model (reaction-KO relationships) is either supplied via globalModel, built from a local KEGG FTP dump in dataDir\keggdb (which requires a KEGG FTP license), or loaded from the internal RAVEN version.
Examples:
model=getKEGGModelForOrganism(organismID,fastaFile,dataDir,...
outDir,keepSpontaneous,keepUndefinedStoich,keepIncomplete,...
keepGeneral,cutOff,minScoreRatioKO,minScoreRatioG,maxPhylDist);
getModelFromHomology¶
Construct a new model from existing models and homology.
Constructs a new model from a set of existing models and gene homology information.
Input arguments:
| Name | Type | Description | Default |
|---|---|---|---|
models
|
cell
|
a cell array of model structures to build the model from. These models must be sorted by importance in decreasing order. |
required |
blastStructure
|
struct
|
a blastStructure as produced by getBlast or makeFakeBlastStructure. |
required |
getModelFor
|
char
|
a three-four letter abbreviation of the organism to build a model for. Must have BLASTP hits in both directions to the organisms in "models". |
required |
Name-value arguments:
| Name | Type | Description | Default |
|---|---|---|---|
preferredOrder
|
cell
|
the order in which reactions should be added from the models. If not supplied, reactions will be included from all models, otherwise one gene will only result in reactions from one model (default {}). |
|
bidirectional
|
logical
|
require acceptable BLASTP hits in both directions (new→template and template→new) for a gene pair to be used. Reciprocal best-hit mapping (bidirectional=true, bestHitsOnly=true) is the most conservative option (default true). |
|
bestHitsOnly
|
logical
|
before mapping, trim each BLASTP direction to the single best hit per query gene (by scoreBy criterion). This removes many-to-many and many-to-one mappings (default false). |
|
scoreBy
|
char
|
criterion used by bestHitsOnly to select the best hit per query gene. "bitscore" selects the hit with the highest bitscore (database-size-independent). "evalue" selects the hit with the lowest E-value. (default "bitscore"). |
|
strictness
|
double
|
legacy integer alias for bidirectional/bestHitsOnly (default 1):
Ignored when bidirectional or bestHitsOnly are supplied explicitly. |
|
onlyGenesInModels
|
logical
|
consider BLASTP results only for genes that exist in the models. This tends to import a larger fraction from the existing models but may give less reliable results. Has effect only if bestHitsOnly=true (default false). |
|
maxE
|
double
|
only look at genes with E-values <= this value (default 10^-30). |
|
minLen
|
double
|
only look at genes with alignment length >= this value (default 200). |
|
minIde
|
double
|
only look at genes with identity >= this value (default 40 (%)). |
|
mapNewGenesToOld
|
logical
|
determines how to match genes if not looking at only 1-1 orthologs. Either map the new genes to the old or old genes to new. The default is to map the new genes (default true). |
Output arguments:
| Name | Type | Description |
|---|---|---|
draftModel
|
struct
|
a model structure for the new organism. |
hitGenes
|
struct
|
collect the old and new genes. |
Examples:
draftModel = getModelFromHomology(models, blastStructure, getModelFor);
Notes
The models in the "models" structure should have named the metabolites in the same manner, have their reversible reactions in the same direction (run sortModel), and use the same compartment names. To avoid keeping unneccesary old genes, the models should not have "or"-relations in their grRules (use expandModel).
The resulting draft model contains only reactions associated with orthologous genes. The old (original) genes involved in "and" relations in grRules without any orthologs are still included in the draft model as OLD_MODELID_geneName.
"to" and "from" means relative to the new organism.
getModelFromKEGG¶
Load the pre-built global KEGG model.
Loads the pre-built global KEGG reaction/gene model from keggModel.mat. The artefact is generated by the raven-toolbox Python package and distributed as a raven-data release asset.
Name-value arguments:
| Name | Type | Description | Default |
|---|---|---|---|
keepSpontaneous
|
logical
|
include reactions labeled as "spontaneous" (default true). |
|
keepUndefinedStoich
|
logical
|
include reactions in the form n A <=> n+1 A (default true). |
|
keepIncomplete
|
logical
|
include reactions labelled as "incomplete", "erroneous" or "unclear" (default true). |
|
keepGeneral
|
logical
|
include reactions labelled as "general reaction" (default false). |
Output arguments:
| Name | Type | Description |
|---|---|---|
model
|
struct
|
a model structure with all reactions and the metabolites used in them. |
KOModel
|
struct
|
a model structure representing KEGG Orthology ids and their associated genes. KO ids are stored as reactions. |
Notes
The model output can be used as a template for fillGaps. In that case, remove the genes and rxnGeneMat fields before parsing: model = rmfield(model, 'genes'), etc.
See also
getKEGGModelForOrganism, getPhylDist, fillGaps
getPhylDist¶
Load the pre-built KEGG phylogenetic distance matrix.
Loads the pre-built phylogenetic distance matrix from keggPhylDist.mat. The artefact is generated by the raven-toolbox Python package and distributed as a raven-data release asset.
Name-value arguments:
| Name | Type | Description | Default |
|---|---|---|---|
onlyInKingdom
|
logical
|
if true, returns a distance matrix with Inf for cross-domain pairs (Prokaryota vs Eukaryota) (default false). |
Output arguments:
| Name | Type | Description |
|---|---|---|
phylDistStruct
|
struct
|
structure with fields:
|
Notes
Distance is based on the number of nodes two organisms are apart in the KEGG taxonomy tree.
getWSLpath¶
Translate a Windows-style path to its Unix WSL equivalent.
Translate a Windows-style path to its Unix WSL (Windows Subsystem for Linux) equivalent.
Input arguments:
| Name | Type | Description | Default |
|---|---|---|---|
path
|
char
|
string with directory of file path, in Windows-style (e.g. 'C:\Directory\'). |
required |
Output arguments:
| Name | Type | Description |
|---|---|---|
path
|
char
|
string with directory of file path, in Unix style (e.g. '/mnt/c/Directory/'). |
Examples:
path = getWSLpath(path);
Notes
Uses the WSL function 'wslpath' to translate the path.
guessComposition¶
Guess the composition of metabolites without one.
Attempts to guess the composition of metabolites without information about elemental composition.
Input arguments:
| Name | Type | Description | Default |
|---|---|---|---|
model
|
struct
|
a model structure. |
required |
Name-value arguments:
| Name | Type | Description | Default |
|---|---|---|---|
printResults
|
logical
|
true if the output should be printed (default true). |
Output arguments:
| Name | Type | Description |
|---|---|---|
model
|
struct
|
a model structure with information about elemental composition added. |
guessedFor
|
double
|
indexes for the metabolites for which a composition could be guessed. |
couldNotGuess
|
double
|
indexes for the metabolites for which no composition could be assigned. |
Examples:
[model, guessedFor, couldNotGuess] = guessComposition(model, printResults);
Notes
This function works in a rather straight forward manner:
- Get the metabolites which lack composition and participates in at least one reaction where all other metabolites have composition information.
- Loop through them and calculate their composition based on the rest of the involved metabolites. If there are any inconsistencies, so that a given metabolite should have different composition in different equations, then throw an error.
- Go to 1.
This simple approach requires that the rest of the metabolites have correct composition information, and that the involved reactions are correct. The function will exit with an error on any inconsistencies, which means that it could also be used as a way of checking the model for errors. Note that just because this exits sucessfully, the calculated compositions could still be wrong (in case that the existing compositions were wrong).
makeFakeBlastStructure¶
Make a fake blastStructure from an ortholog list.
This is a structure that would normally be generated by getBlast. It allows to feed a predefined list of orthologs to getModelFromHomology while retaining the further use of that function. For this function to work, it is crucial that the orthologList is a cell array where the first column contains the genes from the source organism, and the second column contains the genes from the target organism.
Input arguments:
| Name | Type | Description | Default |
|---|---|---|---|
orthologList
|
cell
|
cell array of orthologous genes, where the first column contains the genes from the source organism, while the second column contains the genes from the target organism. |
required |
sourceModelID
|
char
|
ID of the model that will be used as template, that contains the genes in the first column of orthologList. |
required |
getModelFor
|
char
|
the name of the organism to build a model for, identical to the getModelFor parameter in the getModelFromHomology function. |
required |
Output arguments:
| Name | Type | Description |
|---|---|---|
blastStructure
|
struct
|
a fake blastStructure, where the evalue, identity and aligLen are set at extreme values, such that all orthologous pairs will pass the filter when running getModelFromHomology. |
Examples:
blastStructure = makeFakeBlastStructure(orthologList,sourceModelID,getModelFor);
See also
getModelFromHomology, getBlast