Tutorial 5 — Reconstruct a GEM from KEGG¶

This exercise creates a model from KEGG, based on protein sequences in a FASTA file, and runs some functionality checks on the result. The example organism is Saccharomyces cerevisiae.

Unlike Tutorials 1–4, this is more of a showcase: its main purpose is to serve as a scaffold you can adapt to reconstruct a GEM for any organism.

Runtime

De novo reconstruction performs sequence searches against large HMM databases. Building the model takes up to 20–35 minutes on macOS and Unix systems and 40–55 minutes on Windows, depending on your hardware and the size of the target organism's proteome. gapReport can take several to many hours, depending on the number of gaps in the model.

Step by step¶

1. Reconstruct from KEGG¶

Start by downloading trained Hidden Markov Models for eukaryotes. This can be done automatically or manually from the RAVEN Wiki in its GitHub repository; here the archive euk90_kegg105 is picked for automatic download. See the RAVEN Wiki for more information regarding preparation of such an archive.

getKEGGModelForOrganism then creates a model for S. cerevisiae from the whole-proteome FASTA (sce.fa). The parameters are set to exclude general or unclear reactions and reactions with undefined stoichiometry. Type help getKEGGModelForOrganism to see what the different parameters are for.

model=getKEGGModelForOrganism('sce','sce.fa','euk90_kegg105','output',false,false,false,false,10^-30,0.8,0.3,-1);
disp(model);

The resulting model should contain around 1589 reactions, 1600 metabolites and 836 genes. Small variations are possible since it is a heuristic algorithm, and different KEGG versions give slightly different results.

2. Remove reactions that make something from nothing¶

A first control is that the model should not be able to produce any metabolites without uptake of some metabolites. This commonly happens when metabolites have a different meaning in different reactions. removeBadRxns tries to find and remove such reactions in an automated manner; type help removeBadRxns for details.

[newModel, removedRxns]=removeBadRxns(model);

This reports that H⁺ can be made even if no reactions were unbalanced. Protons are particularly problematic since it is rather arbitrary at which pH the formulas are written. For this analysis the protons can be ignored and fixed later, so rerun while allowing H⁺ to be produced.

[newModel, removedRxns]=removeBadRxns(model,1,{'H+'},true);
disp(removedRxns);

Only one reaction was removed because it enabled the model to produce something from nothing. Since it is only one reaction, it is worthwhile to look into it in more detail.

3. Investigate the problematic reaction¶

According to KEGG, the removed reaction is a general polymer reaction. Use makeSomething to look at the flux distributions in more detail and find out if there is a better alternative to delete.

[fluxes, metabolite]=makeSomething(model,{'H+'},true);
model.metNames(metabolite)
printFluxes(model, fluxes, false, [], [],'%rxnID (%rxnName):\n\t%eqn: %flux\n')

This shows the model could produce H₂O, and results in quite a lot of fluxes to look through. It is easier if the elementally balanced reactions are excluded. Since water was produced, only look at the reactions unbalanced for oxygen (column 6 of the elemental balance).

balanceStructure=getElementalBalance(model);
goodOnes=balanceStructure.leftComp(:,6)==balanceStructure.rightComp(:,6);
printFluxes(removeReactions(model,goodOnes), fluxes(~goodOnes), false, [], [],'%rxnID (%rxnName):\n\t%eqn: %flux\n');

There is still a good number of reactions. Leave only the reactions which involve amylose or starch, from one of the problematic reactions identified earlier.

printFluxes(model, fluxes, false, [], [],'%rxnID (%rxnName):\n\t%eqn: %flux\n',{'Amylose';'Starch'});

There are two elementally unbalanced reactions, including the one identified by removeBadRxns. They contradict each other: the first shows that amylose and starch are interconvertible, while the second shows that amylose contains one less glucose unit than starch. Such general reactions should be fixed manually. Trusting removeBadRxns, delete R02110.

model=removeReactions(model,'R02110');

4. Check consumption and add uptakes¶

The model can no longer make something from nothing. Check whether it can consume something without any output.

[solution, metabolite]=consumeSomething(model,{'H+'},true);
model.metNames(metabolite)

Nothing is consumed without output, so that is good. Now add some uptakes and see what the model can produce.

[~, J]=ismember({'D-Glucose';'H2O';'Orthophosphate';'Oxygen';'NH3';'Sulfate'},model.metNames);
[model, addedRxns]=addExchangeRxns(model,'in',J);

canProduce reports which metabolites can be produced given these uptakes. It allows output of all metabolites — which does not happen in a real cell, but is very useful for functionality testing.

I=canProduce(model);
fprintf('%d%%\n', round(sum(I)/numel(model.mets)*100));

Around 31% of the metabolites could be synthesized. It is not directly clear whether this is high or low — many metabolites should not be possible to synthesize from those simple precursors.

5. Fill gaps using the full KEGG model¶

Try to fill gaps using the full KEGG model to see if that gives a significantly higher number. getModelFromKEGG retrieves the full KEGG model. It is associated with more than 6,400,000 genes that are not used for gap-filling, so they are removed to make this a little faster.

keggModel=getModelFromKEGG([],false,false,false,false);
keggModel=rmfield(keggModel,'genes');
keggModel=rmfield(keggModel,'rxnGeneMat');

It is already known that there are some unbalanced reactions in KEGG. Only the balanced ones are used for gap-filling.

balanceStructure=getElementalBalance(keggModel);
keggModel=removeReactions(keggModel,balanceStructure.balanceStatus~=1,true,true);

fillGaps with these settings tries to include reactions so that there is flux through all reactions in the model. The first flag says that production of all metabolites should be allowed.

params.relGap=0.6; %Lower number for a more exhaustive search
params.printReport=true;
[newConnected, cannotConnect, addedRxns, newModel, exitFlag]=fillGaps(model,keggModel,true,false,false,[],params);

The results show that fillGaps could connect around 29 reactions (newConnected) by including around 41 reactions from the KEGG model (addedRxns). These should of course be checked manually to confirm that they exist in yeast, but here it is assumed they all occur in yeast.

6. Report on connectivity¶

Continue to improve the connectivity of the model by identifying metabolites that should be connected. gapReport gives a convenient overview of how connected the model is, together with a lot of useful data.

[noFluxRxns, noFluxRxnsRelaxed, subGraphs, notProducedMets, minToConnect,...
    neededForProductionMat]=gapReport(newModel);

The results show that 532 from 1634 reactions cannot carry flux. Because output is allowed for all metabolites, it calculates 532 in both cases. 543 from 1617 metabolites cannot be synthesized from the supplied precursors. There are 6 subnetworks in the model, of which 1605 from 1617 metabolites belong to the first one.

gapReport also prints the metabolites to connect: to enable net production of all metabolites, a total of 322 metabolites must be connected, with the top ranked one being Acyl-CoA (which connects 162 metabolites). Connecting only the top 10 in the list already connects 271/543 (50%) of them. This is a very useful way of directing the gap-filling tasks to where they are of greatest use.

7. Add minimal-media uptakes and iterate¶

The list reveals that yeast cannot grow on only the substrates defined so far; it requires some other precursors for co-factor synthesis as well. Add uptake reactions for the minimal-media constituents needed for yeast to grow.

[~, J]=ismember({'4-Aminobenzoate';'Riboflavin';'Thiamine';'Biotin';'Folate';'Nicotinate';'Zymosterol';'Choline'},newModel.metNames);
[newModel, addedRxns]=addExchangeRxns(newModel,'in',J);

Rerun gapReport and use the output for targeting the gap-filling efforts. Note that only some info is printed; most of it is available in the output structures. Work like this in an iterative manner until the model is of sufficient quality.

Full script¶

% tutorial5
%   This exercise is about creating a model from KEGG, based on protein
%   sequences in a FASTA file, and doing some functionality checks on the
%   model. The example case is for the yeast Saccharomyces cerevisiae. This
%   tutorial is more of a showcase than the previous four, and its main
%   purpose is to serve as a scaffold to reconstruct a GEM for any
%   organism.
%   This refers to Tutorial 5 from "RAVEN tutorials.docx"
% 
% Start by downloading trained Hidden Markov Models for eukaryotes. This can
% be done automatically or manually from the RAVEN Wiki in its GitHub
% repository. In this tutorial, the archive "kegg118_eukaryotes" is picked for
% the automatic download. See the documentation in the RAVEN Wiki for more
% information regarding preparation of such archive.
% 
% This creates a model for S. cerevisiae. The parameters are set to exclude
% general or unclear reactions and reactions with undefined stoichiometry.
% Type "help getKEGGModelForOrganism" to see what the different parameters
% are for. This process takes up to 20-35 minutes in macOS, Unix systems and
% 40-55 minutes in Windows, depending on your hardware and the size of
% target organism proteome
model=getKEGGModelForOrganism('sce','sce.fa','kegg118_eukaryotes','output',false,false,false,false,10^-30,0.8,0.3,-1);

% The resulting model should contain around 1589 reactions, 1600
% metabolites and 836 genes. Small variations are possible since it is an
% heuristic algorithm and different KEGG versions will give slightly
% different results.
disp(model);

% A first control is that the model should not be able to produce any
% metabolites without uptake of some metabolites. This commonly happens when
% metabolites have a different meaning in different reactions. The best way
% to find such reactions is to run makeSomething and analyze the resulting
% solution for bad reactions. An automated approach is to use removeBadRxns,
% which tries to do the same thing in an automated manner. Type
% "help removeBadRxns" for details.
[newModel, removedRxns]=removeBadRxns(model);

% One can see an error about that H+ can be made even if no reactions were
% unbalanced. Protons are particularly problematic since it is rather
% arbitary at which pH the formulas are written for. For the purpose of this
% analysis, the protons can be ignored and fixed later.
[newModel, removedRxns]=removeBadRxns(model,1,{'H+'},true);

% Only one reaction was removed because it enabled the model to produce
% something from nothing. Since it is only one reaction, it might be
% worthwhile to look into this in more detail.
disp(removedRxns);

% According to the information in KEGG about this reaction it is a general
% polymer reaction. One might want to look at the flux distributions
% in more detail to try to find out if there is any better alternative to
% delete. Use makeSomething to do this.
[fluxes, metabolite]=makeSomething(model,{'H+'},true);
model.metNames(metabolite)

% The model could produce H2O using the following reactions
printFluxes(model, fluxes, false, [], [],'%rxnID (%rxnName):\n\t%eqn: %flux\n')

% That resulted in quite a lot of fluxes to look through. Maybe it is easier
% if the elementally balanced reactions are excluded.
balanceStructure=getElementalBalance(model);

% The hydrogen balancing is a bit tricky, so this time it seems more
% relevant to only look at the ones unbalanced for oxygen (since water was
% produced)
goodOnes=balanceStructure.leftComp(:,6)==balanceStructure.rightComp(:,6);
printFluxes(removeReactions(model,goodOnes), fluxes(~goodOnes), false, [], [],'%rxnID (%rxnName):\n\t%eqn: %flux\n');

% There is still a good number of reactions. Leave only the reactions which
% involve amylose or starch, from one of the problematic reactions that was
% identified earlier.
printFluxes(model, fluxes, false, [], [],'%rxnID (%rxnName):\n\t%eqn: %flux\n',{'Amylose';'Starch'});

% There are two elementally unbalanced reactions, including, the reaction
% which was identified by removeBadRxns. When looking to these reactions
% closer, one can notice the contradiction between the reactions. The first
% one shows that amylose and starch are interconvertible, while the second
% reaction shows that amylose contains one less glucose unit than starch.
% This type of general reactions are problematic and should be fixed
% manually. It is hereby chosen to trust removeBadRxns and delete R02110.
model=removeReactions(model,'R02110');

% The model can no longer make something from nothing. Can it consume
% something without any output?
[solution, metabolite]=consumeSomething(model,{'H+'},true);
model.metNames(metabolite)

% Nope, so that was good. Add some uptakes and see what it can produce.
[~, J]=ismember({'D-Glucose';'H2O';'Orthophosphate';'Oxygen';'NH3';'Sulfate'},model.metNames);
[model, addedRxns]=addExchangeRxns(model,'in',J);

% Check which metabolites can be produced given these uptakes. The
% canProduce function allows for output of all metabolites. This will not
% happen in the real cell, but it is very useful for functionality testing
% of the model. Once it is functional, the excretion reactions based on
% evidence can be added as well.
I=canProduce(model);

fprintf('%d%%\n', round(sum(I)/numel(model.mets)*100));
% It seems that around 31% of the metabolites could be synthesized. It is
% not directly clear whether this is a high or low number, many metabolites
% should not be possible to synthesize from those simple precursors.

% Try to fill gaps using the full KEGG model to see if that gives a
% significantly higher number
keggModel=getModelFromKEGG(false,false,false,false);

% The KEGG model is associated to more than 6,400,000 genes. They will not
% be used for the gapfilling, so they are removed to make this a little
% faster
keggModel=rmfield(keggModel,'genes');
keggModel=rmfield(keggModel,'rxnGeneMat');

% It is already known from the previous commands that there are some
% unbalanced reactions in KEGG. Only use the balanced ones for the gap
% filling
balanceStructure=getElementalBalance(keggModel);
keggModel=removeReactions(keggModel,balanceStructure.balanceStatus~=1,true,true);

% The function fillGaps with these settings will try to include reactions in
% order to have flux through all reactions in the model. There are other
% settings as well. The first flag says that production of all metabolites
% should be allowed.
params.relGap=0.6; %Lower number for a more exhaustive search
params.printReport=true;
[newConnected, cannotConnect, addedRxns, newModel, exitFlag]=fillGaps(model,keggModel,true,false,false,[],params);

% The results show that fillGaps could connect around 29 reactions
% (newConnected) by including around 41 reactions from the KEGG model
% (addedRxns). Those should of course be checked manually to confirm that
% they exist in yeast, but in this tutorial it is assumed that all these
% reactions indeed occur in yeast.

% Continue to improve the connectivity of the model by identifying
% metabolites that should be connected. A convenient way to get an overview
% of how connected the model is, and at the same time getting a lot of
% useful data, is to use gapReport. Note that it can take several to many
% hours to run, depending on the number of gaps in the model.
[noFluxRxns, noFluxRxnsRelaxed, subGraphs, notProducedMets, minToConnect,...
    neededForProductionMat]=gapReport(newModel);

% The results show that 532 from 1634 reactions cannot carry flux. Remember
% that an output is allowed for all metabolites, that is why it calculates
% 532 in both cases. 543 from 1617 metabolites cannot be synthesized from
% the precursors which are supplied. There are 6 subnetworks in the model,
% of which 1605 from 1617 metabolites belong to the first one.

% It will also print something similar to:

% To enable net production of all metabolites, a total of 322 metabolites
% must be connected
% Top 10 metabolites to connect:
%   1. Acyl-CoA[s] (connects 162 metabolites)
%   2. Thiamin diphosphate[s] (connects 17 metabolites)
%   3. G00003[s] (connects 14 metabolites)
%   4. Androstenediol[s] (connects 13 metabolites)
%   5. [Dihydrolipoyllysine-residue succinyltransferase] S-glutaryldihydrolipoyllysine[s] (connects 13 metabolites)
%   6. Selenomethionyl-tRNA(Met)[s] (connects 12 metabolites)
%   7. Retinyl ester[s] (connects 11 metabolites)
%   8. 1-Alkyl-2-acylglycerol[s] (connects 10 metabolites)
%   9. G00146[s] (connects 10 metabolites)
%   10. Progesterone[s] (connects 9 metabolites)

% This is a very useful way of directing the gap-filling tasks to where they
% are of the greatest use. In this case it says that in order to have net
% production of all metabolites in the model from the simple inputs which
% have been defined (543 metabolites can currently not have net production)
% one needs to connect 322 unconnected metabolites in. However, by
% connecting only the top 10 in the list 271/543 (50%) of them will now be
% connected. When looking at the list one can see that the top ranking one
% is Acyl-CoA, which represents many different fatty acid chains. The second
% metabolite, thiamine disphosphate, is an enzyme co-factor. It turns out
% that yeast cannot grow on only the substrates that had been defined, but
% that it requires some other precursors for co-factor synthesis as well.

% Add uptake reactions for the minimal media constituents needed for yeast
% to grow.
[~, J]=ismember({'4-Aminobenzoate';'Riboflavin';'Thiamine';'Biotin';'Folate';'Nicotinate';'Zymosterol';'Choline'},newModel.metNames);
[newModel, addedRxns]=addExchangeRxns(newModel,'in',J);

% Rerun gapReport and use the output for targeting the gap-filling efforts.
% Note that only some info is printed; most of it is available in the output
% structures. Work like this in an iterative manner until the model is of
% sufficient quality.