Skip to content

Reconstruction (Python)

raven-toolbox objects in raven_toolbox.reconstruction, collected from the source of the tracked branch.

Functions

Function Summary
assemble_model_from_ko_genes Build a draft model from a {ko: [gene, ...]} assignment.
assign_kos Assign genes to KOs from HMM hits, applying the cut-off and ratio filters.
blast_from_table Load a precomputed homology hits table (CSV path or DataFrame).
build_hmm_library Build a domain ("prokaryotes"/"eukaryotes") HMM library.
build_kegg_tables Build the minimal relational tables from parsed records.
build_ko_fastas Write one <KO>.fa per KO with its member genes' sequences.
build_ko_hmm Cluster, align and train a profile HMM for one KO's multi-FASTA.
build_reference_model Assemble the gene-free KEGG reference model from parsed records.
download_kegg_dump Fetch and arrange a complete KEGG dump into dest.
extract_kegg_dump Extract and arrange the downloaded archives into the flat dump layout.
fetch_kegg_files Download the raw KEGG archives into dest (basenames). Returns the paths.
flag_set Reaction ids whose column flag is truthy (handles bool or TSV strings).
get_kegg_model_for_organism Reconstruct a draft model for a KEGG species from its KO annotations.
get_kegg_model_for_organism_from_artefacts Load the published 3b.2 artefacts from artefact_dir and build the model.
get_kegg_model_from_sequences Reconstruct a draft model for a proteome by HMM-searching the KO library.
get_kegg_model_from_sequences_with_artefacts Load reference model + tables from artefact_dir and run the HMM query.
get_model_from_homology Build a draft model for model_for by transferring reactions from templates.
HomologyResult Result of :func:get_model_from_homology.
KeggCompound A metabolite parsed from the KEGG compound flat file.
KeggKO A KEGG Orthology entry: its name and the organism genes assigned to it.
KeggReaction A reaction parsed from the KEGG reaction flat file.
make_ortholog_hits Build a bidirectional hits table from a predefined ortholog list.
organism_domains Return {organism_id: domain} (the outermost category).
organisms_in_domain Organism ids whose outermost category matches domain (case-insensitive).
parse_hmmsearch_tblout Parse hmmsearch --tblout text into a [ko, gene, evalue] table.
parse_kegg_compounds Parse <kegg_dir>/compound (+ optional compound.inchi) into records.
parse_kegg_dump Parse a full KEGG dump into the reference model + tables and write them out.
parse_kegg_kos Parse <kegg_dir>/ko into :class:KeggKO records (name + organism genes).
parse_kegg_reactions Parse <kegg_dir>/reaction into :class:KeggReaction records.
parse_taxonomy Return {organism_id: [category, ...]} from outermost to innermost.
parse_taxonomy_records Return [(organism_id, name, [category, ...]), ...] in file order.
phyl_dist Compute the KEGG phylogenetic distance matrix from the taxonomy file.
PhylDist KEGG phylogenetic distance structure — RAVEN getPhylDist / keggPhylDist.
read_kegg_table Read a KEGG table written by :func:write_kegg_tables or
run_blast Bidirectional BLAST+ between an organism and template organisms.
run_diamond Bidirectional DIAMOND between an organism and template organisms.
run_hmmsearch Search the profile library against proteome fasta; return tblout text.
stream_organism_gene_ko Stream the ko file to a sorted, gzipped organism_gene_ko.tsv.gz.
validate_hits Check a hits DataFrame has the required columns; return it unchanged.
write_kegg_tables Write each table as a gzipped TSV (<prefix><name>.tsv.gz) into out_dir.

Reference

assemble_model_from_ko_genes

Build a draft model from a {ko: [gene, ...]} assignment.

Returns (model, gpr_map) where gpr_map is the kept reactions' gene lists, so callers can add gene annotations afterwards.

assign_kos

Assign genes to KOs from HMM hits, applying the cut-off and ratio filters.

Ports RAVEN's three steps on the KO×gene E-value matrix:

  1. keep hits with evalue <= cutoff;
  2. min_score_ratio_ko — within a KO, drop genes whose log(evalue)/log(best_evalue_in_KO) < min_score_ratio_ko (prune weak members of a KO);
  3. min_score_ratio_g — within a gene, drop KOs whose log(evalue)/log(best_evalue_for_gene) < min_score_ratio_g (stop a gene that clearly belongs to one KO leaking into weaker ones).

Smaller E-value = better; since all kept values are < 1 their logs are negative, so the best (smallest) hit gives ratio 1 and weaker hits give a smaller positive ratio.

Default calibration (see IMPROVEMENTS K15). Cross-validated against the true KEGG gene→KO annotation of four organisms spanning the prok/euk libraries and the well-/lesser-studied axis (S. cerevisiae, Cyanidioschyzon merolae, E. coli, Mycoplasma genitalium): real annotations score overwhelmingly (median E ≈ 1e-100…1e-155) while spurious hits pile up at ≈1e-8, so the two are separated by ~20 orders of magnitude. RAVEN's 1e-50 sits inside the true tail and silently drops real but divergent hits — costing 16% gene→KO recall on the divergent minimal genome (M. genitalium) for no noise-rejection benefit (noise is far weaker). The default is therefore loosened to 1e-30 (recovers that tail; still ~22 orders above the noise floor), with the precision work moved to min_score_ratio_g = 0.9 — the effective precision lever (it resolves multi-KO genes). min_score_ratio_ko proved empirically inert across all four organisms (identical output at 0.0/0.3/0.5) and is kept only for RAVEN parity.

blast_from_table

Load a precomputed homology hits table (CSV path or DataFrame).

a plain CSV/DataFrame, not Excel. Must contain the HIT_COLUMNS columns.

build_hmm_library

Build a domain ("prokaryotes"/"eukaryotes") HMM library.

Restricts genes to the domain's organisms (from taxonomy), builds a multi-FASTA and a profile HMM per KO under out_dir, and (if concatenate) concatenates them into out_dir/library.hmm — searched directly with hmmsearch (no hmmpress needed). Returns {"hmms": [...], "library": path | None}.

Heavy and binary-dependent — intended for the maintainer, run once per KEGG release. Resumable: skips KOs that already have an .hmm and reuses an already-built multi-FASTA (see :func:build_ko_fastas), so re-running after a failure continues rather than restarting. force rebuilds everything.

progress shows tqdm bars for the multi-FASTA and per-KO HMM passes (the headline N of M counter for the long build); it is independent of verbose (per-KO INFO logging). With both on, per-KO log lines are routed through tqdm.write so they don't corrupt the bar.

build_kegg_tables

Build the minimal relational tables from parsed records.

Returns a dict of DataFrames keyed by table name: ko_reaction, ko_names, organism_gene_ko, rxn_flags.

build_ko_fastas

Write one <KO>.fa per KO with its member genes' sequences.

but with a stdlib offset index instead of the Java-hashtable byte scan. organisms restricts to a domain's organism codes (for the prok/euk split). Empty KOs are skipped (no file). progress shows a tqdm bar over the per-KO writing pass. Returns {ko: path} for the files written.

Resumable: a <KO>.fa that already exists is kept (not rewritten), and once a full pass completes a .ko_fastas_complete marker is written so a later call returns the existing files without re-scanning genes_pep (the costly part). force ignores both and rebuilds. The resume assumes the same inputs as the original run (as the per-KO HMM resume in :func:build_hmm_library does); use force after changing organism_gene_ko/organisms.

build_ko_hmm

Cluster, align and train a profile HMM for one KO's multi-FASTA.

Single-sequence KOs skip CD-HIT/MAFFT (a lone sequence is its own alignment). seq_identity=-1 skips CD-HIT. All (deduplicated) sequences are kept — memory on large KOs is bounded by switching MAFFT to PartTree, not by dropping sequences. fast uses MAFFT FFT-NS-2 (fast progressive) rather than --auto's slow iterative refinement. MAFFT switches to memory-light PartTree once an alignment is predicted to be too memory-heavy: by default from its DP cost (n_seqs × mean_len² — long proteins cost far more than the same residue count in short ones) against a RAM-derived budget (:func:_auto_cost_budget). Passing parttree_residues overrides this with a simple residue-count cutoff. verbose logs (via the logging module, INFO/DEBUG) which tool is running for this KO, sequence counts at each stage, timings, and the tools' own output. Returns out_hmm.

build_reference_model

Assemble the gene-free KEGG reference model from parsed records.

Only metabolites actually used by a reaction are added. Reactions carry KEGG annotations (reaction id, KO ids, EC codes, pathways) but no genes/GPRs. Bounds are (-1000, 1000) for reversible reactions and (0, 1000) otherwise.

download_kegg_dump

Fetch and arrange a complete KEGG dump into dest.

Convenience wrapper chaining :func:fetch_kegg_files and :func:extract_kegg_dump. Returns the logical-name -> path mapping of the arranged dump, ready for :func:raven_toolbox.reconstruction.kegg.parse_kegg_dump. progress (on by default) shows download/extraction bars; pass progress=False for non-interactive runs.

extract_kegg_dump

Extract and arrange the downloaded archives into the flat dump layout.

Mirrors fetch_keggdb.sh's extract step: untar the *.tar.gz archives, gunzip the *.pep.gz proteomes, lift the needed files out of their sub-directories, and concatenate compound+glycan and the two proteomes. Tar extraction uses the data filter (no path traversal). Returns a mapping of logical name -> path for the files produced. progress shows a bar over the untar step and a byte bar per proteome gunzip.

Network-free, so this is the unit-tested core; download_kegg_dump chains :func:fetch_kegg_files in front of it.

fetch_kegg_files

Download the raw KEGG archives into dest (basenames). Returns the paths.

Existing files are skipped unless force=True (the script's wget -N intent, simplified to skip-if-present). progress shows a per-file byte download bar (sized from the response Content-Length); verbose keeps the fetching/skip INFO log lines.

flag_set

Reaction ids whose column flag is truthy (handles bool or TSV strings).

get_kegg_model_for_organism

Reconstruct a draft model for a KEGG species from its KO annotations.

Parameters:

Name Type Description Default
organism_id str

Three/four-letter KEGG organism code (e.g. "eco"), or "eukaryotes"/"prokaryotes" for a whole-domain model (requires taxonomy). Matched case-insensitively.

required
reference_model Model

The gene-free KEGG reference model (from :func:build_reference_model).

required
ko_reaction DataFrame

The relational tables from :func:build_kegg_tables (or read back with :func:read_kegg_table).

required
organism_gene_ko DataFrame

The relational tables from :func:build_kegg_tables (or read back with :func:read_kegg_table).

required
rxn_flags DataFrame

The relational tables from :func:build_kegg_tables (or read back with :func:read_kegg_table).

required
taxonomy str | Path | None

Path to the KEGG taxonomy file; required only for domain mode.

None
keep_spontaneous bool

Quality filters (RAVEN's keep*). A reaction flagged in rxn_flags is dropped unless its keep flag is set; this takes precedence over having genes. Spontaneous reactions are additionally kept without genes when keep_spontaneous is true.

True
keep_undefined_stoich bool

Quality filters (RAVEN's keep*). A reaction flagged in rxn_flags is dropped unless its keep flag is set; this takes precedence over having genes. Spontaneous reactions are additionally kept without genes when keep_spontaneous is true.

True
keep_incomplete bool

Quality filters (RAVEN's keep*). A reaction flagged in rxn_flags is dropped unless its keep flag is set; this takes precedence over having genes. Spontaneous reactions are additionally kept without genes when keep_spontaneous is true.

True
keep_general bool

Quality filters (RAVEN's keep*). A reaction flagged in rxn_flags is dropped unless its keep flag is set; this takes precedence over having genes. Spontaneous reactions are additionally kept without genes when keep_spontaneous is true.

True

Returns:

Type Description
Model

A copy of the reference restricted to the organism's reactions, with GPRs built and kegg.genes annotations on the genes.

get_kegg_model_for_organism_from_artefacts

Load the published 3b.2 artefacts from artefact_dir and build the model.

Reads reference_model.yml.gz and the ko_reaction/organism_gene_ko/ rxn_flags gzipped-TSV tables, then calls :func:get_kegg_model_for_organism. If artefact_dir is None the published artefacts are fetched/cached via :func:raven_toolbox.data.ensure_kegg_data (version selects the release).

For a whole-domain model (organism_id = "prokaryotes"/"eukaryotes") the KEGG taxonomy artefact is resolved automatically — from artefact_dir if present, otherwise fetched via :func:raven_toolbox.data.ensure_kegg_taxonomy (it is a separate artefact, not part of the core set) — unless an explicit taxonomy path is given.

get_kegg_model_from_sequences

Reconstruct a draft model for a proteome by HMM-searching the KO library.

Searches the library against fasta (3b.3), assigns KOs (:func:assign_kos), and assembles the model against reference_model / ko_reaction. Genes are the query proteome's identifiers.

get_kegg_model_from_sequences_with_artefacts

Load reference model + tables from artefact_dir and run the HMM query.

If artefact_dir / library are None they are fetched/cached via :func:raven_toolbox.data.ensure_kegg_data / :func:raven_toolbox.data.ensure_kegg_hmm_library (domain selects the prok/euk library; version the release).

get_model_from_homology

Build a draft model for model_for by transferring reactions from templates.

strictness (1/2/3) is a legacy alias for bidirectional / best_hits_only.

HomologyResult

Result of :func:get_model_from_homology.

Attributes:

Name Type Description
model Model

The draft cobra.Model.

gene_map dict

{model_id: {template_gene: [new_gene, ...]}} ortholog mapping used.

KeggCompound

A metabolite parsed from the KEGG compound flat file.

KeggKO

A KEGG Orthology entry: its name and the organism genes assigned to it.

KeggReaction

A reaction parsed from the KEGG reaction flat file.

make_ortholog_hits

Build a bidirectional hits table from a predefined ortholog list.

Each (source_gene, target_gene) pair is emitted in both directions with sentinel metrics (evalue 0, identity 100, align_len 1000, bitscore 1000, ppos 100) so every pair passes any reasonable filter. Lets a known ortholog mapping feed :func:get_model_from_homology with no BLAST run — also the testing entry point.

Parameters:

Name Type Description Default
ortholog_pairs Iterable[tuple[str, str]]

Iterable of (source_gene, target_gene) — source = template/model organism, target = the organism being built.

required
source_model_id str

ID of the template model the source genes belong to.

required
target_id str

ID of the organism to build a model for (model_for).

required

organism_domains

Return {organism_id: domain} (the outermost category).

organisms_in_domain

Organism ids whose outermost category matches domain (case-insensitive).

Accepts a prefix, so "prok" matches "Prokaryotes".

parse_hmmsearch_tblout

Parse hmmsearch --tblout text into a [ko, gene, evalue] table.

With the profile library as the query, column 1 (target name) is the proteome gene, column 3 (query name) is the KO, and column 5 is the full-sequence E-value.

parse_kegg_compounds

Parse <kegg_dir>/compound (+ optional compound.inchi) into records.

parse_kegg_dump

Parse a full KEGG dump into the reference model + tables and write them out.

Writes reference_model.yml.gz (gzipped RAVEN/cobra YAML) plus the gzipped-TSV tables into out_dir and returns {name: path} for everything written. The large organism_gene_ko table is streamed to disk (see :func:stream_organism_gene_ko) rather than built in memory, so this scales to the full KEGG database; the small derived tables are built in memory.

When version is given (e.g. "kegg116") the output filenames are version-prefixed (<version>_<name>), matching the published release assets; the returned dict keys stay the logical table names.

progress reports each stage and shows a byte-progress bar over the dominant organism_gene_ko (ko) streaming pass.

parse_kegg_kos

Parse <kegg_dir>/ko into :class:KeggKO records (name + organism genes).

keep limits parsing to those KO ids (e.g. only KOs linked to reactions), mirroring RAVEN's koList argument — the gene lists are huge, so this is the usual call.

parse_kegg_reactions

Parse <kegg_dir>/reaction into :class:KeggReaction records.

Reversibility is taken from the equation arrow and, when reaction_mapformula.lst is present, refined to mark reactions that are irreversible across all KEGG maps (see :func:_irreversible_from_mapformula).

parse_taxonomy

Return {organism_id: [category, ...]} from outermost to innermost.

parse_taxonomy_records

Return [(organism_id, name, [category, ...]), ...] in file order.

name is the organism's scientific name with any trailing KEGG parenthetical kept (e.g. "Homo sapiens (human)"), matching RAVEN getPhylDist; the category list is the lineage from outermost (domain) to innermost. Order is preserved so it aligns with the rows/columns of :func:phyl_dist.

phyl_dist

Compute the KEGG phylogenetic distance matrix from the taxonomy file.

Faithful port of RAVEN getPhylDist. For organisms i and j with category lineages of lengths Li, Lj, the distance is (Li - Lj) + min(Li, Lj) - k where k is the deepest level (within the shorter lineage, comparing position by position from the root) at which the two categories coincide — 1 if none coincide. This reproduces RAVEN's exact values (including their asymmetry and occasional negatives) so the output matches MATLAB keggPhylDist that GECKO was built against.

only_in_kingdom sets the distance to +inf between organisms in different domains (RAVEN's onlyInKingdom variant); the matrix is then returned as float.

Cost is O(N²) in time and memory (N = number of KEGG organisms, ~10⁴), so this is a maintainer/GECKO-side generation step; persist the result rather than rebuilding.

PhylDist

KEGG phylogenetic distance structure — RAVEN getPhylDist / keggPhylDist.

ids/names are the KEGG organism ids and scientific names (names keep RAVEN's trailing parenthetical; consumers that need them cleaned, e.g. geckopy, strip it). dist_matrix[i, j] is RAVEN's distance from ids[i] to ids[j]. It is a faithful reproduction of RAVEN's metric, which is asymmetric and may be negative; GECKO consumes it only by taking the closest (minimum-distance) organism, so the exact values matter for parity with MATLAB but the sign/asymmetry are harmless there.

read_kegg_table

Read a KEGG table written by :func:write_kegg_tables or :func:stream_organism_gene_ko.

Compression is inferred from the suffix; all published tables are gzipped TSV (.tsv.gz), and a version-prefixed name (kegg116_<name>.tsv.gz) reads just the same.

run_blast

Bidirectional BLAST+ between an organism and template organisms.

Returns the hits DataFrame (filtered at evalue). Requires BLAST+ (blastp, makeblastdb).

run_diamond

Bidirectional DIAMOND between an organism and template organisms.

Returns the hits DataFrame. Requires DIAMOND.

run_hmmsearch

Search the profile library against proteome fasta; return tblout text.

One hmmsearch of the whole concatenated multi-profile library (query) against fasta (target) — the fast search direction, parallelised with --cpu, and no hmmpress needed. -Z is fixed to the profile count so the per-hit E-values match the convention :func:assign_kos is calibrated against (identical to a hmmscan against the same library).

stream_organism_gene_ko

Stream the ko file to a sorted, gzipped organism_gene_ko.tsv.gz.

Real KEGG has ~9M gene↔KO associations — far too many to hold in memory as a DataFrame. Rows are sorted by (organism, gene) before writing: gene IDs from one organism share long common prefixes (locus tags, numeric runs), so sorting makes them adjacent and helps the compressor; the order also matches the by-organism query pattern in :func:get_kegg_model_for_organism. Gzip (not xz) keeps the table readable by MATLAB's built-in gunzip with no external tool, at a modestly larger size.

The sort is an external merge sort bounded to chunk_rows rows in memory at a time (sorted runs spooled to gzipped temp files, then merged with :func:heapq.merge), so peak memory stays flat regardless of KEGG size. Only the small ko_names table (one row per KO) is held in full and returned.

progress shows a tqdm byte-progress bar over the (large) ko read pass.

validate_hits

Check a hits DataFrame has the required columns; return it unchanged.

write_kegg_tables

Write each table as a gzipped TSV (<prefix><name>.tsv.gz) into out_dir.

Gzipped TSV is the dependency-free cross-language format shared with MATLAB RAVEN (see docs/kegg_data_format.md) — readable by MATLAB's built-in gunzip with no external tool. prefix version-tags the filenames (e.g. kegg116_). Returns the written paths.