Reconstruction (Python)¶

raven-toolbox objects in raven_toolbox.reconstruction, collected from the source of the tracked branch.

Functions¶

Function	Summary
`assemble_model_from_ko_genes`	Build a draft model from a `{ko: [gene, ...]}` assignment.
`assign_kos`	Assign genes to KOs from HMM hits, applying the cut-off and ratio filters.
`blast_from_table`	Load a precomputed homology hits table (CSV path or DataFrame).
`build_hmm_library`	Build a domain (`"prokaryotes"`/`"eukaryotes"`) HMM library.
`build_kegg_tables`	Build the minimal relational tables from parsed records.
`build_ko_fastas`	Write one `<KO>.fa` per KO with its member genes' sequences.
`build_ko_hmm`	Cluster, align and train a profile HMM for one KO's multi-FASTA.
`build_reference_model`	Assemble the gene-free KEGG reference model from parsed records.
`download_kegg_dump`	Fetch and arrange a complete KEGG dump into `dest`.
`extract_kegg_dump`	Extract and arrange the downloaded archives into the flat dump layout.
`fetch_kegg_files`	Download the raw KEGG archives into `dest` (basenames). Returns the paths.
`flag_set`	Reaction ids whose `column` flag is truthy (handles bool or TSV strings).
`get_kegg_model_for_organism`	Reconstruct a draft model for a KEGG species from its KO annotations.
`get_kegg_model_for_organism_from_artefacts`	Load the published 3b.2 artefacts from `artefact_dir` and build the model.
`get_kegg_model_from_sequences`	Reconstruct a draft model for a proteome by HMM-searching the KO library.
`get_kegg_model_from_sequences_with_artefacts`	Load reference model + tables from `artefact_dir` and run the HMM query.
`get_model_from_homology`	Build a draft model for `model_for` by transferring reactions from templates.
`HomologyResult`	Result of :func:`get_model_from_homology`.
`KeggCompound`	A metabolite parsed from the KEGG `compound` flat file.
`KeggKO`	A KEGG Orthology entry: its name and the organism genes assigned to it.
`KeggReaction`	A reaction parsed from the KEGG `reaction` flat file.
`make_ortholog_hits`	Build a bidirectional hits table from a predefined ortholog list.
`organism_domains`	Return `{organism_id: domain}` (the outermost category).
`organisms_in_domain`	Organism ids whose outermost category matches `domain` (case-insensitive).
`parse_hmmsearch_tblout`	Parse `hmmsearch --tblout` text into a `[ko, gene, evalue]` table.
`parse_kegg_compounds`	Parse `<kegg_dir>/compound` (+ optional `compound.inchi`) into records.
`parse_kegg_dump`	Parse a full KEGG dump into the reference model + tables and write them out.
`parse_kegg_kos`	Parse `<kegg_dir>/ko` into :class:`KeggKO` records (name + organism genes).
`parse_kegg_reactions`	Parse `<kegg_dir>/reaction` into :class:`KeggReaction` records.
`parse_taxonomy`	Return `{organism_id: [category, ...]}` from outermost to innermost.
`parse_taxonomy_records`	Return `[(organism_id, name, [category, ...]), ...]` in file order.
`phyl_dist`	Compute the KEGG phylogenetic distance matrix from the `taxonomy` file.
`PhylDist`	KEGG phylogenetic distance structure — RAVEN `getPhylDist` / `keggPhylDist`.
`read_kegg_table`	Read a KEGG table written by :func:`write_kegg_tables` or
`run_blast`	Bidirectional BLAST+ between an organism and template organisms.
`run_diamond`	Bidirectional DIAMOND between an organism and template organisms.
`run_hmmsearch`	Search the profile `library` against proteome `fasta`; return tblout text.
`stream_organism_gene_ko`	Stream the `ko` file to a sorted, gzipped `organism_gene_ko.tsv.gz`.
`validate_hits`	Check a hits DataFrame has the required columns; return it unchanged.
`write_kegg_tables`	Write each table as a gzipped TSV (`<prefix><name>.tsv.gz`) into `out_dir`.

Reference¶

assemble_model_from_ko_genes¶

Build a draft model from a {ko: [gene, ...]} assignment.

Returns (model, gpr_map) where gpr_map is the kept reactions' gene lists, so callers can add gene annotations afterwards.

assign_kos¶

Assign genes to KOs from HMM hits, applying the cut-off and ratio filters.

Ports RAVEN's three steps on the KO×gene E-value matrix:

keep hits with evalue <= cutoff;
min_score_ratio_ko — within a KO, drop genes whose log(evalue)/log(best_evalue_in_KO) < min_score_ratio_ko (prune weak members of a KO);
min_score_ratio_g — within a gene, drop KOs whose log(evalue)/log(best_evalue_for_gene) < min_score_ratio_g (stop a gene that clearly belongs to one KO leaking into weaker ones).

Smaller E-value = better; since all kept values are < 1 their logs are negative, so the best (smallest) hit gives ratio 1 and weaker hits give a smaller positive ratio.

Default calibration (see IMPROVEMENTS K15). Cross-validated against the true KEGG gene→KO annotation of four organisms spanning the prok/euk libraries and the well-/lesser-studied axis (S. cerevisiae, Cyanidioschyzon merolae, E. coli, Mycoplasma genitalium): real annotations score overwhelmingly (median E ≈ 1e-100…1e-155) while spurious hits pile up at ≈1e-8, so the two are separated by ~20 orders of magnitude. RAVEN's 1e-50 sits inside the true tail and silently drops real but divergent hits — costing 16% gene→KO recall on the divergent minimal genome (M. genitalium) for no noise-rejection benefit (noise is far weaker). The default is therefore loosened to 1e-30 (recovers that tail; still ~22 orders above the noise floor), with the precision work moved to min_score_ratio_g = 0.9 — the effective precision lever (it resolves multi-KO genes). min_score_ratio_ko proved empirically inert across all four organisms (identical output at 0.0/0.3/0.5) and is kept only for RAVEN parity.

blast_from_table¶

Load a precomputed homology hits table (CSV path or DataFrame).

a plain CSV/DataFrame, not Excel. Must contain the HIT_COLUMNS columns.

build_hmm_library¶

Build a domain ("prokaryotes"/"eukaryotes") HMM library.

Restricts genes to the domain's organisms (from taxonomy), builds a multi-FASTA and a profile HMM per KO under out_dir, and (if concatenate) concatenates them into out_dir/library.hmm — searched directly with hmmsearch (no hmmpress needed). Returns {"hmms": [...], "library": path | None}.

Heavy and binary-dependent — intended for the maintainer, run once per KEGG release. Resumable: skips KOs that already have an .hmm and reuses an already-built multi-FASTA (see :func:build_ko_fastas), so re-running after a failure continues rather than restarting. force rebuilds everything.

progress shows tqdm bars for the multi-FASTA and per-KO HMM passes (the headline N of M counter for the long build); it is independent of verbose (per-KO INFO logging). With both on, per-KO log lines are routed through tqdm.write so they don't corrupt the bar.

build_kegg_tables¶

Build the minimal relational tables from parsed records.

Returns a dict of DataFrames keyed by table name: ko_reaction, ko_names, organism_gene_ko, rxn_flags.

build_ko_fastas¶

Write one <KO>.fa per KO with its member genes' sequences.

but with a stdlib offset index instead of the Java-hashtable byte scan. organisms restricts to a domain's organism codes (for the prok/euk split). Empty KOs are skipped (no file). progress shows a tqdm bar over the per-KO writing pass. Returns {ko: path} for the files written.

Resumable: a <KO>.fa that already exists is kept (not rewritten), and once a full pass completes a .ko_fastas_complete marker is written so a later call returns the existing files without re-scanning genes_pep (the costly part). force ignores both and rebuilds. The resume assumes the same inputs as the original run (as the per-KO HMM resume in :func:build_hmm_library does); use force after changing organism_gene_ko/organisms.

build_ko_hmm¶

Cluster, align and train a profile HMM for one KO's multi-FASTA.

Single-sequence KOs skip CD-HIT/MAFFT (a lone sequence is its own alignment). seq_identity=-1 skips CD-HIT. All (deduplicated) sequences are kept — memory on large KOs is bounded by switching MAFFT to PartTree, not by dropping sequences. fast uses MAFFT FFT-NS-2 (fast progressive) rather than --auto's slow iterative refinement. MAFFT switches to memory-light PartTree once an alignment is predicted to be too memory-heavy: by default from its DP cost (n_seqs × mean_len² — long proteins cost far more than the same residue count in short ones) against a RAM-derived budget (:func:_auto_cost_budget). Passing parttree_residues overrides this with a simple residue-count cutoff. verbose logs (via the logging module, INFO/DEBUG) which tool is running for this KO, sequence counts at each stage, timings, and the tools' own output. Returns out_hmm.

build_reference_model¶

Assemble the gene-free KEGG reference model from parsed records.

Only metabolites actually used by a reaction are added. Reactions carry KEGG annotations (reaction id, KO ids, EC codes, pathways) but no genes/GPRs. Bounds are (-1000, 1000) for reversible reactions and (0, 1000) otherwise.

download_kegg_dump¶

Fetch and arrange a complete KEGG dump into dest.

Convenience wrapper chaining :func:fetch_kegg_files and :func:extract_kegg_dump. Returns the logical-name -> path mapping of the arranged dump, ready for :func:raven_toolbox.reconstruction.kegg.parse_kegg_dump. progress (on by default) shows download/extraction bars; pass progress=False for non-interactive runs.

extract_kegg_dump¶

Extract and arrange the downloaded archives into the flat dump layout.

Mirrors fetch_keggdb.sh's extract step: untar the *.tar.gz archives, gunzip the *.pep.gz proteomes, lift the needed files out of their sub-directories, and concatenate compound+glycan and the two proteomes. Tar extraction uses the data filter (no path traversal). Returns a mapping of logical name -> path for the files produced. progress shows a bar over the untar step and a byte bar per proteome gunzip.

Network-free, so this is the unit-tested core; download_kegg_dump chains :func:fetch_kegg_files in front of it.

fetch_kegg_files¶

Download the raw KEGG archives into dest (basenames). Returns the paths.

Existing files are skipped unless force=True (the script's wget -N intent, simplified to skip-if-present). progress shows a per-file byte download bar (sized from the response Content-Length); verbose keeps the fetching/skip INFO log lines.

flag_set¶

Reaction ids whose column flag is truthy (handles bool or TSV strings).

get_kegg_model_for_organism¶

Reconstruct a draft model for a KEGG species from its KO annotations.

Parameters:

Name	Type	Description	Default
`organism_id`	`str`	Three/four-letter KEGG organism code (e.g. `"eco"`), or `"eukaryotes"`/`"prokaryotes"` for a whole-domain model (requires `taxonomy`). Matched case-insensitively.	required
`reference_model`	`Model`	The gene-free KEGG reference model (from :func:`build_reference_model`).	required
`ko_reaction`	`DataFrame`	The relational tables from :func:`build_kegg_tables` (or read back with :func:`read_kegg_table`).	required
`organism_gene_ko`	`DataFrame`	The relational tables from :func:`build_kegg_tables` (or read back with :func:`read_kegg_table`).	required
`rxn_flags`	`DataFrame`	The relational tables from :func:`build_kegg_tables` (or read back with :func:`read_kegg_table`).	required
`taxonomy`	`str \| Path \| None`	Path to the KEGG `taxonomy` file; required only for domain mode.	`None`
`keep_spontaneous`	`bool`	Quality filters (RAVEN's `keep`). A reaction flagged in `rxn_flags` is dropped unless its keep flag is set; this takes precedence over having genes. Spontaneous reactions are additionally kept without* genes when `keep_spontaneous` is true.	`True`
`keep_undefined_stoich`	`bool`	Quality filters (RAVEN's `keep`). A reaction flagged in `rxn_flags` is dropped unless its keep flag is set; this takes precedence over having genes. Spontaneous reactions are additionally kept without* genes when `keep_spontaneous` is true.	`True`
`keep_incomplete`	`bool`	Quality filters (RAVEN's `keep`). A reaction flagged in `rxn_flags` is dropped unless its keep flag is set; this takes precedence over having genes. Spontaneous reactions are additionally kept without* genes when `keep_spontaneous` is true.	`True`
`keep_general`	`bool`	Quality filters (RAVEN's `keep`). A reaction flagged in `rxn_flags` is dropped unless its keep flag is set; this takes precedence over having genes. Spontaneous reactions are additionally kept without* genes when `keep_spontaneous` is true.	`True`

Returns:

Type	Description
`Model`	A copy of the reference restricted to the organism's reactions, with GPRs built and `kegg.genes` annotations on the genes.

get_kegg_model_for_organism_from_artefacts¶

Load the published 3b.2 artefacts from artefact_dir and build the model.

Reads reference_model.yml.gz and the ko_reaction/organism_gene_ko/ rxn_flags gzipped-TSV tables, then calls :func:get_kegg_model_for_organism. If artefact_dir is None the published artefacts are fetched/cached via :func:raven_toolbox.data.ensure_kegg_data (version selects the release).

For a whole-domain model (organism_id = "prokaryotes"/"eukaryotes") the KEGG taxonomy artefact is resolved automatically — from artefact_dir if present, otherwise fetched via :func:raven_toolbox.data.ensure_kegg_taxonomy (it is a separate artefact, not part of the core set) — unless an explicit taxonomy path is given.

get_kegg_model_from_sequences¶

Reconstruct a draft model for a proteome by HMM-searching the KO library.

Searches the library against fasta (3b.3), assigns KOs (:func:assign_kos), and assembles the model against reference_model / ko_reaction. Genes are the query proteome's identifiers.

get_kegg_model_from_sequences_with_artefacts¶

Load reference model + tables from artefact_dir and run the HMM query.

If artefact_dir / library are None they are fetched/cached via :func:raven_toolbox.data.ensure_kegg_data / :func:raven_toolbox.data.ensure_kegg_hmm_library (domain selects the prok/euk library; version the release).

get_model_from_homology¶

Build a draft model for model_for by transferring reactions from templates.

strictness (1/2/3) is a legacy alias for bidirectional / best_hits_only.

HomologyResult¶

Result of :func:get_model_from_homology.

Attributes:

Name	Type	Description
`model`	`Model`	The draft `cobra.Model`.
`gene_map`	`dict`	`{model_id: {template_gene: [new_gene, ...]}}` ortholog mapping used.

KeggCompound¶

A metabolite parsed from the KEGG compound flat file.

KeggKO¶

A KEGG Orthology entry: its name and the organism genes assigned to it.

KeggReaction¶

A reaction parsed from the KEGG reaction flat file.

make_ortholog_hits¶

Build a bidirectional hits table from a predefined ortholog list.

Each (source_gene, target_gene) pair is emitted in both directions with sentinel metrics (evalue 0, identity 100, align_len 1000, bitscore 1000, ppos 100) so every pair passes any reasonable filter. Lets a known ortholog mapping feed :func:get_model_from_homology with no BLAST run — also the testing entry point.

Parameters:

Name	Type	Description	Default
`ortholog_pairs`	`Iterable[tuple[str, str]]`	Iterable of `(source_gene, target_gene)` — source = template/model organism, target = the organism being built.	required
`source_model_id`	`str`	ID of the template model the source genes belong to.	required
`target_id`	`str`	ID of the organism to build a model for (`model_for`).	required

organism_domains¶

Return {organism_id: domain} (the outermost category).

organisms_in_domain¶

Organism ids whose outermost category matches domain (case-insensitive).

Accepts a prefix, so "prok" matches "Prokaryotes".

parse_hmmsearch_tblout¶

Parse hmmsearch --tblout text into a [ko, gene, evalue] table.

With the profile library as the query, column 1 (target name) is the proteome gene, column 3 (query name) is the KO, and column 5 is the full-sequence E-value.

parse_kegg_compounds¶

Parse <kegg_dir>/compound (+ optional compound.inchi) into records.

parse_kegg_dump¶

Parse a full KEGG dump into the reference model + tables and write them out.

Writes reference_model.yml.gz (gzipped RAVEN/cobra YAML) plus the gzipped-TSV tables into out_dir and returns {name: path} for everything written. The large organism_gene_ko table is streamed to disk (see :func:stream_organism_gene_ko) rather than built in memory, so this scales to the full KEGG database; the small derived tables are built in memory.

When version is given (e.g. "kegg116") the output filenames are version-prefixed (<version>_<name>), matching the published release assets; the returned dict keys stay the logical table names.

progress reports each stage and shows a byte-progress bar over the dominant organism_gene_ko (ko) streaming pass.

parse_kegg_kos¶

Parse <kegg_dir>/ko into :class:KeggKO records (name + organism genes).

keep limits parsing to those KO ids (e.g. only KOs linked to reactions), mirroring RAVEN's koList argument — the gene lists are huge, so this is the usual call.

parse_kegg_reactions¶

Parse <kegg_dir>/reaction into :class:KeggReaction records.

Reversibility is taken from the equation arrow and, when reaction_mapformula.lst is present, refined to mark reactions that are irreversible across all KEGG maps (see :func:_irreversible_from_mapformula).

parse_taxonomy¶

Return {organism_id: [category, ...]} from outermost to innermost.

parse_taxonomy_records¶

Return [(organism_id, name, [category, ...]), ...] in file order.

name is the organism's scientific name with any trailing KEGG parenthetical kept (e.g. "Homo sapiens (human)"), matching RAVEN getPhylDist; the category list is the lineage from outermost (domain) to innermost. Order is preserved so it aligns with the rows/columns of :func:phyl_dist.

phyl_dist¶

Compute the KEGG phylogenetic distance matrix from the taxonomy file.

Faithful port of RAVEN getPhylDist. For organisms i and j with category lineages of lengths Li, Lj, the distance is (Li - Lj) + min(Li, Lj) - k where k is the deepest level (within the shorter lineage, comparing position by position from the root) at which the two categories coincide — 1 if none coincide. This reproduces RAVEN's exact values (including their asymmetry and occasional negatives) so the output matches MATLAB keggPhylDist that GECKO was built against.

only_in_kingdom sets the distance to +inf between organisms in different domains (RAVEN's onlyInKingdom variant); the matrix is then returned as float.

Cost is O(N²) in time and memory (N = number of KEGG organisms, ~10⁴), so this is a maintainer/GECKO-side generation step; persist the result rather than rebuilding.

PhylDist¶

KEGG phylogenetic distance structure — RAVEN getPhylDist / keggPhylDist.

ids/names are the KEGG organism ids and scientific names (names keep RAVEN's trailing parenthetical; consumers that need them cleaned, e.g. geckopy, strip it). dist_matrix[i, j] is RAVEN's distance from ids[i] to ids[j]. It is a faithful reproduction of RAVEN's metric, which is asymmetric and may be negative; GECKO consumes it only by taking the closest (minimum-distance) organism, so the exact values matter for parity with MATLAB but the sign/asymmetry are harmless there.

read_kegg_table¶

Read a KEGG table written by :func:write_kegg_tables or :func:stream_organism_gene_ko.

Compression is inferred from the suffix; all published tables are gzipped TSV (.tsv.gz), and a version-prefixed name (kegg116_<name>.tsv.gz) reads just the same.

run_blast¶

Bidirectional BLAST+ between an organism and template organisms.

Returns the hits DataFrame (filtered at evalue). Requires BLAST+ (blastp, makeblastdb).

run_diamond¶

Bidirectional DIAMOND between an organism and template organisms.

Returns the hits DataFrame. Requires DIAMOND.

run_hmmsearch¶

Search the profile library against proteome fasta; return tblout text.

One hmmsearch of the whole concatenated multi-profile library (query) against fasta (target) — the fast search direction, parallelised with --cpu, and no hmmpress needed. -Z is fixed to the profile count so the per-hit E-values match the convention :func:assign_kos is calibrated against (identical to a hmmscan against the same library).

stream_organism_gene_ko¶

Stream the ko file to a sorted, gzipped organism_gene_ko.tsv.gz.

Real KEGG has ~9M gene↔KO associations — far too many to hold in memory as a DataFrame. Rows are sorted by (organism, gene) before writing: gene IDs from one organism share long common prefixes (locus tags, numeric runs), so sorting makes them adjacent and helps the compressor; the order also matches the by-organism query pattern in :func:get_kegg_model_for_organism. Gzip (not xz) keeps the table readable by MATLAB's built-in gunzip with no external tool, at a modestly larger size.

The sort is an external merge sort bounded to chunk_rows rows in memory at a time (sorted runs spooled to gzipped temp files, then merged with :func:heapq.merge), so peak memory stays flat regardless of KEGG size. Only the small ko_names table (one row per KO) is held in full and returned.

progress shows a tqdm byte-progress bar over the (large) ko read pass.

validate_hits¶

Check a hits DataFrame has the required columns; return it unchanged.

write_kegg_tables¶

Write each table as a gzipped TSV (<prefix><name>.tsv.gz) into out_dir.

Gzipped TSV is the dependency-free cross-language format shared with MATLAB RAVEN (see docs/kegg_data_format.md) — readable by MATLAB's built-in gunzip with no external tool. prefix version-tags the filenames (e.g. kegg116_). Returns the written paths.