Reconstruction (Python)¶
raven-toolbox objects in raven_toolbox.reconstruction, collected from the source of the tracked branch.
Functions¶
| Function | Summary |
|---|---|
assemble_model_from_ko_genes |
Build a draft model from a {ko: [gene, ...]} assignment. |
assign_kos |
Assign genes to KOs from HMM hits, applying the cut-off and ratio filters. |
blast_from_table |
Load a precomputed homology hits table (CSV path or DataFrame). |
build_hmm_library |
Build a domain ("prokaryotes"/"eukaryotes") HMM library. |
build_kegg_tables |
Build the minimal relational tables from parsed records. |
build_ko_fastas |
Write one <KO>.fa per KO with its member genes' sequences. |
build_ko_hmm |
Cluster, align and train a profile HMM for one KO's multi-FASTA. |
build_reference_model |
Assemble the gene-free KEGG reference model from parsed records. |
download_kegg_dump |
Fetch and arrange a complete KEGG dump into dest. |
extract_kegg_dump |
Extract and arrange the downloaded archives into the flat dump layout. |
fetch_kegg_files |
Download the raw KEGG archives into dest (basenames). Returns the paths. |
flag_set |
Reaction ids whose column flag is truthy (handles bool or TSV strings). |
get_kegg_model_for_organism |
Reconstruct a draft model for a KEGG species from its KO annotations. |
get_kegg_model_for_organism_from_artefacts |
Load the published 3b.2 artefacts from artefact_dir and build the model. |
get_kegg_model_from_sequences |
Reconstruct a draft model for a proteome by HMM-searching the KO library. |
get_kegg_model_from_sequences_with_artefacts |
Load reference model + tables from artefact_dir and run the HMM query. |
get_model_from_homology |
Build a draft model for model_for by transferring reactions from templates. |
HomologyResult |
Result of :func:get_model_from_homology. |
KeggCompound |
A metabolite parsed from the KEGG compound flat file. |
KeggKO |
A KEGG Orthology entry: its name and the organism genes assigned to it. |
KeggReaction |
A reaction parsed from the KEGG reaction flat file. |
make_ortholog_hits |
Build a bidirectional hits table from a predefined ortholog list. |
organism_domains |
Return {organism_id: domain} (the outermost category). |
organisms_in_domain |
Organism ids whose outermost category matches domain (case-insensitive). |
parse_hmmsearch_tblout |
Parse hmmsearch --tblout text into a [ko, gene, evalue] table. |
parse_kegg_compounds |
Parse <kegg_dir>/compound (+ optional compound.inchi) into records. |
parse_kegg_dump |
Parse a full KEGG dump into the reference model + tables and write them out. |
parse_kegg_kos |
Parse <kegg_dir>/ko into :class:KeggKO records (name + organism genes). |
parse_kegg_reactions |
Parse <kegg_dir>/reaction into :class:KeggReaction records. |
parse_taxonomy |
Return {organism_id: [category, ...]} from outermost to innermost. |
parse_taxonomy_records |
Return [(organism_id, name, [category, ...]), ...] in file order. |
phyl_dist |
Compute the KEGG phylogenetic distance matrix from the taxonomy file. |
PhylDist |
KEGG phylogenetic distance structure — RAVEN getPhylDist / keggPhylDist. |
read_kegg_table |
Read a KEGG table written by :func:write_kegg_tables or |
run_blast |
Bidirectional BLAST+ between an organism and template organisms. |
run_diamond |
Bidirectional DIAMOND between an organism and template organisms. |
run_hmmsearch |
Search the profile library against proteome fasta; return tblout text. |
stream_organism_gene_ko |
Stream the ko file to a sorted, gzipped organism_gene_ko.tsv.gz. |
validate_hits |
Check a hits DataFrame has the required columns; return it unchanged. |
write_kegg_tables |
Write each table as a gzipped TSV (<prefix><name>.tsv.gz) into out_dir. |
Reference¶
assemble_model_from_ko_genes¶
Build a draft model from a {ko: [gene, ...]} assignment.
Returns (model, gpr_map) where gpr_map is the kept reactions' gene
lists, so callers can add gene annotations afterwards.
assign_kos¶
Assign genes to KOs from HMM hits, applying the cut-off and ratio filters.
Ports RAVEN's three steps on the KO×gene E-value matrix:
- keep hits with
evalue <= cutoff; - min_score_ratio_ko — within a KO, drop genes whose
log(evalue)/log(best_evalue_in_KO) < min_score_ratio_ko(prune weak members of a KO); - min_score_ratio_g — within a gene, drop KOs whose
log(evalue)/log(best_evalue_for_gene) < min_score_ratio_g(stop a gene that clearly belongs to one KO leaking into weaker ones).
Smaller E-value = better; since all kept values are < 1 their logs are
negative, so the best (smallest) hit gives ratio 1 and weaker hits give a
smaller positive ratio.
Default calibration (see IMPROVEMENTS K15). Cross-validated against the true
KEGG gene→KO annotation of four organisms spanning the prok/euk libraries and
the well-/lesser-studied axis (S. cerevisiae, Cyanidioschyzon merolae,
E. coli, Mycoplasma genitalium): real annotations score
overwhelmingly (median E ≈ 1e-100…1e-155) while spurious hits pile up at
≈1e-8, so the two are separated by ~20 orders of magnitude. RAVEN's
1e-50 sits inside the true tail and silently drops real but divergent
hits — costing 16% gene→KO recall on the divergent minimal genome
(M. genitalium) for no noise-rejection benefit (noise is far weaker). The
default is therefore loosened to 1e-30 (recovers that tail; still ~22
orders above the noise floor), with the precision work moved to
min_score_ratio_g = 0.9 — the effective precision lever (it resolves
multi-KO genes). min_score_ratio_ko proved empirically inert across all
four organisms (identical output at 0.0/0.3/0.5) and is kept only for RAVEN
parity.
blast_from_table¶
Load a precomputed homology hits table (CSV path or DataFrame).
a plain CSV/DataFrame, not Excel.
Must contain the HIT_COLUMNS columns.
build_hmm_library¶
Build a domain ("prokaryotes"/"eukaryotes") HMM library.
Restricts genes to the domain's organisms (from taxonomy), builds a
multi-FASTA and a profile HMM per KO under out_dir, and (if concatenate)
concatenates them into out_dir/library.hmm — searched directly with
hmmsearch (no hmmpress needed). Returns {"hmms": [...], "library": path | None}.
Heavy and binary-dependent — intended for the maintainer, run once per KEGG
release. Resumable: skips KOs that already have an .hmm and reuses an
already-built multi-FASTA (see :func:build_ko_fastas), so re-running after a
failure continues rather than restarting. force rebuilds everything.
progress shows tqdm bars for the multi-FASTA and per-KO HMM passes (the
headline N of M counter for the long build); it is independent of
verbose (per-KO INFO logging). With both on, per-KO log lines are routed
through tqdm.write so they don't corrupt the bar.
build_kegg_tables¶
Build the minimal relational tables from parsed records.
Returns a dict of DataFrames keyed by table name: ko_reaction,
ko_names, organism_gene_ko, rxn_flags.
build_ko_fastas¶
Write one <KO>.fa per KO with its member genes' sequences.
but with a stdlib offset index instead
of the Java-hashtable byte scan. organisms restricts to a domain's
organism codes (for the prok/euk split). Empty KOs are skipped (no file).
progress shows a tqdm bar over the per-KO writing pass. Returns
{ko: path} for the files written.
Resumable: a <KO>.fa that already exists is kept (not rewritten), and
once a full pass completes a .ko_fastas_complete marker is written so a
later call returns the existing files without re-scanning genes_pep (the
costly part). force ignores both and rebuilds. The resume assumes the same
inputs as the original run (as the per-KO HMM resume in :func:build_hmm_library
does); use force after changing organism_gene_ko/organisms.
build_ko_hmm¶
Cluster, align and train a profile HMM for one KO's multi-FASTA.
Single-sequence KOs skip CD-HIT/MAFFT (a lone sequence is its own alignment).
seq_identity=-1 skips CD-HIT. All (deduplicated) sequences are kept —
memory on large KOs is bounded by switching MAFFT to PartTree, not by
dropping sequences. fast uses MAFFT FFT-NS-2 (fast progressive) rather
than --auto's slow iterative refinement. MAFFT switches to memory-light
PartTree once an alignment is predicted to be too memory-heavy: by default from
its DP cost (n_seqs × mean_len² — long proteins cost far more than the
same residue count in short ones) against a RAM-derived budget
(:func:_auto_cost_budget). Passing parttree_residues overrides this with a
simple residue-count cutoff.
verbose logs (via the logging module, INFO/DEBUG) which tool is running
for this KO, sequence counts at each stage, timings, and the tools' own
output. Returns out_hmm.
build_reference_model¶
Assemble the gene-free KEGG reference model from parsed records.
Only metabolites actually used by a reaction are added. Reactions carry KEGG
annotations (reaction id, KO ids, EC codes, pathways) but no genes/GPRs.
Bounds are (-1000, 1000) for reversible reactions and (0, 1000)
otherwise.
download_kegg_dump¶
Fetch and arrange a complete KEGG dump into dest.
Convenience wrapper chaining :func:fetch_kegg_files and
:func:extract_kegg_dump. Returns the logical-name -> path mapping of the
arranged dump, ready for :func:raven_toolbox.reconstruction.kegg.parse_kegg_dump.
progress (on by default) shows download/extraction bars; pass
progress=False for non-interactive runs.
extract_kegg_dump¶
Extract and arrange the downloaded archives into the flat dump layout.
Mirrors fetch_keggdb.sh's extract step: untar the *.tar.gz archives,
gunzip the *.pep.gz proteomes, lift the needed files out of their
sub-directories, and concatenate compound+glycan and the two proteomes.
Tar extraction uses the data filter (no path traversal). Returns a
mapping of logical name -> path for the files produced. progress shows a
bar over the untar step and a byte bar per proteome gunzip.
Network-free, so this is the unit-tested core; download_kegg_dump chains
:func:fetch_kegg_files in front of it.
fetch_kegg_files¶
Download the raw KEGG archives into dest (basenames). Returns the paths.
Existing files are skipped unless force=True (the script's wget -N
intent, simplified to skip-if-present). progress shows a per-file byte
download bar (sized from the response Content-Length); verbose keeps
the fetching/skip INFO log lines.
flag_set¶
Reaction ids whose column flag is truthy (handles bool or TSV strings).
get_kegg_model_for_organism¶
Reconstruct a draft model for a KEGG species from its KO annotations.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
organism_id
|
str
|
Three/four-letter KEGG organism code (e.g. |
required |
reference_model
|
Model
|
The gene-free KEGG reference model (from :func: |
required |
ko_reaction
|
DataFrame
|
The relational tables from :func: |
required |
organism_gene_ko
|
DataFrame
|
The relational tables from :func: |
required |
rxn_flags
|
DataFrame
|
The relational tables from :func: |
required |
taxonomy
|
str | Path | None
|
Path to the KEGG |
None
|
keep_spontaneous
|
bool
|
Quality filters (RAVEN's |
True
|
keep_undefined_stoich
|
bool
|
Quality filters (RAVEN's |
True
|
keep_incomplete
|
bool
|
Quality filters (RAVEN's |
True
|
keep_general
|
bool
|
Quality filters (RAVEN's |
True
|
Returns:
| Type | Description |
|---|---|
Model
|
A copy of the reference restricted to the organism's reactions, with GPRs
built and |
get_kegg_model_for_organism_from_artefacts¶
Load the published 3b.2 artefacts from artefact_dir and build the model.
Reads reference_model.yml.gz and the ko_reaction/organism_gene_ko/
rxn_flags gzipped-TSV tables, then calls :func:get_kegg_model_for_organism.
If artefact_dir is None the published artefacts are fetched/cached via
:func:raven_toolbox.data.ensure_kegg_data (version selects the release).
For a whole-domain model (organism_id = "prokaryotes"/"eukaryotes")
the KEGG taxonomy artefact is resolved automatically — from artefact_dir
if present, otherwise fetched via :func:raven_toolbox.data.ensure_kegg_taxonomy
(it is a separate artefact, not part of the core set) — unless an explicit
taxonomy path is given.
get_kegg_model_from_sequences¶
Reconstruct a draft model for a proteome by HMM-searching the KO library.
Searches the library against fasta (3b.3), assigns KOs
(:func:assign_kos), and assembles the model against reference_model /
ko_reaction. Genes are the query proteome's identifiers.
get_kegg_model_from_sequences_with_artefacts¶
Load reference model + tables from artefact_dir and run the HMM query.
If artefact_dir / library are None they are fetched/cached via
:func:raven_toolbox.data.ensure_kegg_data / :func:raven_toolbox.data.ensure_kegg_hmm_library
(domain selects the prok/euk library; version the release).
get_model_from_homology¶
Build a draft model for model_for by transferring reactions from templates.
strictness (1/2/3) is a legacy alias for bidirectional / best_hits_only.
HomologyResult¶
Result of :func:get_model_from_homology.
Attributes:
| Name | Type | Description |
|---|---|---|
model |
Model
|
The draft |
gene_map |
dict
|
|
KeggCompound¶
A metabolite parsed from the KEGG compound flat file.
KeggKO¶
A KEGG Orthology entry: its name and the organism genes assigned to it.
KeggReaction¶
A reaction parsed from the KEGG reaction flat file.
make_ortholog_hits¶
Build a bidirectional hits table from a predefined ortholog list.
Each (source_gene, target_gene)
pair is emitted in both directions with sentinel metrics (evalue 0,
identity 100, align_len 1000, bitscore 1000, ppos 100) so every pair passes
any reasonable filter. Lets a known ortholog mapping feed
:func:get_model_from_homology with no BLAST run — also the testing entry
point.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
ortholog_pairs
|
Iterable[tuple[str, str]]
|
Iterable of |
required |
source_model_id
|
str
|
ID of the template model the source genes belong to. |
required |
target_id
|
str
|
ID of the organism to build a model for ( |
required |
organism_domains¶
Return {organism_id: domain} (the outermost category).
organisms_in_domain¶
Organism ids whose outermost category matches domain (case-insensitive).
Accepts a prefix, so "prok" matches "Prokaryotes".
parse_hmmsearch_tblout¶
Parse hmmsearch --tblout text into a [ko, gene, evalue] table.
With the profile library as the query, column 1 (target name) is the proteome gene, column 3 (query name) is the KO, and column 5 is the full-sequence E-value.
parse_kegg_compounds¶
Parse <kegg_dir>/compound (+ optional compound.inchi) into records.
parse_kegg_dump¶
Parse a full KEGG dump into the reference model + tables and write them out.
Writes reference_model.yml.gz (gzipped RAVEN/cobra YAML) plus the
gzipped-TSV tables into out_dir and returns {name: path} for
everything written. The large
organism_gene_ko table is streamed to disk (see
:func:stream_organism_gene_ko) rather than built in memory, so this scales
to the full KEGG database; the small derived tables are built in memory.
When version is given (e.g. "kegg116") the output filenames are
version-prefixed (<version>_<name>), matching the published release
assets; the returned dict keys stay the logical table names.
progress reports each stage and shows a byte-progress bar over the
dominant organism_gene_ko (ko) streaming pass.
parse_kegg_kos¶
Parse <kegg_dir>/ko into :class:KeggKO records (name + organism genes).
keep limits parsing to those KO ids (e.g. only KOs linked to reactions),
mirroring RAVEN's koList argument — the gene lists are huge, so this is
the usual call.
parse_kegg_reactions¶
Parse <kegg_dir>/reaction into :class:KeggReaction records.
Reversibility is taken from the equation arrow and, when
reaction_mapformula.lst is present, refined to mark reactions that are
irreversible across all KEGG maps (see :func:_irreversible_from_mapformula).
parse_taxonomy¶
Return {organism_id: [category, ...]} from outermost to innermost.
parse_taxonomy_records¶
Return [(organism_id, name, [category, ...]), ...] in file order.
name is the organism's scientific name with any trailing KEGG parenthetical
kept (e.g. "Homo sapiens (human)"), matching RAVEN getPhylDist; the category
list is the lineage from outermost (domain) to innermost. Order is preserved so it
aligns with the rows/columns of :func:phyl_dist.
phyl_dist¶
Compute the KEGG phylogenetic distance matrix from the taxonomy file.
Faithful port of RAVEN getPhylDist. For organisms i and j with category
lineages of lengths Li, Lj, the distance is (Li - Lj) + min(Li, Lj) - k where
k is the deepest level (within the shorter lineage, comparing position by position
from the root) at which the two categories coincide — 1 if none coincide. This
reproduces RAVEN's exact values (including their asymmetry and occasional negatives)
so the output matches MATLAB keggPhylDist that GECKO was built against.
only_in_kingdom sets the distance to +inf between organisms in different
domains (RAVEN's onlyInKingdom variant); the matrix is then returned as float.
Cost is O(N²) in time and memory (N = number of KEGG organisms, ~10⁴), so this
is a maintainer/GECKO-side generation step; persist the result rather than rebuilding.
PhylDist¶
KEGG phylogenetic distance structure — RAVEN getPhylDist / keggPhylDist.
ids/names are the KEGG organism ids and scientific names (names keep RAVEN's
trailing parenthetical; consumers that need them cleaned, e.g. geckopy, strip it).
dist_matrix[i, j] is RAVEN's distance from ids[i] to ids[j]. It is a
faithful reproduction of RAVEN's metric, which is asymmetric and may be negative;
GECKO consumes it only by taking the closest (minimum-distance) organism, so the exact
values matter for parity with MATLAB but the sign/asymmetry are harmless there.
read_kegg_table¶
Read a KEGG table written by :func:write_kegg_tables or
:func:stream_organism_gene_ko.
Compression is inferred from the suffix; all published tables are gzipped TSV
(.tsv.gz), and a version-prefixed name (kegg116_<name>.tsv.gz) reads
just the same.
run_blast¶
Bidirectional BLAST+ between an organism and template organisms.
Returns the hits DataFrame (filtered at
evalue). Requires BLAST+ (blastp, makeblastdb).
run_diamond¶
Bidirectional DIAMOND between an organism and template organisms.
Returns the hits DataFrame. Requires DIAMOND.
run_hmmsearch¶
Search the profile library against proteome fasta; return tblout text.
One hmmsearch of the whole concatenated multi-profile library (query)
against fasta (target) — the fast search direction, parallelised with
--cpu, and no hmmpress needed. -Z is fixed to the profile count so the
per-hit E-values match the convention :func:assign_kos is calibrated against
(identical to a hmmscan against the same library).
stream_organism_gene_ko¶
Stream the ko file to a sorted, gzipped organism_gene_ko.tsv.gz.
Real KEGG has ~9M gene↔KO associations — far too many to hold in memory as a
DataFrame. Rows are sorted by (organism, gene) before writing: gene IDs
from one organism share long common prefixes (locus tags, numeric runs), so
sorting makes them adjacent and helps the compressor; the order also matches
the by-organism query pattern in :func:get_kegg_model_for_organism. Gzip
(not xz) keeps the table readable by MATLAB's built-in gunzip with no
external tool, at a modestly larger size.
The sort is an external merge sort bounded to chunk_rows rows in
memory at a time (sorted runs spooled to gzipped temp files, then merged with
:func:heapq.merge), so peak memory stays flat regardless of KEGG size. Only
the small ko_names table (one row per KO) is held in full and returned.
progress shows a tqdm byte-progress bar over the (large) ko read pass.
validate_hits¶
Check a hits DataFrame has the required columns; return it unchanged.
write_kegg_tables¶
Write each table as a gzipped TSV (<prefix><name>.tsv.gz) into out_dir.
Gzipped TSV is the dependency-free cross-language format shared with MATLAB
RAVEN (see docs/kegg_data_format.md) — readable by MATLAB's built-in gunzip
with no external tool. prefix version-tags the filenames (e.g.
kegg116_). Returns the written paths.