This directory contains scripts to rebuild MMFP inputs and reproduce all paper results.
| Paper Table | Script | Description |
|---|---|---|
| Table 1 | reproduce_full_model.py |
Full model train + eval with temporal text embeddings |
| Table 2 | reproduce_modality_contribution.py |
Retrain with modality subsets (seq_only, seq_text, seq_struct, seq_ppi) |
| Table 3 | run_ablation.py |
Ablation over fusion variants (hybrid, concat, bilinear, aux_only) |
| Table 4 | reproduce_masking_eval.py |
Single-modality masking evaluation using trained checkpoints |
Prepare the canonical mixed text bundle:
python scripts/extract_uniprot_text.py prepare-temporal-textThis produces:
data/embedding_cache/uniprot_text/protein_descriptions.tsvdata/embedding_cache/uniprot_text/protein_descriptions_historical.tsvdata/embedding_cache/uniprot_text/temporal_recipe/protein_descriptions_historical_punct_v1_test.tsvdata/embedding_cache/uniprot_text/temporal_recipe/protein_descriptions_mixed.tsv
Train from scratch with the temporal text cache:
python train.py \
--seq-model prott5 \
--fusion-types gated_bilinear \
--aspects BPO CCO MFO \
--use-late-fusion \
--text-embedding-dir data/embedding_cache/exp_text_embeddings_temporal \
--output-base results/full_model \
--num-workers 0 \
--seed 42Verify against the kept target metrics:
python scripts/reproduce_full_model.pypython scripts/reproduce_modality_contribution.pyRetrains 4 variants: seq_only, seq_text, seq_struct, seq_ppi. Results in results/modality_contribution/.
python scripts/run_ablation.pyRetrains fusion variants: hybrid_gated, concat_only, concat_plus_aux, bilinear_gated, aux_only. Results in results/ablation/.
python scripts/reproduce_masking_eval.pyUses full model checkpoints, evaluates with single modalities masked. Results in results/masking_eval/.
| Modality | Script(s) | Output Dimension | Description |
|---|---|---|---|
| PPI | extract_ppi_embeddings.py |
512-D | STRING protein-protein interaction network embeddings |
| Text | extract_uniprot_text.py + embed_uniprot_descriptions.py |
768-D | Current + historical UniProt text and PubMedBERT embeddings |
| Structure | check_alphafold_coverage.py + extract_esm_if1_embeddings.py |
512-D | ESM-IF1 structure embeddings from AlphaFold PDB files |
| ProtT5 | extract_prott5_embeddings.py |
1024-D | ProtT5-XL sequence embeddings |
pip install torch transformers numpy tqdm requests h5py fair-esm| Resource | Required By | Download |
|---|---|---|
| CAFA Assessment Tool | PPI, Text, Structure | https://github.com/ashleyzhou972/CAFA_assessment_tool |
| STRING Network Files | PPI | https://string-db.org/cgi/download |
| AlphaFold API | Structure | Internet access required |
| UniProt API | Text | Internet access required |
Scripts that previously had hardcoded paths now use CLI arguments or environment variables:
| Variable | Used By | Description |
|---|---|---|
CAFA_ASSESSMENT_DIR |
extract_uniprot_text.py, extract_ppi_embeddings.py, check_alphafold_coverage.py |
Path to CAFA assessment tool |
CAFA3_RAW_DIR |
prepare_cafa3_data.py |
Path to raw CAFA3 CSV files |
STRING_H5_FILE |
extract_ppi_embeddings.py |
Path to STRING embeddings h5 file |
STRING_ALIAS_FILE |
extract_ppi_embeddings.py |
Path to STRING alias file |
All scripts also accept these as CLI arguments (run with --help to see options).
Extracts 512-D protein-protein interaction embeddings from STRING database.
Usage:
python scripts/extract_ppi_embeddings.py \
--string-h5 /path/to/protein.network.embeddings.v12.0.h5 \
--string-alias /path/to/protein.aliases.v12.0.txt \
--cafa-assessment-dir /path/to/CAFA_assessment_toolOutput: data/embedding_cache/ppi/{protein_id}.npy
Supports:
- current UniProt text extraction
- historical UniSave extraction for the test split with a
2016-02-17cutoff - canonical temporal text assembly
Usage:
python scripts/extract_uniprot_text.py extract-current
python scripts/extract_uniprot_text.py extract-historical
python scripts/extract_uniprot_text.py prepare-temporal-textOutputs:
data/embedding_cache/uniprot_text/protein_descriptions.tsvdata/embedding_cache/uniprot_text/protein_descriptions_historical.tsvdata/embedding_cache/uniprot_text/temporal_recipe/*.tsv
Generates PubMedBERT embeddings from extracted text descriptions.
Usage:
python scripts/embed_uniprot_descriptions.py --data-dir dataPrimary outputs:
data/embedding_cache/exp_text_embeddings/{protein_id}.npydata/embedding_cache/exp_text_embeddings_temporal/{protein_id}.npy
python scripts/check_alphafold_coverage.py \
--cafa-assessment-dir /path/to/CAFA_assessment_toolOutput: data/alphafold_structures/{protein_id}.pdb
python scripts/extract_esm_if1_embeddings.py \
--pdb_dir data/alphafold_structures \
--output_dir data/embedding_cache/IF1 \
--pooling mean \
--device cudaOutput: data/embedding_cache/IF1/{protein_id}.npy
python scripts/extract_prott5_embeddings.py \
--fasta_file data/proteins.fasta \
--output_dir data/embedding_cache/prott5 \
--batch_size 8Output: data/embedding_cache/prott5/{protein_id}.npy
Prepares CAFA3 benchmark data splits.
python scripts/prepare_cafa3_data.py --cafa3-dir /path/to/cafa3Output:
data/{aspect}_{split}_names.npy- Protein IDsdata/{aspect}_{split}_labels.npz- GO term labelsdata/{aspect}_{split}_sequences.json- Protein sequences
Based on CAFA3 benchmark data:
| Modality | Train | Valid | Test |
|---|---|---|---|
| Text | 100% | 100% | 100% |
| Structure | ~98% | ~98% | ~90% |
| PPI | ~83% | ~83% | ~87% |
| ProtT5/ESM | 100% | 100% | 100% |