Painless sequence-function models for proteins and nucleotides.
Made with ❤️ by Eric Ma (@ericmjl).
Fast-SeqFunc is a Python package designed for efficient sequence-function modeling for proteins and nucleotide machine learning problems. It provides a simple, high-level API that handles sequence embedding methods and automates model selection and training.
The core purpose of Fast-SeqFunc is to quickly determine if there is meaningful "signal" in your sequence-function data. By rapidly building baseline models, you can discover early whether predictive relationships exist in your data and opportunistically use these models for scoring and ranking candidate sequences to test. When signal is detected, you can invest your time more effectively in developing advanced models (such as deep neural networks) as a second iteration.
-
Multiple Embedding Methods:
- One-hot encoding (currently implemented)
- CARP (Microsoft's protein-sequence-models) - planned for future releases
- ESM2 (Facebook's ESM) - planned for future releases
-
Automated Machine Learning:
- Uses PyCaret for model selection and hyperparameter tuning
- Supports regression and classification tasks
- Evaluates performance with appropriate metrics
-
Sequence Handling:
- Flexible handling of variable-length sequences
- Configurable padding options for consistent embeddings
- Custom alphabets support
-
Simple API:
- Single function call to train models
- Handles data loading and preprocessing
-
Command-line Interface:
- Train models directly from the command line
- Make predictions on new sequences
- Compare different embedding methods
pip install fast-seqfunc
git clone [email protected]:ericmjl/fast-seqfunc
cd fast-seqfunc
pixi install
from fast_seqfunc import train_model, predict
import pandas as pd
# Load your sequence-function data
train_data = pd.read_csv("train_data.csv")
test_data = pd.read_csv("test_data.csv")
# Train a model
model = train_model(
train_data=train_data,
test_data=test_data,
sequence_col="sequence",
target_col="function",
embedding_method="one-hot", # currently the only implemented method
model_type="regression", # or "classification"
)
# Make predictions on new sequences
new_data = pd.read_csv("new_sequences.csv")
predictions = predict(model, new_data["sequence"])
# Save the model for later use
model.save("my_model.pkl")
Train a model:
# All outputs (model, metrics, cache) will be saved to the 'outputs' directory
fast-seqfunc train train_data.csv --sequence-col sequence --target-col function --embedding-method one-hot --output-dir outputs
Make predictions:
# All prediction outputs will be saved to the 'prediction_outputs' directory
fast-seqfunc predict-cmd outputs/model.pkl new_sequences.csv --output-dir prediction_outputs
Compare embedding methods:
# All outputs (comparison results, metrics, models, cache) will be saved to the 'comparison_outputs' directory
fast-seqfunc compare-embeddings train_data.csv --test-data test_data.csv --output-dir comparison_outputs
Currently, only one-hot encoding is implemented. Support for multiple embedding methods is planned for future releases.
model = train_model(
train_data=train_data,
embedding_method="one-hot",
)
The output directories from CLI commands contain comprehensive model performance metrics and visualizations:
outputs/ # Main output directory
├── model.pkl # Saved model
├── summary.json # Summary of output locations and parameters
├── metrics/ # Performance metrics and visualizations
│ ├── one-hot_metrics.json # Detailed metrics in JSON format
│ ├── one-hot_predictions.csv # Raw predictions and true values
│ ├── one-hot_scatter_plot.png # Visualization plots
│ ├── one-hot_residual_plot.png
│ └── ...
└── cache/ # Cached embeddings
For predictions:
prediction_outputs/ # Prediction output directory
├── predictions.csv # Saved predictions
├── predictions_histogram.png # Histogram of prediction values (for regression)
└── prediction_summary.json # Summary of prediction parameters
When comparing embedding methods, a similar structure is created:
comparison_outputs/
├── embedding_comparison.csv # Table comparing all methods
├── embedding_comparison_plot.png # Bar chart comparing metrics across methods
├── summary.json # Summary of output locations and parameters
├── models/ # Saved models for each method
│ ├── one-hot_model.pkl
├── metrics/ # Performance metrics for each method
│ ├── one-hot_metrics.json
└── cache/ # Cached embeddings
You can also generate these outputs programmatically:
from pathlib import Path
from fast_seqfunc import train_model, save_model, save_detailed_metrics
# Create output directories
output_dir = Path("my_model_outputs")
output_dir.mkdir(exist_ok=True)
metrics_dir = output_dir / "metrics"
metrics_dir.mkdir(exist_ok=True)
cache_dir = output_dir / "cache"
cache_dir.mkdir(exist_ok=True)
# Train model
model_info = train_model(
train_data=train_data,
test_data=test_data,
embedding_method="one-hot",
cache_dir=cache_dir,
)
# Save model
save_model(model_info, output_dir / "model.pkl")
# Save detailed metrics if test data was provided
if model_info.get("test_results"):
save_detailed_metrics(
metrics_data=model_info["test_results"],
output_dir=metrics_dir,
model_type=model_info["model_type"],
embedding_method="one-hot"
)
Specify metrics to optimize during model selection:
model = train_model(
train_data=train_data,
model_type="regression",
optimization_metric="r2" # or "rmse", "mae", etc.
)
Fast-SeqFunc handles variable length sequences with configurable padding:
# Default behavior pads all sequences to the max length with "-"
model = train_model(
train_data=train_data,
embedding_method="one-hot",
embedder_kwargs={"pad_sequences": True, "gap_character": "-"}
)
# Disable padding for sequences of different lengths
model = train_model(
train_data=train_data,
embedding_method="one-hot",
embedder_kwargs={"pad_sequences": False}
)
# Set a fixed maximum length and custom gap character
model = train_model(
train_data=train_data,
embedding_method="one-hot",
embedder_kwargs={"max_length": 100, "gap_character": "X"}
)
For a complete example, see examples/variable_length_sequences.py
.
For full documentation, visit https://ericmjl.github.io/fast-seqfunc/.
Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the MIT License - see the LICENSE file for details.