Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(project)✨: Enhance Fast-SeqFunc with CLI, embedding, and model functionalities #1

Merged
merged 17 commits into from
Mar 25, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
17 commits
Select commit Hold shift + click to select a range
60d0b09
feat(project)✨: Enhance Fast-SeqFunc with CLI, embedding, and model f…
ericmjl Mar 23, 2025
9c8c747
fix(ci)🔧: Correct environment name in GitHub Actions workflow
ericmjl Mar 23, 2025
2bd402a
docs(documentation)📚: Enhance documentation with API reference and de…
ericmjl Mar 24, 2025
836550f
fix(dependencies)🔧: Corrected dependency formatting and updated lockf…
ericmjl Mar 24, 2025
486c716
Update pixi lockfile.
ericmjl Mar 24, 2025
0b7f5a6
feat(project)✨: Add example scripts and enhance sequence-function mod…
ericmjl Mar 24, 2025
66dcfed
docs(documentation)📚: Add comprehensive documentation for fast-seqfunc
ericmjl Mar 24, 2025
a665fe6
docs(roadmap)🗺️: Add a roadmap document outlining planned development…
ericmjl Mar 24, 2025
0cf86e5
refactor(tests)🧪: Refactor tests for OneHotEmbedder to align with upd…
ericmjl Mar 24, 2025
bef3538
docs(design)📝: Add design document for custom alphabets in fast-seqfunc
ericmjl Mar 24, 2025
fda641a
feat(synthetic data)✨: Add synthetic data generation and visualizatio…
ericmjl Mar 24, 2025
18ddde7
feat(sequence handling)✨: Add support for variable-length sequence pa…
ericmjl Mar 24, 2025
1f8035d
feat(core)✨: Add confidence score option to prediction function
ericmjl Mar 24, 2025
af9e76b
refactor(cli)🔄: Refactor model handling to use model_info structure
ericmjl Mar 25, 2025
6e3f769
Update pixi.lock file.
ericmjl Mar 25, 2025
22c34fd
refactor(embedders)🔧: Enhance OneHotEmbedder to support variable-leng…
ericmjl Mar 25, 2025
eb32487
fix(cli)🛠️: Remove unsupported 'multi-class' option from CLI model type
ericmjl Mar 25, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/pr-tests.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ jobs:
- uses: prefix-dev/[email protected]
with:
cache: true
environments: testing
environments: tests

- name: Run tests
run: |
Expand Down
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -149,3 +149,6 @@ oryx-build-commands.txt
.DS_Store
docs/cli.md
.pixi
message_log.db
catboost_info/*
examples/output/*
4 changes: 4 additions & 0 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -14,19 +14,23 @@ repos:
hooks:
- id: interrogate
args: [-c, pyproject.toml]
exclude: ^notebooks/.*\.py$
- repo: https://github.com/jsh9/pydoclint
rev: 0.6.2
hooks:
- id: pydoclint
args:
- "--config=pyproject.toml"
exclude: ^notebooks/.*\.py$
- repo: https://github.com/astral-sh/ruff-pre-commit
# Ruff version.
rev: v0.11.2
hooks:
- id: ruff
args: [--fix, --exit-non-zero-on-fix, --exclude, nbconvert_config.py]
exclude: ^notebooks/.*\.py$
- id: ruff-format
exclude: ^notebooks/.*\.py$
- repo: local
hooks:
- id: pixi-install
Expand Down
166 changes: 164 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,12 +4,174 @@ Painless sequence-function models for proteins and nucleotides.

Made with ❤️ by Eric Ma (@ericmjl).

## Get started for development
## Overview

To get started:
Fast-SeqFunc is a Python package designed for efficient sequence-function modeling for proteins and nucleotide machine learning problems. It provides a simple, high-level API that handles various sequence embedding methods and automates model selection and training.

### Key Features

- **Multiple Embedding Methods**:
- One-hot encoding
- CARP (Microsoft's protein-sequence-models)
- ESM2 (Facebook's ESM)

- **Automated Machine Learning**:
- Uses PyCaret for model selection and hyperparameter tuning
- Supports regression and classification tasks
- Evaluates performance with appropriate metrics

- **Sequence Handling**:
- Flexible handling of variable-length sequences
- Configurable padding options for consistent embeddings
- Custom alphabets support

- **Simple API**:
- Single function call to train models
- Handles data loading and preprocessing

- **Command-line Interface**:
- Train models directly from the command line
- Make predictions on new sequences
- Compare different embedding methods

## Installation

### Using pip

```bash
pip install fast-seqfunc
```

### From Source

```bash
git clone [email protected]:ericmjl/fast-seqfunc
cd fast-seqfunc
pixi install
```

## Quick Start

### Python API

```python
from fast_seqfunc import train_model, predict
import pandas as pd

# Load your sequence-function data
train_data = pd.read_csv("train_data.csv")
val_data = pd.read_csv("val_data.csv")

# Train a model
model = train_model(
train_data=train_data,
val_data=val_data,
sequence_col="sequence",
target_col="function",
embedding_method="one-hot", # or "carp", "esm2", "auto"
model_type="regression", # or "classification"
)

# Make predictions on new sequences
new_data = pd.read_csv("new_sequences.csv")
predictions = predict(model, new_data["sequence"])

# Save the model for later use
model.save("my_model.pkl")
```

### Command-line Interface

Train a model:

```bash
fast-seqfunc train train_data.csv --sequence-col sequence --target-col function --embedding-method one-hot --output-path model.pkl
```

Make predictions:

```bash
fast-seqfunc predict-cmd model.pkl new_sequences.csv --output-path predictions.csv
```

Compare embedding methods:

```bash
fast-seqfunc compare-embeddings train_data.csv --test-data test_data.csv --output-path comparison.csv
```

## Advanced Usage

### Using Multiple Embedding Methods

You can try multiple embedding methods in one run:

```python
model = train_model(
train_data=train_data,
embedding_method=["one-hot", "carp", "esm2"],
)
```

### Custom Metrics for Optimization

Specify metrics to optimize during model selection:

```python
model = train_model(
train_data=train_data,
model_type="regression",
optimization_metric="r2" # or "rmse", "mae", etc.
)
```

### Getting Confidence Estimates

```python
predictions, confidence = predict(
model,
sequences,
return_confidence=True
)
```

### Handling Variable Length Sequences

Fast-SeqFunc handles variable length sequences with configurable padding:

```python
# Default behavior pads all sequences to the max length with "-"
model = train_model(
train_data=train_data,
embedding_method="one-hot",
embedder_kwargs={"pad_sequences": True, "gap_character": "-"}
)

# Disable padding for sequences of different lengths
model = train_model(
train_data=train_data,
embedding_method="one-hot",
embedder_kwargs={"pad_sequences": False}
)

# Set a fixed maximum length and custom gap character
model = train_model(
train_data=train_data,
embedding_method="one-hot",
embedder_kwargs={"max_length": 100, "gap_character": "X"}
)
```

For a complete example, see `examples/variable_length_sequences.py`.

## Documentation

For full documentation, visit [https://ericmjl.github.io/fast-seqfunc/](https://ericmjl.github.io/fast-seqfunc/).

## Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

## License

This project is licensed under the MIT License - see the LICENSE file for details.
40 changes: 40 additions & 0 deletions docs/api.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,43 @@
# Top-level API for fast-seqfunc

::: fast_seqfunc

# API Reference

This page provides the API reference for Fast-SeqFunc.

## Core API

These are the main functions you'll use to train models and make predictions.

::: fast_seqfunc.core
options:
show_root_heading: false
show_source: false

## Embedders

Sequence embedding methods to convert protein or nucleotide sequences into numerical representations.

::: fast_seqfunc.embedders
options:
show_root_heading: false
show_source: false

## Models

Model classes for sequence-function prediction.

::: fast_seqfunc.models
options:
show_root_heading: false
show_source: false

## CLI

Command-line interface for Fast-SeqFunc.

::: fast_seqfunc.cli
options:
show_root_heading: false
show_source: false
Loading