metbit

An open-source Python package for reproducible ¹H NMR metabolomics - from raw FID preprocessing through normalization, chemometrics, and interactive visualization, in a single scriptable workflow.

metbit v9.1.0 extends the v9.0.0 compute backend with a full machine-learning and deep-learning layer, extended multivariate analysis (LDA, PLSR, ICA, HCA), volcano-plot and ANOVA/Kruskal statistics, flexible train/test and cross-validation utilities, and an interactive spectra viewer - all behind the same four-tier auto-dispatch backend (GPU - C+OpenMP - multiprocessing - chunked NumPy).

Installation
Quick Start
Workflow Overview
NMR Preprocessing
Normalization and Alignment
Principal Component Analysis
OPLS-DA
Extended Multivariate Analysis
Statistical Testing
Machine Learning Classifiers
Deep Learning Models
Train/Test and Cross-Validation
Spectra Visualization
STOCSY
Large-Scale Compute Backend
Interactive Dash Applications
Performance
Testing
Citation

Installation

pip install metbit

The C extension (VIP kernel, Pearson correlation, column variance) is compiled automatically at install time when a C compiler is available. OpenMP parallelism is enabled when the compiler supports -fopenmp.

# Verify the active compute backend after installation
python -c "import metbit; print(metbit.backend_info())"
# {'native_c': True, 'openmp_threads': 8, 'gpu_cupy': False, 'gpu_torch': False, ...}

Optional extras:

pip install metbit[ml]   # XGBoost classifier support
pip install metbit[dl]   # PyTorch deep-learning models
pip install metbit[all]  # both of the above

Optional GPU acceleration - install CuPy or PyTorch with CUDA support:

pip install cupy-cuda12x   # or torch with CUDA

Quick Start

import pandas as pd
from metbit import nmr_preprocessing, pca, opls_da, STOCSY

# 1. Load a preprocessed spectral matrix (samples x variables)
df = pd.read_csv("nmr_data.csv")
X = df.iloc[:, 2:]          # spectral matrix
y = df["Group"]             # binary class label
ppm = X.columns.astype(float).tolist()

# 2. PCA - exploratory overview
pca_mod = pca(X=X, label=y, features_name=ppm, n_components=3)
pca_mod.fit()
pca_mod.plot_pca_scores(pc=["PC1", "PC2"])

# 3. OPLS-DA - supervised discrimination
model = opls_da(X=X, y=y, features_name=ppm, scale="uv", auto_ncomp=True)
model.fit()
model.plot_oplsda_scores()
model.vip_scores()
model.vip_plot(threshold=2)

# 4. STOCSY - correlation from a VIP anchor peak
stocsy = STOCSY(X=X, ppm=ppm)
stocsy.fit(driver=3.05)
stocsy.plot()

Workflow Overview

Raw Bruker FID
      │
      ▼
nmr_preprocessing          # digital filter removal, zero-fill, FFT,
      │                    # ACME phase correction, arPLS baseline correction,
      │                    # TSP/DSS calibration, region exclusion
      ▼
Normalization              # PQN / MSC / TSP-area
      │
      ▼
icoshift_align             # segment-wise spectral alignment
      │
      ▼
pca / lda / plsr / ica / hca    # exploratory and supervised multivariate
      │
      ▼
opls_da                    # supervised binary classification
      │                    # VIP scoring, permutation test, CV-ANOVA
      ▼
VolcanoPlot / ANOVAStats / KruskalStats   # group comparison statistics
      │
      ▼
MLClassifier               # RF / SVM / XGBoost / ElasticNet with CV
SpectralAutoencoder        # PyTorch deep-learning models
      │
      ▼
STOCSY / ChunkedSTOCSY     # structural correlation from anchor peaks
      │
      ▼
Interactive Dash apps      # local browser exploration, no data upload

NMR Preprocessing

from metbit import nmr_preprocessing

# Process a folder of Bruker FID experiments
proc = nmr_preprocessing(
    path="path/to/bruker/experiments",
    ppm_range=(-0.5, 10.0),
    water_region=(4.7, 4.9),
    reference_peak=0.0,   # TSP/DSS at 0 ppm
)
spectra_df = proc.process()   # returns samples x ppm DataFrame

The preprocessing pipeline applies, in order:

Step	Method
Digital filter removal	FID truncation at group delay
Zero filling	Next power of two
Fourier transform	FFT with apodization
Phase correction	ACME entropy minimization (Chen et al., 2002)
Baseline correction	arPLS (Baek et al., 2015)
Chemical shift calibration	TSP/DSS reference at 0 ppm
Region exclusion	Water, solvent, user-defined windows

Normalization and Alignment

from metbit import Normalise, icoshift_align

# Probabilistic Quotient Normalization
normalised = Normalise(spectra_df).pqn()

# Multiplicative Scatter Correction
normalised = Normalise(spectra_df).msc()

# Icoshift spectral alignment
aligned = icoshift_align(normalised, n_intervals="whole")

Principal Component Analysis

from metbit import pca

pca_mod = pca(X=X, label=y, features_name=ppm, n_components=3)
pca_mod.fit()

pca_mod.plot_cumulative_observed()
pca_mod.plot_pca_scores(pc=["PC1", "PC2"], symbol_=symbol_)
pca_mod.plot_3d_pca(marker_size=10, symbol_=symbol_)
pca_mod.plot_pca_trajectory(time_=symbol_, time_order=time_order, pc=["PC1", "PC2"])

OPLS-DA

from metbit import opls_da

model = opls_da(
    X=X,
    y=y,
    features_name=ppm,
    scale="uv",          # unit-variance scaling
    auto_ncomp=True,     # automatic component selection via cross-validation
)
model.fit()

Scores and loadings

model.plot_oplsda_scores()   # predictive vs orthogonal scores
model.plot_loading()         # loading plot (color by correlation)
model.plot_s_scores()        # S-plot (covariance vs correlation)

Model validation

model.permutation_test(n_permutations=1000, n_jobs=-1)
model.plot_hist()            # R2 / Q2 null distribution

model.vip_scores()
model.vip_plot(threshold=2)

Model metrics

print(model.R2Y)    # variance explained in Y
print(model.Q2)     # cross-validated predictive ability
print(model.AUROC)  # area under the ROC curve

Extended Multivariate Analysis

v9.1.0 adds four additional multivariate methods behind the same interface.

from metbit import lda, plsr, ica, hca

# Linear Discriminant Analysis
lda_mod = lda(X=X, y=y, features_name=ppm)
lda_mod.fit()
lda_mod.plot_scores()

# Partial Least Squares Regression
plsr_mod = plsr(X=X, y=y_continuous, features_name=ppm, n_components=5)
plsr_mod.fit()
plsr_mod.plot_scores()

# Independent Component Analysis
ica_mod = ica(X=X, features_name=ppm, n_components=5)
ica_mod.fit()
ica_mod.plot_components()

# Hierarchical Cluster Analysis
hca_mod = hca(X=X, label=y, features_name=ppm)
hca_mod.fit()
hca_mod.plot_dendrogram()

Statistical Testing

v9.1.0 adds volcano-plot and multi-group statistical tests directly to the public API.

from metbit import VolcanoPlot, ANOVAStats, KruskalStats

# Volcano plot - fold change vs significance
vp = VolcanoPlot(X=X, y=y, ppm=ppm)
vp.fit()
vp.plot(fc_threshold=1.0, p_threshold=0.05)

# One-way ANOVA across groups
anova = ANOVAStats(X=X, y=y, features_name=ppm)
anova.fit()
print(anova.summary())

# Non-parametric Kruskal-Wallis (use when normality is not assured)
kruskal = KruskalStats(X=X, y=y, features_name=ppm)
kruskal.fit()
print(kruskal.summary())

Machine Learning Classifiers

MLClassifier wraps four model families behind a single sklearn-compatible interface with built-in cross-validation, feature importance, and Plotly visualizations.

pip install metbit[ml]   # adds XGBoost support

from metbit import MLClassifier

# Random Forest (default)
clf = MLClassifier(X=X, y=y, model="rf")
clf.fit(cv=5)

# Support Vector Machine
clf = MLClassifier(X=X, y=y, model="svm")
clf.fit(cv=5)

# XGBoost (requires metbit[ml])
clf = MLClassifier(X=X, y=y, model="xgb")
clf.fit(cv=5)

# ElasticNet logistic regression
clf = MLClassifier(X=X, y=y, model="elasticnet")
clf.fit(cv=5)

Metrics and plots

print(clf.cv_results_)          # accuracy, balanced_accuracy, roc_auc

clf.plot_feature_importance()   # top features by permutation importance
clf.plot_confusion_matrix()     # confusion matrix heatmap
clf.plot_roc()                  # per-class ROC curves with AUC

# Predict on new data
labels = clf.predict(X_new)
proba  = clf.predict_proba(X_new)

# Top N important features as a DataFrame
top_features = clf.get_feature_importance(top_n=20)

Model	`model=`	Notes
Random Forest	`"rf"`	No extra install needed
SVM	`"svm"`	No extra install needed
XGBoost	`"xgb"`	Requires `metbit[ml]`
ElasticNet	`"elasticnet"`	No extra install needed

Deep Learning Models

Three PyTorch-based models for spectral classification and representation learning. Requires metbit[dl].

pip install metbit[dl]

from metbit import SpectralAutoencoder, SpectralMLP, SpectralCNN

# Autoencoder - unsupervised representation learning
ae = SpectralAutoencoder(input_dim=X.shape[1], latent_dim=32)
ae.fit(X, epochs=100)
latent = ae.transform(X)          # compressed representation
reconstructed = ae.reconstruct(X)

# MLP classifier
mlp = SpectralMLP(input_dim=X.shape[1], n_classes=len(y.unique()))
mlp.fit(X, y, epochs=50)
labels = mlp.predict(X_new)
proba  = mlp.predict_proba(X_new)

# 1-D CNN classifier
cnn = SpectralCNN(input_dim=X.shape[1], n_classes=len(y.unique()))
cnn.fit(X, y, epochs=50)
labels = cnn.predict(X_new)

Train/Test and Cross-Validation

from metbit import TrainTestSplit, CrossValidator, available_cv_strategies

# Simple stratified train/test split
splitter = TrainTestSplit(X=X, y=y, test_size=0.2, stratify=True, random_state=42)
X_train, X_test, y_train, y_test = splitter.split()
splitter.plot_split()           # bar chart of class distribution in each split
print(splitter.get_summary())   # DataFrame: class, train_n, test_n, train_pct, test_pct

# Cross-validation with any sklearn-compatible estimator
print(available_cv_strategies())   # ["kfold", "stratified", "loo", "repeatedstratified"]

from sklearn.ensemble import RandomForestClassifier
cv = CrossValidator(
    X=X, y=y,
    estimator=RandomForestClassifier(n_estimators=100),
    strategy="stratified",
    n_splits=5,
)
cv.fit()
print(cv.get_scores())   # DataFrame with per-fold accuracy, balanced_accuracy, roc_auc

Spectra Visualization

SpectraPlot renders NMR spectra as interactive Plotly figures with overlay, mean±SD, stacked, and single-sample views.

from metbit import SpectraPlot

sp = SpectraPlot(X=spectra_df, ppm=ppm, label=y)

sp.overlay()      # all spectra overlaid, colored by group
sp.mean_sd()      # group mean with shaded ±1 SD ribbon
sp.stacked()      # vertically stacked spectra
sp.single(idx=0)  # single sample spectrum

STOCSY

Statistical Total Correlation Spectroscopy identifies structurally related resonances by correlating every spectral variable against an anchor peak.

from metbit import STOCSY

stocsy = STOCSY(X=spectra_df, ppm=ppm)
stocsy.fit(driver=3.05)   # anchor ppm - typically a high-VIP peak
stocsy.plot()             # interactive spectrum colored by Pearson r

ChunkedSTOCSY - for large feature counts

from metbit import ChunkedSTOCSY

cs = ChunkedSTOCSY(X=spectra_df, ppm=ppm, chunk_size=10_000)
cs.fit(driver=3.05)
cs.plot()

print(ChunkedSTOCSY.active_backend())

Large-Scale Compute Backend

The four-tier auto-dispatch backend is inherited by all v9.1.0 modules including MLClassifier, SpectralAutoencoder, and the new multivariate methods.

GPU (CuPy / PyTorch CUDA)     >500 M elements, GPU memory fits
        ↓
C + OpenMP                     >10 M elements, C extension compiled
        ↓
Multiprocessing + NumPy        C extension absent, many cores available
        ↓
Chunked NumPy                  universal fallback, bounded memory

import metbit

print(metbit.backend_info())
# {'native_c': True, 'openmp_threads': 8, 'gpu_cupy': False, 'gpu_torch': False,
#  'n_jobs': 8, 'default_chunk': 50000}

print(metbit.native_available())
print(metbit.gpu_available())

Environment overrides

Variable	Effect
`METBIT_DISABLE_NATIVE=1`	Skip the C extension, use NumPy paths
`METBIT_DISABLE_GPU=1`	Skip CuPy and PyTorch GPU backends
`METBIT_N_JOBS=N`	Override worker count for multiprocessing
`METBIT_CHUNK=N`	Override feature chunk size (default 50,000)

Memory estimation

from metbit import MemoryEstimator, memory_report

est = MemoryEstimator()
info = est.estimate(n_samples=10_000, n_features=65_536, dtype="float64")
print(info)
# {'peak_gb': 4.88, 'recommended_dtype': 'float32', 'float32_peak_gb': 2.44}

memory_report(X)

Interactive Dash Applications

from metbit.apps import stocsy_app, opls_app, pca_app, spectra_app

stocsy_app(X=spectra_df, ppm=ppm, port=8050)
opls_app(model=model, port=8051)

All apps run locally - no data is uploaded to external servers.

Performance

Benchmarks measured on Apple M5 Pro, single-threaded C extension, no GPU (minimum of 5 wall-clock repetitions, one warm-up discarded).

Kernel	Dataset (n x p)	Implementation	Speedup	Memory reduction
VIP scores	80 x 5,000	C vs Python loop	2,680x	-
VIP scores	80 x 5,000	NumPy vec. vs Python loop	817x	-
Pearson correlation	500 x 30,000	C vs full-copy NumPy	1.2x	126x
Column variance	500 x 100,000	C float32 vs NumPy	2.0x	251x
ChunkedSTOCSY	100 x 5,000	Chunked vs standard	47x	bounded

python scripts/perf_report.py

Testing

# Run full suite
pytest

# Run without slow/performance tests (faster)
pytest -m "not slow and not perf"

# Run performance benchmarks only
pytest -m perf

Links

Documentation: https://metbit-docs.vercel.app
Getting Started: https://metbit-docs.vercel.app/docs/getting-started
API Reference: https://metbit-docs.vercel.app/docs/api
PyPI: https://pypi.org/project/metbit/
Source: https://github.com/aeiwz/metbit
Changelog: CHANGELOG.md

License

MIT License. See LICENSE for details.

Name		Name	Last commit message	Last commit date
Latest commit History 603 Commits
.github		.github
.leankg		.leankg
docs		docs
figures		figures
manuscript		manuscript
metbit		metbit
scripts		scripts
sources		sources
src		src
tests		tests
.coveragerc		.coveragerc
.gitattributes		.gitattributes
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile.test		Dockerfile.test
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
README.rst		README.rst
SECURITY.md		SECURITY.md
docker-test.sh		docker-test.sh
metbit.code-workspace		metbit.code-workspace
package-lock.json		package-lock.json
paper.bib		paper.bib
paper.md		paper.md
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
requirements-dev.txt		requirements-dev.txt
setup.cfg		setup.cfg
setup.py		setup.py

Folders and files

Latest commit

History

Repository files navigation

metbit

Contents

Installation

Quick Start

Workflow Overview

NMR Preprocessing

Normalization and Alignment

Principal Component Analysis

OPLS-DA

Scores and loadings

Model validation

Model metrics

Extended Multivariate Analysis

Statistical Testing

Machine Learning Classifiers

Metrics and plots

Deep Learning Models

Train/Test and Cross-Validation

Spectra Visualization

STOCSY

ChunkedSTOCSY - for large feature counts

Large-Scale Compute Backend

Environment overrides

Memory estimation

Interactive Dash Applications

Performance

Testing

Links

License

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 218

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages