Feature Request: GLAP integration for multilingual text search #347

javiavid · 2026-02-26T13:31:58Z

javiavid
Feb 26, 2026

(I asked the AI to write it, I have no technical knowledge)

Summary
Currently, AudioMuse-AI uses LAION-CLAP for text search, which works well in English but has limited support for other languages (Spanish, French, German, etc.). This makes the text search feature much less useful for non-English-speaking users.

Proposed improvement
GLAP (General Language-Audio Pretraining) is a recently published model (June 2025) that extends CLAP with multilingual capabilities — evaluated on 50+ languages — while maintaining the same music retrieval performance as CLAP.

📄 Paper: https://arxiv.org/abs/2506.11350

💻 Code + checkpoints: https://github.com/xiaomi-research/dasheng-glap (public, Apache 2.0)

Why it fits AudioMuse-AI
Drop-in replacement for the current CLAP model — same architecture, same usage pattern

No performance regression on music tasks

Would unlock text search in Spanish, French, Italian, German, Chinese, Russian and more

Checkpoints are already public on HuggingFace, no extra training needed

Use case example
A Spanish-speaking user could search:

"música melancólica con piano" instead of "melancholic piano music"

"jazz tranquilo para trabajar" instead of "calm jazz for working"

Request
Would it be feasible to evaluate GLAP as an alternative (or selectable) model for the text search feature? Even as an optional TEXT_MODEL=glap environment variable would be a great addition.

Thanks for the great work on this project! 🎵

NeptuneHub · 2026-03-03T17:34:29Z

NeptuneHub
Mar 3, 2026
Maintainer

Thanks for sharing this. I firmly believe that the machine learning model is the heart of AudioMuse-AI and improve it is always on the top priorities.

Googling around I found the GLAP model here:

https://zenodo.org/records/15493136

and it is 3.4 GB.

Just to give you some data, LAION CLAP is around 700MB and it's already pretty heavy for an avarage homelabber (In avarage people run it on CPU, most of the time and old one, and have 50k songs in their collection). To give some number on a Raspberry PI 5 (that is probably the slower hw supported) it takes around 50-60 seconds per song inference with clap.

Now going to a model that it is 5x bigger, will cut off probably all the user with CPU, and will be executed in a decent amount of time only from user with good GPU. I believe that this is not good for AudioMuse-AI.

I'll still keep eye open in search for new model to use, but they need to be model compact enough to match an homelab use.

0 replies

javiavid · 2026-03-03T19:03:46Z

javiavid
Mar 3, 2026
Author

I think what you're saying is good, it's good that as many users as possible can use it.
Maybe something smaller will be released later.
Thanks

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request: GLAP integration for multilingual text search #347

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Feature Request: GLAP integration for multilingual text search #347

Uh oh!

javiavid Feb 26, 2026

Replies: 2 comments

Uh oh!

Uh oh!

NeptuneHub Mar 3, 2026 Maintainer

Uh oh!

javiavid Mar 3, 2026 Author

javiavid
Feb 26, 2026

NeptuneHub
Mar 3, 2026
Maintainer

javiavid
Mar 3, 2026
Author