Replies: 2 comments
-
|
Thanks for sharing this. I firmly believe that the machine learning model is the heart of AudioMuse-AI and improve it is always on the top priorities. Googling around I found the GLAP model here: and it is 3.4 GB. Just to give you some data, LAION CLAP is around 700MB and it's already pretty heavy for an avarage homelabber (In avarage people run it on CPU, most of the time and old one, and have 50k songs in their collection). To give some number on a Raspberry PI 5 (that is probably the slower hw supported) it takes around 50-60 seconds per song inference with clap. Now going to a model that it is 5x bigger, will cut off probably all the user with CPU, and will be executed in a decent amount of time only from user with good GPU. I believe that this is not good for AudioMuse-AI. I'll still keep eye open in search for new model to use, but they need to be model compact enough to match an homelab use. |
Beta Was this translation helpful? Give feedback.
-
|
I think what you're saying is good, it's good that as many users as possible can use it. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
(I asked the AI to write it, I have no technical knowledge)
Summary
Currently, AudioMuse-AI uses LAION-CLAP for text search, which works well in English but has limited support for other languages (Spanish, French, German, etc.). This makes the text search feature much less useful for non-English-speaking users.
Proposed improvement
GLAP (General Language-Audio Pretraining) is a recently published model (June 2025) that extends CLAP with multilingual capabilities — evaluated on 50+ languages — while maintaining the same music retrieval performance as CLAP.
📄 Paper: https://arxiv.org/abs/2506.11350
💻 Code + checkpoints: https://github.com/xiaomi-research/dasheng-glap (public, Apache 2.0)
Why it fits AudioMuse-AI
Drop-in replacement for the current CLAP model — same architecture, same usage pattern
No performance regression on music tasks
Would unlock text search in Spanish, French, Italian, German, Chinese, Russian and more
Checkpoints are already public on HuggingFace, no extra training needed
Use case example
A Spanish-speaking user could search:
"música melancólica con piano" instead of "melancholic piano music"
"jazz tranquilo para trabajar" instead of "calm jazz for working"
Request
Would it be feasible to evaluate GLAP as an alternative (or selectable) model for the text search feature? Even as an optional TEXT_MODEL=glap environment variable would be a great addition.
Thanks for the great work on this project! 🎵
Beta Was this translation helpful? Give feedback.
All reactions