Open
Description
Question
I am trying transformers.js with WebGPU. The performance is great, but I found that transformers.js returns a Float32Array where the model is quantized to fp16
:
const extractor = await pipeline(
"feature-extraction",
"bge-small-zh-v1.5",
{
device: "webgpu",
dtype: "fp16",
local_files_only: true,
},
);
// ...
const embeddings = await extractor(texts, {pooling: "mean", normalize: true});
console.log(embeddings.data);
// -> Float32Array(5120000) [...]
Since the model itself has only 16-bit precision, returning a Float32Array (instead of Float16Array that is supported in latest browsers) seems a waste of performance. Is this comment correct, and do we have plans to support Float16Array for better performance? Thanks!