Implement the following optimized sigmoid kernels for `float32` and `float16` with vectorized versions and PyTorch bindings for improved performance. - [ ] **`sigmoid_f32_kernel`**: Standard sigmoid function for `float32` data type. - [ ] **`sigmoid_f32x4_kernel`**: Vectorized sigmoid for `float32`, processing 4 elements at a time (`float4`). - [ ] **`sigmoid_f16_kernel`**: Standard sigmoid function for `float16` (half-precision). - [ ] **`sigmoid_f16x2_kernel`**: Vectorized sigmoid for `float16`, processing 2 elements at a time (`half2`). - [ ] **`sigmoid_f16x8_kernel`**: Unpacked version of `float16`, processing 8 elements in parallel. - [ ] **`sigmoid_f16x8_pack_kernel`**: Packed version of `sigmoid_f16x8_kernel` for efficient memory access. - [ ] **PyTorch bindings**: Expose the above kernels through PyTorch.
Implement the following optimized sigmoid kernels for
float32andfloat16with vectorized versions and PyTorch bindings for improved performance.sigmoid_f32_kernel: Standard sigmoid function forfloat32data type.sigmoid_f32x4_kernel: Vectorized sigmoid forfloat32, processing 4 elements at a time (float4).sigmoid_f16_kernel: Standard sigmoid function forfloat16(half-precision).sigmoid_f16x2_kernel: Vectorized sigmoid forfloat16, processing 2 elements at a time (half2).sigmoid_f16x8_kernel: Unpacked version offloat16, processing 8 elements in parallel.sigmoid_f16x8_pack_kernel: Packed version ofsigmoid_f16x8_kernelfor efficient memory access.