Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sync : ggml #12732

Merged
merged 5 commits into from
Apr 7, 2025
Merged

sync : ggml #12732

merged 5 commits into from
Apr 7, 2025

Conversation

ggerganov
Copy link
Member

No description provided.

@github-actions github-actions bot added script Script related ggml changes relating to the ggml tensor library for machine learning labels Apr 3, 2025
@ggerganov
Copy link
Member Author

@cmdr2 Could you take a look at the arm build failures?

@cmdr2
Copy link
Contributor

cmdr2 commented Apr 3, 2025

@ggerganov Sure, taking a look

@cmdr2
Copy link
Contributor

cmdr2 commented Apr 3, 2025

tl;dr - Maybe a difference in the strictness of the C++ compiler (vs compiling the C file)?

Interestingly, this isn't new behavior. The previous CI runs for this runner also raises this (as a warning) (i.e. without this PR's change) - https://github.com/ggml-org/llama.cpp/actions/runs/14237462102/job/39899497124#step:9:87

src/ggml-cpu/ggml-cpu.c:1586:21: warning: incompatible pointer types assigning to 'const __fp16 *' from 'const ggml_fp16_internal_t *' (aka 'const unsigned short *') [-Wincompatible-pointer-types]
 1586 |             ax[j] = GGML_F16_VEC_LOAD(x + i + j*GGML_F16_EPR, j);

But the CI runner now raises this as a warning as well as an error, after these lines are part of C++.

src/ggml-cpu\vec.h:118:25: error: incompatible pointer types assigning to 'const __fp16 *' from 'const ggml_fp16_internal_t *' (aka 'const unsigned short *')
  118 |                 ax[j] = GGML_F16_VEC_LOAD(x[k] + i + j*GGML_F16_EPR, j);

Maybe a difference in the strictness of the C++ compiler (vs compiling the C file)?

Continuing to investigate.

@cmdr2
Copy link
Contributor

cmdr2 commented Apr 3, 2025

tl;dr - This warning is justified, and worth fixing anyway. Rather than trying to coerce the compiler into letting this through, it's probably worth investigating this (erstwhile) warning.

Details:
On MSVC (or LLVM with MSVC), the code defines ggml_fp16_internal_t as uint16_t (unsigned short) -

#include <arm_neon.h>
#ifdef _MSC_VER
typedef uint16_t ggml_fp16_internal_t;
#define ggml_vld1q_u32(w,x,y,z) { ((w) + ((uint64_t)(x) << 32)), ((y) + ((uint64_t)(z) << 32)) }
#else
typedef __fp16 ggml_fp16_internal_t;
#define ggml_vld1q_u32(w,x,y,z) { (w), (x), (y), (z) }
#endif // _MSC_VER

It very clearly DOES NOT define ggml_fp16_internal_t as __fp16.

So I'm not sure where __fp16 comes into the picture (as mentioned in the error)? Does the arm-neon.h implementation actually require __fp16 as the arg type? The stack trace below is interesting:

D:/a/llama.cpp/llama.cpp/ggml/src/ggml-cpu\vec.h:118:25: error: incompatible pointer types assigning to 'const __fp16 *' from 'const ggml_fp16_internal_t *' (aka 'const unsigned short *')
  118 |                 ax[j] = GGML_F16_VEC_LOAD(x[k] + i + j*GGML_F16_EPR, j);
      |                         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
D:/a/llama.cpp/llama.cpp/ggml/src/ggml-cpu\simd-mappings.h:127:41: note: expanded from macro 'GGML_F16_VEC_LOAD'
  127 |     #define GGML_F16_VEC_LOAD(p, i)     GGML_F32Cx4_LOAD(p)
      |                                         ^~~~~~~~~~~~~~~~~~~
D:/a/llama.cpp/llama.cpp/ggml/src/ggml-cpu\simd-mappings.h:117:60: note: expanded from macro 'GGML_F32Cx4_LOAD'
  117 |     #define GGML_F32Cx4_LOAD(x)      vcvt_f32_f16(vld1_f16((const ggml_fp16_internal_t *)(x)))
      |                                                   ~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
C:\Program Files\LLVM\lib\clang\18\include\arm_neon.h:39354:47: note: expanded from macro 'vld1_f16'
 39354 |   __ret = (float16x4_t) __builtin_neon_vld1_v(__p0, 8); \
       |                                               ^~~~

Will look at the arm header file. Continuing to investigate..

@cmdr2
Copy link
Contributor

cmdr2 commented Apr 3, 2025

Interesting, I see a note in our code:

// 16-bit float
// on Arm, we use __fp16
// on x86, we use uint16_t

But we're not running on x86, yet we're using uint16_t, simply because the current implementation checks for MSVC and defines it as uint16_t.

@ggerganov @slaren Shouldn't we also explicitly check for x86, along with the msvc check? Before defining ggml_fp16_internal_t as uint16_t?

@cmdr2
Copy link
Contributor

cmdr2 commented Apr 3, 2025

I'm a bit confused now. Arm neon isn't x86. So why does it set ggml_fp16_internal_t to uint16_t inside the #if defined(__ARM_NEON) block, if it is MSVC?

#if defined(__ARM_NEON)
// if YCM cannot find <arm_neon.h>, make a symbolic link to it, for example:
//
// $ ln -sfn /Library/Developer/CommandLineTools/usr/lib/clang/13.1.6/include/arm_neon.h ./src/
//
#include <arm_neon.h>
#ifdef _MSC_VER
typedef uint16_t ggml_fp16_internal_t;
#define ggml_vld1q_u32(w,x,y,z) { ((w) + ((uint64_t)(x) << 32)), ((y) + ((uint64_t)(z) << 32)) }
#else
typedef __fp16 ggml_fp16_internal_t;

We're also doing this in ggml-impl.h:

#if defined(__ARM_NEON)
#if defined(_MSC_VER) || (defined(__CUDACC__) && __CUDACC_VER_MAJOR__ <= 11)
typedef uint16_t ggml_fp16_internal_t;
#else
typedef __fp16 ggml_fp16_internal_t;
#endif
#endif

@ggerganov
Copy link
Member Author

I'm a bit confused now. Arm neon isn't x86. So why does it set ggml_fp16_internal_t to uint16_t inside the #if defined(__ARM_NEON) block, if it is MSVC?

I'm not really sure as well. I think the history of this is #5404 and the references there in.

It appears that MSVC on Arm64 had some issues (maybe it still has them) and this required the hacky ggml_vld1q_u32 macro, which in turn leads to using uint16_t instead of the native __fp16.

@cmdr2
Copy link
Contributor

cmdr2 commented Apr 4, 2025

Thanks, digging further in that direction..

@cmdr2
Copy link
Contributor

cmdr2 commented Apr 4, 2025

Submitted a PR for this - ggml-org/ggml#1176

Yeah, it looks like we need to do things in uint16_t for msvc+arm neon. Traced back to the original PR for ARM64 support for Windows - b52b29a

So maybe the solution is to just tell the compiler that it's okay (by using reinterpret_cast if we're in the C++ compiler)? I've proposed that change in my PR.

I have no idea about ARM or SIMD code, so please feel free to suggest alternatives :) Thanks!

@ggerganov ggerganov force-pushed the sync-ggml-25-04-03 branch 3 times, most recently from 8339981 to 6232cee Compare April 7, 2025 09:32
@github-actions github-actions bot added the Nvidia GPU Issues specific to Nvidia GPUs label Apr 7, 2025
@cmdr2
Copy link
Contributor

cmdr2 commented Apr 7, 2025

@ggerganov What if we keep the arm header include the same (i.e. if not MUSA), but continue allowing the fp32 conversion function declarations for MUSA?

Basically, bringing back these lines: e638450#diff-1f56ac82eed1293d4aa7c35aef0bc19e831cdb24dcb6af43582143936eb7eae4L19-L25

And removing && !defined(__MUSACC__) from this line (to match what used to happen previously): e638450#diff-1f56ac82eed1293d4aa7c35aef0bc19e831cdb24dcb6af43582143936eb7eae4R313

Thanks

@cmdr2
Copy link
Contributor

cmdr2 commented Apr 7, 2025

Or maybe that's completely unrelated to the CI failure?

@slaren
Copy link
Member

slaren commented Apr 7, 2025

Yes, it is due to the CUDA BF16 change.

@ggerganov
Copy link
Member Author

The MUSA build is simple to fix, but I am not able to fix the HIP build. I think HIP does not support cublasGemmEx with BF16 type.

@JohannesGaessler Do you have suggestions how to fix?

@JohannesGaessler
Copy link
Collaborator

I pushed a fix for the HIP compilation failure. From what I can tell the problem is that the HIP header does not define __bfloat162float which is used in the CUDA code to do the type conversion. However, both the HIP and CUDA headers define the float() operator which in the case of CUDA is just a wrapper for __bfloat162float. More generally, I think the code can be simplified by just always explicitly casting the source data to float. For float <-> half and float <-> nv_bfloat16 that should be a no-op and for half <-> nv_bfloat16 that is what the code was already doing.

I will not be able to test the code on actual AMD hardware until in a few hours.

@JohannesGaessler
Copy link
Collaborator

Sorry, I missed the error messages about the HIP types. In hipblas.h the compute types are defined as follows:

#define HIPBLAS_R_16F HIP_R_16F
#define HIPBLAS_R_32F HIP_R_32F
#define HIPBLAS_R_64F HIP_R_64F
#define HIPBLAS_C_16F HIP_C_16F
#define HIPBLAS_C_32F HIP_C_32F
#define HIPBLAS_C_64F HIP_C_64F
#define HIPBLAS_R_8I HIP_R_8I
#define HIPBLAS_R_8U HIP_R_8U
#define HIPBLAS_R_32I HIP_R_32I
#define HIPBLAS_R_32U HIP_R_32U
#define HIPBLAS_C_8I HIP_C_8I
#define HIPBLAS_C_8U HIP_C_8U
#define HIPBLAS_C_32I HIP_C_32I
#define HIPBLAS_C_32U HIP_C_32U
#define HIPBLAS_R_16B HIP_R_16BF
#define HIPBLAS_C_16B HIP_C_16BF

It also lists the following types as supported for GEMM:

  |   aType    |   bType    |   cType    |     computeType     |
  | ---------- | ---------- | ---------- | ------------------- |
  | HIP_R_16F  | HIP_R_16F  | HIP_R_16F  | HIPBLAS_COMPUTE_16F |
  | HIP_R_16F  | HIP_R_16F  | HIP_R_16F  | HIPBLAS_COMPUTE_32F |
  | HIP_R_16F  | HIP_R_16F  | HIP_R_32F  | HIPBLAS_COMPUTE_32F |
  | HIP_R_16BF | HIP_R_16BF | HIP_R_16BF | HIPBLAS_COMPUTE_32F |
  | HIP_R_16BF | HIP_R_16BF | HIP_R_32F  | HIPBLAS_COMPUTE_32F |
  | HIP_R_32F  | HIP_R_32F  | HIP_R_32F  | HIPBLAS_COMPUTE_32F |
  | HIP_R_64F  | HIP_R_64F  | HIP_R_64F  | HIPBLAS_COMPUTE_64F |
  | HIP_R_8I   | HIP_R_8I   | HIP_R_32I  | HIPBLAS_COMPUTE_32I |
  | HIP_C_32F  | HIP_C_32F  | HIP_C_32F  | HIPBLAS_COMPUTE_32F |
  | HIP_C_64F  | HIP_C_64F  | HIP_C_64F  | HIPBLAS_COMPUTE_64F |

So it should be possible to use the code for HIP by just changing the vendor headers in ggml.

@ggerganov
Copy link
Member Author

I had this change, which fixes MUSA: 5ef588b

But HIP still fails like this:

https://github.com/ggml-org/llama.cpp/actions/runs/14306928482/job/40092833687#step:6:128

[ 17%] Building HIP object ggml/src/ggml-hip/CMakeFiles/ggml-hip.dir/__/ggml-cuda/im2col.cu.o
/__w/llama.cpp/llama.cpp/ggml/src/ggml-cuda/ggml-cuda.cu:1217:50: error: use of undeclared identifier 'HIPBLAS_R_16BF'
                    &alpha_f32,  src0_ptr,       CUDA_R_16BF, ne00,

Not sure why the HIPBLAS_R_16BF identifier is not recognized after my change.

@JohannesGaessler
Copy link
Collaborator

Because they're calling the type HIPBLAS_R_16B instead of HIPBLAS_R_16BF.

@ggerganov
Copy link
Member Author

Ok, let me push a fix now.

@ggerganov ggerganov force-pushed the sync-ggml-25-04-03 branch 2 times, most recently from b65ac0b to 709fa72 Compare April 7, 2025 12:35
cmdr2 and others added 2 commits April 7, 2025 15:38
… (ggml/1167)

* cpu: refactor SIMD mappings and vectorized op functions into separate files

* Fix warning for ggml_float to float

* Fix warnings

* cpu: move all the operations (except mul_mat) to a separate c++ file

* fix whitespace

* Update ggml/src/ggml-cpu/vec.h

Co-authored-by: Diego Devesa <[email protected]>

* Fix PR comments - use GGML_UNUSED, use cassert in ops.cpp

* Reverse the order of import for ops.h and vec.h, to match what was present in ggml-cpu.c previously

---------

Co-authored-by: Diego Devesa <[email protected]>
* add bf16 support

* use convert_from_bf16_cuda instead of convert_unary_cuda for f32

* revert 7ec5085

* move functionality into convert_unary with constexpr
* ggml : simlpify Arm fp16 CPU logic

ggml-ci

* cont : bring back CUDA/MUSA checks

ggml-ci
ggml-ci
@ggerganov ggerganov force-pushed the sync-ggml-25-04-03 branch from 709fa72 to 92d7d4d Compare April 7, 2025 12:39
@ggerganov ggerganov merged commit 1a1ab7e into master Apr 7, 2025
56 of 59 checks passed
@ggerganov ggerganov deleted the sync-ggml-25-04-03 branch April 7, 2025 15:44
zhengjun-xing pushed a commit to zhengjun-xing/llama.cpp that referenced this pull request Apr 10, 2025
Fix ggml-org#12732:
* Remove incorrect inclusion of "arm_neon.h" for CUDA versions ≥ 12
zhengjun-xing added a commit to zhengjun-xing/llama.cpp that referenced this pull request Apr 10, 2025
Fix ggml-org#12732:
* Remove incorrect inclusion of "arm_neon.h" for CUDA versions ≥ 12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs script Script related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants