Serialize Executables crashing when compiling LLaMa on async-cpu #17244

rsuderman · 2024-04-30T22:28:26Z

The following dispatches appear to cause a crash when compiling a llama model. Unrolling / vectorization makes 20K+ lines of generated code which likely causes final LLVM compilation to completely fail.

module_prefill_bs4$async_dispatch_1.zip
module_decode_bs4$async_dispatch_2.zip

rsuderman · 2024-04-30T23:03:45Z

It appears the issue is in LLVMCPUVectorTransferLowering. There is a full unrolling making the dispatch rather unruly.

hanhanW · 2024-04-30T23:27:42Z

It appears the issue is in LLVMCPUVectorTransferLowering. There is a full unrolling making the dispatch rather unruly.

The unrolling is needed because LLVM backend wants 1D vector. It could be the issue in tile size selection, and vector shape optimization potentially can help with it.

rsuderman · 2024-05-01T19:34:33Z

Some additional guilty lines:

  %12 = vector.transfer_read %10[%c0, %c0], %cst_2 {in_bounds = [true, true]} : tensor<32000x3200xf16>, vector<32000x3200xf16>
  %13 = arith.extf %12 : vector<32000x3200xf16> to vector<32000x3200xf32>
  %14 = vector.transfer_write %13, %11[%c0, %c0] {in_bounds = [true, true]} : vector<32000x3200xf32>, tensor<32000x3200xf32>

If we defer the unrolling of the vector.transfer_write we have the arith.extf unroll inside of the convert-to-llvm. I would imagine generic vectorization should generate an actual loop of vector instructions instead of resulting in the whole operation unrolling during LLVM generation.

hanhanW · 2024-05-01T21:41:25Z

Okay, so this is similar to what I'm seeing in #17226 (comment)

IMO, we should not fuse these two generic ops. TileAndFuse is basically broken for the case. There are no dependency captured by operands. I'll talk to Mahesh to see if we can disable such fusion.

hanhanW · 2024-05-01T23:23:29Z

@pashu123 please help take a look if there are other issues, apart from the fusion issue.

ScottTodd · 2024-05-07T19:58:49Z

Do we have a workaround for this or any patches we could try?

I'm also seeing unusably slow behavior after running LLVMCPUVectorTransferLowering on open_llama_3b_v2_f16_gguf from https://github.com/nod-ai/sharktank. Logs and IR here: https://gist.github.com/ScottTodd/17734adbbd570dbfa3d275c8c7a8e9a9

hanhanW · 2024-05-07T21:02:46Z

Perhaps you can try llvm/torch-mlir#3277 . It should fix the embedding lookup issue at torch level.

ScottTodd · 2024-05-07T21:53:39Z

Perhaps you can try llvm/torch-mlir#3277 . It should fix the embedding lookup issue at torch level.

That gets further, yeah :D. Might be enough to call this particular issue fixed?

I do see another error with
iree-compile open_llama_3b_v2_f16.mlir --iree-hal-target-backends=llvm-cpu -o /tmp/open_llama_3b_v2_f16_cpu.vmfb:

failed to legalize operation 'arith.extui'
note: see current operation: %1401 = "arith.extui"(%1398) : (i1) -> i64

pretty late in compilation: https://gist.github.com/ScottTodd/6fbe7edd118bbb53c0abc2582459158d

hanhanW · 2024-05-07T22:00:58Z

That gets further, yeah :D. Might be enough to call this particular issue fixed?

There is an action item at LinAlg level: #17226 (comment)

I do see another error with iree-compile open_llama_3b_v2_f16.mlir --iree-hal-target-backends=llvm-cpu -o /tmp/open_llama_3b_v2_f16_cpu.vmfb

@ScottTodd can you provide the mlir file? @pashu123 please help triage and provide possible solutions

ScottTodd · 2024-05-07T22:07:16Z

@ScottTodd can you provide the mlir file? @pashu123 please help triage and provide possible solutions

This is the input file I'm working with: https://sharkpublic.blob.core.windows.net/sharkpublic/scotttodd/issue_reports/open_llama_3b_v2_f16.mlir

pashu123 · 2024-05-08T02:12:43Z

@ScottTodd I think you should add -iree-opt-demote-i64-to-i32 flag. Meanwhile, I'll double check this.

pashu123 · 2024-05-08T03:05:51Z

-iree-opt-demote-i64-to-i32

Verified adding this generates .vmfb.

hanhanW · 2024-05-09T17:02:41Z

After thinking a while, I think we can close the issue. The action item I mentioned is tracking in the other issue, and we don't have action items for this issue now.

rsuderman · 2024-05-20T17:09:41Z

This issue is blocking another model on the onnx front.

zjgarvey · 2024-05-20T17:22:25Z

This issue is blocking another model on the onnx front.

Yes, the model is RAFT_vaiq_int8

I added some information to the issue #17226

hanhanW · 2024-05-20T18:33:50Z

I think we only need to track it in one of the issues? So either we can close this or the other one.

zjgarvey · 2024-05-20T18:39:46Z

I think we only need to track it in one of the issues? So either we can close this or the other one.

Yeah, that's why I opted to provide more information there. I can't close this issue because I don't have permissions.

ScottTodd · 2024-05-20T18:42:37Z

Wherever the issue is tracked, can we follow-up and get fixes or patches landed? I've needed to keep patching llvm/torch-mlir#3277 locally as a workaround for compilation crashes for several weeks now.

hanhanW · 2024-05-20T18:54:24Z

Actually, I think we already land a more robust fix. 748db31 is landed. It should create a valid IR for codegen input. @ScottTodd could you to verify if the commit fixes the issue?

ScottTodd · 2024-05-22T19:59:19Z

Actually, I think we already land a more robust fix. 748db31 is landed. It should create a valid IR for codegen input. @ScottTodd could you to verify if the commit fixes the issue?

Thanks! I haven't seen issues lately without the torch-mlir commit but I'm still juggling a few other flags (notably --iree-opt-strip-assertions) and patches.

pashu123 · 2024-05-23T06:45:04Z

After circling back from #17341 (comment) I think we need that torch-mlir Patch. I will address the comments there.

rsuderman assigned pashu123 and hanhanW Apr 30, 2024

ScottTodd mentioned this issue May 7, 2024

Vulkan compile errors for llama model from sharktank #17304

Closed

ScottTodd mentioned this issue May 7, 2024

Tracking running llama models through IREE nod-ai/shark-ai#22

Closed

hanhanW closed this as completed May 9, 2024

rsuderman reopened this May 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Serialize Executables crashing when compiling LLaMa on async-cpu #17244

Serialize Executables crashing when compiling LLaMa on async-cpu #17244

rsuderman commented Apr 30, 2024

rsuderman commented Apr 30, 2024

hanhanW commented Apr 30, 2024

rsuderman commented May 1, 2024

hanhanW commented May 1, 2024

hanhanW commented May 1, 2024

ScottTodd commented May 7, 2024

hanhanW commented May 7, 2024

ScottTodd commented May 7, 2024

hanhanW commented May 7, 2024

ScottTodd commented May 7, 2024

pashu123 commented May 8, 2024

pashu123 commented May 8, 2024

hanhanW commented May 9, 2024

rsuderman commented May 20, 2024

zjgarvey commented May 20, 2024

hanhanW commented May 20, 2024

zjgarvey commented May 20, 2024

ScottTodd commented May 20, 2024

hanhanW commented May 20, 2024

ScottTodd commented May 22, 2024

pashu123 commented May 23, 2024

Serialize Executables crashing when compiling LLaMa on async-cpu #17244

Serialize Executables crashing when compiling LLaMa on async-cpu #17244

Comments

rsuderman commented Apr 30, 2024

rsuderman commented Apr 30, 2024

hanhanW commented Apr 30, 2024

rsuderman commented May 1, 2024

hanhanW commented May 1, 2024

hanhanW commented May 1, 2024

ScottTodd commented May 7, 2024

hanhanW commented May 7, 2024

ScottTodd commented May 7, 2024

hanhanW commented May 7, 2024

ScottTodd commented May 7, 2024

pashu123 commented May 8, 2024

pashu123 commented May 8, 2024

hanhanW commented May 9, 2024

rsuderman commented May 20, 2024

zjgarvey commented May 20, 2024

hanhanW commented May 20, 2024

zjgarvey commented May 20, 2024

ScottTodd commented May 20, 2024

hanhanW commented May 20, 2024

ScottTodd commented May 22, 2024

pashu123 commented May 23, 2024