Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is libdnn support Mali gpu? #17

Open
295988101 opened this issue Sep 21, 2016 · 4 comments
Open

Is libdnn support Mali gpu? #17

295988101 opened this issue Sep 21, 2016 · 4 comments

Comments

@295988101
Copy link

I use caffe-opencl with Mali gpu but I seems that libdnn can not support Mali.
Actually, I want to make some optimization in opencl kernel for some operate such as element-wise multiplication. You have do some memory optimization in libdnn of opencl kernel. But as I know, the memory of opencl in mali just use CL_MEM_ALLOC_HOST_PTR .. for cpu data.
would you tell me the method libdnn use for memory optimization or show me some resources about this.

thank you

@naibaf7
Copy link
Owner

naibaf7 commented Sep 21, 2016

@zhenghuitian
I suggest you read those two excellent articles first:

Other than that, you mainly have to find out the required FLOPS per global memory read/write to fully occupy the chip, as well as memory reading/writing strides for the individual threads.

Now, GEMM and convolution are quite difficult to get exactly right, for element-wise operations it's much easier.

LibDNN is mainly developed for desktop class GPUs (AMD RX480, W9100 and nVidia GTX980, 1080) at the moment.
The problem with mobile chips and Intel chips (for which we have separte spatial kernels in Caffe) is the reduced memory bandwidth, less or no local memory at all, and other tricks you have to apply

@295988101
Copy link
Author

@naibaf7 thank you.
I have read those two articles, but the main contents is not for mali gpu. I have read other articles for mali, and follow those, without local memory, I use vector, half, vload and other methods to make it faster, and it is useful.

@295988101
Copy link
Author

@naibaf7 but I do not understand the meaning of what you say "you mainly have to find out the required FLOPS per global memory read/write to fully occupy the chip, as well as memory reading/writing strides for the individual threads." and how to do that?

@bhack
Copy link

bhack commented Dec 3, 2016

I've opened a specific issue at #18

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants