1. Stop using ndloop and compute an operation with one CUDA kernel using indexer 2. Compact dimension to make computation of indexer fast Element-wise (binary ops) is already done at https://github.com/sonots/cumo/pull/64. But, reduction and others such as `store_from` are not yet done. Without this, cumo (and red-chainer) can not compete with cupy (and chainer) Current performance comparison on k80 machine: * chainer mnist: 5 sec / epoch * red-chainer mnist: 13 sec / epoch