The following cases all meet the definition of BatchTranspose, where the contents of the brackets indicate the order of the dimensions.Īs we can see from the graph, OneFlow can approach native Copy operations in most cases, in terms of both computation time and bandwidth utilization. matrix transpose, exchanges only the last two dimensions of the matrix. In some special cases, we can merge accesses to improve bandwidth utilization and speed, which brings us to the following discuss on BatchTranspose optimization. Regular Permute is applicable to a wide range of situations, and therefore there may be cases where memory accesses are not merged. Using the two optimization techniques above, OneFlow can be easily faster than PyTorch. The bandwidth of Permute is a little higher than that of native Copy because there is no unroll inter-instructional parallelism optimization in the Copy Kernel, whereas the Permute Kernel has done the relevant optimization internally. In terms of operating time compared with PyTorch, Oneflow is at least 1.24 times faster, and can reach 1.4 times at most. The test data covers different sizes from 16MB to 128MB, and the data type contains both fp32 and half data type.Īs we can see from the two graphs above, OneFlow can approximate or even slightly exceed the bandwidth of the Copy operation in most cases. The test environment is NVIDIA A100 40GB and the scenario is (0, 1, 2)->(1, 0, 2), with the horizontal coordinates indicating the data shape and data type. Ensure that the data pointers meet the alignment requirements of the new access granularity.įor the second rule, it corresponds to the following Permute scenario:.permutation=x.dims, and the size is a multiple of the new access granularity. The last dimension is moved as a whole, i.e.CUDA supports access granularity of 1B, 2B, 4B, 8B, 16B, the larger the granularity the better the performance.We set the rules for access granularity as follows: In the Nvidia Performance Optimization blog Increase Performance with Vectorized Memory Access, it mentioned that CUDA Kernel performance can be improved by vectorizing memory operations to reduce the number of instructions and improve bandwidth utilization. You may have observed a template parameter size_t movement_size in the kernel function, which indicates the granularity of the elements to be accessed. This reduces the number of divisions and multiplications compared with the case of unmerged, thus increasing speed. Consecutive dimensions can be merged into one dimension.įor the second rule, let’s consider the following Permute case:Īfter merging dimensions, when calculating indexes based on offsets using the NdIndexOffsetHelper, before merging we need to calculate them as 4-dimensional indexes, while after merging we only need to calculate them as 2-dimensional indexes.Dimensions of size 1 can be removed directly.In some special cases, Permute dimensions can be merged, with the following rules: Therefore, we add a template parameter IndexType to the kernel function to specify the data type of the index, and decide whether IndexType is int32_t or int64_t depending on the number of elements involved in the Permute. And in coordinate permutation, the division operation has different overheads for different integer types. Static Dispatch of IndexTypeĪs deep learning models get larger, the number of elements involved in the operation may exceed the range represented by int32_t. For these two perspectives we introduce the following optimization schemes. The computational cost of this naive Permute Kernel comes from coordinate permutation and the memory access overhead from data movement.
0 Comments
Leave a Reply. |