Tensorflow permute

5/5/2023

The following cases all meet the definition of BatchTranspose, where the contents of the brackets indicate the order of the dimensions.Īs we can see from the graph, OneFlow can approach native Copy operations in most cases, in terms of both computation time and bandwidth utilization. matrix transpose, exchanges only the last two dimensions of the matrix. In some special cases, we can merge accesses to improve bandwidth utilization and speed, which brings us to the following discuss on BatchTranspose optimization. Regular Permute is applicable to a wide range of situations, and therefore there may be cases where memory accesses are not merged. Using the two optimization techniques above, OneFlow can be easily faster than PyTorch. The bandwidth of Permute is a little higher than that of native Copy because there is no unroll inter-instructional parallelism optimization in the Copy Kernel, whereas the Permute Kernel has done the relevant optimization internally. In terms of operating time compared with PyTorch, Oneflow is at least 1.24 times faster, and can reach 1.4 times at most. The test data covers different sizes from 16MB to 128MB, and the data type contains both fp32 and half data type.Īs we can see from the two graphs above, OneFlow can approximate or even slightly exceed the bandwidth of the Copy operation in most cases. The test environment is NVIDIA A100 40GB and the scenario is (0, 1, 2)->(1, 0, 2), with the horizontal coordinates indicating the data shape and data type. Ensure that the data pointers meet the alignment requirements of the new access granularity.įor the second rule, it corresponds to the following Permute scenario:.permutation=x.dims, and the size is a multiple of the new access granularity. The last dimension is moved as a whole, i.e.CUDA supports access granularity of 1B, 2B, 4B, 8B, 16B, the larger the granularity the better the performance.We set the rules for access granularity as follows: In the Nvidia Performance Optimization blog Increase Performance with Vectorized Memory Access, it mentioned that CUDA Kernel performance can be improved by vectorizing memory operations to reduce the number of instructions and improve bandwidth utilization. You may have observed a template parameter size_t movement_size in the kernel function, which indicates the granularity of the elements to be accessed. This reduces the number of divisions and multiplications compared with the case of unmerged, thus increasing speed. Consecutive dimensions can be merged into one dimension.įor the second rule, let’s consider the following Permute case:Īfter merging dimensions, when calculating indexes based on offsets using the NdIndexOffsetHelper, before merging we need to calculate them as 4-dimensional indexes, while after merging we only need to calculate them as 2-dimensional indexes.Dimensions of size 1 can be removed directly.In some special cases, Permute dimensions can be merged, with the following rules: Therefore, we add a template parameter IndexType to the kernel function to specify the data type of the index, and decide whether IndexType is int32_t or int64_t depending on the number of elements involved in the Permute. And in coordinate permutation, the division operation has different overheads for different integer types.

Static Dispatch of IndexTypeĪs deep learning models get larger, the number of elements involved in the operation may exceed the range represented by int32_t. For these two perspectives we introduce the following optimization schemes. The computational cost of this naive Permute Kernel comes from coordinate permutation and the memory access overhead from data movement.

Then we convert the input index to a one-dimensional offset src_offset, get the input element and assign it to the corresponding output.
First we obtain the high-dimensional index dst_index of the currently processed output element, and assign it to the Permute input index src_index.
PermuteKernelParams is a structure containing the initialized NdIndexOffsetHelper (one for src and one for dst), the total number of elements count and the transformed dimensional order permutation.

0 Comments

Tensorflow permute

Leave a Reply.

Author

Archives

Categories