Skip to content

2....a)Hardware:④HTA

Tingyuan LIANG edited this page Nov 13, 2018 · 12 revisions

Motivation of HTA

Although FBTA and PATA provide high-performance solution to DMM, when the number of MAUs is raised to more than 512, their consumption of area will be increased notably because of the large width of BVs array in HLS and high parallel computation between them. Therefore, FBTA and PATA perform quite well for coarse-grained allocation, such as the allocation of arrays, for which designers can define MAU with large size, for example, 1 KB, so that there would not be too many MAUs. However, the solution based on large MAU is not suitable for fine-grained allocation because it will lead to serious memory fragmentation issues resulting in wasted memory. To overcome this limitation, Hybrid Tree Allocator (HTA), provided by Hi-DMM, adopts the basic hierarchical idea based on group tree from SysAlloc to store the information of buddy tree, which can save a lot of area. However, HTA does not rely on the search based on AND-gate tree so that it takes less than 20 cycles to accomplish the allocation request for thousands of MAUs, instead of suffering the long allocation latency (hundreds of cycles) of SysAlloc.


Structure of Hybrid Tree:

The hybrid tree consists of two different kinds of layers, trunk layers and branch layers, as shown in the figure below. The depths of trunk and branch parts of a hybrid tree are and respectively. Based on the hybrid tree structure, HTA can manage MAUs. The trunk part of hybrid tree is described in the same way as FBTA but leaves in the trunk part, e.g. the node 4,5,6 and 7 in the figure below, are mapped to a corresponding group tree in branch layer, indicating whether a large memory block is available. As for the branch layers, group trees are used to manage fine-grained allocation and each of their leaves is mapped to a MAU. To locate an available memory node with small size efficiently, these group trees are described by two different kinds of BVs, layer BV and group trees BV. Each branch layer has its own layer BV, and each bit of layer BV accounts for the status of a specific layer of a group trees, indicating whether the group trees has available nodes in the specified branch layer. For example, in the figure below, both of the first two layer BVs (01...) indicate that the first group trees has available node in the layers while the second group trees does not. In contrast, the third layer BV (00...) indicate that both two group trees have available nodes in the third layer. Each group trees in the branch part can be described within a group trees BV, where each bit indicates the status of a node of the group trees. Therefore, if the tree in trunk layer has leaves, there will be group trees in branch part and the width of each layer BV will be . Similarly, the width of group trees BV should be .


Allocation Stage of HTA:

The requests of allocation are classified into coarse-grained and fine-grained according to their sizes. A request, size of which is greater than , is coarse-grained. The allocation for this request will only involve the trunk part of hybrid tree. The process of coarse-grained allocation will be the same as FBTA. In contrast, fine-grained allocation will involve group trees in branch part of hybrid tree. According to the request size, HTA locates the target layer for allocation in branch part. Based on the corresponding layer BV, HTA can find which group tree has space for the request by searching the first zero bit in the layer BV. Finally, techniques of FBTA will be applied on the located group tree and the address of allocable block will be obtained. Since HTA conducts the search of first zero bit twice for fine-grained allocation, the latency of allocation stage will be about 2 times more than FBTA.

Such procedure is also described in the figure below:


Maintenance Stage of HTA:

The maintenance for trunk part and the group tree BV of branch part can be implemented according to the definition of OR-gate tree. Note that the maintenance after allocation of a node in the buddy tree is known during compilation. Therefore, Hi-DMM maintains the group tree BV by applying an OR operation with a mask BV to it. These mask BVs can be grouped into a look-up table so that the maintenance of group tree BVs requires lower latency. For example, if node 14 is allocated, then nodes 7, 28 and 29 should be marked as 1 and the corresponding mask BV is 0011011. Other related examples are shown in the figure below. As for the layer BVs, when the request is coarse-grained, all of them are only involved in marking downwards and the operation applied on the BV of leaves in trunk layer can be passed to these layer BV directly. However, when the request is fine-grained, each layer BV for branch layer needs to be handled according to the status of the group tree. Since only the bits for the target group tree in layer BVs are involved in maintenance, the computation overhead is low. Moreover, since the layer BVs are independent of each other, HTA can update them concurrently.


Extension of HTA:

The method mentioned above can let the HTA handle 2K MAUs with very low latency. In order to let HTA handle more MAUs (i.e. up to 64K) while maintaining the low latency, we need to do some extension, and Ext-HTA is proposed. The source code is also provided in the directory "/Allocator_IPs". In the original HTA, each leaf of hybrid tree indicates a MAU, but for Ext-HTA, each leaf accounts for a small buddy tree (with a small number of layers) and each small buddy tree can handle more fine-grained allocation. Apart from the extension of HTA, we also equip Ext-HTA with Ext-KWTA, which is used to find the available small buddy tree among the 2K possible candidates with low latency. Therefore, designer can look into the source code for Ext-KWTA, in which they can notice the interaction between Ext-KWTA and Ext-HTA. When the accelerator requests large memory, the Ext-HTA will response while the request is small, the Ext-KWTA will also response. These two allocators will co-operate and the compiler will automatically interconnect them.