Review Src register re-use in matmul #5

rtawfik01 · 2024-05-28T16:40:17Z

Source register re-use for the llk_unpack_AB_matmul.h and llk_math_matmul.h kernels was a necessity for wormhole_b0, due to the unpacker bandwidth being too low to saturate the compute engine. For blackhole, that is no longer the case, we should review the matmul kernel, and measure performance without re-using Source register as found here:

    for (uint t = 0; t < t_dim; t++) {

        std::uint32_t offset_address_a =tile_size_a*(tile_index_a + (reuse_a ? (t*kt_dim) : (0)));
        std::uint32_t offset_address_b = tile_size_b*(tile_index_b + (reuse_a ? (0       ) : (t)));
        std::uint32_t address_a = base_address_a + offset_address_a;
        std::uint32_t address_b = base_address_b + offset_address_b;

        // Wait for free context
        wait_for_next_context(2);

        semaphore_post(semaphore::UNPACK_SYNC);  // Trisc::SEMPOST for context acquire

        // Program unpacker 1 base address
        if (0 == unp_cfg_context) {
            cfg[THCON_SEC0_REG3_Base_address_ADDR32] = address_b;
            cfg[THCON_SEC1_REG3_Base_address_ADDR32] = address_a;
        } else {
            cfg[THCON_SEC0_REG3_Base_cntx1_address_ADDR32] = address_b;
            cfg[THCON_SEC1_REG3_Base_cntx1_address_ADDR32] = address_a;
        }

        if (reuse_a) {
            #if SKIP_UNP == 1
                TTI_NOP;
            #else
                if (partial_face) {
                    // Do face by face unpacking
                    TTI_UNPACR(SrcB, 0b00010001, 0, 0, 0, 1 /*Set OvrdThreadId*/, 0 /*Set Dvalid*/, p_unpacr::RAREFYB_DISABLE, 0, 0 /* Set ContextIdInc */, 0, 0, 1);
                    TTI_UNPACR(SrcB, 0b00010001, 0, 0, 0, 1 /*Set OvrdThreadId*/, 1 /*Set Dvalid*/, p_unpacr::RAREFYB_DISABLE, 0, 0 /* Set ContextIdInc */, 0, 0, 1);
                    TTI_SETADCZW(p_setadc::UNP_B, 0, 0, 0, 0, 0b0101); // Set ch0_z=0, ch1_z=0
                } else {
                    TTI_UNPACR(SrcB, 0, 0, 0, 0, 1 /*Set OvrdThreadId*/, 1 /*Set Dvalid*/, p_unpacr::RAREFYB_DISABLE, 0, 0 /* Set ContextIdInc */, 0, 0, 1);
                }
            #endif
        } else {
            #if SKIP_UNP == 1
                TTI_NOP;
            #else
                TTI_UNPACR(SrcA, 0, 0, 0, 0, 1 /*Set OvrdThreadId*/, 1 /*Set Dvalid*/, p_unpacr::RAREFYB_DISABLE, 0, 0 /* Set ContextIdInc */, 0, 0, 1);
            #endif
        }


        TT_MOP(0, (reuse_a ? ct_dim : rt_dim) - 1, unp_cfg_context == 0 ? 0 : 0xff); // Run the MOP

        // T6::SEMGET for context release
        t6_semaphore_get(semaphore::UNPACK_SYNC);

        // Switch unpacker config context
        switch_config_context(unp_cfg_context);
    }

If the matmul kernels performance is similar or greater without re-use, then the re-use flags and functionality should be removed.

@ttmtrajkovic @rdjogoTT fyi.

The text was updated successfully, but these errors were encountered:

rtawfik01 added the Performance Feature that helps with performance, not a blocker for functionality label May 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Review Src register re-use in matmul #5

Review Src register re-use in matmul #5

rtawfik01 commented May 28, 2024

Review Src register re-use in matmul #5

Review Src register re-use in matmul #5

Comments

rtawfik01 commented May 28, 2024