Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Review Src register re-use in matmul #5

Open
rtawfik01 opened this issue May 28, 2024 · 0 comments
Open

Review Src register re-use in matmul #5

rtawfik01 opened this issue May 28, 2024 · 0 comments
Labels
Performance Feature that helps with performance, not a blocker for functionality

Comments

@rtawfik01
Copy link
Collaborator

Source register re-use for the llk_unpack_AB_matmul.h and llk_math_matmul.h kernels was a necessity for wormhole_b0, due to the unpacker bandwidth being too low to saturate the compute engine. For blackhole, that is no longer the case, we should review the matmul kernel, and measure performance without re-using Source register as found here:

    for (uint t = 0; t < t_dim; t++) {

        std::uint32_t offset_address_a =tile_size_a*(tile_index_a + (reuse_a ? (t*kt_dim) : (0)));
        std::uint32_t offset_address_b = tile_size_b*(tile_index_b + (reuse_a ? (0       ) : (t)));
        std::uint32_t address_a = base_address_a + offset_address_a;
        std::uint32_t address_b = base_address_b + offset_address_b;

        // Wait for free context
        wait_for_next_context(2);

        semaphore_post(semaphore::UNPACK_SYNC);  // Trisc::SEMPOST for context acquire

        // Program unpacker 1 base address
        if (0 == unp_cfg_context) {
            cfg[THCON_SEC0_REG3_Base_address_ADDR32] = address_b;
            cfg[THCON_SEC1_REG3_Base_address_ADDR32] = address_a;
        } else {
            cfg[THCON_SEC0_REG3_Base_cntx1_address_ADDR32] = address_b;
            cfg[THCON_SEC1_REG3_Base_cntx1_address_ADDR32] = address_a;
        }

        if (reuse_a) {
            #if SKIP_UNP == 1
                TTI_NOP;
            #else
                if (partial_face) {
                    // Do face by face unpacking
                    TTI_UNPACR(SrcB, 0b00010001, 0, 0, 0, 1 /*Set OvrdThreadId*/, 0 /*Set Dvalid*/, p_unpacr::RAREFYB_DISABLE, 0, 0 /* Set ContextIdInc */, 0, 0, 1);
                    TTI_UNPACR(SrcB, 0b00010001, 0, 0, 0, 1 /*Set OvrdThreadId*/, 1 /*Set Dvalid*/, p_unpacr::RAREFYB_DISABLE, 0, 0 /* Set ContextIdInc */, 0, 0, 1);
                    TTI_SETADCZW(p_setadc::UNP_B, 0, 0, 0, 0, 0b0101); // Set ch0_z=0, ch1_z=0
                } else {
                    TTI_UNPACR(SrcB, 0, 0, 0, 0, 1 /*Set OvrdThreadId*/, 1 /*Set Dvalid*/, p_unpacr::RAREFYB_DISABLE, 0, 0 /* Set ContextIdInc */, 0, 0, 1);
                }
            #endif
        } else {
            #if SKIP_UNP == 1
                TTI_NOP;
            #else
                TTI_UNPACR(SrcA, 0, 0, 0, 0, 1 /*Set OvrdThreadId*/, 1 /*Set Dvalid*/, p_unpacr::RAREFYB_DISABLE, 0, 0 /* Set ContextIdInc */, 0, 0, 1);
            #endif
        }


        TT_MOP(0, (reuse_a ? ct_dim : rt_dim) - 1, unp_cfg_context == 0 ? 0 : 0xff); // Run the MOP

        // T6::SEMGET for context release
        t6_semaphore_get(semaphore::UNPACK_SYNC);

        // Switch unpacker config context
        switch_config_context(unp_cfg_context);
    }

If the matmul kernels performance is similar or greater without re-use, then the re-use flags and functionality should be removed.

@ttmtrajkovic @rdjogoTT fyi.

@rtawfik01 rtawfik01 added the Performance Feature that helps with performance, not a blocker for functionality label May 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Feature that helps with performance, not a blocker for functionality
Projects
None yet
Development

No branches or pull requests

1 participant