[Bug]: Specifying the codepage (MCP) does not work on macOS. #267

StingKo · 2024-12-24T06:49:49Z

bit7z version

4.0.x

Compilation options

BIT7Z_7ZIP_VERSION, BIT7Z_ANALYZE_CODE, BIT7Z_AUTO_FORMAT, BIT7Z_LINK_LIBCPP, BIT7Z_USE_NATIVE_STRING, BIT7Z_USE_STD_BYTE

7-zip version

v23.01

7-zip shared library used

7z.dll / 7z.so

Compilers

Clang

Compiler versions

AppleClang 16

Architecture

arm64

Operating system

macOS

Operating system versions

macos 15.2

Bug description

const bit7z::BitArchiveReader arc{
    m_lib,
    url,
    bit7z::BitFormat::Auto
};
arc.useFormatProperty(L"cp", 932u);
for (const auto &item: arc) {
    std::cout << "path: " << item.nativePath() << std::endl;
}

Use the compressed package provided in the Issues. #248

Steps to reproduce

No response

Expected behavior

No response

Relevant compilation output

No response

Code of Conduct

By submitting this issue, I agree to follow bit7z's Code of Conduct

The text was updated successfully, but these errors were encountered:

rikyoz · 2024-12-24T17:07:32Z

Hi!
Unfortunately, 7-Zip doesn't support specifying the codepage on Unix systems, as stated here by its creator (he actually says Linux in that comment, but the same is true for macOS).
I also verified this in 7-Zip's source code: the Unix string conversion functions (here and here) basically ignore the codepage parameter unless it refers to the UTF-8 encoding.
In this case there is not much bit7z can do.

StingKo · 2024-12-24T21:27:28Z

Hi! Unfortunately, 7-Zip doesn't support specifying the codepage on Unix systems, as stated here by its creator (he actually says Linux in that comment, but the same is true for macOS). I also verified this in 7-Zip's source code: the Unix string conversion functions (here and here) basically ignore the codepage parameter unless it refers to the UTF-8 encoding. In this case there is not much bit7z can do.

Thank you for your answer.
Sorry, I am not deeply familiar with reading the 7-zip source code. In this case, do you know if there is another way to parse content with a specific code page?
I noticed that some people use iconv to achieve this (here) , but I don’t know how to implement it in conjunction with bit7z.

rikyoz · 2024-12-27T20:08:27Z

In this case, do you know if there is another way to parse content with a specific code page?

The short answer is that on the develop branch there are already some API improvements that can be used to implement some partial workaround. I hope to add more for the next v4.1-beta release.

The long answer requires a bit more context, unfortunately (sorry for the long comment).

The Underlying Issue

The 7-Zip API always uses wide strings to handle Unicode aware strings.

On Windows, wide strings are always UTF-16 encoded, and 7-Zip follows this encoding. When you use the mcp parameter, 7-Zip converts the original strings from the given codepage to UTF-16.

On Linux and macOS, wide strings are usually UTF-32 encoded (sizeof(wchar_t) = 4), but 7-Zip's handling of the encoding is a total mess.

First of all, 7-Zip's behavior depends on the format of the archive.

For example, take a .7z archive that stores a file with the Unicode character 𤭢 in the name. In a valid UTF-32 wide string, it should be encoded as L"\x00024B62". On the contrary, 7-Zip returns the wide string as L"\x0000d852\x0000df62". That is, it simply takes the UTF-16 encoding of the character, and stores each of its two 16-bit code units as a 32-bit wide character.
This wide string is not a valid UTF-32 string.

Second, even if you are reading only one archive format, 7-Zip may behave differently depending on how the archive was created.

Some archivers store filenames in UTF-8. When 7-Zip detects this, it converts the UTF-8 string to UTF-16, and then stores the 16-bit units as 32-bit characters, just as it does for the 7z format, producing an invalid UTF-32 string.

On the other hand, if a zip file was created using the Shift-JIS encoding, 7-Zip returns a wide string that stores each Shift-JIS byte of the string in 32-bit characters, without any conversion to UTF-32. For example, the character 敦 is encoded as \x93D6 in Shift-JIS, and becomes \x00000093\x000000D6 in 7-Zip, instead of the correct UTF-32 \x0000006556.

This confusing behavior of 7-Zip is a problem because bit7z expects UTF-32 encoded strings from 7-Zip and tries to convert them to UTF-8 on that assumption, producing the garbage strings you noticed.

Workarounds

I have not yet found a clean way to handle this whole mess within bit7z.
This is actually the biggest obstacle for the release of the next v4.1-beta.

In the current stable v4.0.9 there's no workaround that I can think of, unfortunately.

However, on the develop branch, I have added a new rawPath() method to the BitArchiveItem class. This function is like path() or nativePath(), but it always returns the "raw" wide string provided by 7-Zip, without any conversion attempt by bit7z.

This means that you can now handle the string encoding yourself.

For example, if you know that the archive uses the Shift-JIS encoding, and you need to print its items in UTF-8, you can now write a string conversion function using iconv as follows:

#include <cstring> // For strerror
#include <memory> // For std::unique_ptr
#include <type_traits> // For std::remove_pointer_t

#include <iconv.h>

// Define a type alias for the converter, using std::unique_ptr for automatic resource management.
using converter_t = std::unique_ptr<std::remove_pointer_t<iconv_t>, decltype(&::iconv_close)>;

auto make_converter(const char* to, const char* from) -> converter_t {
    const auto converter = ::iconv_open(to, from);
    if (converter == reinterpret_cast<decltype(converter)>(-1)) {
        throw std::runtime_error(std::format("Failed to open iconv: {}", strerror(errno)));
    }
    return {converter, ::iconv_close};
}

auto from_shiftjis_to_utf8(const std::wstring& wstr) -> std::string {
    // Create an iconv converter to transform from SHIFT_JIS to UTF-8 encoding.
    static const converter_t converter = make_converter("UTF-8", "SHIFT_JIS");

    // Convert the 7-Zip wide string (32-bit units) to a narrow string (8-bit units).
    // 7-Zip stores each byte of the original Shift-JIS string in 32-bit units
    // (e.g., the Shift-JIS byte sequence 0x93D6 becomes 0x00000093 0x000000D6 in wstr).
    // This conversion keeps each Shift-JIS byte intact, just storing them in an 8-bit string.
    std::string str{wstr.cbegin(), wstr.cend()};

    // Compute the maximum possible size for the output buffer (UTF-8 uses up to 4 bytes per codepoint).
    const std::size_t dstMaxLen = 4 * str.size();

    // Allocate space for the resulting UTF-8 string.
    std::string result(dstMaxLen, '\0');

    // Set up the input and output buffers for iconv.
    char* src = str.data();
    std::size_t srcLen = str.size();
    char* dst = result.data();
    std::size_t dstLen = result.size();

    // Perform the actual encoding conversion from Shift-JIS to UTF-8 using iconv.
    const auto iconv_result = ::iconv(converter.get(), &src, &srcLen, &dst, &dstLen);
    if (iconv_result == static_cast<decltype(iconv_result)>(-1)) {
        throw std::runtime_error(std::format("Failed to convert to UTF-8: {}", strerror(errno)));
    }

    // Adjust the result string size based on the actual number of bytes written.
    result.resize(dstMaxLen - dstLen);

    return result;
}

and then use it with bit7z:

const Bit7zLibrary lib{"./7z.so"};
const BitArchiveReader reader{lib, "./スマトラ im11462659.zip", BitFormat::Auto};
for (const auto& item : reader) {
    std::println("{:2}) {}", item.index(), from_shiftjis_to_utf8(item.rawPath()));
}

This works for printing the items in an archive, but there's still no workaround for extracting them with the correct name to the filesystem.
I'm still trying to figure out the best way to do this.

StingKo · 2024-12-27T21:24:23Z

This works for printing the items in an archive, but there's still no workaround for extracting them with the correct name to the filesystem. I'm still trying to figure out the best way to do this.

Thank you for taking the time to provide such a detailed and thoughtful response to my question. I will carefully explore the code you shared and make every effort to refine and discover the most effective approach. At the same time, I want to express my sincere respect and gratitude for your incredible work in creating such an outstanding library.

rikyoz · 2024-12-31T18:06:06Z

You're welcome, and thank you for your kind words! I really appreciate your support and enthusiasm for the project. Feel free to reach out if you have any further questions or ideas as you work through the code.

rikyoz · 2025-01-02T17:29:26Z

Hi!
I've just realized that there is already a possible workaround for the extraction problem on the develop branch as well.
I recently introduced an overload for the BitInputArchive::extractTo method that takes a second RenameCallback parameter, a callback function that is called on each item to be extracted, providing its index and the path it will have inside the output directory, and must return a string with the desired "renamed" path.
In this particular case, the second parameter is useless because it has already been converted to UTF-8 and is likely to be garbled by the string conversion.
However, in the context of the code example I used in my previous comment, you could write something like this:

const Bit7zLibrary lib{"./7z.so"};
const BitArchiveReader reader{lib, "./スマトラ im11462659.zip", BitFormat::Auto};
reader.extractTo("./out/", [&reader](std::uint32_t index, const std::string&) {
    const auto item = reader.itemAt(index);
    return from_shiftjis_to_utf8(item.rawPath());
});

In this way, the files extracted will have the correctly encoded name on the filesystem.

I'm also working on fixing this issue without needing any workaround by the library user.

StingKo added the 🐞 bug label Dec 24, 2024

StingKo assigned rikyoz Dec 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Specifying the codepage (MCP) does not work on macOS. #267

[Bug]: Specifying the codepage (MCP) does not work on macOS. #267

StingKo commented Dec 24, 2024

rikyoz commented Dec 24, 2024 •

edited

Loading

StingKo commented Dec 24, 2024

rikyoz commented Dec 27, 2024 •

edited

Loading

StingKo commented Dec 27, 2024

rikyoz commented Dec 31, 2024

rikyoz commented Jan 2, 2025

[Bug]: Specifying the codepage (MCP) does not work on macOS. #267

[Bug]: Specifying the codepage (MCP) does not work on macOS. #267

Comments

StingKo commented Dec 24, 2024

bit7z version

Compilation options

7-zip version

7-zip shared library used

Compilers

Compiler versions

Architecture

Operating system

Operating system versions

Bug description

Steps to reproduce

Expected behavior

Relevant compilation output

Code of Conduct

rikyoz commented Dec 24, 2024 • edited Loading

StingKo commented Dec 24, 2024

rikyoz commented Dec 27, 2024 • edited Loading

The Underlying Issue

Workarounds

StingKo commented Dec 27, 2024

rikyoz commented Dec 31, 2024

rikyoz commented Jan 2, 2025

rikyoz commented Dec 24, 2024 •

edited

Loading

rikyoz commented Dec 27, 2024 •

edited

Loading