Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Specifying the codepage (MCP) does not work on macOS. #267

Open
1 task done
StingKo opened this issue Dec 24, 2024 · 6 comments
Open
1 task done

[Bug]: Specifying the codepage (MCP) does not work on macOS. #267

StingKo opened this issue Dec 24, 2024 · 6 comments
Assignees
Labels

Comments

@StingKo
Copy link

StingKo commented Dec 24, 2024

bit7z version

4.0.x

Compilation options

BIT7Z_7ZIP_VERSION, BIT7Z_ANALYZE_CODE, BIT7Z_AUTO_FORMAT, BIT7Z_LINK_LIBCPP, BIT7Z_USE_NATIVE_STRING, BIT7Z_USE_STD_BYTE

7-zip version

v23.01

7-zip shared library used

7z.dll / 7z.so

Compilers

Clang

Compiler versions

AppleClang 16

Architecture

arm64

Operating system

macOS

Operating system versions

macos 15.2

Bug description

const bit7z::BitArchiveReader arc{
    m_lib,
    url,
    bit7z::BitFormat::Auto
};
arc.useFormatProperty(L"cp", 932u);
for (const auto &item: arc) {
    std::cout << "path: " << item.nativePath() << std::endl;
}

Use the compressed package provided in the Issues. #248

70f3fea4e56613db63b5a635b4ad77be

Steps to reproduce

No response

Expected behavior

No response

Relevant compilation output

No response

Code of Conduct

@rikyoz
Copy link
Owner

rikyoz commented Dec 24, 2024

Hi!
Unfortunately, 7-Zip doesn't support specifying the codepage on Unix systems, as stated here by its creator (he actually says Linux in that comment, but the same is true for macOS).
I also verified this in 7-Zip's source code: the Unix string conversion functions (here and here) basically ignore the codepage parameter unless it refers to the UTF-8 encoding.
In this case there is not much bit7z can do.

@StingKo
Copy link
Author

StingKo commented Dec 24, 2024

Hi! Unfortunately, 7-Zip doesn't support specifying the codepage on Unix systems, as stated here by its creator (he actually says Linux in that comment, but the same is true for macOS). I also verified this in 7-Zip's source code: the Unix string conversion functions (here and here) basically ignore the codepage parameter unless it refers to the UTF-8 encoding. In this case there is not much bit7z can do.

Thank you for your answer.
Sorry, I am not deeply familiar with reading the 7-zip source code. In this case, do you know if there is another way to parse content with a specific code page?
I noticed that some people use iconv to achieve this (here) , but I don’t know how to implement it in conjunction with bit7z.

@rikyoz
Copy link
Owner

rikyoz commented Dec 27, 2024

In this case, do you know if there is another way to parse content with a specific code page?

The short answer is that on the develop branch there are already some API improvements that can be used to implement some partial workaround. I hope to add more for the next v4.1-beta release.

The long answer requires a bit more context, unfortunately (sorry for the long comment).

The Underlying Issue

The 7-Zip API always uses wide strings to handle Unicode aware strings.

On Windows, wide strings are always UTF-16 encoded, and 7-Zip follows this encoding. When you use the mcp parameter, 7-Zip converts the original strings from the given codepage to UTF-16.

On Linux and macOS, wide strings are usually UTF-32 encoded (sizeof(wchar_t) = 4), but 7-Zip's handling of the encoding is a total mess.

First of all, 7-Zip's behavior depends on the format of the archive.

For example, take a .7z archive that stores a file with the Unicode character 𤭢 in the name. In a valid UTF-32 wide string, it should be encoded as L"\x00024B62". On the contrary, 7-Zip returns the wide string as L"\x0000d852\x0000df62". That is, it simply takes the UTF-16 encoding of the character, and stores each of its two 16-bit code units as a 32-bit wide character.
This wide string is not a valid UTF-32 string.

Second, even if you are reading only one archive format, 7-Zip may behave differently depending on how the archive was created.

Some archivers store filenames in UTF-8. When 7-Zip detects this, it converts the UTF-8 string to UTF-16, and then stores the 16-bit units as 32-bit characters, just as it does for the 7z format, producing an invalid UTF-32 string.

On the other hand, if a zip file was created using the Shift-JIS encoding, 7-Zip returns a wide string that stores each Shift-JIS byte of the string in 32-bit characters, without any conversion to UTF-32. For example, the character is encoded as \x93D6 in Shift-JIS, and becomes \x00000093\x000000D6 in 7-Zip, instead of the correct UTF-32 \x0000006556.

This confusing behavior of 7-Zip is a problem because bit7z expects UTF-32 encoded strings from 7-Zip and tries to convert them to UTF-8 on that assumption, producing the garbage strings you noticed.

Workarounds

I have not yet found a clean way to handle this whole mess within bit7z.
This is actually the biggest obstacle for the release of the next v4.1-beta.

In the current stable v4.0.9 there's no workaround that I can think of, unfortunately.

However, on the develop branch, I have added a new rawPath() method to the BitArchiveItem class. This function is like path() or nativePath(), but it always returns the "raw" wide string provided by 7-Zip, without any conversion attempt by bit7z.

This means that you can now handle the string encoding yourself.

For example, if you know that the archive uses the Shift-JIS encoding, and you need to print its items in UTF-8, you can now write a string conversion function using iconv as follows:

#include <cstring> // For strerror
#include <memory> // For std::unique_ptr
#include <type_traits> // For std::remove_pointer_t

#include <iconv.h>

// Define a type alias for the converter, using std::unique_ptr for automatic resource management.
using converter_t = std::unique_ptr<std::remove_pointer_t<iconv_t>, decltype(&::iconv_close)>;

auto make_converter(const char* to, const char* from) -> converter_t {
    const auto converter = ::iconv_open(to, from);
    if (converter == reinterpret_cast<decltype(converter)>(-1)) {
        throw std::runtime_error(std::format("Failed to open iconv: {}", strerror(errno)));
    }
    return {converter, ::iconv_close};
}

auto from_shiftjis_to_utf8(const std::wstring& wstr) -> std::string {
    // Create an iconv converter to transform from SHIFT_JIS to UTF-8 encoding.
    static const converter_t converter = make_converter("UTF-8", "SHIFT_JIS");

    // Convert the 7-Zip wide string (32-bit units) to a narrow string (8-bit units).
    // 7-Zip stores each byte of the original Shift-JIS string in 32-bit units
    // (e.g., the Shift-JIS byte sequence 0x93D6 becomes 0x00000093 0x000000D6 in wstr).
    // This conversion keeps each Shift-JIS byte intact, just storing them in an 8-bit string.
    std::string str{wstr.cbegin(), wstr.cend()};

    // Compute the maximum possible size for the output buffer (UTF-8 uses up to 4 bytes per codepoint).
    const std::size_t dstMaxLen = 4 * str.size();

    // Allocate space for the resulting UTF-8 string.
    std::string result(dstMaxLen, '\0');

    // Set up the input and output buffers for iconv.
    char* src = str.data();
    std::size_t srcLen = str.size();
    char* dst = result.data();
    std::size_t dstLen = result.size();

    // Perform the actual encoding conversion from Shift-JIS to UTF-8 using iconv.
    const auto iconv_result = ::iconv(converter.get(), &src, &srcLen, &dst, &dstLen);
    if (iconv_result == static_cast<decltype(iconv_result)>(-1)) {
        throw std::runtime_error(std::format("Failed to convert to UTF-8: {}", strerror(errno)));
    }

    // Adjust the result string size based on the actual number of bytes written.
    result.resize(dstMaxLen - dstLen);

    return result;
}

and then use it with bit7z:

const Bit7zLibrary lib{"./7z.so"};
const BitArchiveReader reader{lib, "./スマトラ im11462659.zip", BitFormat::Auto};
for (const auto& item : reader) {
    std::println("{:2}) {}", item.index(), from_shiftjis_to_utf8(item.rawPath()));
}

image

This works for printing the items in an archive, but there's still no workaround for extracting them with the correct name to the filesystem.
I'm still trying to figure out the best way to do this.

@StingKo
Copy link
Author

StingKo commented Dec 27, 2024

This works for printing the items in an archive, but there's still no workaround for extracting them with the correct name to the filesystem. I'm still trying to figure out the best way to do this.

Thank you for taking the time to provide such a detailed and thoughtful response to my question. I will carefully explore the code you shared and make every effort to refine and discover the most effective approach. At the same time, I want to express my sincere respect and gratitude for your incredible work in creating such an outstanding library.

@rikyoz
Copy link
Owner

rikyoz commented Dec 31, 2024

You're welcome, and thank you for your kind words! I really appreciate your support and enthusiasm for the project. Feel free to reach out if you have any further questions or ideas as you work through the code.

@rikyoz
Copy link
Owner

rikyoz commented Jan 2, 2025

Hi!
I've just realized that there is already a possible workaround for the extraction problem on the develop branch as well.
I recently introduced an overload for the BitInputArchive::extractTo method that takes a second RenameCallback parameter, a callback function that is called on each item to be extracted, providing its index and the path it will have inside the output directory, and must return a string with the desired "renamed" path.
In this particular case, the second parameter is useless because it has already been converted to UTF-8 and is likely to be garbled by the string conversion.
However, in the context of the code example I used in my previous comment, you could write something like this:

const Bit7zLibrary lib{"./7z.so"};
const BitArchiveReader reader{lib, "./スマトラ im11462659.zip", BitFormat::Auto};
reader.extractTo("./out/", [&reader](std::uint32_t index, const std::string&) {
    const auto item = reader.itemAt(index);
    return from_shiftjis_to_utf8(item.rawPath());
});

In this way, the files extracted will have the correctly encoded name on the filesystem.

I'm also working on fixing this issue without needing any workaround by the library user.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants