-
-
Notifications
You must be signed in to change notification settings - Fork 120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Specifying the codepage (MCP) does not work on macOS. #267
Comments
Hi! |
Thank you for your answer. |
The short answer is that on the The long answer requires a bit more context, unfortunately (sorry for the long comment). The Underlying IssueThe 7-Zip API always uses wide strings to handle Unicode aware strings. On Windows, wide strings are always UTF-16 encoded, and 7-Zip follows this encoding. When you use the On Linux and macOS, wide strings are usually UTF-32 encoded ( First of all, 7-Zip's behavior depends on the format of the archive. For example, take a Second, even if you are reading only one archive format, 7-Zip may behave differently depending on how the archive was created. Some archivers store filenames in UTF-8. When 7-Zip detects this, it converts the UTF-8 string to UTF-16, and then stores the 16-bit units as 32-bit characters, just as it does for the 7z format, producing an invalid UTF-32 string. On the other hand, if a zip file was created using the Shift-JIS encoding, 7-Zip returns a wide string that stores each Shift-JIS byte of the string in 32-bit characters, without any conversion to UTF-32. For example, the character This confusing behavior of 7-Zip is a problem because bit7z expects UTF-32 encoded strings from 7-Zip and tries to convert them to UTF-8 on that assumption, producing the garbage strings you noticed. WorkaroundsI have not yet found a clean way to handle this whole mess within bit7z. In the current stable v4.0.9 there's no workaround that I can think of, unfortunately. However, on the This means that you can now handle the string encoding yourself. For example, if you know that the archive uses the Shift-JIS encoding, and you need to print its items in UTF-8, you can now write a string conversion function using #include <cstring> // For strerror
#include <memory> // For std::unique_ptr
#include <type_traits> // For std::remove_pointer_t
#include <iconv.h>
// Define a type alias for the converter, using std::unique_ptr for automatic resource management.
using converter_t = std::unique_ptr<std::remove_pointer_t<iconv_t>, decltype(&::iconv_close)>;
auto make_converter(const char* to, const char* from) -> converter_t {
const auto converter = ::iconv_open(to, from);
if (converter == reinterpret_cast<decltype(converter)>(-1)) {
throw std::runtime_error(std::format("Failed to open iconv: {}", strerror(errno)));
}
return {converter, ::iconv_close};
}
auto from_shiftjis_to_utf8(const std::wstring& wstr) -> std::string {
// Create an iconv converter to transform from SHIFT_JIS to UTF-8 encoding.
static const converter_t converter = make_converter("UTF-8", "SHIFT_JIS");
// Convert the 7-Zip wide string (32-bit units) to a narrow string (8-bit units).
// 7-Zip stores each byte of the original Shift-JIS string in 32-bit units
// (e.g., the Shift-JIS byte sequence 0x93D6 becomes 0x00000093 0x000000D6 in wstr).
// This conversion keeps each Shift-JIS byte intact, just storing them in an 8-bit string.
std::string str{wstr.cbegin(), wstr.cend()};
// Compute the maximum possible size for the output buffer (UTF-8 uses up to 4 bytes per codepoint).
const std::size_t dstMaxLen = 4 * str.size();
// Allocate space for the resulting UTF-8 string.
std::string result(dstMaxLen, '\0');
// Set up the input and output buffers for iconv.
char* src = str.data();
std::size_t srcLen = str.size();
char* dst = result.data();
std::size_t dstLen = result.size();
// Perform the actual encoding conversion from Shift-JIS to UTF-8 using iconv.
const auto iconv_result = ::iconv(converter.get(), &src, &srcLen, &dst, &dstLen);
if (iconv_result == static_cast<decltype(iconv_result)>(-1)) {
throw std::runtime_error(std::format("Failed to convert to UTF-8: {}", strerror(errno)));
}
// Adjust the result string size based on the actual number of bytes written.
result.resize(dstMaxLen - dstLen);
return result;
} and then use it with bit7z: const Bit7zLibrary lib{"./7z.so"};
const BitArchiveReader reader{lib, "./スマトラ im11462659.zip", BitFormat::Auto};
for (const auto& item : reader) {
std::println("{:2}) {}", item.index(), from_shiftjis_to_utf8(item.rawPath()));
} This works for printing the items in an archive, but there's still no workaround for extracting them with the correct name to the filesystem. |
Thank you for taking the time to provide such a detailed and thoughtful response to my question. I will carefully explore the code you shared and make every effort to refine and discover the most effective approach. At the same time, I want to express my sincere respect and gratitude for your incredible work in creating such an outstanding library. |
You're welcome, and thank you for your kind words! I really appreciate your support and enthusiasm for the project. Feel free to reach out if you have any further questions or ideas as you work through the code. |
Hi! const Bit7zLibrary lib{"./7z.so"};
const BitArchiveReader reader{lib, "./スマトラ im11462659.zip", BitFormat::Auto};
reader.extractTo("./out/", [&reader](std::uint32_t index, const std::string&) {
const auto item = reader.itemAt(index);
return from_shiftjis_to_utf8(item.rawPath());
}); In this way, the files extracted will have the correctly encoded name on the filesystem. I'm also working on fixing this issue without needing any workaround by the library user. |
bit7z version
4.0.x
Compilation options
BIT7Z_7ZIP_VERSION, BIT7Z_ANALYZE_CODE, BIT7Z_AUTO_FORMAT, BIT7Z_LINK_LIBCPP, BIT7Z_USE_NATIVE_STRING, BIT7Z_USE_STD_BYTE
7-zip version
v23.01
7-zip shared library used
7z.dll / 7z.so
Compilers
Clang
Compiler versions
AppleClang 16
Architecture
arm64
Operating system
macOS
Operating system versions
macos 15.2
Bug description
Use the compressed package provided in the Issues. #248
Steps to reproduce
No response
Expected behavior
No response
Relevant compilation output
No response
Code of Conduct
The text was updated successfully, but these errors were encountered: