Skip to content

Add char8_t mode #529

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,7 @@ cmake_dependent_option(PUGIXML_BUILD_SHARED_AND_STATIC_LIBS

# Expose options from the pugiconfig.hpp
option(PUGIXML_WCHAR_MODE "Enable wchar_t mode" OFF)
option(PUGIXML_CHAR8_MODE "Enable char8_t mode" OFF)
option(PUGIXML_COMPACT "Enable compact mode" OFF)

# Advanced options from pugiconfig.hpp
Expand All @@ -51,6 +52,7 @@ endif()

set(PUGIXML_PUBLIC_DEFINITIONS
$<$<BOOL:${PUGIXML_WCHAR_MODE}>:PUGIXML_WCHAR_MODE>
$<$<BOOL:${PUGIXML_CHAR8_MODE}>:PUGIXML_CHAR8_MODE>
$<$<BOOL:${PUGIXML_COMPACT}>:PUGIXML_COMPACT>
$<$<BOOL:${PUGIXML_NO_XPATH}>:PUGIXML_NO_XPATH>
$<$<BOOL:${PUGIXML_NO_STL}>:PUGIXML_NO_STL>
Expand Down
18 changes: 15 additions & 3 deletions docs/manual.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -228,6 +228,8 @@ pugixml uses several defines to control the compilation process. There are two w

[[PUGIXML_WCHAR_MODE]]`PUGIXML_WCHAR_MODE` define toggles between UTF-8 style interface (the in-memory text encoding is assumed to be UTF-8, most functions use `char` as character type) and UTF-16/32 style interface (the in-memory text encoding is assumed to be UTF-16/32, depending on `wchar_t` size, most functions use `wchar_t` as character type). See <<dom.unicode>> for more details.

[[PUGIXML_CHAR8_MODE]]`PUGIXML_CHAR8_MODE` define makes the UTF-8 style interface use `char8_t` instead of `char`.

[[PUGIXML_COMPACT]]`PUGIXML_COMPACT` define activates a different internal representation of document storage that is much more memory efficient for documents with a lot of markup (i.e. nodes and attributes), but is slightly slower to parse and access. For details see <<dom.memory.compact>>.

[[PUGIXML_NO_XPATH]]`PUGIXML_NO_XPATH` define disables XPath. Both XPath interfaces and XPath implementation are excluded from compilation. This option is provided in case you do not need XPath functionality and need to save code space.
Expand Down Expand Up @@ -399,7 +401,7 @@ Nodes and attributes do not exist without a document tree, so you can't create t
[[dom.unicode]]
=== Unicode interface

There are two choices of interface and internal representation when configuring pugixml: you can either choose the UTF-8 (also called char) interface or UTF-16/32 (also called wchar_t) one. The choice is controlled via <<PUGIXML_WCHAR_MODE,PUGIXML_WCHAR_MODE>> define; you can set it via `pugiconfig.hpp` or via preprocessor options, as discussed in <<install.building.config>>. If this define is set, the wchar_t interface is used; otherwise (by default) the char interface is used. The exact wide character encoding is assumed to be either UTF-16 or UTF-32 and is determined based on the size of `wchar_t` type.
There are three choices of interface and internal representation when configuring pugixml: you can either choose the UTF-8 (also called char) interface or UTF-16/32 (also called wchar_t) one. The UTF-8 interface can either use char (the default) or char8_t. The choice is controlled via the <<PUGIXML_WCHAR_MODE,PUGIXML_WCHAR_MODE>> and <<PUGIXML_CHAR8_MODE,PUGIXML_CHAR8_MODE>> defines; you can set them via `pugiconfig.hpp` or via preprocessor options, as discussed in <<install.building.config>>. If `PUGIXML_WCHAR_MODE` is set, the wchar_t interface is used; otherwise, if `PUGIXML_CHAR8_MODE` is set, the char8_t interface is used; otherwise (by default) the char interface is used. The exact wide character encoding is assumed to be either UTF-16 or UTF-32 and is determined based on the size of `wchar_t` type.

NOTE: If the size of `wchar_t` is 2, pugixml assumes UTF-16 encoding instead of UCS-2, which means that some characters are represented as two code points.

Expand All @@ -411,6 +413,14 @@ const char* xml_node::name() const;
bool xml_node::set_name(const char* value);
----

like this in char8_t mode:

[source]
----
const char8_t* xml_node::name() const;
bool xml_node::set_name(const char8_t* value);
----

and like this in wchar_t mode:

[source]
Expand All @@ -420,7 +430,7 @@ bool xml_node::set_name(const wchar_t* value);
----

[[char_t]][[string_t]]
There is a special type, `pugi::char_t`, that is defined as the character type and depends on the library configuration; it will be also used in the documentation hereafter. There is also a type `pugi::string_t`, which is defined as the STL string of the character type; it corresponds to `std::string` in char mode and to `std::wstring` in wchar_t mode.
There is a special type, `pugi::char_t`, that is defined as the character type and depends on the library configuration; it will be also used in the documentation hereafter. There is also a type `pugi::string_t`, which is defined as the STL string of the character type; it corresponds to `std::string` in char mode, `std::u8string` in char8_t mode, and to `std::wstring` in wchar_t mode.

In addition to the interface, the internal implementation changes to store XML data as `pugi::char_t`; this means that these two modes have different memory usage characteristics - generally UTF-8 mode is more memory and performance efficient, especially if `sizeof(wchar_t)` is 4. The conversion to `pugi::char_t` upon document loading and from `pugi::char_t` upon document saving happen automatically, which also carries minor performance penalty. The general advice however is to select the character mode based on usage scenario, i.e. if UTF-8 is inconvenient to process and most of your XML data is non-ASCII, wchar_t mode is probably a better choice.

Expand All @@ -443,13 +453,15 @@ std::wstring as_wide(const std::string& str);

[NOTE]
====
Most examples in this documentation assume char interface and therefore will not compile with <<PUGIXML_WCHAR_MODE,PUGIXML_WCHAR_MODE>>. This is done to simplify the documentation; usually the only changes you'll have to make is to pass `wchar_t` string literals, i.e. instead of
Most examples in this documentation assume char interface and therefore will not compile with <<PUGIXML_WCHAR_MODE,PUGIXML_WCHAR_MODE>> or <<PUGIXML_CHAR8_MODE,PUGIXML_CHAR8_MODE>>. This is done to simplify the documentation; usually the only changes you'll have to make is to pass the appropriate string literals, i.e. instead of

`xml_node node = doc.child("bookstore").find_child_by_attribute("book", "id", "12345");`

you'll have to use

`xml_node node = doc.child(L"bookstore").find_child_by_attribute(L"book", L"id", L"12345");`

in wchar_t mode.
====

[[dom.thread]]
Expand Down
3 changes: 3 additions & 0 deletions src/pugiconfig.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,9 @@
// Uncomment this to enable wchar_t mode
// #define PUGIXML_WCHAR_MODE

// Uncomment this to enable char8_t mode
//#define PUGIXML_CHAR8_MODE

// Uncomment this to enable compact mode
// #define PUGIXML_COMPACT

Expand Down
76 changes: 70 additions & 6 deletions src/pugixml.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -217,6 +217,8 @@ PUGI__NS_BEGIN

#ifdef PUGIXML_WCHAR_MODE
return wcslen(s);
#elif defined(PUGIXML_CHAR8_MODE)
return strlen(reinterpret_cast<const char*>(s));
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is strlength not implemented in terms of std::char_traits<pugi::char_t>::length, like you did for the tests?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

char_traits are only available when compiling with STL-support. The tests don't have that restriction.

#else
return strlen(s);
#endif
Expand All @@ -229,6 +231,8 @@ PUGI__NS_BEGIN

#ifdef PUGIXML_WCHAR_MODE
return wcscmp(src, dst) == 0;
#elif defined(PUGIXML_CHAR8_MODE)
return strcmp(reinterpret_cast<const char*>(src), reinterpret_cast<const char*>(dst)) == 0;
#else
return strcmp(src, dst) == 0;
#endif
Expand Down Expand Up @@ -2300,7 +2304,7 @@ PUGI__NS_BEGIN
return wchar_decoder::process(str, length, 0, utf8_counter());
}

PUGI__FN void as_utf8_end(char* buffer, size_t size, const wchar_t* str, size_t length)
PUGI__FN void as_utf8_end(u8char_t* buffer, size_t size, const wchar_t* str, size_t length)
{
// convert to utf8
uint8_t* begin = reinterpret_cast<uint8_t*>(buffer);
Expand All @@ -2312,13 +2316,13 @@ PUGI__NS_BEGIN
}

#ifndef PUGIXML_NO_STL
PUGI__FN std::string as_utf8_impl(const wchar_t* str, size_t length)
PUGI__FN std::basic_string<u8char_t> as_utf8_impl(const wchar_t* str, size_t length)
{
// first pass: get length in utf8 characters
size_t size = as_utf8_begin(str, length);

// allocate resulting string
std::string result;
std::basic_string<u8char_t> result;
result.resize(size);

// second pass: convert to utf8
Expand Down Expand Up @@ -3503,7 +3507,7 @@ PUGI__NS_BEGIN
#else
static char_t* parse_skip_bom(char_t* s)
{
return (s[0] == '\xef' && s[1] == '\xbb' && s[2] == '\xbf') ? s + 3 : s;
return (s[0] == char_t('\xef') && s[1] == char_t('\xbb') && s[2] == char_t('\xbf')) ? s + 3 : s;
}
#endif

Expand Down Expand Up @@ -4607,6 +4611,8 @@ PUGI__NS_BEGIN
{
#ifdef PUGIXML_WCHAR_MODE
return wcstod(value, 0);
#elif defined(PUGIXML_CHAR8_MODE)
return strtod(reinterpret_cast<const char*>(value), 0);
#else
return strtod(value, 0);
#endif
Expand All @@ -4616,6 +4622,8 @@ PUGI__NS_BEGIN
{
#ifdef PUGIXML_WCHAR_MODE
return static_cast<float>(wcstod(value, 0));
#elif defined(PUGIXML_CHAR8_MODE)
return static_cast<float>(strtod(reinterpret_cast<const char*>(value), 0));
#else
return static_cast<float>(strtod(value, 0));
#endif
Expand Down Expand Up @@ -4674,6 +4682,8 @@ PUGI__NS_BEGIN
for (; buf[offset]; ++offset) wbuf[offset] = buf[offset];

return strcpy_insitu(dest, header, header_mask, wbuf, offset);
#elif defined(PUGIXML_CHAR8_MODE)
return strcpy_insitu(dest, header, header_mask, reinterpret_cast<const char8_t*>(buf), strlen(reinterpret_cast<const char*>(buf)));
#else
return strcpy_insitu(dest, header, header_mask, buf, strlen(buf));
#endif
Expand Down Expand Up @@ -5104,12 +5114,24 @@ namespace pugi

#ifndef PUGIXML_NO_STL
PUGI__FN xml_writer_stream::xml_writer_stream(std::basic_ostream<char, std::char_traits<char> >& stream): narrow_stream(&stream), wide_stream(0)
#ifdef PUGIXML_CHAR8_MODE
, utf8_stream(0)
#endif
{
}

PUGI__FN xml_writer_stream::xml_writer_stream(std::basic_ostream<wchar_t, std::char_traits<wchar_t> >& stream): narrow_stream(0), wide_stream(&stream)
#ifdef PUGIXML_CHAR8_MODE
, utf8_stream(0)
#endif
{
}

#ifdef PUGIXML_CHAR8_MODE
PUGI__FN xml_writer_stream::xml_writer_stream(std::basic_ostream<char8_t, std::char_traits<char8_t> >& stream): narrow_stream(0), wide_stream(0), utf8_stream(&stream)
{
}
#endif

PUGI__FN void xml_writer_stream::write(const void* data, size_t size)
{
Expand All @@ -5118,6 +5140,13 @@ namespace pugi
assert(!wide_stream);
narrow_stream->write(reinterpret_cast<const char*>(data), static_cast<std::streamsize>(size));
}
#ifdef PUGIXML_CHAR8_MODE
else if (utf8_stream)
{
assert(!wide_stream);
utf8_stream->write(reinterpret_cast<const char8_t*>(data), static_cast<std::streamsize>(size));
}
#endif
else
{
assert(wide_stream);
Expand Down Expand Up @@ -6492,6 +6521,15 @@ namespace pugi

print(writer, indent, flags, encoding_wchar, depth);
}

#ifdef PUGIXML_CHAR8_MODE
PUGI__FN void xml_node::print(std::basic_ostream<char8_t, std::char_traits<char8_t> >& stream, const char_t* indent, unsigned int flags, unsigned int depth) const
{
xml_writer_stream writer(stream);

print(writer, indent, flags, encoding_wchar, depth);
}
#endif
#endif

PUGI__FN ptrdiff_t xml_node::offset_debug() const
Expand Down Expand Up @@ -7314,6 +7352,15 @@ namespace pugi

return impl::load_stream_impl(static_cast<impl::xml_document_struct*>(_root), stream, options, encoding_wchar, &_buffer);
}

#ifdef PUGIXML_CHAR8_MODE
PUGI__FN xml_parse_result xml_document::load(std::basic_istream<char8_t, std::char_traits<char8_t> >& stream, unsigned int options)
{
reset();

return impl::load_stream_impl(static_cast<impl::xml_document_struct*>(_root), stream, options, encoding_utf8, &_buffer);
}
#endif
#endif

PUGI__FN xml_parse_result xml_document::load_string(const char_t* contents, unsigned int options)
Expand Down Expand Up @@ -7416,6 +7463,15 @@ namespace pugi

save(writer, indent, flags, encoding_wchar);
}

#ifdef PUGIXML_CHAR8_MODE
PUGI__FN void xml_document::save(std::basic_ostream<char8_t, std::char_traits<char8_t> >& stream, const char_t* indent, unsigned int flags) const
{
xml_writer_stream writer(stream);

save(writer, indent, flags, encoding_wchar);
}
#endif
#endif

PUGI__FN bool xml_document::save_file(const char* path_, const char_t* indent, unsigned int flags, xml_encoding encoding) const
Expand Down Expand Up @@ -7446,14 +7502,14 @@ namespace pugi
}

#ifndef PUGIXML_NO_STL
PUGI__FN std::string PUGIXML_FUNCTION as_utf8(const wchar_t* str)
PUGI__FN std::basic_string<u8char_t> PUGIXML_FUNCTION as_utf8(const wchar_t* str)
{
assert(str);

return impl::as_utf8_impl(str, impl::strlength_wide(str));
}

PUGI__FN std::string PUGIXML_FUNCTION as_utf8(const std::basic_string<wchar_t>& str)
PUGI__FN std::basic_string<u8char_t> PUGIXML_FUNCTION as_utf8(const std::basic_string<wchar_t>& str)
{
return impl::as_utf8_impl(str.c_str(), str.size());
}
Expand Down Expand Up @@ -8094,6 +8150,9 @@ PUGI__NS_BEGIN
{
#ifdef PUGIXML_WCHAR_MODE
return wcschr(s, c);
#elif defined(PUGIXML_CHAR8_MODE)
return reinterpret_cast<const char8_t*>(
strchr(reinterpret_cast<const char*>(s), static_cast<char>(c)));
#else
return strchr(s, c);
#endif
Expand All @@ -8104,6 +8163,9 @@ PUGI__NS_BEGIN
#ifdef PUGIXML_WCHAR_MODE
// MSVC6 wcsstr bug workaround (if s is empty it always returns 0)
return (*p == 0) ? s : wcsstr(s, p);
#elif defined(PUGIXML_CHAR8_MODE)
return reinterpret_cast<const char8_t*>(
strstr(reinterpret_cast<const char*>(s), reinterpret_cast<const char*>(p)));
#else
return strstr(s, p);
#endif
Expand Down Expand Up @@ -8550,6 +8612,8 @@ PUGI__NS_BEGIN
// parse string
#ifdef PUGIXML_WCHAR_MODE
return wcstod(string, 0);
#elif defined(PUGIXML_CHAR8_MODE)
return strtod(reinterpret_cast<const char*>(string), 0);
#else
return strtod(string, 0);
#endif
Expand Down
33 changes: 31 additions & 2 deletions src/pugixml.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -122,10 +122,17 @@
# endif
#endif

#if defined(PUGIXML_CHAR8_MODE) && !defined(__cpp_char8_t)
# error "char8_t mode requires C++20 or later"
#endif

// Character interface macros
#ifdef PUGIXML_WCHAR_MODE
# define PUGIXML_TEXT(t) L ## t
# define PUGIXML_CHAR wchar_t
#elif defined(PUGIXML_CHAR8_MODE)
# define PUGIXML_TEXT(t) u8 ## t
# define PUGIXML_CHAR char8_t
#else
# define PUGIXML_TEXT(t) t
# define PUGIXML_CHAR char
Expand All @@ -136,6 +143,13 @@ namespace pugi
// Character type used for all internal storage and operations; depends on PUGIXML_WCHAR_MODE
typedef PUGIXML_CHAR char_t;

// Character type used for UTF-8; depends on PUGIXML_CHAR8_MODE
#ifdef PUGIXML_CHAR8_MODE
typedef char8_t u8char_t;
#else
typedef char u8char_t;
#endif

#ifndef PUGIXML_NO_STL
// String type used for operations that work with STL string; depends on PUGIXML_WCHAR_MODE
typedef std::basic_string<PUGIXML_CHAR, std::char_traits<PUGIXML_CHAR>, std::allocator<PUGIXML_CHAR> > string_t;
Expand Down Expand Up @@ -351,12 +365,18 @@ namespace pugi
// Construct writer from an output stream object
xml_writer_stream(std::basic_ostream<char, std::char_traits<char> >& stream);
xml_writer_stream(std::basic_ostream<wchar_t, std::char_traits<wchar_t> >& stream);
#ifdef PUGIXML_CHAR8_MODE
xml_writer_stream(std::basic_ostream<char8_t, std::char_traits<char8_t> >& stream);
#endif

virtual void write(const void* data, size_t size) PUGIXML_OVERRIDE;

private:
std::basic_ostream<char, std::char_traits<char> >* narrow_stream;
std::basic_ostream<wchar_t, std::char_traits<wchar_t> >* wide_stream;
#ifdef PUGIXML_CHAR8_MODE
std::basic_ostream<char8_t, std::char_traits<char8_t> >* utf8_stream;
#endif
};
#endif

Expand Down Expand Up @@ -696,6 +716,9 @@ namespace pugi
// Print subtree to stream
void print(std::basic_ostream<char, std::char_traits<char> >& os, const char_t* indent = PUGIXML_TEXT("\t"), unsigned int flags = format_default, xml_encoding encoding = encoding_auto, unsigned int depth = 0) const;
void print(std::basic_ostream<wchar_t, std::char_traits<wchar_t> >& os, const char_t* indent = PUGIXML_TEXT("\t"), unsigned int flags = format_default, unsigned int depth = 0) const;
#ifdef PUGIXML_CHAR8_MODE
void print(std::basic_ostream<char8_t, std::char_traits<char8_t> >& os, const char_t* indent = PUGIXML_TEXT("\t"), unsigned int flags = format_default, unsigned int depth = 0) const;
#endif
#endif

// Child nodes iterators
Expand Down Expand Up @@ -1064,6 +1087,9 @@ namespace pugi
// Load document from stream.
xml_parse_result load(std::basic_istream<char, std::char_traits<char> >& stream, unsigned int options = parse_default, xml_encoding encoding = encoding_auto);
xml_parse_result load(std::basic_istream<wchar_t, std::char_traits<wchar_t> >& stream, unsigned int options = parse_default);
#ifdef PUGIXML_CHAR8_MODE
xml_parse_result load(std::basic_istream<char8_t, std::char_traits<char8_t> >& stream, unsigned int options = parse_default);
#endif
#endif

// (deprecated: use load_string instead) Load document from zero-terminated string. No encoding conversions are applied.
Expand Down Expand Up @@ -1094,6 +1120,9 @@ namespace pugi
// Save XML document to stream (semantics is slightly different from xml_node::print, see documentation for details).
void save(std::basic_ostream<char, std::char_traits<char> >& stream, const char_t* indent = PUGIXML_TEXT("\t"), unsigned int flags = format_default, xml_encoding encoding = encoding_auto) const;
void save(std::basic_ostream<wchar_t, std::char_traits<wchar_t> >& stream, const char_t* indent = PUGIXML_TEXT("\t"), unsigned int flags = format_default) const;
#ifdef PUGIXML_CHAR8_MODE
void save(std::basic_ostream<char8_t, std::char_traits<char8_t> >& stream, const char_t* indent = PUGIXML_TEXT("\t"), unsigned int flags = format_default) const;
#endif
#endif

// Save XML to file
Expand Down Expand Up @@ -1429,8 +1458,8 @@ namespace pugi

#ifndef PUGIXML_NO_STL
// Convert wide string to UTF8
std::basic_string<char, std::char_traits<char>, std::allocator<char> > PUGIXML_FUNCTION as_utf8(const wchar_t* str);
std::basic_string<char, std::char_traits<char>, std::allocator<char> > PUGIXML_FUNCTION as_utf8(const std::basic_string<wchar_t, std::char_traits<wchar_t>, std::allocator<wchar_t> >& str);
std::basic_string<u8char_t, std::char_traits<u8char_t>, std::allocator<u8char_t> > PUGIXML_FUNCTION as_utf8(const wchar_t* str);
std::basic_string<u8char_t, std::char_traits<u8char_t>, std::allocator<u8char_t> > PUGIXML_FUNCTION as_utf8(const std::basic_string<wchar_t, std::char_traits<wchar_t>, std::allocator<wchar_t> >& str);

// Convert UTF8 to wide string
std::basic_string<wchar_t, std::char_traits<wchar_t>, std::allocator<wchar_t> > PUGIXML_FUNCTION as_wide(const char* str);
Expand Down
Loading