ssfang/setlocale_sample.md

Last active September 27, 2016 08:34

Star (0) You must be signed in to star a gist
Fork (0) You must be signed in to fork a gist

Select an option

Learn more about clone URLs
Clone this repository at <script src="https://gist.github.com/ssfang/8730f55e2bc386fee2460fb6fdb221c2.js"></script>
Save ssfang/8730f55e2bc386fee2460fb6fdb221c2 to your computer and use it in GitHub Desktop.

Download ZIP

Internationalization and localization: Character set, Character encoding, setlocale

Raw

setlocale_sample.md

VS2008（文件->高级保存选项-编码：简体中文(GB2312) - 代码页 936）

#include <stdio.h>
#include <tchar.h>
#include <locale.h>

int _tmain(int argc, _TCHAR* argv[])
{
/*
C:\Users\fangss>chcp /?
	  显示或设置活动代码页编号。

	  CHCP [nnn]

	  nnn   指定代码页编号。
	  不带参数键入 CHCP 以显示活动代码页编号。

	  chcp 65001  就是换成UTF-8代码页
	  chcp 936 可以换回默认的GBK
	  chcp 437 是美国英语  

	  https://www.microsoft.com/resources/documentation/windows/xp/all/proddocs/en-us/chcp.mspx?mfr=true
*/
	//https://msdn.microsoft.com/en-us/library/x99tb11d.aspx
	printf("the thread's current locale: %s\nthe active console code page: ", setlocale(LC_ALL, NULL));
	system("CHCP");

	printf("\n");

	//setlocale();
	printf("printf(\"%%S\"): WideChars = %S", L"CN中国");
	printf(" vs ");
	wprintf(L"wprintf(L\"%%s\"): WideChars = %s", L"CN中国");

	printf("\n\n");

	// POSIX language[_territory][.codeset][@modifier]
	// MS lang[_country_region[.code_page]]
	// language 为 ISO 639 中规定的语言代码，territory 为 ISO 3166 中规定的国家/地区代码，codeset 为字符集名称
	// e.g. zh_CN.GBK for POSIX vs Chinese_People's Republic of China.936 for Windows CRT

	//setlocale(LC_ALL, "Chinese_People's Republic of China.936");
	setlocale(LC_ALL, "");

	printf("printf(\"%%S\"): WideChars = %S", L"CN中国");
	printf(" vs ");
	wprintf(L"wprintf(L\"%%s\"): WideChars = %s", L"CN中国");

	printf("\n\n");
	//getchar();
	system("pause");
	return 0;
}

输出结果

the thread's current locale: C
the active console code page: 活动代码页: 936

printf("%S"): WideChars = CN vs wprintf(L"%s"): WideChars = CN??

printf("%S"): WideChars = CN中国 vs wprintf(L"%s"): WideChars = CN中国

请按任意键继续. . .

为了更清楚了解源文件的编码，如下查看（截取了文件前段和后段部分）

C:\Users\fangss>hexdump -C D:\VSProjects\Win32\Win32\propdump.cpp
00000000  23 69 6e 63 6c 75 64 65  20 3c 73 74 64 69 6f 2e  |#include <stdio.|
00000010  68 3e 0d 0a 23 69 6e 63  6c 75 64 65 20 3c 74 63  |h>..#include <tc|

00000530  53 22 2c 20 4c 22 43 4e  d6 d0 b9 fa 22 29 3b 0d  |S", L"CN....");.| <---
00000540  0a 20 20 20 20 70 72 69  6e 74 66 28 22 20 76 73  |.    printf(" vs|
00000550  20 22 29 3b 0d 0a 20 20  20 20 77 70 72 69 6e 74  | ");..    wprint|
00000560  66 28 4c 22 77 70 72 69  6e 74 66 28 4c 5c 22 25  |f(L"wprintf(L\"%|
00000570  25 73 5c 22 29 3a 20 57  69 64 65 43 68 61 72 73  |%s\"): WideChars|
00000580  20 3d 20 25 73 22 2c 20  4c 22 43 4e d6 d0 b9 fa  | = %s", L"CN....| <---
00000590  22 29 3b 0d 0a 0d 0a 20  20 20 20 70 72 69 6e 74  |");....    print|
000005a0  66 28 22 5c 6e 5c 6e 22  29 3b 0d 0a 20 20 20 20  |f("\n\n");..    |
000005b0  2f 2f 67 65 74 63 68 61  72 28 29 3b 0d 0a 20 20  |//getchar();..  |

00000530和00000580偏移行包含中文字符中国，二进制为d6 d0 b9 fa。

Unihan data for U+4E2D 中和 Unihan data for U+56FD 国

从ftp://ftp.unicode.org/Public/MAPPINGS/VENDORS/MICSFT/WindowsBestFit/bestfit936.txt摘录部分：

fangss@fangss-PC ~
$ grep -n "Null" /cygdrive/c/Users/fangss/Desktop/bestfit936.txt
7:0x00  0x0000  ;Null
24465:0x0000    0x0000  ;Null

fangss@fangss-PC ~
$ grep -n "中" /cygdrive/c/Users/fangss/Desktop/bestfit936.txt
16694:0xd0      0x4e2d  ;中
25342:0x3197    0xd6d0  ;中
25486:0x4e2d    0xd6d0  ;中

fangss@fangss-PC ~
$ grep -n "国" /cygdrive/c/Users/fangss/Desktop/bestfit936.txt
11139:0xfa      0x56fd  ;国
27742:0x56fd    0xb9fa  ;国

同一个字符，前者归属于MBTABLE或DBCSTABLE；后者归属WCTABLE 24482，文件结构如下：

CODEPAGE 936            ; PRC GBK (XGB) - ANSI, OEM

CPINFO 2 0x3f 0x003f    ; DBCS CP, Default Char = Question Mark

MBTABLE 130

0x00	0x0000	;Null
0x01	0x0001	;Start Of Heading
;...省略
0xff	0xf8f5	;


DBCSRANGE  1            ;Lead Byte Range: 0x81-0xfe

0x81  0xfe              ;Lead Byte Range


DBCSTABLE 190           ;LeadByte = 0x81

0x40	0x4e02	;丂
0x41	0x4e04	;丄
;...省略

DBCSTABLE 190           ;LeadByte = 0x82

0x40	0x4fa4	;侤
0x41	0x4fab	;侫
;...省略
0xfe	0xe4c5	;


WCTABLE	24482

0x0000	0x0000	;Null
0x0001	0x0001	;Start Of Heading
;...省略
0xffe5	0xa3a4	;￥

ENDCODEPAGE

Windows CRT setlocale, _wsetlocale
National Language Support (NLS) API Reference
WindowsBestFit 936

Author

ssfang commented Aug 22, 2016 •

edited

Loading

C 和 C++ 的标准库分别有自己的 locale 操作方法，C 标准库的 locale 设定函数是 setlocale()，而 C++ 标准库有 locale 类和流对象的 imbue() 方法。这篇是我自己的 setlocale() 使用总结。

Linux的glibc中的setlocale()
具体参考：man 3 setlocale

头文件与声明如下：

1 #include <locale.h>
2 char* setlocale(int category, const char* locale);
说明：

category：为locale分类，表达一种locale的领域方面，通常有下面这些预定义常量：LC_ALL、LC_COLLATE、LC_CTYPE、LC_MESSAGES、LC_MONETARY、LC_NUMERIC、LC_TIME，其中 LC_ALL 表示所有其它locale分类的并集。

locale：为期望设定的locale名称字符串，在Linux/Unix环境下，通常以下面格式表示locale名称：language[_territory][.codeset][@modifier]，language 为 ISO 639 中规定的语言代码，territory 为 ISO 3166 中规定的国家/地区代码，codeset 为字符集名称。

在Linux下，可以使用 locale -a 命令查看系统中所有已配置的 locale。用不带选项的 locale 命令查看当前 Shell 中活动的 locale。用 locale -m 命令查看locale系统支持的所有可用的字符集编码。

和locale相关的包叫做：locales，locale系统支持的所有可用locale在文件：/usr/share/i18n/SUPPORTED 中列出。

在Debian下，可用 dpkg-reconfigure locales 命令重新配置 locale，也可以手工修改 /etc/locale.gen 文件，然后运行 locale-gen 命令。

在Ubuntu下，修改 /var/lib/locales/supported.d/local 文件，配置新的 locale，然后运行 locale-gen 命令。

当 locale 为 NULL 时，函数只做取回当前 locale 操作，通过返回值传出，并不改变当前 locale。

当 locale 为 "" 时，根据环境的设置来设定 locale，检测顺序是：环境变量 LC_ALL，每个单独的locale分类LC_*，最后是 LANG 变量。为了使程序可以根据环境来改变活动 locale，一般都在程序的初始化阶段加入下面代码：setlocale(LC_ALL, "")。

当C语言程序初始化时（刚进入到 main() 时），locale 被初始化为默认的 C locale，其采用的字符编码是所有本地 ANSI 字符集编码的公共部分，是用来书写C语言源程序的最小字符集（所以才起locale名叫：C）。

当用 setlocale() 设置活动 locale 时，如果成功，会返回当前活动 locale 的全名称；如果失败，会返回 NULL。

Windows的CRT中的setlocale()
具体参考：setlocale - MSDN Run-Time Library Reference

在 Windows CRT 的实现中还有一个使用 wchar_t 作为 locale 名的宽字符版本：_wsetlocale()。因此，也有了使用 _TCHAR 宏版本的 setlocale()：_tsetlocale()。

Windows CRT 实现的 setlocale() 和 glibc 版本的头文件与声明相同，使用方法类似，如下：

支持的 locale 分类常量：LC_ALL、LC_COLLATE、LC_CTYPE、LC_MONETARY、LC_NUMERIC、LC_TIME。

请求设定的 locale 名可以为以下格式（参考MSDN：Language and Country/Region Strings）：

lang[country_region[.code_page]]：虽然形式与 glibc 的相同，当 Windows 的 locale 名并不符合 POSIX 的规范，比如采用 GBK 字符集的大陆中文，POSIX 的名字为：zh_CN.GBK，而在 Windows CRT 中要用：Chinese_People's Republic of China.936，(--^)。

lang 字段的可用值参考：Language Strings

country_region 字段的可用值参考：Country/Region Strings

code_page 字段的可用值是 Windows 系统支持的代码页编号，参考：Code Page Identifiers

.code_page：可以直接使用代码页来设定 locale，而且可以使用 .OCP、.ACP 两个伪代码页，.OCP 表示从系统获得的当前活动的 OEM 代码页，.ACP 表示从系统获得的活动 ANSI 代码页。

""：根据 Windows 系统环境的活动 ANSI 代码页来设定 locale。.OCP、.ACP、和环境代码页都受控制面板中“区域与语言选项”的设置影响。默认装完简体中文版 Windows 后，活动的 ANSI 代码页为：936（即 GBK），可用 chcp 控制台程序查看活动代码页。

NULL：取回当前 locale，不改变当前 locale。

setlocale()的作用和使用例子
当向终端、控制台输出 wchar_t 类型的字符时，需要设置 setlocale()，因为通常终端、控制台环境自身是不支持 UCS 系列的字符集编码的，使用流操作函数时（如：printf()），在标准/RT库实现的内部会将 UCS 字符转换成合适的本地 ANSI 编码字符，转换的依据就是 setlocale() 设定的活动 locale，最后将结果字符序列传递给终端，对于来自终端的输入流这个过程刚好相反。

可以用重定向输出流到文件的方法验证上面的机制：无论是 Windows CRT、Linux glibc、Cygwin glibc，使用 wprintf() 打印 wchar_t 字符文本时，重定向到文件的内容总是 GBK、UTF-8 等本地 ANSI 编码，而不会是 UCS 编码。

下面是我写的一个使用 setlocale() 的示例：

#ifdef __GNUC__

#define CSET_GBK    "GBK"
#define CSET_UTF8   "UTF-8"

#define LC_NAME_zh_CN   "zh_CN"

// ifdef __GNUC__
#elif defined(_MSC_VER)

#define CSET_GBK    "936"
#define CSET_UTF8   "65001"

#define LC_NAME_zh_CN   "Chinese_People's Republic of China"

// ifdef _MSC_VER
#endif

#define LC_NAME_zh_CN_GBK       LC_NAME_zh_CN "." CSET_GBK
#define LC_NAME_zh_CN_UTF8      LC_NAME_zh_CN "." CSET_UTF8
#define LC_NAME_zh_CN_DEFAULT   LC_NAME_zh_CN_GBK

void print_current_loc();

int main(int argc, char* argv[])
{
    char* locname = NULL;
    const wchar_t* strzh = L"中文字符串";

    print_current_loc();

    // 使用指定的 locale
    locname = setlocale(LC_ALL, LC_NAME_zh_CN_DEFAULT);
    if ( NULL == locname )
    {
        printf("setlocale() with %s failed.\n", LC_NAME_zh_CN_DEFAULT);
    }
    else
    {
        printf("setlocale() with %s succeed.\n", LC_NAME_zh_CN_DEFAULT);
    }

    print_current_loc();

    wprintf(L"Zhong text is: %ls\n", strzh);

    // 使用运行环境中的 locale 设置
    locname = setlocale(LC_ALL, "");
    if ( NULL == locname )
    {
        printf("setlocale() from environment failed.\n");
    }
    else
    {
        printf("setlocale() from environment succeed.\n");
    }

    print_current_loc();

    wprintf(L"Zhong text is: %ls\n", strzh);

    puts("End of program.");
    return 0;
}

// 打印当前 locale
void print_current_loc()
{
    char* locname = setlocale(LC_ALL, NULL);
    printf("Current locale is: %s\n", locname);
}

要使上面程序成功编译并执行，需要注意一下几点：

Windows CRT 是不支持 UTF-8 编码作为 locale 的，运行时使用 setlocale(LC_ALL, ".65001") 会失败。

使用 Linux 和 Cygwin 的 glibc 时，要在终端显示正确的中文，需满足以下条件：

不要混用 char 和 wchar_t 版本的流操作函数，否则会导致这些函数运行异常，我用Cygwin GCC 4测试混用 printf() 和 wprintf() 时，程序甚至崩掉，所以要将上面程序中 printf() 语句全注释掉才行。Window CRT 的实现则没有这个问题。

运行环境的 locale 设置要和程序中 setlocale() 设定的 locale 一致，比如：终端的活动字符集、环境变量（一般用 LANG），要设置为 *.UTF-8，才能显示 setlocale(LC_ALL, "zh_CN.UTF-8") 设定的 wchar_t 的中文字符。

用 GCC 编译时，要使用 UTF-8 编码保存源文件，这是 GCC 在编译时，将 wchar_t 文字量（以 L 打头）正确转换为 UCS 编码保存在对象文件中的必需条件，用 Native ANSI 编码（比如：GBK）有 wchar_t 文字量的源文件时，GCC 会编译出错，Linux 和 Cygwin 的 GCC 都有这个约束。另外在 Linux GCC 使用 UCS-4 编码保存 wchar_t，而 Windows 和 Cygwin GCC 使用 UCS-2。

用 wprintf() 时，要用 %ls 表示 wchar_t 的字符串，用 %s 表示 char 的字符串，具体参考：man 3 wprintf，而 Windows 的实现用 %ls、%s 都可以正确输出 wchar_t 字符串。

ssfang/setlocale_sample.md

Select an option

No results found

Select an option

No results found

ssfang commented Aug 22, 2016 •

edited

Loading

Uh oh!

ssfang/setlocale_sample.md

ssfang commented Aug 22, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ssfang commented Aug 22, 2016 •

edited

Loading