目前支持 Unicode 的编程语言都有哪些？

V2EX = way to explore

V2EX 是一个关于分享和探索的地方

现在注册

已注册用户请登录

这是一个创建于 3852 天前的主题，其中的信息可能已经有所发展或是发生改变。

PS：假如维基百科的大大们路过，希望顺便添加一个相关词条。

第 1 条附言 · 2014-05-02 23:02:52 +08:00

我是指在语言的定制标准中指定要明确支持 Unicode ，并且解析编译工具在发布时自带 Unicode 相关模块或者在语法上已经支持 Unicode 的编程语言。

假如有带 Unicode 支持但明显有「大坑」的也希望能够指出来。（例如 Objective-C 的： http://www.objc.io/issue-9/unicode.html ）

第 2 条附言 · 2014-05-03 03:05:05 +08:00

从评论中得到关于「坑」的小总结：
1. 关于相应规范：要注意获取字符或计算字符串长度时，是按 character 还是 unicode unit 来划分单元。比较常见的是以 16bit 为一个 unicode unit 来划分，例如 U+XXXX 。
2. 关于 Unicode 规范：在同一种 Unicode 编码下，一个字符可能对应着多种编码方式。例如，Ä 可能对应 \u00C4 ，\u0041\u0308 （A 以及可用于修饰的组合字符 ̈）。
3. 关于相应规范：内部编码和外部编码可能不一致。可参见前 2 点。ZSH 示例如下：
a=$(echo -n "\u00c4")
b=$(echo -n "\u0041\u0308")
echo $a ${#a} $b ${#b}
# -> Ä 1 Ä 2

第 3 条附言 · 2014-05-03 03:32:46 +08:00

咳，关于上面第三点的示例，我用错了（hexdump 一下可解）。可参见 16 楼关于 Ruby 的说明： https://www.v2ex.com/t/110893#r_1066951

维基百科

unicode

29 条回复 • 2014-05-06 16:20:41 +08:00

wwqgtxx

2014-05-02 21:16:14 +08:00 via Android

java/python3

ochapman

2014-05-02 21:23:32 +08:00

golang utf-8, unicode的实现之一

zzNucker

2014-05-02 21:29:30 +08:00

javascript

Andor_Chen

2014-05-02 21:31:43 +08:00

Ruby 1.9+

jakwings

2014-05-02 21:32:50 +08:00

我傻傻地补一个好了：JavaScript (16bit Unicode unit)

hazard

2014-05-02 22:44:18 +08:00

bash shell?

xierch

2014-05-02 22:47:49 +08:00

我觉得有必要明确一下“支持”是啥意思..

jakwings

2014-05-02 23:04:10 +08:00

@xierch 附言已添加～

jakwings

2014-05-02 23:10:42 +08:00

@hazard =_= 貌似算是支持，假如 bash 的版本够新，而且环境用的字符编码是 UTF-8 兼容的话。echo ${#str} 也能够正确显示长度。

usedname

2014-05-03 01:06:32 +08:00 via Android

php6？好像原生支持？

timothyqiu

2014-05-03 01:29:07 +08:00

C++ 自 C++11 起加入了 UTF-8/UTF-16/UTF-32 的支持。

---

附赠一个几乎通用的坑：

如果按照严格定义，很多语言与其说是支持 Unicode，不如说是支持某种特定的 Unicode 编码。

* UTF-8 / UTF-16 两者都是可变长编码。Python / Java / JavaScript 等语言，求字符「𠂊」的长度的结果都是 2，因为「𠂊」的 Unicode 码位 U+2008A 被 UTF-16 编码后是 2 个单元。

* 即便是 UTF-32 这种定长编码，一个编码单元对应一个 Unicode 码位，依旧有问题。因为字符和 Unicode 码位并不都是一一对应的，一个字符可能对应多个码位。例如德语中常见的字符「Ä」，在 Unicode 中有两种表示法：独立字符「Ä」（U+00C4）；以及字母「A」（U+0041）加上组合字符「¨」（U+0308）。按照 Unicode 标准，这两种表示法应该被认为是同一个字符。但是绝大多数语言里，使用第二种表示法的字符串 "\u0041\u0308" 虽然可以正常显示出「Ä」，但是对其取长度依旧是 2。

尤其是第二点，目前几乎没有语言能保证从字符串中取得正确的字符个数。

blacktulip

2014-05-03 01:30:59 +08:00

@timothyqiu

 irb
2.1.1 :001 > "𠂊".length
=> 1

blacktulip

2014-05-03 01:31:51 +08:00

 irb
2.1.1 :001 > "𠂊".length
=> 1
2.1.1 :002 > "Ä".length
=> 1

timothyqiu

2014-05-03 01:37:02 +08:00

@blacktulip 是的，我回复之前试过 Ruby，所以没有列上去……

很多语言的「字符串长度」功能直接返回的是编码单元个数。Ruby 要么是以 UTF-32 存储的字符串的，要么是在求字符串长度时先将字符串还原成了码位。（Ruby 只学过皮毛，不是很明白）

timothyqiu

2014-05-03 01:38:55 +08:00

@blacktulip Ä 的例子需要用转义符方式写。毕竟直接写 Ä 可能直接就用 \u00c4 表示了。

blacktulip

2014-05-03 01:45:30 +08:00

@timothyqiu 嗯， Ruby 里面每个字符串都有自己的编码，可以看看这个 http://yokolet.blogspot.co.uk/2009/07/design-and-implementation-of-ruby-m17n.html

"Ruby multilingualization (M17N) of Ruby 1.9 uses the code set
independent model (CSI) while many other languages use the Unicode
normalization model."

"Under the CSI model, all encodings are handled equally, which means,
Unicode is one of character sets. The most remarkable feature of the
CSI model is that the model does not require a character code
conversion since external and internal character codes are identical.
Thus, the cost for conversion can be eliminated. Besides, we can keep
away from unexpected information loss caused by the conversion,
especially by cutting bits or bytes off. Ruby uses the CSI model, so
do Solaris, Citrus, or other system based on the C library that does
not use __STDC_ISO_10646__."

"Moreover, it is possible to handle various character sets even though
they are not based on Unicode."

skydiver

2014-05-03 01:52:04 +08:00

@timothyqiu Python3里面是对的。

In [1]: len('𠂊')
Out[1]: 1

timothyqiu

2014-05-03 02:27:31 +08:00

@skydiver 谢谢～我找了下，这应该是 Python 3.3 引入的默认行为(PEP 393)。

2.1 < Python < 3.3 的版本可以在编译时通过添加相应的编译选项选择使用 UTF-32 而不是 UTF-16 作为
unicode 的编码。

Python <= 2.1 的版本，只支持 UTF-16，确切地说，只支持 Unicode BMP。

jakwings

2014-05-03 04:19:18 +08:00

@blacktulip Ruby 2.1.1p76
irb> "Ä".length
=> 1
irb> "Ä".length
=> 2

@skydiver Python3.3.5
print(len('Ä'))
#=> 1
print(len('Ä'))
#=> 2

看来组合字符要靠查编码表当 0 来算了……硬伤……

est

2014-05-03 08:53:28 +08:00

@zzNucker javascript 可以说只是支持ucs2而不是支持unicode。

est

2014-05-03 08:56:26 +08:00

@timothyqiu 也不对。python2可以编译时候指定

--enable-unicode=ucs4

>>> import sys
>>> print sys.maxunicode
1114111

--enable-unicode=ucs2:

>>> import sys
>>> print sys.maxunicode
65535

lidashuang

2014-05-03 09:05:59 +08:00 via Android

elixir

timothyqiu

2014-05-03 09:26:46 +08:00

@est 呃～不大清楚哪里不对……

Unicode 码位范围是 U+000000 ~ U+10FFFF，于是：
ucs4 -> UTF-32 -> 0~1114111(0x10FFFF)
ucs2 -> UTF-16 -> 0~65535(0xFFFF)

est

2014-05-03 10:27:15 +08:00

@timothyqiu 额。看错了。你贴的是对的。

jakwings

2014-05-03 16:38:10 +08:00

@usedname PHP 从 5.4.3 开始就默认自带 Multibyte String 模块了，支持多种 Unicode 编码方式，有 mb_split 函数，也算是支持得比较好吧。
http://docs.php.net/manual/en/mbstring.encodings.php

jakwings

2014-05-03 16:43:17 +08:00

@lidashuang 谢谢。具体查了下是支持 UTF-8 的，对文本长度计算也似乎是准确的：
http://elixir-lang.org/docs/stable/String.html

zzNucker

2014-05-03 21:15:41 +08:00

@est 请看ES6 已经完全支持32位
https://github.com/Singularity-zju/understandinges6_zh_cn/blob/master/01-The-Basics.md

wssgcg1213

2014-05-04 12:38:16 +08:00

ES6 codePointAt

jakwings

2014-05-06 16:20:41 +08:00

Lua5.3 也打算支持 UTF-8 ： http://www.lua.org/work/doc/manual.html#6.5