一个 python 字符编码的问题

V2EX = way to explore

V2EX 是一个关于分享和探索的地方

现在注册

已注册用户请登录

推荐学习书目

› Learn Python the Hard Way

Python Sites

› PyPI - Python Package Index

› http://diveintopython.org/toc/index.html

› Pocoo

值得关注的项目

› PyPy

› Celery

› Jinja2

› Read the Docs

› gevent

› pyenv

› virtualenv

› Stackless Python

› Beautiful Soup

› 结巴中文分词

› Green Unicorn

› Sentry

› Shovel

› Pyflakes

› pytest

Python 编程

› pep8 Checker

Styles

› PEP 8

› Google Python Style Guide

› Code Style from The Hitchhiker's Guide

这是一个创建于 3031 天前的主题，其中的信息可能已经有所发展或是发生改变。

我写了个爬虫，从网页上获取了一些 unicode 字符，但是我在 python 里面使用的时候，被显示 str 的字符串，请问下我该如何让 python 把这些字符串识别成 unicode 的字符。如果我强行用(str)转换，并没有什么用，求老司机带。

PS ：环境 python2.7.11

字符

Python

unicode

str

11 条回复 • 2016-09-19 23:21:16 +08:00

Magic347

2016-09-19 18:08:10 +08:00

题主把问题的细节描述一下，比如抓取的网页链接，提取的网页内容，等等。
你说的这么笼统，让别人怎么帮到你？

whwq2012

2016-09-19 18:15:47 +08:00

@Magic347 好吧，网址是 https://lvyou.baidu.com/shengquansi ，然后你查看源代码搜 more_desc ，就可以看到那些 unicode 字符

crazykuma

2016-09-19 18:17:57 +08:00

错误信息贴出来啊

CosimoZi

2016-09-19 18:29:16 +08:00

'\u8fd1'.decode('unicode-escape')

prefere

2016-09-19 18:33:04 +08:00 via Android

你用的哪个库？
把代码错误信息贴上来

ClutchBear

2016-09-19 19:16:15 +08:00

百度这个,
如果用的是 requests 库
直接
req = requests.get(url)
req.encoding = "utf-8"
解决了

whwq2012

2016-09-19 19:57:28 +08:00

@crazykuma
@prefere
@ClutchBear
@Magic347
我的问题是这样的，我需要的数据是以 unicode 字符的形式直接写在源代码里的 script 标签里的东西,如下图<br/>

然后我无论是用这段字符串用 unicode ， decode ， encode 转码都无效。<br/>
一般的 unicode 字符会在 python 中被识别成 unicode 字符，而这段字符会被识别成 str 类型，所以我想问的是有什么办法能转义，将这段被误识别成 str 的 unicode 字符识别成 unicode 字符。<br/>
4 楼给出了解决方法 decode('unicode-escape')，用这个函数就能转义了。<br/>
不过还是谢谢你们认真的回答。

a87150

2016-09-19 20:15:12 +08:00

py3 就不会有这个问题

whwq2012

2016-09-19 20:27:09 +08:00 via Android

@a87150 说的我有点动摇了， py3 现在成熟了吗？

a87150

2016-09-19 20:41:37 +08:00

@whwq2012 http://py3readiness.org/

wind3110991

2016-09-19 23:21:16 +08:00

b_txt = '\u636e\u8bf4\u8fdc\u53e4\u65f6\u6709\u5deb\u5e08\u5728\u6c34\u4e2d\u4e0b\u6bd2\u6bd2\u5bb3\u6751\u6c11\uff0c\u5927\u795e\u56e0\u9640\u7f57\u4ee5\u77db\u523a\u5730\u6d8c\u51fa\u6cc9\u6c34\uff0c\u89e3\u6551\u4e86\u6751\u6c11\u3002\u8fd9\u5c31\u662f\u5723\u6cc9\u7684\u6765\u5386\u3002\u00100\u5723\u6cc9\u5bfa\u4e2d\u6709\u5341\u591a\u4e2a\u51fa\u6c34\u53e3\uff0c\u5386\u7ecf\u5343\u5e74\u4f9d\u7136\u6e05\u6f88\u3002\u6bcf\u4e2a\u51fa\u6c34\u53e3\u7684\u529f\u6548\u90fd\u4e0d\u540c\uff0c\u6709\u7684\u53ef\u4ee5\u6d88\u707e\u89e3\u7978\uff0c\u6709\u7684\u53ef\u4ee5\u9a71\u9010\u75c5\u75db\uff0c\u6709\u7684\u53ef\u4ee5\u6d17\u6da4\u5fc3\u7075\uff0c\u9644\u8fd1\u5c45\u6c11\u6bcf\u5929\u65e9\u3001\u4e2d\u3001\u665a\u4e09\u6b21\u6765\u6b64\u6c90\u6d74\u3002\n\u5723\u6cc9\u5bfa\u9644\u8fd1\u7684\u5c0f\u5c71\u4e0a\u6709\u4e00\u5ea7\u6b27\u5f0f\u5efa\u7b51\uff0c\u662f\u5370\u5c3c\u603b\u7edf\u884c\u5bab\uff0c\u66fe\u63a5\u5f85\u4e16\u754c\u5404\u56fd\u653f\u8981\u6765\u8bbf\u3002\n\u5723\u6cc9\u5bfa\u4e2d\u6709\u5341\u591a\u4e2a\u51fa\u6c34\u53e3\uff0c\u5386\u7ecf\u5343\u5e74\u4f9d\u7136\u6e05\u6f88\u3002\u6bcf\u4e2a\u51fa\u6c34\u53e3\u7684\u529f\u6548\u90fd\u4e0d\u540c\uff0c\u6709\u7684\u53ef\u4ee5\u6d88\u707e\u89e3\u7978\uff0c\u6709\u7684\u53ef\u4ee5\u9a71\u9010\u75c5\u75db\uff0c\u6709\u7684\u53ef\u4ee5\u6d17\u6da4\u5fc3\u7075\uff0c\u9644\u8fd1\u5c45\u6c11\u6bcf\u5929\u65e9\u3001\u4e2d\u3001\u665a\u4e09\u6b21\u6765\u6b64\u6c90\u6d74\u3002'

a_txt = txt.decode('unicode-escape')
print a_txt

用 decode 就行了啊