最近在收集ip库时候,遇到了一个问题:在抓取某些页面时候总是出现以下错误,
File “showip_ip.py”, line 39, in showIP
content = content.decode(‘GB2312′).encode(‘UTF-8′)
UnicodeDecodeError: ‘gb2312′ codec can’t decode bytes in position 7572-7573: illegal multibyte sequence
最好翻开python文档,最后在http://docs.python.org/howto/unicode.html 找到答案
这个链接给出的示例如下:
>>> unicode('abcdef')
u'abcdef'
>>> s = unicode('abcdef')
>>> type(s)
<type 'unicode'>
>>> unicode('abcdef' + chr(255))
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 6:
ordinal not in range(128)
>>> unicode('\x80abc', errors='strict')
Traceback (most recent call last):
File "<stdin>", line 1, in ?
UnicodeDecodeError: 'ascii' codec can't decode byte 0x80 in position 0:
ordinal not in range(128)
>>> unicode('\x80abc', errors='replace')
u'\ufffdabc'
>>> unicode('\x80abc', errors='ignore')
u'abc'
unicode 函数有这个选项变量,那decode 和encode 应该也有吧,看这个情况应该我的程序在处理过程遇到非法字符,所以导致异常。
果然,在http://docs.python.org/release/2.6.7/library/stdtypes.html#sequence-types-str-unicode-list-tuple-buffer-xrange找到了相应的decode函数的用法
The default is ‘strict’, meaning that encoding errors raise UnicodeError.
默认的是strict,遇到非法字符就抛出异常。
还有其他选项如下:
Value Meaning
‘strict’ Raise UnicodeError (or a subclass); this is the default.
‘ignore’ Ignore the character and continue with the next.
‘replace’ Replace with a suitable replacement character; Python will use the official U+FFFD REPLACEMENT CHARACTER for the built-in Unicode codecs on decoding and ‘?’ on encoding.
‘xmlcharrefreplace’ Replace with the appropriate XML character reference (only for encoding).
‘backslashreplace’ Replace with backslashed escape sequences (only for encoding).
选择ignore选项,将会忽略非法字符。
content = content.decode('GB2312','ignore').encode('UTF-8')
再次执行,果然不在报错。