Pythonでメールをデコード〜メモ書き／文字コード取得編〜

Pythonでメールの内容を読み取ろうとしているのですが、どうも上手くデコードできないメールがある。

で、エンコード名もiso-2022-jpと合っている。

何が悪いのか調べたら、どうもPython（てか世界一般的には）iso-2022-jpは全部で6種類があるらしい。*1

で、文章の内容はどのiso-2022-jpなのか不明なため自動判別する関数。

注意として、下記の関数は優先順位があるため、間違ったエンコード名を返すかの所為がありますので。

#
# name = get_jp_encoding_name(msg[,char_code])
# name is msg encoding type name.
# This only check the Japanese.
#

def get_jp_encoding_name (msg,char_code = 'iso-2022-jp'):
	import types
	
	enc_name = char_code or 'iso-2022-jp'
	check_type = ['utf_8',
				 enc_name,        'iso2022_jp_1', 'iso2022_jp_2',
				 'iso2022_jp_2004','iso2022_jp_3', 'iso2022_jp_ext',
				 'shift-jis',      'cp932',        'shift_jis_2004',
				 'shift_jisx0213', 'euc_jp',       'euc_jis_2004',
				 'euc_jisx0213']
	for enc in check_type:
		try:
			if isinstance(msg, types.StringType):
				s = msg.decode(enc)
			else:
				s = unicode(msg,enc)
		except :
			continue
		return enc
	return None

参考サイト：2006-12-28 - かせきのうさぎさん　itasukeさんありがとう:-)

＊import typesを書くのを忘れてて、全て返り値がNoneになっていたバグ修正

＊2008/09/01 追記：微妙にタイトル変えました。このスクリプトはあくまで文字コード取得なので。

*1:http://www.python.jp/doc/release/lib/standard-encodings.html

at_yasu's blog

ロード的なことを

Pythonでメールをデコード〜メモ書き／文字コード取得編〜