郵便番号をDBに突っ込みまして、その際に落ちた所とはまったところ、んで出来たデータの事。
落とし穴
現在のデータ状態(2009/01/14)
以下の数字はDBに突っ込んだ数字でありまして、CSVのレコード件数ではないです。*1
- データ総数:122254件
- 郵便番号の件数:118618件
- 郵便番号重複(同県内住所違い)件数:19799件*2
- 郵便番号重複(他県住所違い)件数:6件
- 498-0000
- 三重県桑名郡木曽岬町
- 愛知県弥富市
- 618-0000
- 大阪府三島郡島本町
- 京都府乙訓郡大山崎町
- 871-0000
- 大分県中津市
- 福岡県築上郡吉富町
- 498-0000
- 郵便番号重複(同じ住所)件数:2件
- 581-0027 大阪府八尾市八尾木
- 673-0012 兵庫県明石市和坂
まとめ
日本広い。
後、「北九州市八幡西区洞北町」は「きたきゅうしゅうしやはたにしくどうほくまち」に統一されたみたいなんだぜ。
参考
参考にしました。そしていろいろご教授下さったkhashi先生、ありがとうございます。
SQLのベンチマーク
ここからは趣味のPostgreSQLへ。バージョンは、8.3.5。まず、テーブル
lip_develop=> \d account_postaddressmap Table "public.account_postaddressmap" Column | Type | Modifiers ------------+-------------------------+--------------------------------------------------------------------- id | integer | not null default nextval('account_postaddressmap_id_seq'::regclass) postcode | character varying(8) | not null perfecture | character varying(10) | not null address | character varying(1024) | not null Indexes: "account_postaddressmap_pkey" PRIMARY KEY, btree (id) "account_postaddressmap_perfecture" btree (perfecture) "account_postaddressmap_postcode" btree (postcode) lip_develop=>
総件数の列数を取得。これぐらいはまだ速い。と言っても、0.5秒・・・
lip_develop=> EXPLAIN ANALYZE select count(*) from account_postaddressmap; QUERY PLAN ------------------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=2702.58..2702.59 rows=1 width=0) (actual time=599.517..599.519 rows=1 loops=1) -> Seq Scan on account_postaddressmap (cost=0.00..2397.26 rows=122126 width=0) (actual time=0.091..316.625 rows=122254 loops=1) Total runtime: 599.611 ms (3 rows) lip_develop=>
唯一な郵便番号を取得。1.4秒・・・
lip_develop=> EXPLAIN ANALYZE select distinct(postcode) from account_postaddressmap; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Unique (cost=0.00..8596.04 rows=91932 width=9) (actual time=0.220..1219.124 rows=118618 loops=1) -> Index Scan using account_postaddressmap_postcode on account_postaddressmap (cost=0.00..8290.73 rows=122126 width=9) (actual time=0.211..545.461 rows=122254 loops=1) Total runtime: 1486.844 ms (3 rows) lip_develop=>
郵便番号重複(同県内住所違い)を取得。ここらへんから、SQL文が怪しくなってくる。ここで、2.4秒。
lip_develop=> EXPLAIN ANALYZE select a.id,b.id,a.postcode,a.perfecture,a.address from account_postaddressmap as a, account_postaddressmap as b where a.id != b.id and a.postcode = b.postcode and a.address != b.address and a.perfecture = b.perfecture; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Merge Join (cost=0.00..29929.92 rows=7757 width=52) (actual time=3.496..2318.163 rows=39598 loops=1) Merge Cond: ((a.postcode)::text = (b.postcode)::text) Join Filter: ((a.id <> b.id) AND ((a.address)::text <> (b.address)::text) AND ((a.perfecture)::text = (b.perfecture)::text)) -> Index Scan using account_postaddressmap_postcode on account_postaddressmap a (cost=0.00..8290.73 rows=122126 width=48) (actual time=0.252..405.424 rows=122254 loops=1) -> Index Scan using account_postaddressmap_postcode on account_postaddressmap b (cost=0.00..8290.73 rows=122126 width=48) (actual time=0.020..642.018 rows=161862 loops=1) Total runtime: 2408.005 ms (6 rows) lip_develop=>
郵便番号重複(他県住所違い)を取得。2.2秒。上の同県内住所違いとどっこいどっこい。
lip_develop=> EXPLAIN ANALYZE select a.id,b.id,a.postcode,a.perfecture,a.address from account_postaddressmap as a, account_postaddressmap as b where a.id != b.id and a.postcode = b.postcode and a.address != b.address and a.perfecture != b.perfecture; QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Merge Join (cost=0.00..29929.92 rows=234982 width=52) (actual time=838.928..2209.211 rows=6 loops=1) Merge Cond: ((a.postcode)::text = (b.postcode)::text) Join Filter: ((a.id <> b.id) AND ((a.address)::text <> (b.address)::text) AND ((a.perfecture)::text <> (b.perfecture)::text)) -> Index Scan using account_postaddressmap_postcode on account_postaddressmap a (cost=0.00..8290.73 rows=122126 width=48) (actual time=0.259..409.955 rows=122254 loops=1) -> Index Scan using account_postaddressmap_postcode on account_postaddressmap b (cost=0.00..8290.73 rows=122126 width=48) (actual time=0.021..663.080 rows=161862 loops=1) Total runtime: 2209.386 ms (6 rows) lip_develop=>
以上。MySQLだともっと速くなる・・・?