掌上百科 - PDAWIKI

 找回密码
 免费注册

QQ登录

只需一步,快速开始

查看: 3927|回复: 4

[教程] 制作Dictionary of Phrase and Fable, E. Cobham Brewer, 1894

[复制链接]

该用户从未签到

发表于 2013-11-14 08:24:13 | 显示全部楼层 |阅读模式
本帖最后由 Oeasy 于 2013-11-17 09:54 编辑 ! u/ u1 H  s, \, @

# X8 b; W1 }5 A# o0 B& @+ v' ^6 M6 |- V- i/ k& `
一个简单得不能再简单的网页抓取然后制作mdx教程(20131114)
* F2 m3 ^% c2 R4 l4 M7 W4 h
2 F$ q: p' m7 A, c% l, T( Q使用软件: l/ a5 ?, C: L0 H8 V$ s1 W+ C
0. 操作系统:Windows 7 旗舰版64位7 q* ^+ O3 A) J3 Z1 x
1. 抓取工具:wget,http://users.ugent.be/~bpuype/wget/http://baike.baidu.com/view/1312507.htm
' T' v, R9 G! G* M2. 文本处理:EditPlus, UltraEdit, TextForever(http://www.comicer.com/stronghorse/software/index.htm#TextForever
6 x( B/ S7 q9 g* l& y! t1 @7 @7 `% ?5 K% z1 y
目标词典
' y7 M6 t+ L) P4 E- ?0 y4 X  A) ODictionary of Phrase and Fable,1894: http://www.infoplease.com/dictionary/brewers/ 这词典是公版的,而且网站没有设置抓取限制(至少目前看来没有设置),获取index也非常容易,故以此为例。( V5 W  _3 n( B% r* A0 H
另:有个pdf http://pan.baidu.com/share/link?shareid=267207&uk=2063908536,版本不详,似乎是第17版的。
. t6 {1 T& S& ], M3 c9 [% V: ~. ^% E' O' }" |! ~3 z% M. Y
操作步骤
5 i2 N: s# N% d# b& o3 G3 q6 l1. 获取index  z, q  C, N* h: a+ g; L9 G4 C( \3 G
观察http://www.infoplease.com/dictionary/brewers/,该网站本身可以browse整本词典,获取index非常容易。- O6 q0 [% {. S8 x
新建一个txt,内容为& a! Z1 E& I- j7 n! D7 ?
http://www.infoplease.com/dictionary/brewers/index-a.html
& E+ u+ A" Z% f, r# b! Dhttp://www.infoplease.com/dictionary/brewers/index-b.html0 E7 s5 J& F( N& s0 f2 w7 F% T
http://www.infoplease.com/dictionary/brewers/index-c.html+ r% U% E! l5 {& X/ H( {9 S
http://www.infoplease.com/dictionary/brewers/index-d.html
; X. T$ \) v8 I: Qhttp://www.infoplease.com/dictionary/brewers/index-e.html
9 h# C3 u4 M  p$ Vhttp://www.infoplease.com/dictionary/brewers/index-f.html  i' r0 D( i1 M/ t" B. p3 [
http://www.infoplease.com/dictionary/brewers/index-g.html
7 y0 e' p, F% B. I& v3 _http://www.infoplease.com/dictionary/brewers/index-h.html
" F6 t6 F5 L- w; y4 a" vhttp://www.infoplease.com/dictionary/brewers/index-i.html; R' `$ X% k  J
http://www.infoplease.com/dictionary/brewers/index-j.html/ R: G) Y) X4 M& x0 C3 a
http://www.infoplease.com/dictionary/brewers/index-k.html
- B0 x5 W3 ~& e' Shttp://www.infoplease.com/dictionary/brewers/index-l.html; H3 e$ [' [) X) }2 c) g/ o4 ]
http://www.infoplease.com/dictionary/brewers/index-m.html
  }7 ?, J$ P0 b( U, Phttp://www.infoplease.com/dictionary/brewers/index-n.html
# W/ N7 g' D/ xhttp://www.infoplease.com/dictionary/brewers/index-o.html, K6 M" B: z3 ^" C  R
http://www.infoplease.com/dictionary/brewers/index-p.html; a3 k9 t5 F$ Q5 j0 n
http://www.infoplease.com/dictionary/brewers/index-q.html
' O6 _" y( D6 M& a* P5 Chttp://www.infoplease.com/dictionary/brewers/index-r.html
0 r% @/ ?) v8 \' x9 j  j8 ~http://www.infoplease.com/dictionary/brewers/index-s.html' N; z7 A9 _1 \! q5 [
http://www.infoplease.com/dictionary/brewers/index-t.html8 L5 @; \! t8 |
http://www.infoplease.com/dictionary/brewers/index-u.html
( ]; c) e, G4 X5 D, v, Jhttp://www.infoplease.com/dictionary/brewers/index-v.html+ O" y1 |; k6 w, v: }) G
http://www.infoplease.com/dictionary/brewers/index-w.html
9 g! d* e0 I  H  }% Q$ A) vhttp://www.infoplease.com/dictionary/brewers/index-x.html/ n. o0 n" w0 o: `5 |" u
http://www.infoplease.com/dictionary/brewers/index-y.html
# w- J" F& u" p  v" z3 xhttp://www.infoplease.com/dictionary/brewers/index-z.html
7 q1 ]# r3 G& Y/ g/ ^
这些地址都是观察上面网站而得,txt命名为download.txt。! Z* ^" Y( \1 H8 A6 U/ C' ]
我把这个download.txt和wget.exe(如果你下载的wget是wget+版本号.exe,不妨重命名为wget.exe),这俩文件都放在D:\DOPF下。
! D2 t( M9 X$ [( ], {' F6 z' N
" i5 Z; M0 r( `3 k8 a3 _cmd.exe->CD/D D:\DOPF->wget -i download.txt; S8 c( [9 K! n3 y8 Z) _

& V5 n: G) j' K$ E8 C! @" @很快,26个html文件就下下来了,对这26个html文件进行整理,得到# [5 h% x3 r3 G  ]
http://www.infoplease.com/dictionary/brewers/a.html
7 U/ ~) U# |+ E8 ^: m8 s4 t$ p. x) Shttp://www.infoplease.com/dictionary/brewers/a1.html
" L' U: C+ E% i/ A; Yhttp://www.infoplease.com/dictionary/brewers/a-b.html
  c8 u* E- a# lhttp://www.infoplease.com/dictionary/brewers/a-b-c.html% D* P8 d) i1 `1 Y8 y3 [" l
http://www.infoplease.com/dictionary/brewers/a-b-c-book.html
1 h1 ^, z0 P1 D0 Y# P" K6 nhttp://www.infoplease.com/dictionary/brewers/a-b-c-process.html
/ }, k# x$ ]- G1 C0 i; g! qhttp://www.infoplease.com/dictionary/brewers/a-e-i-o-u.html
& F) ^: J  c3 q/ shttp://www.infoplease.com/dictionary/brewers/a-u-c.html4 }, r  I2 \6 l
http://www.infoplease.com/dictionary/brewers/aaron.html
7 R  }3 E0 g$ N* |' Qhttp://www.infoplease.com/dictionary/brewers/ab.html
8 |' z  {$ I/ N' N% m6 g+ bhttp://www.infoplease.com/dictionary/brewers/aback.html/ _6 I4 m' z' F' [2 J* M* t
http://www.infoplease.com/dictionary/brewers/abacus.html- O9 O$ Y& U  a
http://www.infoplease.com/dictionary/brewers/abaddon.html
' O, t1 g6 Y0 a% Xhttp://www.infoplease.com/dictionary/brewers/abambou.html0 W( \+ S0 r; d: k
http://www.infoplease.com/dictionary/brewers/abandon.html
# L$ s3 `* K( B) X- }* ]' Shttp://www.infoplease.com/dictio ... on-fait-larron.html
9 s/ l- S0 }. D' hhttp://www.infoplease.com/dictionary/brewers/abaris.html+ w7 d: H# |7 B1 P" O& B: B- |. |
http://www.infoplease.com/dictionary/brewers/abate.html
3 G8 O* w6 y4 @0 lhttp://www.infoplease.com/dictionary/brewers/abaton.html
+ u8 g; z1 r" W. ahttp://www.infoplease.com/dictionary/brewers/abbassides.html
2 \# N1 F$ f8 V: T! c9 rhttp://www.infoplease.com/dictionary/brewers/abbey-laird.html5 j% _2 c1 O" H% o0 a0 _
http://www.infoplease.com/dictionary/brewers/abbey-lubber.html5 O6 @3 s. v1 X  T
……

8 @: L: j  f7 j* H  \这样的一共16698个链接。
- j% C, a1 Z+ b6 |: G
, k. Y1 u# o$ D$ f* l2. 抓取内容
( t5 H8 F9 a* u: \: D+ h同样的,wget -i download.txt0 q: J0 \6 ]3 i/ q) C
把上面那N个html都抓下来,然后就很简单了。
. X/ O1 n8 k4 h- T2 l-2013年11月14日 16:35:47
' n' `* h. R+ p# \3 R3 l成功抓取了16695个html,漏了3个,懒得研究到底是哪3个了。3 H. {- F: |# v8 T; F" |
% U3 c7 E4 F7 Z1 t
3. 文本提取- P; l) P" Z: a2 [
观察可知,词典条目内容在第一个<h1>和<div class="source">之间- O  e' P' k, k: J4 N2 t. n
<h1>Charybdis</h1>; B: `# W$ f$ U! e$ ~

+ T! H, e, `; ~* G) e! n. h2 Y& ]<p> [ch=k]. A whirlpool on the coast of Sicily. Scylla and* J% {. N/ F  z# C/ d
Charybdis are employed to signify two equal dangers. Thus Horace says; m' L' ~8 X1 C; |
an author trying to avoid Scylla, drifts into Charybdis,<em> i.e.</em>
4 {* R, z$ K) j+ J8 L4 V. pseeking to avoid one fault, falls into another. The tale is that
* J, p7 W  N) n. `" nCharybdis stole the oxen of Hercules, was killed by lightning, and
  ^: L2 L# p7 L6 F, \changed into the gulf.</p>
9 @, V; o- d5 z<p>“Thus when I shun Scylla, your father, I fall into Charybdis, your
  S( K$ T5 c) M1 ~: rmother.” —<cite>Shakespeare: Merchant of Venice,</cite> iii. 5.
, h' j$ C0 d$ r; k</p>
: x8 C! P  G. e
" f, j, C( e/ `1 [<div class="source">Source: <cite>Dictionary of Phrase and Fable</cite>, E. Cobham Brewer, 1894</div>

9 p0 X) {0 \! `7 ?5 L6 R利用TextForever来提取文本. F, }) c* F3 o  w% Y  o
" {0 k, A. x1 q
-8 g! A8 E9 h3 N4 b+ C, g
1 D3 a, j; f9 s* E+ ^
提取完毕,合并得到的16695个html,
2 F* v1 G& ?9 u& _9 T( Y7 h; ^/ S: o
这本词典的制作过程中,我思考了下,不用在“文件内容前加注文件名”,有的情况下,是需要这样做的,以方便提取keywords,经过测试,还是要在“文件内容后加空行”。
1 p7 b5 G- c' @; l4 T2 S
9 L  Y# T% S& b9 N4 N3 b1 W得到dopf-src.txt,对这个txt进行操作,得到可build为mdx的txt。  \  g& q; P- u' C- b2 S* {

9 g4 G! b! c& i0 }- [4. 制作mdx$ a+ d0 |8 E9 v/ o4 X6 c. @
合并后的文本长这样:
& G8 ?0 R1 l/ q6 ~* N2 n  J; G/ T9 x8 o6 `" q; ^3 |% d( g

& G6 K2 [) [9 T' |) i% a, h4 Z9 `明显http://www.infoplease.com/dictionary/brewers/的词典是xml,由于MDict PC版不支持xml+css,我们要把xml标签替换为html标签。经过下面一系列的操作。
4 `5 p2 i5 j/ n9 c( p: g+ E
3 e' A1 q% ~$ a4 e
& @( G* a6 e0 m( u# q. ~. I1 k+ j处理后最终的文本是这样:$ g: }+ ^3 ^/ a0 d6 P

9 p/ J* j3 {+ B3 C
. Z+ C% F* a5 U- d0 D再简单写点css
/ C  \1 q0 e+ j% i1 J
1 l0 P$ a+ |0 R
( [9 z; D2 W. Q2 o8 g* d中途遇到些小问题,一个个解决,最后,成品:
# J7 g: E6 i1 m, T* d+ _8 N( b( ~6 `+ r* C* H/ N8 \+ L
是不是比在线的稍微顺眼点呢?- I+ u- j4 d1 p9 _2 \5 p
http://www.infoplease.com/dictionary/brewers/comb.html
% e/ o  B+ n8 g% P: Z9 q
# K/ U* ~" j$ M" g$ c* B
5 l$ J- q  J  y* o; IPS:虽然做完了,但是我发现了一些问题,从上面的截图中就可以看出来,有些词之间少了空格。暂无意修改,等有空改完了再分享。谁有兴趣改一改练练手的话,可以PM我,我把下载的网页发给你。
/ g% ~. g5 P0 j, O

本帖被以下淘专辑推荐:

该用户从未签到

发表于 2013-11-14 16:26:41 | 显示全部楼层
此贴要顶!
  • TA的每日心情
    开心
    2018-1-27 00:16
  • 签到天数: 1 天

    [LV.1]初来乍到

    发表于 2013-11-15 23:30:45 来自手机 | 显示全部楼层
    老大您好。感谢您提供的教程。
    , J5 x% c) @' D$ B1 C7 ^小弟按照您的教程,把第一步完成了,但是如何有效地完成第二步,就是您所说的抓取一千多个网页的那个步骤,小弟一头雾水,手动一个一个地输入也是一个方法,不过效率不高。不知道老大是否有批量获得每个单词网页的方法呢?烦请指点一二,多谢多谢。2 Y2 l1 O  ~. j8 z/ F! t

    7 Z5 d6 c+ p3 o, m1 D& B8 H小弟想抓取的网页如下:
    + L- W- [4 v' F# thttp://zokugo-dict.com/2 K3 n8 C4 R* ]2 F

    8 }  M1 s; h+ A右边的五十音图就是索引部分。

    该用户从未签到

     楼主| 发表于 2013-11-16 14:14:53 | 显示全部楼层
    liuyunrushui 发表于 2013-11-15 23:30
    7 g4 F* d' |$ t2 j( h* r# l老大您好。感谢您提供的教程。
    9 v5 ^5 f: n- \( u( A) }* n' Y* F小弟按照您的教程,把第一步完成了,但是如何有效地完成第二步,就是您所说 ...
    " u1 @6 c' d5 ?; K
    + D$ ?7 u1 y9 A' [8 |4 z. C
    cmd.exe2 ]; E3 V1 m: U% b
    8 c7 G1 N1 w' l5 c0 @
    wget -i download.txt
    ( X9 X' T/ {$ h' l& Q2 E' w3 o所有网页链接在download.txt,参考http://baike.baidu.com/view/1312507.htm,也可以自己写程序抓。结合awk等等的话,其实可以更快,抓完也就制作完了。

    该用户从未签到

    发表于 2014-4-1 09:02:21 | 显示全部楼层
    thank you very much
    您需要登录后才可以回帖 登录 | 免费注册

    本版积分规则

    小黑屋|手机版|Archiver|PDAWIKI |网站地图

    GMT+8, 2024-4-27 14:29 , Processed in 0.039206 second(s), 10 queries , MemCache On.

    Powered by Discuz! X3.4

    Copyright © 2001-2023, Tencent Cloud.

    快速回复 返回顶部 返回列表