|
本帖最后由 Oeasy 于 2013-11-17 09:54 编辑
6 X; R7 {2 R* B0 I. G7 e: S: { S4 D
- ^3 \9 j! e9 y$ v/ |
一个简单得不能再简单的网页抓取然后制作mdx教程(20131114)1 O) P! m" t$ ^! k m% p2 L
# r+ e* |9 S4 S2 c% C% B
使用软件5 D! G( e. z2 r9 \3 u, s
0. 操作系统:Windows 7 旗舰版64位5 f' m* j; o" |( i/ t j
1. 抓取工具:wget,http://users.ugent.be/~bpuype/wget/,http://baike.baidu.com/view/1312507.htm( c; _ }& U% a" q1 l" o6 r
2. 文本处理:EditPlus, UltraEdit, TextForever(http://www.comicer.com/stronghorse/software/index.htm#TextForever)
8 y& j( l% z. c# u; g& B' Z2 R' _
3 L3 O. F( [( o6 d& q, B' V4 x目标词典
2 U: p) O9 l( A' WDictionary of Phrase and Fable,1894: http://www.infoplease.com/dictionary/brewers/ 这词典是公版的,而且网站没有设置抓取限制(至少目前看来没有设置),获取index也非常容易,故以此为例。
; Q! Q$ @) s% o4 V* E* A+ X J0 J另:有个pdf http://pan.baidu.com/share/link?shareid=267207&uk=2063908536,版本不详,似乎是第17版的。* B! X5 P2 v, p% T/ D: ~: `
% @ e! d m- M: g
操作步骤# M# T" ?! D( ~) b4 s; P; c5 D T
1. 获取index, ]& s& x: k+ h9 j) C% |
观察http://www.infoplease.com/dictionary/brewers/,该网站本身可以browse整本词典,获取index非常容易。! ~' j( r( i% }/ {: \; u0 c
新建一个txt,内容为
2 ]& F+ D$ _! S$ ~4 | i
5 C1 r4 F; r0 t! b8 Q0 K这些地址都是观察上面网站而得,txt命名为download.txt。, v( Q1 n1 y6 @. B& |9 t) q
我把这个download.txt和wget.exe(如果你下载的wget是wget+版本号.exe,不妨重命名为wget.exe),这俩文件都放在D:\DOPF下。
' t" @+ u+ |5 m7 \* P
0 ^3 e- C0 V' e0 `cmd.exe->CD/D D:\DOPF->wget -i download.txt N. V% {9 T+ X- s L& D. O
1 d% m" Z4 k7 I9 }8 ]很快,26个html文件就下下来了,对这26个html文件进行整理,得到& @0 D, j- q' J; w" b' S% s+ d
http://www.infoplease.com/dictionary/brewers/a.html
( u, A" r0 G" Q: H9 e( W' Ohttp://www.infoplease.com/dictionary/brewers/a1.html, L0 M, u; c, x4 a
http://www.infoplease.com/dictionary/brewers/a-b.html
K8 x) C; b2 |7 v+ |http://www.infoplease.com/dictionary/brewers/a-b-c.html
: |) b! [ k2 t; Mhttp://www.infoplease.com/dictionary/brewers/a-b-c-book.html
' O. f B9 \& A8 h0 h+ Uhttp://www.infoplease.com/dictionary/brewers/a-b-c-process.html
' h- n: I* W* yhttp://www.infoplease.com/dictionary/brewers/a-e-i-o-u.html
' F, C' N0 i2 j: o ]http://www.infoplease.com/dictionary/brewers/a-u-c.html
0 A N! D3 q) b: Ohttp://www.infoplease.com/dictionary/brewers/aaron.html4 H9 e1 V5 ~ x" D2 h$ m
http://www.infoplease.com/dictionary/brewers/ab.html
9 S. q" C8 m1 v9 A# vhttp://www.infoplease.com/dictionary/brewers/aback.html
G( [2 l0 O3 s* chttp://www.infoplease.com/dictionary/brewers/abacus.html
% ~1 j, E+ O$ d4 e0 C4 Qhttp://www.infoplease.com/dictionary/brewers/abaddon.html% F" y+ \9 w/ a* @' ^) j6 A5 ]
http://www.infoplease.com/dictionary/brewers/abambou.html
& m3 S9 [ g) E" p" jhttp://www.infoplease.com/dictionary/brewers/abandon.html3 W! I6 U: z9 Z7 U# O
http://www.infoplease.com/dictio ... on-fait-larron.html
+ {; r7 Z- N- O$ S/ l. U, S5 hhttp://www.infoplease.com/dictionary/brewers/abaris.html v- y5 c3 c) n; G; ~. J9 r+ G
http://www.infoplease.com/dictionary/brewers/abate.html( _4 T! O' B# ?, L5 C5 A
http://www.infoplease.com/dictionary/brewers/abaton.html
8 O5 \5 M9 o* G2 ohttp://www.infoplease.com/dictionary/brewers/abbassides.html& E J+ I" q) u' x6 c
http://www.infoplease.com/dictionary/brewers/abbey-laird.html. o/ s( b1 g, p+ E+ b' V9 M, j
http://www.infoplease.com/dictionary/brewers/abbey-lubber.html, h" @# {, q& l1 B. ~+ n
……
) D5 K3 K# s" h7 Q% i% B1 S, z这样的一共16698个链接。, H+ D+ {% f* \7 z5 ^
7 v6 [$ E. O. t p" H
2. 抓取内容
! [# l$ v" y% u: P+ h# I, Q同样的,wget -i download.txt
2 b' h6 L& x! |# a. u9 H% M把上面那N个html都抓下来,然后就很简单了。
# d5 L6 x% Q- }5 p# l-2013年11月14日 16:35:47. d+ `! `1 |+ t
成功抓取了16695个html,漏了3个,懒得研究到底是哪3个了。' @% a1 t. j u7 `6 v3 |0 @
{# @% d7 Q) s" ]3. 文本提取 K9 r" l% u& \) J' W" K
观察可知,词典条目内容在第一个<h1>和<div class="source">之间
& Q* ^) ^ V7 N* H: n6 E" q<h1>Charybdis</h1>
7 E3 y# K$ ] I' Q7 O& H1 \( r" F7 B& d4 M4 t- x, m
<p> [ch=k]. A whirlpool on the coast of Sicily. Scylla and2 Z8 G$ @6 D* @, b7 _1 M
Charybdis are employed to signify two equal dangers. Thus Horace says5 a/ Z' ~! g, Q7 \) C
an author trying to avoid Scylla, drifts into Charybdis,<em> i.e.</em>
8 X8 K p! F; D! useeking to avoid one fault, falls into another. The tale is that
0 y& v$ }3 n. p B# OCharybdis stole the oxen of Hercules, was killed by lightning, and2 e( v" o9 ] s0 q( e! s+ J# ^
changed into the gulf.</p># o, m5 ~8 H) q
<p>“Thus when I shun Scylla, your father, I fall into Charybdis, your. `% P: b0 n: v8 F" x
mother.” —<cite>Shakespeare: Merchant of Venice,</cite> iii. 5.( k* \* [- h- y. A9 C3 d
</p> y+ ?$ \, }# O* J
/ j1 c" b" }1 m7 i9 I7 l. i0 w<div class="source">Source: <cite>Dictionary of Phrase and Fable</cite>, E. Cobham Brewer, 1894</div> $ w6 F% U9 M- D# @3 ]
利用TextForever来提取文本* L' P/ f- S4 l/ D& y/ A
! s) F( n7 z# D! J! v3 x
-3 P( s i+ @% l& Z% X5 r7 R

$ I U2 S. w& h提取完毕,合并得到的16695个html,- z6 m, |, N; {4 C; I: E% Y

+ C& O; x3 l4 k: r, s, S这本词典的制作过程中,我思考了下,不用在“文件内容前加注文件名”,有的情况下,是需要这样做的,以方便提取keywords,经过测试,还是要在“文件内容后加空行”。
( p5 }% ]) z; M3 M1 z2 R7 a( Z
( |( I+ L3 N& ^1 _8 T) S& h0 M# U: k得到dopf-src.txt,对这个txt进行操作,得到可build为mdx的txt。
- f1 R4 y5 f6 D7 k% q* k2 z# y' B" Y' A% a
4. 制作mdx. M: G. P6 @6 ?3 ?5 ~
合并后的文本长这样:) C9 q+ P0 f1 b; M( z: [
8 g6 ?* U" I7 }2 E X
7 }% W( o# ?$ q2 C" j) a
明显http://www.infoplease.com/dictionary/brewers/的词典是xml,由于MDict PC版不支持xml+css,我们要把xml标签替换为html标签。经过下面一系列的操作。
4 }1 z$ T% x6 ]& a p2 o
0 F* I& l: d, r0 h( s4 k! X3 q2 h5 q1 W8 k/ I$ o/ j
处理后最终的文本是这样:- c; |9 H; @1 S

Q0 n3 l! q# t) M/ u* d$ J9 k' Z, F" l# O0 t7 `
再简单写点css
; g$ z4 E! w( n. ^+ |8 F! ]8 F
5 E3 Y3 d8 k0 D, C7 e! q, C4 f: J) A* N. F4 O1 Q+ ~
中途遇到些小问题,一个个解决,最后,成品:, A. W+ U G+ [9 K3 A8 r; B
$ V' \; n# d& r1 P! E9 W
是不是比在线的稍微顺眼点呢?3 b1 u6 I$ A( p1 Q8 ^- ^
http://www.infoplease.com/dictionary/brewers/comb.html
# x p, `9 N/ w( ^7 z; K# D3 K ! J1 v4 }6 p, v. z
6 o+ l- i( ^) D1 s# t+ bPS:虽然做完了,但是我发现了一些问题,从上面的截图中就可以看出来,有些词之间少了空格。暂无意修改,等有空改完了再分享。谁有兴趣改一改练练手的话,可以PM我,我把下载的网页发给你。
+ Y& g5 A4 r8 T7 D8 g) { |
|