Oeasy 发表于 2013-11-14 08:24:13

制作Dictionary of Phrase and Fable, E. Cobham Brewer, 1894

本帖最后由 Oeasy 于 2013-11-17 09:54 编辑


一个简单得不能再简单的网页抓取然后制作mdx教程(20131114)

使用软件
0. 操作系统:Windows 7 旗舰版64位
1. 抓取工具:wget,http://users.ugent.be/~bpuype/wget/,http://baike.baidu.com/view/1312507.htm
2. 文本处理:EditPlus, UltraEdit, TextForever(http://www.comicer.com/stronghorse/software/index.htm#TextForever)

目标词典
Dictionary of Phrase and Fable,1894: http://www.infoplease.com/dictionary/brewers/ 这词典是公版的,而且网站没有设置抓取限制(至少目前看来没有设置),获取index也非常容易,故以此为例。
另:有个pdf http://pan.baidu.com/share/link?shareid=267207&uk=2063908536,版本不详,似乎是第17版的。

操作步骤
1. 获取index
观察http://www.infoplease.com/dictionary/brewers/,该网站本身可以browse整本词典,获取index非常容易。
新建一个txt,内容为
http://www.infoplease.com/dictionary/brewers/index-a.html
http://www.infoplease.com/dictionary/brewers/index-b.html
http://www.infoplease.com/dictionary/brewers/index-c.html
http://www.infoplease.com/dictionary/brewers/index-d.html
http://www.infoplease.com/dictionary/brewers/index-e.html
http://www.infoplease.com/dictionary/brewers/index-f.html
http://www.infoplease.com/dictionary/brewers/index-g.html
http://www.infoplease.com/dictionary/brewers/index-h.html
http://www.infoplease.com/dictionary/brewers/index-i.html
http://www.infoplease.com/dictionary/brewers/index-j.html
http://www.infoplease.com/dictionary/brewers/index-k.html
http://www.infoplease.com/dictionary/brewers/index-l.html
http://www.infoplease.com/dictionary/brewers/index-m.html
http://www.infoplease.com/dictionary/brewers/index-n.html
http://www.infoplease.com/dictionary/brewers/index-o.html
http://www.infoplease.com/dictionary/brewers/index-p.html
http://www.infoplease.com/dictionary/brewers/index-q.html
http://www.infoplease.com/dictionary/brewers/index-r.html
http://www.infoplease.com/dictionary/brewers/index-s.html
http://www.infoplease.com/dictionary/brewers/index-t.html
http://www.infoplease.com/dictionary/brewers/index-u.html
http://www.infoplease.com/dictionary/brewers/index-v.html
http://www.infoplease.com/dictionary/brewers/index-w.html
http://www.infoplease.com/dictionary/brewers/index-x.html
http://www.infoplease.com/dictionary/brewers/index-y.html
http://www.infoplease.com/dictionary/brewers/index-z.html
这些地址都是观察上面网站而得,txt命名为download.txt。
我把这个download.txt和wget.exe(如果你下载的wget是wget+版本号.exe,不妨重命名为wget.exe),这俩文件都放在D:\DOPF下。

cmd.exe->CD/D D:\DOPF->wget -i download.txt

很快,26个html文件就下下来了,对这26个html文件进行整理,得到
http://www.infoplease.com/dictionary/brewers/a.html
http://www.infoplease.com/dictionary/brewers/a1.html
http://www.infoplease.com/dictionary/brewers/a-b.html
http://www.infoplease.com/dictionary/brewers/a-b-c.html
http://www.infoplease.com/dictionary/brewers/a-b-c-book.html
http://www.infoplease.com/dictionary/brewers/a-b-c-process.html
http://www.infoplease.com/dictionary/brewers/a-e-i-o-u.html
http://www.infoplease.com/dictionary/brewers/a-u-c.html
http://www.infoplease.com/dictionary/brewers/aaron.html
http://www.infoplease.com/dictionary/brewers/ab.html
http://www.infoplease.com/dictionary/brewers/aback.html
http://www.infoplease.com/dictionary/brewers/abacus.html
http://www.infoplease.com/dictionary/brewers/abaddon.html
http://www.infoplease.com/dictionary/brewers/abambou.html
http://www.infoplease.com/dictionary/brewers/abandon.html
http://www.infoplease.com/dictionary/brewers/abandon-fait-larron.html
http://www.infoplease.com/dictionary/brewers/abaris.html
http://www.infoplease.com/dictionary/brewers/abate.html
http://www.infoplease.com/dictionary/brewers/abaton.html
http://www.infoplease.com/dictionary/brewers/abbassides.html
http://www.infoplease.com/dictionary/brewers/abbey-laird.html
http://www.infoplease.com/dictionary/brewers/abbey-lubber.html
……

这样的一共16698个链接。

2. 抓取内容
同样的,wget -i download.txt
把上面那N个html都抓下来,然后就很简单了。
-2013年11月14日 16:35:47
成功抓取了16695个html,漏了3个,懒得研究到底是哪3个了。

3. 文本提取
观察可知,词典条目内容在第一个<h1>和<div class="source">之间
<h1>Charybdis</h1>

<p> . A whirlpool on the coast of Sicily. Scylla and
Charybdis are employed to signify two equal dangers. Thus Horace says
an author trying to avoid Scylla, drifts into Charybdis,<em> i.e.</em>
seeking to avoid one fault, falls into another. The tale is that
Charybdis stole the oxen of Hercules, was killed by lightning, and
changed into the gulf.</p>
<p>“Thus when I shun Scylla, your father, I fall into Charybdis, your
mother.” —<cite>Shakespeare: Merchant of Venice,</cite> iii. 5.
</p>

<div class="source">Source: <cite>Dictionary of Phrase and Fable</cite>, E. Cobham Brewer, 1894</div>
利用TextForever来提取文本
https://pdawiki.com/forum/data/attachment/album/201311/14/164301dkbgkujbjkfjfkj3.png
-
https://pdawiki.com/forum/data/attachment/album/201311/14/165405tyeokeajp5dwe1zr.png
提取完毕,合并得到的16695个html,
https://pdawiki.com/forum/data/attachment/album/201311/14/170052k6q0zqqw0q0zw2qs.png
这本词典的制作过程中,我思考了下,不用在“文件内容前加注文件名”,有的情况下,是需要这样做的,以方便提取keywords,经过测试,还是要在“文件内容后加空行”。
https://pdawiki.com/forum/data/attachment/album/201311/14/170532emg5ze7g340ef6dl.png
得到dopf-src.txt,对这个txt进行操作,得到可build为mdx的txt。

4. 制作mdx
合并后的文本长这样:
https://pdawiki.com/forum/data/attachment/album/201311/14/172055rklvl00cc1chhgal.png
https://pdawiki.com/forum/data/attachment/album/201311/14/173007i53xe8exmebx8s3l.png
明显http://www.infoplease.com/dictionary/brewers/的词典是xml,由于MDict PC版不支持xml+css,我们要把xml标签替换为html标签。经过下面一系列的操作。
https://pdawiki.com/forum/data/attachment/album/201311/14/184818l5vyao7ke6gk13e8.png

处理后最终的文本是这样:
https://pdawiki.com/forum/data/attachment/album/201311/14/190542iojoz48ggge9k93o.png

再简单写点css
https://pdawiki.com/forum/data/attachment/album/201311/14/184900is7eqdczrde7l757.png

中途遇到些小问题,一个个解决,最后,成品:
https://pdawiki.com/forum/data/attachment/album/201311/14/185139c2yufcmwx4gwufyr.png
是不是比在线的稍微顺眼点呢?
http://www.infoplease.com/dictionary/brewers/comb.html
https://pdawiki.com/forum/data/attachment/album/201311/14/185140lco0r0mzooo10cr9.png

PS:虽然做完了,但是我发现了一些问题,从上面的截图中就可以看出来,有些词之间少了空格。暂无意修改,等有空改完了再分享。谁有兴趣改一改练练手的话,可以PM我,我把下载的网页发给你。

Hugh 发表于 2013-11-14 16:26:41

此贴要顶!

liuyunrushui 发表于 2013-11-15 23:30:45

老大您好。感谢您提供的教程。
小弟按照您的教程,把第一步完成了,但是如何有效地完成第二步,就是您所说的抓取一千多个网页的那个步骤,小弟一头雾水,手动一个一个地输入也是一个方法,不过效率不高。不知道老大是否有批量获得每个单词网页的方法呢?烦请指点一二,多谢多谢。

小弟想抓取的网页如下:
http://zokugo-dict.com/

右边的五十音图就是索引部分。

Oeasy 发表于 2013-11-16 14:14:53

liuyunrushui 发表于 2013-11-15 23:30 static/image/common/back.gif
老大您好。感谢您提供的教程。
小弟按照您的教程,把第一步完成了,但是如何有效地完成第二步,就是您所说 ...


cmd.exe

wget -i download.txt
所有网页链接在download.txt,参考http://baike.baidu.com/view/1312507.htm,也可以自己写程序抓。结合awk等等的话,其实可以更快,抓完也就制作完了。

tovaremeterio 发表于 2014-4-1 09:02:21

thank you very much
页: [1]
查看完整版本: 制作Dictionary of Phrase and Fable, E. Cobham Brewer, 1894