Word and phrase origins[2008版, 高清, 可复制]
请各位看看这个http://www.baidu.com/link?url=YB5QcrR_p5U5jT2-nxrWpjMnsxwQQziO1UtOOzcp52wFa0JQWY70UqkZySn_g8bANBJigasAC20NXFW8hNMJda, 看看能否转换成mdx dingyang 发表于 2013-9-27 23:06 static/image/common/back.gifPDF很难转成mdx的
{:5_227:}也不难,就是不管高手、低手,估计最低也得耗上两百个小时,才能把楼主链接里的pdf,做成能见人的mdx。
文本版的pdf,跟扫描版的pdf相比,也就省了OCR一步,距离能直接build为mdx的txt还有十万八千里呢。
Self-help is better than help from others; God helps those that help themselves.自己动手、丰衣足食啊。谁感兴趣谁动手啊。
提供几个思路
1. pdf转html,这样pdf里词头的加粗可能得以保留,但是会有很多问题,因为pdf页面里内容是两栏,转成html后内容会出现错位的情况,最后让人心力交瘁,还不如一条一条复制粘贴。
2. pdf转word,pdf的两栏可能就变成文本框了,这样操作起来稍微简单些,但是最后说不定会发现,还是不如一条一条复制粘贴。
未实际操作,供参考。 本帖最后由 mikeee 于 2018-11-25 18:08 编辑
有一个办法应该可行:先用 Abbyy Finereader 转成 docx,docx再转成 htm。
我机器里没装Finereader,用在线 https://finereaderonline.com 做了十页(每天在线只能OCR十页),效果不错:htm里的页头自动消失。两列变成了单列,粗体保留,好像原pdf换行时的 hyphen 都去掉了,但原pdf里跨页的段落好像没有合并。
Chrome Devtools 大致看了看:css selector: p.Bodytext21 可定位所有的释义
css selector:p.Bodytext21>span.Bodytext2Bold 可定位释义里的粗体
贴不了图,发个 docx 和 htm 文件(仅10页) 百度盘链接: https://pan.baidu.com/s/15Qc4tQeWcePy7AhTJLiJXQ 提取码: encg
折腾了一阵,这个 python3 码处理上面说的 htm 得到的东西大致可以做成 mdx
'''word and phrase orgins test
'''
from pyquery import PyQuery as pq
file = r'WordandPhraseOrigins.htm'
try:
html = open(file, 'rt', encoding='utf8').read()
except Exception as exc:
print('error: {}. Trying gb2312...'.format(exc))
try:
html = open(file, 'rt', encoding='gb2312').read()
print('Looks good')
except Exception as exc:
SystemExit('error: {}. Giving up...'.format(exc))
doc = pq(html)
css_text = 'p.Bodytext21'
css_bold = 'p.Bodytext21>span.Bodytext2Bold'
items = doc(css_text)
text = doc(css_text).map(lambda idx, elm: pq(elm)(
'span.Bodytext2Bold').text() + ('(hw)\n' if pq(elm)('span.Bodytext2Bold').text() else '\n') + pq(elm)('span.Bodytext20').text())
print('\n\n'.join(text[:60]))
上面码的输出大致这个样子:。。。
A-Rod.(hw)
People who have little or no knowledge of baseball might have trouble with these initials. They are short for Alex Rodriguez, the famous Yankee baseball star.
around Cape Horn.(hw)
An expression once used in whaling communities to mean “being away on a whaling voyage.” One old poem went:
“I’ll tell your father, boys,” I cried To lads at play upon my lawn.
They chorused back, “You’ll have to go Around Cape Horn.”
around the horn.(hw)
In the days of the tall ships any sailor who had sailed around Cape Horn was entitled to spit to windward; otherwise, it was a serious infraction of nautical rules of conduct. Thus, the permissible practice of spitting to windward was called Cape Horn isn’t so named because it is shaped like a horn. Captain Schouten, the Dutch navigator who first rounded it in 1616, named it after Hoorn, his birthplace in northern Holland.
arrant thief; knight errant.(hw)
was originally just a variation of nomadic or vagabond, the word best known in a knight who roamed the country performing good deeds. But from its persistent use in expressions such as an a thief who roamed the countryside holding up victims, came to mean thorough, downright, or out-
。。。
顺便安利一下 pyquery,是不是完爆正则、bs4、lxml{:4_95:}?
PDF很难转成mdx的 pdf转 txt,格式有一定规律, 编程处理后可以做成mdict词库,我正在编程处理…… shbf 发表于 2013-9-29 17:38 static/image/common/back.gif
pdf转 txt,格式有一定规律, 编程处理后可以做成mdict词库,我正在编程处理……
期待新作品,辛苦了。 shbf 发表于 2013-9-29 17:38 static/image/common/back.gif
pdf转 txt,格式有一定规律, 编程处理后可以做成mdict词库,我正在编程处理……
期待新作品,辛苦了。Many thanks! {:5_213:} 词典文本已导出并处理……基本无误,两栏问题完美解决。
主要遗留一些小缺点,可以自行纠正,1. 部分.,)后面少一个空格。2.部分年份数字和英文单词之间少一个空格。这两个问题很好解决的。
当然要做成mdx,还需把关键词标记出来,我用{}标记到字母C, 剩下的需要对照pdf,工作量有点大,不做了。发上源文本,请有时间的网友处理吧!
http://pan.baidu.com/share/link?shareid=1686563253&uk=3759036089 感谢 shbf 兄的辛勤工作。 做成了 mdx 毛坯版,不完美,但可以用了,会找时间完善一下。15米,相当于免费的吧。
下载mdx:https://www.pdawiki.com/forum/forum.php?mod=viewthread&tid=32266&page=1&extra=#pid1035923
欢迎制作校对精美版,可免费提供从 pdf 到 mdx 各环节的资料(文本,python程序等等)。详细步骤及相关资料可参考此贴 https://www.pdawiki.com/forum/forum.php?mod=viewthread&tid=32208&extra=page%3D1 。
页:
[1]