TA的每日心情 | 开心 2019-8-21 08:44 |
---|
签到天数: 163 天 [LV.7]常住居民III
|
发表于 2018-11-25 16:50:10
|
显示全部楼层
本帖最后由 mikeee 于 2018-11-25 18:08 编辑
8 \% I" O* e- @1 J
+ ]6 d3 ^# e2 Q/ {有一个办法应该可行:先用 Abbyy Finereader 转成 docx,docx再转成 htm。# v4 ?6 R: l6 A& w8 P$ E
% C, n! i$ D# H- P0 r
我机器里没装Finereader,用在线 https://finereaderonline.com 做了十页(每天在线只能OCR十页),效果不错:htm里的页头自动消失。两列变成了单列,粗体保留,好像原pdf换行时的 hyphen 都去掉了,但原pdf里跨页的段落好像没有合并。
& f; D) R& j6 v3 H2 b; T9 K
) L! `2 B& {# J* \Chrome Devtools 大致看了看:css selector: p.Bodytext21 可定位所有的释义) F; ^. O9 e. ~
css selector:p.Bodytext21>span.Bodytext2Bold 可定位释义里的粗体
4 c3 L, C; M3 H$ ~+ U0 H. |6 ^4 o( |+ A$ J+ d" K: P0 F: \2 O
贴不了图,发个 docx 和 htm 文件(仅10页) 百度盘链接: https://pan.baidu.com/s/15Qc4tQeWcePy7AhTJLiJXQ 提取码: encg
4 f! U% d% e" S Z# j: P& _
9 p4 W: e2 {/ K8 m* a9 d& F2 w折腾了一阵,这个 python3 码处理上面说的 htm 得到的东西大致可以做成 mdx& ]# l1 A( a- y
- '''word and phrase orgins test
* k7 O, k: X! v6 d: k6 N$ Z - '''
9 S( Q H. s% Z, n7 I5 K: C- \5 y i - from pyquery import PyQuery as pq
3 v5 a! {# c9 r* R7 b" Y
, R; Q" P' r6 ~5 [6 ]$ Q( ^- file = r'WordandPhraseOrigins.htm'
Z; n7 i. Z5 `1 C2 T" s8 K- ~) p - try:& y9 H8 ^( F$ S$ h$ @, {! T
- html = open(file, 'rt', encoding='utf8').read(), t' b7 Z C" @0 j
- except Exception as exc:
8 s, U& f# ?, x6 f9 X$ n7 ~ - print('error: {}. Trying gb2312...'.format(exc))9 a& p% Z; Y' W: o* h* ]
- try:5 z, C% u& M/ ^6 t
- html = open(file, 'rt', encoding='gb2312').read(): G' Q4 ^% e7 c# W, K ~3 q' V9 @
- print('Looks good')8 a7 U- M: h2 {" W' I6 C U5 m: m3 |5 r
- except Exception as exc:
c6 a, K- W, G f- l - SystemExit('error: {}. Giving up...'.format(exc))' `) z$ S @$ T( S; v: n. [# N- u
- doc = pq(html)" h7 w+ t2 B6 [/ j9 m
( `# U" @! i6 P- css_text = 'p.Bodytext21'
! I& m& d n: {( z9 |% ? - css_bold = 'p.Bodytext21>span.Bodytext2Bold'
* S- e7 {; N, }! ^ - + D% c8 d0 @# R1 S+ s" e
- items = doc(css_text)
& U2 e# g3 [) S; o( z# c - , ~6 o* Q; o& D
- text = doc(css_text).map(lambda idx, elm: pq(elm)(0 X# F" a( c& R" s! f
- 'span.Bodytext2Bold').text() + ('(hw)\n' if pq(elm)('span.Bodytext2Bold').text() else '\n') + pq(elm)('span.Bodytext20').text())& W- R6 z w: ?& N/ U
- print('\n\n'.join(text[:60]))
4 N4 d( `+ d( y8 T, C
复制代码 上面码的输出大致这个样子:。。。5 `3 y1 m6 i6 t+ B% I
A-Rod.(hw)# ^; F- C, w% r
People who have little or no knowledge of baseball might have trouble with these initials. They are short for Alex Rodriguez, the famous Yankee baseball star.
: {* {3 z. V+ B( W1 U5 d+ `: u9 V' ~/ O8 X: \
around Cape Horn.(hw)$ G% f1 S- N% b3 i: Q( Z
An expression once used in whaling communities to mean “being away on a whaling voyage.” One old poem went:
# v* Q7 r# B2 p5 I( K. Q. x
3 u$ u* c. j" t: X* C, n9 e% K9 j9 _6 x3 U, q
“I’ll tell your father, boys,” I cried To lads at play upon my lawn.
! s/ P9 d/ Y5 Q* t0 j) @+ S& X# \) N1 A2 S# f. w
; |- |! @- P; {4 T! |" n/ d/ WThey chorused back, “You’ll have to go Around Cape Horn.”
6 |& H$ F/ I. Z
4 W$ ~" J& a* A% Baround the horn.(hw)" h/ a: @4 }/ r; [ Z) r8 d4 o
In the days of the tall ships any sailor who had sailed around Cape Horn was entitled to spit to windward; otherwise, it was a serious infraction of nautical rules of conduct. Thus, the permissible practice of spitting to windward was called Cape Horn isn’t so named because it is shaped like a horn. Captain Schouten, the Dutch navigator who first rounded it in 1616, named it after Hoorn, his birthplace in northern Holland.4 W2 E' ?# i& E! i2 A# V- L5 I
* `' g' I# A( y! N5 `" k
arrant thief; knight errant.(hw)
`6 M; l& m4 j. Kwas originally just a variation of nomadic or vagabond, the word best known in a knight who roamed the country performing good deeds. But from its persistent use in expressions such as an a thief who roamed the countryside holding up victims, came to mean thorough, downright, or out-8 h( [0 ^% C% a, e8 r& }# x6 V3 D6 e
。。。
8 E: n# Q- h$ o3 R% C- f% h9 k 顺便安利一下 pyquery,是不是完爆正则、bs4、lxml ?
/ u- J+ n2 e0 n: g. d
. J8 T* O6 E: ]- l1 m2 `5 z: a8 j% y7 M) n
, I* e. D* m# t# P$ G, J
|
评分
-
1
查看全部评分
-
|