TA的每日心情 | 开心 2019-8-21 08:44 |
---|
签到天数: 163 天 [LV.7]常住居民III
|
发表于 2018-11-16 21:27:49
|
显示全部楼层
- '''# _; p3 {9 W0 [0 z
- Based on xmllarge.py
, {$ m6 O- ~( ]8 e% O7 ?% { - '''
: I' b' `# e8 i2 g& D - # from pyquery import PyQuery as pq
' l& N% B7 H% A# I7 Z - from pathlib import Path
- V- Z7 _( r" r: ?0 Q - 7 t Y" Z J8 l/ a
- / y- o1 [2 G4 X" R
- def xml_iter(file, tag):, `4 Z/ R9 y& B( t3 e1 M
- '''
6 I+ n# y6 J! g - Process huge xml files
1 A$ E5 ^/ c- I: r! G0 o, x' n - <tag> </tag> need to be in separate lines
% E1 i" s% f+ a4 }; D! s - # TODO: in the middle of lines, @1 q: A* w& v1 H5 m7 j
) p; r W9 s5 U$ m7 I: M: Q3 m- :file: file path+ |* Y" O ~; X1 f' }# t
- :tag: element to retrieve
4 \6 Y- b* R1 D! T - '''
! |8 [& y9 c$ x5 ]+ a5 B$ A - tagb1 = '<' + tag + '>'
8 G+ r; b/ j" v" ~0 l. L5 Z - tagb1 = tagb1.encode()* I; t8 ^8 l# U* e, m
- - i) D' h2 c b( ?& ~ a, f
- 3 U# x4 X5 d8 K0 y: c6 }5 Q: ?
- tagb2 = '<' + tag + ' '
+ C4 I. U+ i& H+ |2 Q, Q; l0 I p - tagb2 = tagb2.encode()7 N5 y% P( j. B; h0 j& Q# c
- 5 P) V$ v- r% V& |- b
- tagb3 = '</' + tag + '>'2 s; Z. z3 J( [" I) V. M9 s" Q
- tagb3 = tagb3.encode()$ h: {, Q: Q! v7 t$ i
/ D4 F# \7 e3 l% `( N3 I" t- with open(file, 'rb') as inputfile:# V0 M! i+ k5 J0 h, B
- append = False$ \& y; w+ J/ o
- for line in inputfile:! }6 h" G1 v! B! c
- #~ if b'<tu>' in line or b'<tu ' in line:1 q0 B, I0 M! O4 }
- if tagb1 in line:: Z7 _2 P1 S, B, _, r2 \$ f
- inputbuffer = line[line.index(tagb1):], x7 `; B6 e: T; E' Z# D
- append = True6 k2 I. I$ I; t, `" J
- elif tagb2 in line:
5 o0 y6 G1 X3 e! f- Y' [ - inputbuffer = line[line.index(tagb2):]
2 }3 l' \' A5 r2 _* }' i% T! w - append = True
- x) @( s0 x3 ^5 @- Q0 a - #~ elif b'</tu>' in line:2 W+ v9 q# ~" ?0 w( o
- elif tagb3 in line:
8 Y8 B) G. r; \) D! p1 I E, b2 g - inputbuffer += line[:line.index(tagb3) + len(tagb3)]) ]1 ~9 f1 X1 M- i/ G
- append = False; E m+ ]8 j& i( o6 A; y
- yield inputbuffer
& |1 d) Q# ^ L - #~ docitem = process_buffer(inputbuffer, id_num)
9 e: N$ ?4 y% `( F. e# W! H, N0 @: M - #~ print(id_num)6 c0 _5 P+ n/ D) R
- #~ id_num += 1
9 o$ H4 e, {: Y4 b# z - inputbuffer = b''. e/ A7 q* H) [' ]
- elif append:: Z1 O% u2 Z% |! Q$ [
- inputbuffer += line
复制代码 5 E1 i5 `% |4 K j1 O
4 f+ j6 Z. j1 c/ Z
这么多人找这东西?我过一阵打包发个小工具。( W: K) n8 X1 w
& Z8 N# M+ e2 B9 _# K& D& A( P: T
上面的python3函数用法: O9 B$ r5 }4 s: C0 C: U
resu = ''
$ z' W6 r4 K6 Z' W+ t3 i( n8 T/ pfor elm in xml_iter(filename, 'tu'):$ o) C7 x! Z# E9 _. @
resu += elm
+ d/ p% e: F. B5 u& ^" A1 ~4 Z4 k6 v2 t/ d5 O$ n1 V8 _2 J
内存足迹极小……不管文件多大。 |
|