TA的每日心情 | 开心 2019-8-21 08:44 |
---|
签到天数: 163 天 [LV.7]常住居民III
|
发表于 2018-11-16 21:27:49
|
显示全部楼层
- '''# n/ W" V6 O$ X* B! m6 [: r
- Based on xmllarge.py$ m4 m, P5 `) Y6 u
- ''') ?) _2 G% P/ D# r4 p
- # from pyquery import PyQuery as pq
( n( Q" _$ s; a( p4 ?7 n - from pathlib import Path
& k1 F" F, Q! M' S
% G! A4 k" ]5 |' R$ r- # F" z5 G w9 G9 }' f
- def xml_iter(file, tag):) H' i, p8 a, y# D7 F) W, A0 u; B, ^1 X
- '''
6 O' m: ?4 ?. H; l! F - Process huge xml files, W M# N+ p/ K# h
- <tag> </tag> need to be in separate lines, c; l# @8 g& r+ ^, F
- # TODO: in the middle of lines& F; L( Q3 Z* S. m$ q6 |1 ^
- 8 k# @. A; v$ A; U( D
- :file: file path+ n; x. y' B$ x" I9 C
- :tag: element to retrieve
3 ~( W; v3 d3 ` - ''': J: K) m: y) h& c- }9 x
- tagb1 = '<' + tag + '>', M Y+ [7 n( y
- tagb1 = tagb1.encode()* l) o$ ~' n1 W
% F% |: ?* \: z% k9 T1 k
4 H+ C8 [3 Y7 P3 M2 ?; q- tagb2 = '<' + tag + ' '8 Q- S; X" t& x+ W: k, M
- tagb2 = tagb2.encode()8 `- _0 i V. T9 g% R% \& g' U2 |
- 2 B& E. Q( s$ V* G) }/ D$ n8 _% M
- tagb3 = '</' + tag + '>'% C' g) q6 F4 T4 a. X
- tagb3 = tagb3.encode()0 Z# K, j3 [. k
- ' ^& ~0 P k" Y
- with open(file, 'rb') as inputfile:% D1 j J K$ O3 v$ d( B
- append = False
" B% a, I; T- J0 d# O( y - for line in inputfile:+ x! }! Y! V5 }( `( ~2 J' H# P
- #~ if b'<tu>' in line or b'<tu ' in line:4 Q0 {! \3 l7 h/ J5 h) r
- if tagb1 in line:
5 j4 W: U k9 u7 A. k1 f, B6 V: a1 t ` - inputbuffer = line[line.index(tagb1):]
5 z. h& a2 `5 Z( F8 M1 T - append = True
* N( A- E& Z$ Z# h' Y# u/ y9 X - elif tagb2 in line:
8 ]3 Y$ O. F9 \! I# O) h2 x - inputbuffer = line[line.index(tagb2):]
( [9 R: C! Q/ j+ O - append = True
) l: r; N9 f; B- @ - #~ elif b'</tu>' in line:
) K c. P& o8 Q9 i9 \! c* T9 F - elif tagb3 in line:
' X. t4 p1 e$ q$ T9 Z - inputbuffer += line[:line.index(tagb3) + len(tagb3)]4 r5 a5 v. y g* ]6 `. z% i
- append = False2 k( Z/ x" t. [- J
- yield inputbuffer
; I3 {5 t' u; z/ q% I - #~ docitem = process_buffer(inputbuffer, id_num)% y) w( ?1 T% j H1 e& Z, y
- #~ print(id_num)
7 y/ }! A' Q- ]2 U - #~ id_num += 1
2 i* r$ N1 k( W- Q* N/ m - inputbuffer = b''3 v2 i1 X( z5 H! g% p! ]* S3 j
- elif append:
7 o3 s! p( e9 E8 S7 J! W - inputbuffer += line
复制代码 7 D5 t# q6 l. [
$ K! {0 M0 Y5 K$ ]% ?) Z- I
这么多人找这东西?我过一阵打包发个小工具。8 s, `$ ^, i& H/ M" G' ?- t
& b4 Y! o5 t* K `# I% a上面的python3函数用法% M0 c9 Z4 f! Q% z+ ^2 f: n9 c8 q+ c
resu = '' s- ?7 {" B4 _! I0 D* D2 q
for elm in xml_iter(filename, 'tu'):8 M- }: v; ^$ y3 x1 T
resu += elm! _; z" d# K# {7 c0 B
, t$ v3 N8 f1 i+ I( g) H1 J! R1 y
内存足迹极小……不管文件多大。 |
|