python - Using lxml, how can I read text inside nested elements? -
i'm trying search 500 xml documents specific phrases, , output id of element contains of phrases. currently, code:
from lxml import etree import os import re files = os.listdir('c:/users/me/desktop/xml') search_words = ['house divided', 'committee divided', 'on division', 'division list', 'the ayes , noes',] f in files: doc = etree.parse('c:/users/me/desktop/xml/' +f) elem in doc.iter(): word in search_words: if elem.text not none , str(elem.attrib) != "{}" , word in elem.text , len(re.findall(r'\d+', elem.text))>1: votes = re.findall(r'\d+', elem.text) string = str(elem.attrib)[8:-2] + "," string += (str(votes[0]) + "," + str(votes[1]) + ",") string += word + "," string += str(elem.sourceline) print string
input output properly:
<p id="s3v0001p0-01869">the house divided; against motion 83; 23—majority 60.</p>
but input nested elements missed, because text inside not being parsed phrases:
<p id="s3v0141p0-01248"><member>the chancellor of exchequer</member><membercontribution> said, precedent occurred on 8th of april, 1850, on motion going committee of supply. amendment moved captain boldero on subject of assistant-surgeons in navy, when, on division being called for, question put words proposed left out stand part of question. house divided, when numbers were—ayes, 40; noes, 48. question, "that proposed words added" put , agreed to; main question, amended, put , agreed to; , question being put, "that mr. speaker leave chair," motion agreed to, , house went committee of supply.</membercontribution></p>
is there way read text inside nested elements , return id?
with lxml there xpath
method , xpath has contains
function can use e.g.
doc = et.fromstring('<p id="s3v0141p0-01248"><member>the chancellor of exchequer</member><membercontribution> said, precedent occurred on 8th of april, 1850, on motion going committee of supply. amendment moved captain boldero on subject of assistant-surgeons in navy, when, on division being called for, question put words proposed left out stand part of question. house divided, when numbers were—ayes, 40; noes, 48. question, "that proposed words added" put , agreed to; main question, amended, put , agreed to; , question being put, "that mr. speaker leave chair," motion agreed to, , house went committee of supply.</membercontribution></p>') result = doc.xpath('//*[@id , contains(., $word)]', word = 'house divided')
Comments
Post a Comment