python - Using lxml, how can I read text inside nested elements? -


i'm trying search 500 xml documents specific phrases, , output id of element contains of phrases. currently, code:

from lxml import etree import os import re  files = os.listdir('c:/users/me/desktop/xml') search_words = ['house divided', 'committee divided', 'on division', 'division list',                 'the ayes , noes',]  f in files:     doc = etree.parse('c:/users/me/desktop/xml/' +f)     elem in doc.iter():         word in search_words:             if elem.text not none , str(elem.attrib) != "{}" , word in elem.text , len(re.findall(r'\d+', elem.text))>1:                 votes = re.findall(r'\d+', elem.text)                 string = str(elem.attrib)[8:-2] + ","                 string += (str(votes[0]) + "," + str(votes[1]) + ",")                 string += word + ","                 string += str(elem.sourceline)                 print string       

input output properly:

<p id="s3v0001p0-01869">the house divided; against motion 83; 23&#x2014;majority 60.</p> 

but input nested elements missed, because text inside not being parsed phrases:

<p id="s3v0141p0-01248"><member>the chancellor of exchequer</member><membercontribution> said, precedent occurred on 8th of april, 1850, on motion going committee of supply. amendment moved captain boldero on subject of assistant-surgeons in navy, when, on division being called for, question put words proposed left out stand part of question. house divided, when numbers were&#x2014;ayes, 40; noes, 48. question, "that proposed words added" put , agreed to; main question, amended, put , agreed to; , question being put, "that mr. speaker leave chair," motion agreed to, , house went committee of supply.</membercontribution></p> 

is there way read text inside nested elements , return id?

with lxml there xpath method , xpath has contains function can use e.g.

doc = et.fromstring('<p id="s3v0141p0-01248"><member>the chancellor of exchequer</member><membercontribution> said, precedent occurred on 8th of april, 1850, on motion going committee of supply. amendment moved captain boldero on subject of assistant-surgeons in navy, when, on division being called for, question put words proposed left out stand part of question. house divided, when numbers were&#x2014;ayes, 40; noes, 48. question, "that proposed words added" put , agreed to; main question, amended, put , agreed to; , question being put, "that mr. speaker leave chair," motion agreed to, , house went committee of supply.</membercontribution></p>') result = doc.xpath('//*[@id , contains(., $word)]', word = 'house divided') 

Comments

Popular posts from this blog

node.js - Node js - Trying to send POST request, but it is not loading javascript content -

javascript - Replicate keyboard event with html button -

javascript - Web audio api 5.1 surround example not working in firefox -