bash - Extract multiple values between XML tags -
i have xml file tripadvisor page, , shows restaurants in specific area.
i want extract 'cuisines' offered of restaurants in search result. of values stored between <a>
, <span>
html tags.
for each restaurant, data stored between <div>
tag, snippet of cuisines 1 restaurant shown below:
<div class="cuisines"> <span class="item price">££ - £££</span> <span class="item cuisine" onclick="ta.restaurant_list_tracking.clicknonlinkedcuisine()">bar</span> <a class="item cuisine" href="/restaurants-g1096751-c7-whittlebury_northamptonshire_england.html" onclick="ta.setevtcookie('restaurant_details', 'restaurants_details_cuisine', '', 0, this.href);">british</a> <span class="item cuisine" onclick="ta.restaurant_list_tracking.clicknonlinkedcuisine()">pub</span> <span class="item cuisine" onclick="ta.restaurant_list_tracking.clicknonlinkedcuisine()">gastropub</span> <a class="item cuisine" href="/restaurants-g1096751-zfz10665-whittlebury_northamptonshire_england.html" onclick="ta.setevtcookie('restaurant_details', 'restaurants_details_cuisine', '', 0, this.href);">vegetarian friendly</a> <a class="item cuisine" href="/restaurants-g1096751-zfz10992-whittlebury_northamptonshire_england.html" onclick="ta.setevtcookie('restaurant_details', 'restaurants_details_cuisine', '', 0, this.href);">gluten free options</a> </div>
how go extracting cuisines between these div
tags each restaurant , outputting new text file?
the expected output want snippet of code be:
bar, british, pub, gastropub, vegetarian friendly, gluten free options
mind you, there several <div>
tags in xml file, , want process through of them, extracting results of different cuisines 1 text file. each line showing cuisines each <div>
block.
thanks!
this basic bash script (using awk) job, @ least example provided:
#!/bin/bash cat in.xml | awk ' /item cuisine/ {item=gensub(/<[^>]*>/, "", "g"); ans = (ans=="") ? item : ans ", " item;} end {print ans}' > out.txt
the script removes text inside brackets , retains text between them, , on lines containing "item cuisine".
however, note very fragile way of extracting values xml file, or, matter, data exchange format (like json, yaml etc.), , stop working dozen different reasons (bad xml format, xml line containing "item cuisine" term outside of brackets, xml tags not being broken newlines, etc.).
one extend above script , cover increasing number of errors, there's no need reinvent wheel has been done in better way. tools xmllint or xgrep offer more robust xml parsing, letting concentrate on task @ hand instead of error handling.
if more quick personal hack/experiment, i'd urge use 1 of available tools.
Comments
Post a Comment