bash - Extract multiple values between XML tags -

March 15, 2014

i have xml file tripadvisor page, , shows restaurants in specific area.

i want extract 'cuisines' offered of restaurants in search result. of values stored between <a> , <span> html tags.

for each restaurant, data stored between <div> tag, snippet of cuisines 1 restaurant shown below:

<div class="cuisines"> <span class="item price">££ - £££</span> <span class="item cuisine" onclick="ta.restaurant_list_tracking.clicknonlinkedcuisine()">bar</span> <a class="item cuisine" href="/restaurants-g1096751-c7-whittlebury_northamptonshire_england.html" onclick="ta.setevtcookie('restaurant_details', 'restaurants_details_cuisine', '', 0, this.href);">british</a> <span class="item cuisine" onclick="ta.restaurant_list_tracking.clicknonlinkedcuisine()">pub</span> <span class="item cuisine" onclick="ta.restaurant_list_tracking.clicknonlinkedcuisine()">gastropub</span> <a class="item cuisine" href="/restaurants-g1096751-zfz10665-whittlebury_northamptonshire_england.html" onclick="ta.setevtcookie('restaurant_details', 'restaurants_details_cuisine', '', 0, this.href);">vegetarian friendly</a> <a class="item cuisine" href="/restaurants-g1096751-zfz10992-whittlebury_northamptonshire_england.html" onclick="ta.setevtcookie('restaurant_details', 'restaurants_details_cuisine', '', 0, this.href);">gluten free options</a> </div>

how go extracting cuisines between these div tags each restaurant , outputting new text file?

the expected output want snippet of code be:

bar, british, pub, gastropub, vegetarian friendly, gluten free options

mind you, there several <div> tags in xml file, , want process through of them, extracting results of different cuisines 1 text file. each line showing cuisines each <div> block.

thanks!

this basic bash script (using awk) job, @ least example provided:

#!/bin/bash     cat in.xml | awk ' /item cuisine/ {item=gensub(/<[^>]*>/, "", "g");      ans = (ans=="") ? item : ans ", " item;} end {print ans}' > out.txt

the script removes text inside brackets , retains text between them, , on lines containing "item cuisine".

however, note very fragile way of extracting values xml file, or, matter, data exchange format (like json, yaml etc.), , stop working dozen different reasons (bad xml format, xml line containing "item cuisine" term outside of brackets, xml tags not being broken newlines, etc.).

one extend above script , cover increasing number of errors, there's no need reinvent wheel has been done in better way. tools xmllint or xgrep offer more robust xml parsing, letting concentrate on task @ hand instead of error handling.

if more quick personal hack/experiment, i'd urge use 1 of available tools.

Search This Blog

RT

bash - Extract multiple values between XML tags -

Comments

Post a Comment

Popular posts from this blog

Ansible warning on jinja2 braces on when -

Parsing a protocol message from Go by Java -

html - How to custom Bootstrap grid height? -