web scraping - Python BeautifulSoup extract html table cells that contains images and text -


i want extract table url, got lost... see have done below:

url = "https://www.marinetraffic.com/en/ais/index/ports/all/per_page:50"  headers = {'user-agent': 'mozilla/5.0'} raw_html = requests.get(url, headers=headers)  raw_data = raw_html.text soup_data = beautifulsoup(raw_data, "lxml")  td = soup_data.findall('tr')[1:]  country = []  data in td:     col = data.find_all('td')     country.append(col) 

how text , url of of columns (country, port name, un/locode, type, , port's map)?

i did scraping you. can use dictionary key value table headers below. can iterate through individual td required column , use find('tag_name')['attribute_name'] url, src, href etc , .text texts. hope helps.

url = "https://www.marinetraffic.com/en/ais/index/ports/all/per_page:50"  headers = {'user-agent': 'mozilla/5.0'} raw_html = requests.get(url, headers=headers)  raw_data = raw_html.text soup_data = beautifulsoup(raw_data, "lxml")  td = soup_data.findall('tr')[1:]  country = []  data in td:     col = data.find_all('td')     details = {}     i,col in enumerate(col):         if == 0:             details['img-src'] = ("https://www.marinetraffic.com"+col.find('img')['src'])         if == 1:             details["port_name"] = (col.text.replace('\n',''))         if == 2:              details['un/locode'] = (col.text.replace('\r\n','').replace(" ",""))         if == 4:             details['type'] = (col.text.replace('\r\n','').replace(" ",""))         if == 5:             details['map_url'] = ("https://www.marinetraffic.com"+(col.find('a')['href']))     country.append(details) 

output:

 [{'img-src': 'https://www.marinetraffic.com/img/flags/png40/cn.png',   'port_name': 'shanghai',   'un/locode': 'cnsha',   'map_url': 'https://www.marinetraffic.com/en/ais/home/zoom:9/centerx:121.614746/centery:31.3663635/showports:true/portid:1253',   'type': 'port'},  {'img-src': 'https://www.marinetraffic.com/img/flags/png40/cn.png',   'port_name': 'maanshan',   'un/locode': 'cnmaa',   'map_url': 'https://www.marinetraffic.com/en/ais/home/zoom:14/centerx:118.459503/centery:31.7180004/showports:true/portid:2746',   'type': 'port'},  {'img-src': 'https://www.marinetraffic.com/img/flags/png40/hk.png',   'port_name': 'hong kong',   'un/locode': 'hkhkg',   'map_url': 'https://www.marinetraffic.com/en/ais/home/zoom:14/centerx:114.181366/centery:22.2879486/showports:true/portid:2429',   'type': 'port'},    ...   ] 

Comments

Popular posts from this blog

node.js - Node js - Trying to send POST request, but it is not loading javascript content -

javascript - Replicate keyboard event with html button -

javascript - Web audio api 5.1 surround example not working in firefox -