web scraping - Python BeautifulSoup extract html table cells that contains images and text -
i want extract table url, got lost... see have done below:
url = "https://www.marinetraffic.com/en/ais/index/ports/all/per_page:50" headers = {'user-agent': 'mozilla/5.0'} raw_html = requests.get(url, headers=headers) raw_data = raw_html.text soup_data = beautifulsoup(raw_data, "lxml") td = soup_data.findall('tr')[1:] country = [] data in td: col = data.find_all('td') country.append(col)
how text , url of of columns (country, port name, un/locode, type, , port's map)?
i did scraping you. can use dictionary key value table headers below. can iterate through individual td required column , use find('tag_name')['attribute_name']
url, src, href etc , .text
texts. hope helps.
url = "https://www.marinetraffic.com/en/ais/index/ports/all/per_page:50" headers = {'user-agent': 'mozilla/5.0'} raw_html = requests.get(url, headers=headers) raw_data = raw_html.text soup_data = beautifulsoup(raw_data, "lxml") td = soup_data.findall('tr')[1:] country = [] data in td: col = data.find_all('td') details = {} i,col in enumerate(col): if == 0: details['img-src'] = ("https://www.marinetraffic.com"+col.find('img')['src']) if == 1: details["port_name"] = (col.text.replace('\n','')) if == 2: details['un/locode'] = (col.text.replace('\r\n','').replace(" ","")) if == 4: details['type'] = (col.text.replace('\r\n','').replace(" ","")) if == 5: details['map_url'] = ("https://www.marinetraffic.com"+(col.find('a')['href'])) country.append(details)
output:
[{'img-src': 'https://www.marinetraffic.com/img/flags/png40/cn.png', 'port_name': 'shanghai', 'un/locode': 'cnsha', 'map_url': 'https://www.marinetraffic.com/en/ais/home/zoom:9/centerx:121.614746/centery:31.3663635/showports:true/portid:1253', 'type': 'port'}, {'img-src': 'https://www.marinetraffic.com/img/flags/png40/cn.png', 'port_name': 'maanshan', 'un/locode': 'cnmaa', 'map_url': 'https://www.marinetraffic.com/en/ais/home/zoom:14/centerx:118.459503/centery:31.7180004/showports:true/portid:2746', 'type': 'port'}, {'img-src': 'https://www.marinetraffic.com/img/flags/png40/hk.png', 'port_name': 'hong kong', 'un/locode': 'hkhkg', 'map_url': 'https://www.marinetraffic.com/en/ais/home/zoom:14/centerx:114.181366/centery:22.2879486/showports:true/portid:2429', 'type': 'port'}, ... ]
Comments
Post a Comment