Complex python3 csv scraper

Complex python3 csv scraper

I've got the code below working great when pulling data from a row, in my case row[0]. I'm wondering how to tweak it to pull data from multiple rows?

Also, I would love to be able to specify which divTag class (see the code below) to use for a specific column.

Something like for row[1,2] use:

divTag = soup.find("div", {"class": "productsPicture"})

and for row[4,5] use:

divTag = soup.find("div", {"class": "product_content"})

If that make sense to you guys.

from bs4 import BeautifulSoup import requests import csv with open('urls.csv', 'r') as csvFile, open('results.csv', 'w', newline='') as results: reader = csv.reader(csvFile, delimiter=';') writer = csv.writer(results) for row in reader: # get the url url = row[0] print(url) # fetch content from server try: html = requests.get(url).content except requests.exceptions.ConnectionError as e: writer.writerow([url, '', 'bad url']) continue # soup fetched content soup = BeautifulSoup(html, 'html.parser') divTag = soup.find("div", {"class": "productsPicture"}) if divTag: # Return all 'a' tags that contain an href for a in divTag.find_all("a", href=True): url_sub = a['href'] # Test that link is valid try: r = requests.get(url_sub) writer.writerow([url, url_sub, 'ok']) except requests.exceptions.ConnectionError as e: writer.writerow([url, url_sub, 'bad link']) else: writer.writerow([url, '', 'no results'])

urls.csv sample:

urls.csv

https://www.tennis-point.com/index.php?stoken=737F2976&lang=1&cl=search&searchparam=E705Y-0193; https://www.tennis-point.com/index.php?stoken=737F2976&lang=1&cl=search&searchparam=E703Y-0193; https://www.tennis-point.com/index.php?stoken=737F2976&lang=1&cl=search&searchparam=E702Y-4589; https://www.tennis-point.com/index.php?stoken=737F2976&lang=1&cl=search&searchparam=E706Y-9093;

Example classes to search for:

anyone to help me to solve this ? @Martin Evans ?
– AnotherUser31
May 22 at 13:57

Guys, please have look here and help me to solve this puzzle
– AnotherUser31
May 23 at 6:54

1 Answer
1

To add a per column find parameter, you could create a dictionary mapping the index number into the required find parameters as follows:

from bs4 import BeautifulSoup import requests import csv class_1 = {"class": "productsPicture"} class_2 = {"class": "product_content"} class_3 = {"class": "id-fix"} # map a column number to the required find parameters class_to_find = { 0 : class_3, # Not defined in question 1 : class_1, 2 : class_1, 3 : class_3, # Not defined in question 4 : class_2, 5 : class_2} with open('urls.csv', 'r') as csvFile, open('results.csv', 'w', newline='') as results: reader = csv.reader(csvFile) writer = csv.writer(results) for row in reader: # get the url output_row = for index, url in enumerate(row): url = url.strip() # Skip any empty URLs if len(url): #print('col: {}nurl: {}nclass: {}nn'.format(index, url, class_to_find[index])) # fetch content from server try: html = requests.get(url).content except requests.exceptions.ConnectionError as e: output_row.extend([url, '', 'bad url']) continue except requests.exceptions.MissingSchema as e: output_row.extend([url, '', 'missing http...']) continue # soup fetched content soup = BeautifulSoup(html, 'html.parser') divTag = soup.find("div", class_to_find[index]) if divTag: # Return all 'a' tags that contain an href for a in divTag.find_all("a", href=True): url_sub = a['href'] # Test that link is valid try: r = requests.get(url_sub) output_row.extend([url, url_sub, 'ok']) except requests.exceptions.ConnectionError as e: output_row.extend([url, url_sub, 'bad link']) else: output_row.extend([url, '', 'no results']) writer.writerow(output_row)

The enumerate() function is used to return a counter whist iterating over a list. So index will be 0 for the first URL, and 1 for the next. This can then be used with the class_to_find dictionary to get the required parameters to search on.

enumerate()

index

0

1

class_to_find

Each URL results in 3 columns being created, the url, the sub-url if successful and the result. These can be removed if not needed.

yes ! yes ! yes ! you are the man ! BUUUUUT it's not doing it per column, it's doing it per row. If we can figure out how to do it per COLUMN, that would be faaaaaaaantastic ! Thank you man again for your time !!!
– AnotherUser31
May 24 at 10:35

Try recopying, there was a bug.
– Martin Evans
May 24 at 10:35

Alright, @Martin Evans I test it out, looks like you still can't do 2 classes, to COLUMNS at a time, am I right ? say for COLUMN 2 do class productsPicture and for COLUMN 3 do class product_content ? you are so so close !!!! please don't give up !
– AnotherUser31
May 24 at 11:06

Currently class_1 is used for columns 0 and 3 (they are numbered from 0). class_2 is used for columns 1 and 4.
– Martin Evans
May 24 at 11:08

class_1

0

3

class_2

1

4

emmmmm...I have 5 rows of links in column 3 and 4, and I get only one result :( check it out docs.google.com/spreadsheets/d/…
– AnotherUser31
May 24 at 11:20

By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

搜尋此網誌

Xuykyuu