Complex python3 csv scraper

The name of the pictureThe name of the pictureThe name of the pictureClash Royale CLAN TAG#URR8PPP


Complex python3 csv scraper



I've got the code below working great when pulling data from a row, in my case row[0]. I'm wondering how to tweak it to pull data from multiple rows?



Also, I would love to be able to specify which divTag class (see the code below) to use for a specific column.



Something like for row[1,2] use:


divTag = soup.find("div", {"class": "productsPicture"})



and for row[4,5] use:


divTag = soup.find("div", {"class": "product_content"})



If that make sense to you guys.


from bs4 import BeautifulSoup
import requests
import csv

with open('urls.csv', 'r') as csvFile, open('results.csv', 'w', newline='') as results:
reader = csv.reader(csvFile, delimiter=';')
writer = csv.writer(results)

for row in reader:
# get the url
url = row[0]
print(url)

# fetch content from server

try:
html = requests.get(url).content
except requests.exceptions.ConnectionError as e:
writer.writerow([url, '', 'bad url'])
continue

# soup fetched content
soup = BeautifulSoup(html, 'html.parser')

divTag = soup.find("div", {"class": "productsPicture"})

if divTag:
# Return all 'a' tags that contain an href
for a in divTag.find_all("a", href=True):
url_sub = a['href']

# Test that link is valid
try:
r = requests.get(url_sub)
writer.writerow([url, url_sub, 'ok'])
except requests.exceptions.ConnectionError as e:
writer.writerow([url, url_sub, 'bad link'])
else:
writer.writerow([url, '', 'no results'])



urls.csv sample:


urls.csv


https://www.tennis-point.com/index.php?stoken=737F2976&lang=1&cl=search&searchparam=E705Y-0193;
https://www.tennis-point.com/index.php?stoken=737F2976&lang=1&cl=search&searchparam=E703Y-0193;
https://www.tennis-point.com/index.php?stoken=737F2976&lang=1&cl=search&searchparam=E702Y-4589;
https://www.tennis-point.com/index.php?stoken=737F2976&lang=1&cl=search&searchparam=E706Y-9093;



Example classes to search for:



classes to search for per column





anyone to help me to solve this ? @Martin Evans ?
– AnotherUser31
May 22 at 13:57





Guys, please have look here and help me to solve this puzzle
– AnotherUser31
May 23 at 6:54




1 Answer
1



To add a per column find parameter, you could create a dictionary mapping the index number into the required find parameters as follows:


from bs4 import BeautifulSoup
import requests
import csv

class_1 = {"class": "productsPicture"}
class_2 = {"class": "product_content"}
class_3 = {"class": "id-fix"}

# map a column number to the required find parameters
class_to_find = {
0 : class_3, # Not defined in question
1 : class_1,
2 : class_1,
3 : class_3, # Not defined in question
4 : class_2,
5 : class_2}

with open('urls.csv', 'r') as csvFile, open('results.csv', 'w', newline='') as results:
reader = csv.reader(csvFile)
writer = csv.writer(results)

for row in reader:
# get the url

output_row =

for index, url in enumerate(row):
url = url.strip()

# Skip any empty URLs
if len(url):
#print('col: {}nurl: {}nclass: {}nn'.format(index, url, class_to_find[index]))

# fetch content from server

try:
html = requests.get(url).content
except requests.exceptions.ConnectionError as e:
output_row.extend([url, '', 'bad url'])
continue
except requests.exceptions.MissingSchema as e:
output_row.extend([url, '', 'missing http...'])
continue

# soup fetched content
soup = BeautifulSoup(html, 'html.parser')


divTag = soup.find("div", class_to_find[index])

if divTag:
# Return all 'a' tags that contain an href
for a in divTag.find_all("a", href=True):
url_sub = a['href']

# Test that link is valid
try:
r = requests.get(url_sub)
output_row.extend([url, url_sub, 'ok'])
except requests.exceptions.ConnectionError as e:
output_row.extend([url, url_sub, 'bad link'])
else:
output_row.extend([url, '', 'no results'])

writer.writerow(output_row)



The enumerate() function is used to return a counter whist iterating over a list. So index will be 0 for the first URL, and 1 for the next. This can then be used with the class_to_find dictionary to get the required parameters to search on.


enumerate()


index


0


1


class_to_find



Each URL results in 3 columns being created, the url, the sub-url if successful and the result. These can be removed if not needed.





yes ! yes ! yes ! you are the man ! BUUUUUT it's not doing it per column, it's doing it per row. If we can figure out how to do it per COLUMN, that would be faaaaaaaantastic ! Thank you man again for your time !!!
– AnotherUser31
May 24 at 10:35





Try recopying, there was a bug.
– Martin Evans
May 24 at 10:35





Alright, @Martin Evans I test it out, looks like you still can't do 2 classes, to COLUMNS at a time, am I right ? say for COLUMN 2 do class productsPicture and for COLUMN 3 do class product_content ? you are so so close !!!! please don't give up !
– AnotherUser31
May 24 at 11:06





Currently class_1 is used for columns 0 and 3 (they are numbered from 0). class_2 is used for columns 1 and 4.
– Martin Evans
May 24 at 11:08


class_1


0


3


class_2


1


4





emmmmm...I have 5 rows of links in column 3 and 4, and I get only one result :( check it out docs.google.com/spreadsheets/d/…
– AnotherUser31
May 24 at 11:20






By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Popular posts from this blog

Keycloak server returning user_not_found error when user is already imported with LDAP

415 Unsupported Media Type while sending json file over REST Template

PHP parse/syntax errors; and how to solve them?