Using my Python Web Crawler in my site - javascript

I created a Web Crawler in Python 3.7 that pulls different info and stores them into 4 different arrays. I have now come across an issue that I am not sure how to fix. I want to use the data from those four arrays in my site and place them into a table made from JS and HTML/CSS. How do I go about accessing the info from my Python file in my JavaScript file? I tried searching in other places before creating an account, and came across some things that talk of using Json, but I am not too familiar with these and would appreciate some help if that is the way to do it. I will post my code below which I have stored in the same directory as my other sites files. Thanks in advance!
from requests import get
from bs4 import BeautifulSoup
from flask import Flask
app = Flask(__name__)
#app.route("/")
def main():
# lists to store data
names = []
gp = []
collectionScore = []
arenaRank = []
url = 'https://swgoh.gg/g/21284/gid-1-800-druidia/'
response = get(url)
soup = BeautifulSoup(response.content, 'html.parser')
# username of the guild members:
for users in soup.findAll('strong'):
if users.text.strip().encode("utf-8") != '':
if users.text.strip().encode("utf-8") == '\xe9\x82\x93\xe6\xb5\xb7':
names.append('Deniz')
else:
names.append(users.text.strip().encode("utf-8"))
if users.text.strip().encode("utf-8") == 'Note':
names.remove('Note')
if users.text.strip().encode("utf-8") == 'GP':
names.remove('GP')
if users.text.strip().encode("utf-8") == 'CS':
names.remove('CS')
print(names)
# GP of the guild members:
for galacticPower in soup.find_all('td', class_='text-center'):
gp.append(galacticPower.text.strip().encode("utf-8"))
totLen = len(gp)
i = 0
finGP = []
while i < totLen:
finGP.append(gp[i])
i += 4
print(finGP)
# CS of the guild members:
j = 1
while j < totLen:
collectionScore.append(gp[j])
j += 4
print(collectionScore)
# Arena rank of guild member:
k = 2
while k < totLen:
arenaRank.append(gp[k])
k += 4
print(arenaRank)
if __name__ == "__main__":
app.run()
TLDR: I want to use the four lists - finGP, names, collectionScore, and arenaRank in a JavaScript or HTML file. How do I go about doing this?

Ok, this will be somewhat long but I'm going to try breaking it down into simple steps. The goal of this answer is to:
Have you get a basic webpage being generated and served from python.
Insert the results of your script as javascript into the page.
Do some basic rendering with the data.
What this answer is not:
An in-depth javascript and python tutorial. We don't want to overload you with too many concepts at one time. You should eventually learn about databases and caching, but that's further down the road.
Ok, here's what I want you to do first. Read and implement this tutorial up until the "Creating a Signup Page" section. That starts to get into dealing with Mysql, which isn't something you need to worry about right now.
Next, you need to execute your scraping script when a request for the server. When you get the results back, you output those into the html page template inside a script tag that looks like:
<script>
const data = [];
console.log(data);
</script>
Inside the brackets in data = [] use json.dumps (https://docs.python.org/2/library/json.html) to format your Python array data as json. Json is actually a subset of javascript, so you just output it as a raw javascript string here and it gets loaded into the webpage via the script tag.
The console.log statement in the script tag will show the data in the dev tools in your browser.
For now, lets pause here. Get all of this working first (probably a few hours to a day's work). Getting into doing html rendering with javascript is a different topic and again, I don't want to overload you with too much information right now.
Leave comments on this answer if you need extra help.

Related

Python Beautiful Webscraping Simulate Click to Scrape All Pages

I ran into an interesting issue while trying to scrape http://www.o-review.com/database_filter_model.php?table_name=glasses&tag= containing 42 pages of data. I was able to successfully scrape the first page of information, but while trying to scrape all pages I found that the URL remains unchanged, and changing the page uses a button at the bottom of the website.
The html code in the inspector reads:
<div onclick="filter_page('1')" class="filter_nav_button round5"
style="cursor:pointer;"><img src="/images/icon_arrow_next.svg"></div>
I'm very new at scraping and python but was told I need to simulate a "click" in the javascript which I have absolutely no idea how to do, and wasn't sure if it could be hard-coded. My weak attempt to try something so far:
response = get('http://www.o-review.com/database_filter_model.php?
table_name=glasses&tag=')
soup = bs(response.text, 'html.parser')
print(soup)
for page in range(1, 42):
pages = soup.find('div', onclick_ = 'filter_page()')
Hopefully, someone has solved this issue in the past. Help would be greatly appreciated! Thanks!
Edit: Here is the code I'm trying to add:
## Find All Frame models
for find_frames in soup.find_all('a', class_ = 'round5
grid_model'):
# Each iteration grabs child text and prints it
all_models = find_frames.text
print(all_models)
This would be added where the comment was to add code! Thanks!
The request is made via POST request where you can check my previous-answer in order to know how to get the actual API
Also html.parser or lxml is the not a part of your issue.
The reason why i used lxml because it's more fast than html.parser according to documentation
import requests
from bs4 import BeautifulSoup
from pprint import pp
def main(url):
with requests.Session() as req:
for page in range(1, 44):
print("[*] - Extracting Page# {}".format(page))
data = {
"table_name": "glasses",
"family": "",
"page": "{}".format(page),
"sort": "",
"display": "list",
}
r = req.post(url, data=data)
soup = BeautifulSoup(r.text, 'lxml')
pp([x.text for x in soup.select(
'.text-clip') if x.get_text(strip=True)])
if __name__ == "__main__":
main('http://www.o-review.com/ajax/database_filter_model.php')
I saw the answer to this question and your comment. The reason αԋɱҽԃ αмєяιcαη's code works is because it sends request to the actual ajax api the sit is getting data from. You could easily use your browser's developer tools to track it down. It's not because of lxml or whatever, you just had to find the right source ;)
And of course αԋɱҽԃ αмєяιcαη should have explained some parts in his answer to clarify everything for you.

Is there a way to get js display data using the requests python module?

So I am trying to access data on a video game stat tracker website. Now when I go to inspect element on the website and look at the code it says:
<div class="trn-defstat__value">Division 7</div>
But when I use requests.get(url).text the same element shows up as:
<div class="trn-defstat__value">{{ activeArena.division.metadata.description }}</div>
I am trying to get the "Division 7" part but keep getting this activeArena thing, I am using python, the code I have tried is
import requests
url = ('https://fortnitetracker.com/profile/all/tl%20starrlol/competitive?season=16')
file = open("myfilename", "w")
r = requests.get(url)
info = r.content
info = str(info)
file.write(info)
file.close()
and I have also tried
import requests
url = ('https://fortnitetracker.com/profile/all/tl%20starrlol/competitive?season=16')
file = open("myfilename", "w")
r = requests.get(url)
info = r.text
file.write(info)
file.close()
I am pretty new to coding so if the answer is obvious I apologize, but I am lost.
The HTML you're receiving contains a template engine code, the javascript on the page is loading and filling it up with values. If you examine the page via the network panel on the browser you'll notice a stats API call. Make the same call from your code to extract the data you need.
import requests
url = "https://fortnitetracker.com/api/v0/profile/863f1c3c-2e61-487e-8987-ceefff2981ad/stats"
querystring = {"season":"16","isCompetitive":"true"}
response = requests.request("GET", url, data="", headers={}, params=querystring)
data = response.json()
print (data[0]['arena']['division']['displayValue'])
# prints "Contender League Division 7"
It's better to check for official APIs instead of this approach. The parameters in the API like the UUID after profile may be a parameter that's valid only for a certain time. It's also worth evaluating the Selenium or Puppeteer approach recommended in the comments(under the question) to see if that fits your overall problem.

Error whilst trying to use the .difference() function in Python Jupyter

Context
I am currently going through a course on webscraping. Upon getting to the module on scraping javascript, a function set_1.difference(set_2) was used to distinguish the old variables from the newly created variables. But when I did it, it brought up this error:
AttributeError: 'list' object has no attribute 'difference'
I searched online and stumbled on this website. But running the example on their own website brought up an error
Problem
Any reason why this is not working? I want to print the newly generated javascript links. Below is the code I am trying to run:
from requests_html import AsyncHTMLSession
session = AsyncHTMLSession()
r = await session.get('https://www.ons.gov.uk/economy/economicoutputandproductivity/output/datasets/economicactivityfasterindicatorsuk')
r.status_code
divs = r.html.find('div')
downloads = r.html.find('a')
urls = r.html.absolute_links
# Now need to render the javascript. Downloads chromium the first time we use it,
# It is a browser that has no GUI
await r.html.arender()
new_divs = r.html.find('div')
new_downloads = r.html.find('a')
new_urls = r.html.absolute_links
# Get only the newly created html
new_downloads.difference(downloads)
Don't know what the "r" object is, so can't verify your code but difference is a method of sets, not lists.
https://docs.python.org/3/library/stdtypes.html#frozenset.difference
This should do the trick: set(new_downloads).difference(downloads)

scrapy + selenium: <a> tag has no href, but content is loaded by javascript

I'm almost there with my first try of using scrapy, selenium to collect data from website with javascript loaded content.
Here is my code:
# -*- coding: utf-8 -*-
import scrapy
from selenium import webdriver
from scrapy.selector import Selector
from scrapy.http import Request
from selenium.webdriver.common.by import By
import time
class FreePlayersSpider(scrapy.Spider):
name = 'free_players'
allowed_domains = ['www.forge-db.com']
start_urls = ['https://www.forge-db.com/fr/fr11/players/?server=fr11']
driver = {}
def __init__(self):
self.driver = webdriver.Chrome('/home/alain/Documents/repository/web/foe-python/chromedriver')
self.driver.get('https://forge-db.com/fr/fr11/players/?server=fr11')
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
#time.sleep(1)
sel = Selector(text = self.driver.page_source)
players = sel.xpath('.//table/tbody/tr')
for player in players:
joueur = player.xpath('.//td[3]/a/text()').get()
guilde = player.xpath('.//td[4]/a/text()').get()
yield {
'player' : joueur,
'guild' : guilde
}
next_page_btn = self.driver.find_element_by_xpath('//a[#class="paginate_button next"]')
if next_page_btn:
time.sleep(2)
next_page_btn.click()
yield scrapy.Request(url = self.start_urls, callback=self.parse)
# Close the selenium driver, so in fact it closes the testing browser
self.driver.quit()
def parse_players(self):
pass
I want to collect user names and their relative guild and output to a csv file.
For now my issue is to proceed to NEXT PAGE and to parse again the content loaded by javascript.
if i'm able to simulate click on NEXT tag, i'm not 100% sure that code will proceed all pages and i'm not able to parse the new content using the same function.
Any idea how could i solve this issue ?
thx.
Instead of using selenium, you should try recreate the request to update the table. If you look closely at the HTML under chrometools. You can see that the request is made with parameters and a response is sent back with the data in a nice structured format.
Please see here with regards to dynamic content in scrapy. As it explains the first step to think about is it necessary to recreate browser activity ? Or can I get the information I need from reverse engineering HTTP get requests. Sometimes the information is hidden with <script></script> tags and you can use some regex or some string methods to gain what you want. Rendering the page and then using browser activity should be thought of as a last step.
Now before I go into some background on reverse engineering the requests, this website you're trying to get information from requires only to reverse engineer the HTTP requests.
Reverse Engineering HTTP requests in Scrapy
Now in terms of the actual web itself we can use chrome devtools by right clicking inspect on a page. Clicking the network tab allows you to see all requests the browser makes to render the page. In this case you want to see what happens when you click next.
Image1: here
Here you can see all the requests made when you click next on the page. I always look for the biggest sized response as that'll most likely have your data.
Image2: here
Here you can see the request headers/params etc... things you need to make a proper HTTP request. We can see that the referring URL is actually getplayers.php with all the params to get the next page added on. If you scroll down you can see all the same parameters it sends to getplayers.php. Keep this in mind, sometimes we need to send headers, cookies and parameters.
Image3: here
Here is the preview of the data we would get back from the server if we make the correct request, it's a nice neat format which is great for scraping.
Now You could copy the headers and parameters, cookies here into scrapy, but after a bit of searching and it's always worth checking this first, if just by passing in an HTTP request with the url will you get the data you want then that is the simplest way.
In this case it's true and infact you get in a nice need format with all the data.
Code example
import scrapy
class TestSpider(scrapy.Spider):
name = 'test'
allowed_domains = ['forge-db.com']
def start_requests(self):
url = 'https://www.forge-db.com/fr/fr11/getPlayers.php?'
yield scrapy.Request(url=url)
def parse(self,response):
for row in response.json()['data']:
yield {'name':row[2],'guild':row[3] }
Settings
In settings.py, you need to set ROBOTSTXT_OBEY = False The site doesn't want you to access this data so we need to set it to false. Be careful, you could end getting banned from the server.
I would also suggest a couple of other settings to be respectful and cache the results so if you want to play around with this large dataset you don't hammer the server.
CONCURRENT_REQUESTS = 1
DOWNLOAD_DELAY = 3
HTTPCACHE_ENABLED = True
HTTPCACHE_DIR = 'httpcache'
Comments on the code
We make a request to https://www.forge-db.com/fr/fr11/getPlayers.php? and if you were to print the response you get all the data from the table, it's quite a lot... Now it looks like it's in json format so we use scrapy's new feature to handle json and convert into a python dictionary. response.json() be sure that you have uptodate scrapy to take advantage of this. Otherwise you could use the json library that python provides to do the same thing.
Now you have to look at the preview data abit here but the individual rows are within response.json()['data'][i] where i in the row of data. The name and guild are within response.json()['data'][i][2] and response.json()['data'][i][3]. So looping over every response.json()['data']and grabbing the name and guild.
If the data wasn't so structured as it is here and it needed modifying I would strongly urge you to use Items or ItemLoaders for creating the fields that you can then output the data. You can modifying the extracted data more easily with ItemLoaders and you can interact with duplicates items etc using a pipeline. These are just some thoughts for in the future, I almost never use yielding a dictionary for extracting data particularly large datasets.

Using Wikia API

I am trying to access the X-men API on wikia, to try and extract the name and image of each character, to then be used on a SPA using javascript.
This is the link too the page on the wiki:
http://x-men.wikia.com/wiki/Category:Characters
I cannot for the life of me figure out how to access the API. It doesn't seem to be RESFTful, and that's all I have any experience in.
Has anyone used the Wikia API successfully before? I can get some articles and such, but nothing useful.
(The documentation is shocking, been searching around for hours.)
Probably you have already found a solution, but I think you should write something like this:
import requests
xmen_url = "http://x-men.wikia.com/api/v1/Articles/List?expand=1&category=Characters&limit=10000"
r = requests.get(xmen_url)
response = r.json()
# print response
a = 0
for item in response['items']:
a += 1
print("{}\t{}\t({})".format(str(a),item['title'].encode(encoding='utf-8'),item['id']))
This will print a list of all the articles of the category Characters (I think there also some subcategories, you should check). If you want to take a deeper look at the json file you can uncomment the commented code.
Hope it helps.

Categories