Context
I am currently going through a course on webscraping. Upon getting to the module on scraping javascript, a function set_1.difference(set_2) was used to distinguish the old variables from the newly created variables. But when I did it, it brought up this error:
AttributeError: 'list' object has no attribute 'difference'
I searched online and stumbled on this website. But running the example on their own website brought up an error
Problem
Any reason why this is not working? I want to print the newly generated javascript links. Below is the code I am trying to run:
from requests_html import AsyncHTMLSession
session = AsyncHTMLSession()
r = await session.get('https://www.ons.gov.uk/economy/economicoutputandproductivity/output/datasets/economicactivityfasterindicatorsuk')
r.status_code
divs = r.html.find('div')
downloads = r.html.find('a')
urls = r.html.absolute_links
# Now need to render the javascript. Downloads chromium the first time we use it,
# It is a browser that has no GUI
await r.html.arender()
new_divs = r.html.find('div')
new_downloads = r.html.find('a')
new_urls = r.html.absolute_links
# Get only the newly created html
new_downloads.difference(downloads)
Don't know what the "r" object is, so can't verify your code but difference is a method of sets, not lists.
https://docs.python.org/3/library/stdtypes.html#frozenset.difference
This should do the trick: set(new_downloads).difference(downloads)
Related
So I'm trying to open a new window by executing a script in Selenium using driver.execute_script("window.open('');")
But I want to open a link given by the user.
So I got the link input from my array and put it to my javascript code just like this:
driver.execute_script("window.open(data[0]);")
Now It's giving an error like this:
selenium.common.exceptions.JavascriptException: Message: javascript error: data is not defined
How to fix this? Thanks for your time.
EDIT: A part of my code is something like that:
from selenium import webdriver
import PySimpleGUI as sg
import time
global data
data = []
layouts = [[[sg.Text("Enter the Wordpress New Post link: "), sg.InputText(key=0)]],
[sg.Button('Start The Process'), [sg.Button('Exit')]]]
window = sg.Window("Title", layouts)
def selenium_process():
# Getting the driver path
driver = webdriver.Chrome(r'Driver\path')
driver.get('https://google.com')
driver.execute_script(f"window.open({data[0]});")
time.sleep(10000)
while True:
event, values = window.read()
if event in (sg.WIN_CLOSED, 'Exit'):
break
data.append(values[0])
selenium_process()
did you try string interpolation ?
Try this:
driver.execute_script(f"window.open({data[0]});")
Your solution does not work since data[0] is a string, not a variable. You instead need to substitute data[0] with its value (must be a value that JS can understand).
Please read the description of Javascript window.open : https://developer.mozilla.org/fr/docs/Web/API/Window/open
If you just need to get to an URL:
driver.get(data[0])
So I am trying to access data on a video game stat tracker website. Now when I go to inspect element on the website and look at the code it says:
<div class="trn-defstat__value">Division 7</div>
But when I use requests.get(url).text the same element shows up as:
<div class="trn-defstat__value">{{ activeArena.division.metadata.description }}</div>
I am trying to get the "Division 7" part but keep getting this activeArena thing, I am using python, the code I have tried is
import requests
url = ('https://fortnitetracker.com/profile/all/tl%20starrlol/competitive?season=16')
file = open("myfilename", "w")
r = requests.get(url)
info = r.content
info = str(info)
file.write(info)
file.close()
and I have also tried
import requests
url = ('https://fortnitetracker.com/profile/all/tl%20starrlol/competitive?season=16')
file = open("myfilename", "w")
r = requests.get(url)
info = r.text
file.write(info)
file.close()
I am pretty new to coding so if the answer is obvious I apologize, but I am lost.
The HTML you're receiving contains a template engine code, the javascript on the page is loading and filling it up with values. If you examine the page via the network panel on the browser you'll notice a stats API call. Make the same call from your code to extract the data you need.
import requests
url = "https://fortnitetracker.com/api/v0/profile/863f1c3c-2e61-487e-8987-ceefff2981ad/stats"
querystring = {"season":"16","isCompetitive":"true"}
response = requests.request("GET", url, data="", headers={}, params=querystring)
data = response.json()
print (data[0]['arena']['division']['displayValue'])
# prints "Contender League Division 7"
It's better to check for official APIs instead of this approach. The parameters in the API like the UUID after profile may be a parameter that's valid only for a certain time. It's also worth evaluating the Selenium or Puppeteer approach recommended in the comments(under the question) to see if that fits your overall problem.
I looked up various questions and answers but unfortunately none of the problems I found dealt with a case that is similar to mine. In a typical question, the JavaScript table builds up directly when the website is loaded. In my case, however, I first have to navigate through the JavaScript module and select several criteria before I get the sought-after result.
This is my case: I have to scrape the exchange rates for various currencies from this website www.globocambio.co. To do that, I have (1) to navigate to “I WANT COLOMBIAN PESO”, (2) select the currency (e.g., “Chilean Peso”), (3) and the collection destination (e.g., “El Dorado International Airport”). Only then the respective exchange rate is being loaded. See this screenshot for illustration. I marked the three selection steps red. Green is the data point that I want to scrape for different currencies.
I am not very familiar with JavaScript but I tried to understand what is going on. Here is what I found out:
Using Chrome DevTools, I investigated the Network activity when loading an exchange rate. There is an XHR called “GetPrice” that requests the price using this URL: https://reservations.globocambio.co/DesktopModules/GlobalExchange/API/Widget/GetPrice and using the following Form Data
ISOAOrigen=CLP&cantidadOrigen=9000&ISOADestino=COP&cantidadDestino=0¢erId=27&operationType=OperationTypesBuying
I understand that the Form Data contains the information that I initially selected manually:
operationType=OperationTypesBuying: this is the “I WANT COLOMBIAN PESO” option
ISOAOrigen=CLP: this is the “Chilean Peso”
centerId=27: this is the “El Dorado International Airport”
The server responds to my request with the following information:
{“MonedaOrigen":{"ISOA":"CLP","Nombre":null,"Margen":0.1630000000,"Tramo":0.0,"Fixing":2.9000000000},"CantidadOrigen":9000.00,"MonedaDestino":{"ISOA":"COP","Nombre":null,"Margen":0.0,"Tramo":0.0,"Fixing":0.0},"CantidadDestino":21845.70,"TipoCambio":2.42730000000000000000,"MargenOrigen":0.0,"TramoOrigen":0.0,"FixingOrigen":0.0,"MargenDestino":0.0,"TramoDestino":0.0,"FixingDestino":0.0,"IdCentro":"27","Comision":null,"ComisionTramoSuperior":null,"ComisionAplicada":{"CodigoMoneda":null,"CodigoTipoMoneda":0,"ComisionFija":0.0,"ComisionVariable":0.0,"TramoInicio":0.0,"TramoFin":null,"Orden”:0}}
From this response, "TipoCambio":2.42730000000000000000 is then being written on the website using this line of HTML code: <span id="spTipoCambioCompra">2.427300</span>
This means that "TipoCambio" is the value that I am looking for.
So, I have to communicate somehow via R with the server using the Form Data as input variables. Can anyone tell me how to do this?
I mean, understand that I have to combine the URL https://reservations.globocambio.co/DesktopModules/GlobalExchange/API/Widget/GetPrice with the Form Data “ISOAOrigen=CLP&cantidadOrigen=9000&ISOADestino=COP&cantidadDestino=0¢erId=27&operationType=OperationTypesBuying” somehow but I do not know how it works..
Any help will be appreciated!
Update:
I still have no idea how to solve the above issue, yet. However, I try to approach it with small steps.
Using RSelenium, I am currently trying to find out how to click on the option “I WANT COLOMBIAN PESO”. My idea was to use the following code:
library(RSelenium)
remDr <- RSelenium::remoteDriver(remoteServerAddr = "localhost",
port = 4445L,
browserName = "chrome")
remDr$open()
remDr$navigate("https://www.globocambio.co/en/home")
webElem <- remDr$findElement("id", "tabCompra") #What is wrong here?
webElem$clickElement() # Click on "I WANT COLOMBIAN PESO"
But I get an error message after executing webElem <- remDr$findElement("id", "tabCompra"):
Selenium message:no such element: Unable to locate element: {"method":"css selector","selector":"#tabCompra"}
(Session info: chrome=81.0.4044.113)
For documentation on this error, please visit: https://www.seleniumhq.org/exceptions/no_such_element.html
...
Error: Summary: NoSuchElement
Detail: An element could not be located on the page using the given search parameters.
class: org.openqa.selenium.NoSuchElementException
Further Details: run errorDetails method
What am I doing wrong here?
I solved my problem using selenium in Python:
from selenium import webdriver
driver = webdriver.Firefox(executable_path = '/your_path/geckodriver')
driver.get("https://www.globocambio.co/en/")
driver.switch_to.frame("iframeWidget");
elem = driver.find_element_by_id('tabCompra')
elem.click()
elem = driver.find_element_by_id('inputddlMonedaOrigenCompra')
elem.click()
elem.send_keys(Keys.CLEAR)
elem.send_keys("Chilean Peso")
elem.send_keys(Keys.ENTER)
elem.send_keys(Keys.ARROW_DOWN)
elem.send_keys(Keys.RETURN)
elem = driver.find_element_by_id('info-change-compra')
print(elem.text)
I am trying to make a small web app using django and javascript however I have ran into a problem.
In my model I have an entity name "entry" which has attributes id, name, location_lat and location_lon all as char arrays. I pass entrys associated to a certain sport to a "map.html" file using the following code:
def sportpage(request,sportname):
t = get_template('map.html')
try:
entry_list = Feature.objects.filter(sport=sportname)
except BaseException:
raise Http404
sport_list = Sport.objects.all().order_by('name')
context = RequestContext(request,{'sport_list':sport_list,'entry_list':entry_list, 'page_title':sportname})
return HttpResponse(t.render(context))
and in the JavaScript file I attempt to read "entry_list" in using:
var data = "{{entry_list|safe}}";
However the data only contains a list of the entry names but not the location_lat or location_lon attributes. How would I gain access to these?
Thank you very much for any help
I have a program that scrapes value from https://web.apps.markit.com/WMXAXLP?YYY2220_zJkhPN/sWPxwhzYw8K4DcqW07HfIQykbYMaXf8fTzWT6WKnuivTcM0W584u1QRwj
My current code is:
doc = Nokogiri::HTML(open(source_url))
puts doc.css('span.indexDate').text
date = doc.css('span.indexDate').text
date = Date.parse(date)
puts date
values = doc.css('table#CdsIndexTable td.col2 span')
puts values
This scrapes the date and values of the second column from the "CDS Indexes" table correctly which is fine. Now, I want to scrape the similar values from the "Bond Indexes" table where I am facing the problem.
I can see a JavaScript function changes it without loading the page and without changing the URL of the page. The difference between these two tables is their IDs are different which is exactly that it should be. But, unfortunately when I try with:
values = doc.css('table#BondIndexTable')
puts values
I get nothing from the Bond Indexes table. But I get values from CDS Indexes table if I use:
values = doc.css('table#CdsIndexTable')
puts values
How can I get the values from both tables?
You can use Capybara with the Poltergeist driver to execute the Javascript and format the page. Poltergeist is a wrapper for the PhantomJS headless browser. Here's an example of how you can do it:
require 'rubygems'
require 'capybara'
require 'capybara/dsl'
require 'capybara/poltergeist'
Capybara.default_driver = :poltergeist
Capybara.run_server = false
module GetPrice
class WebScraper
include Capybara::DSL
def get_page_data(url)
visit(url)
doc = Nokogiri::HTML(page.html)
doc.css('td.col2 span')
end
end
end
scraper = GetPrice::WebScraper.new
puts scraper.get_page_data('https://web.apps.markit.com/WMXAXLP?YYY2220_zJkhPN/sWPxwhzYw8K4DcqW07HfIQykbYMaXf8fTzWT6WKnuivTcM0W584u1QRwj').map(&:text).inspect
Visit here for a complete example using Amazon.com:
https://github.com/wakproductions/amazon_get_price/blob/master/getprice.rb
If you don't want to use PhantomJS you can also use the network sniffer on Firefox or Chrome development tools, and you will see that the HTML table data is returned with a javascript POST request to the server.
Then instead of opening the original page URL with Nokogiri, you'd instead run this POST from your Ruby script and parse and interpret that data instead. It looks like it's just JSON data with HTML embedded into it. You could extract the HTML and feed that to Nokogiri.
It requires a bit of extra detective work, but I've used this method many times with JavaScript web pages and scraping. It works OK for most simple tasks, but it requires a bit of digging into the inner workings of the page and network traffic.
Here's an example of the JSON data from the Javascript POST request:
Bonds:
https://web.apps.markit.com/AppsApi/GetIndexData?indexOrBond=bond&ClientCode=WSJ
CDS:
https://web.apps.markit.com/AppsApi/GetIndexData?indexOrBond=cds&ClientCode=WSJ
Here's the quick and dirty solution just so you get an idea. This will grab the cookie from the initial page and use it in the request to get the JSON data, then parse the JSON data and feed the extracted HTML to Nokogiri:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
require 'json'
# Open the initial page to grab the cookie from it
p1 = open('https://web.apps.markit.com/WMXAXLP?YYY2220_zJkhPN/sWPxwhzYw8K4DcqW07HfIQykbYMaXf8fTzWT6WKnuivTcM0W584u1QRwj')
# Save the cookie
cookie = p1.meta['set-cookie'].split('; ',2)[0]
# Open the JSON data page using our cookie we just obtained
p2 = open('https://web.apps.markit.com/AppsApi/GetIndexData?indexOrBond=bond&ClientCode=WSJ',
'Cookie' => cookie)
# Get the raw JSON
json = p2.read
# Parse it
data = JSON.parse(json)
# Feed the html portion to Nokogiri
doc = Nokogiri.parse(data['html'])
# Extract the values
values = doc.css('td.col2 span')
puts values.map(&:text).inspect
=> ["0.02%", "0.02%", "n.a.", "-0.03%", "0.02%", "0.04%",
"0.01%", "0.02%", "0.08%", "-0.01%", "0.03%", "0.01%", "0.05%", "0.04%"]
PhantomJS is a headless browser with a JavaScript API. Since you need to run the scripts on the page you are scraping, a browser will do that for you; and PhantomJS will allow you to manipulate and scrape the page after the script execution.