Web scraping in R by first navigating through a JavaScript module - javascript

I looked up various questions and answers but unfortunately none of the problems I found dealt with a case that is similar to mine. In a typical question, the JavaScript table builds up directly when the website is loaded. In my case, however, I first have to navigate through the JavaScript module and select several criteria before I get the sought-after result.
This is my case: I have to scrape the exchange rates for various currencies from this website www.globocambio.co. To do that, I have (1) to navigate to “I WANT COLOMBIAN PESO”, (2) select the currency (e.g., “Chilean Peso”), (3) and the collection destination (e.g., “El Dorado International Airport”). Only then the respective exchange rate is being loaded. See this screenshot for illustration. I marked the three selection steps red. Green is the data point that I want to scrape for different currencies.
I am not very familiar with JavaScript but I tried to understand what is going on. Here is what I found out:
Using Chrome DevTools, I investigated the Network activity when loading an exchange rate. There is an XHR called “GetPrice” that requests the price using this URL: https://reservations.globocambio.co/DesktopModules/GlobalExchange/API/Widget/GetPrice and using the following Form Data
ISOAOrigen=CLP&cantidadOrigen=9000&ISOADestino=COP&cantidadDestino=0&centerId=27&operationType=OperationTypesBuying
I understand that the Form Data contains the information that I initially selected manually:
operationType=OperationTypesBuying: this is the “I WANT COLOMBIAN PESO” option
ISOAOrigen=CLP: this is the “Chilean Peso”
centerId=27: this is the “El Dorado International Airport”
The server responds to my request with the following information:
{“MonedaOrigen":{"ISOA":"CLP","Nombre":null,"Margen":0.1630000000,"Tramo":0.0,"Fixing":2.9000000000},"CantidadOrigen":9000.00,"MonedaDestino":{"ISOA":"COP","Nombre":null,"Margen":0.0,"Tramo":0.0,"Fixing":0.0},"CantidadDestino":21845.70,"TipoCambio":2.42730000000000000000,"MargenOrigen":0.0,"TramoOrigen":0.0,"FixingOrigen":0.0,"MargenDestino":0.0,"TramoDestino":0.0,"FixingDestino":0.0,"IdCentro":"27","Comision":null,"ComisionTramoSuperior":null,"ComisionAplicada":{"CodigoMoneda":null,"CodigoTipoMoneda":0,"ComisionFija":0.0,"ComisionVariable":0.0,"TramoInicio":0.0,"TramoFin":null,"Orden”:0}}
From this response, "TipoCambio":2.42730000000000000000 is then being written on the website using this line of HTML code: <span id="spTipoCambioCompra">2.427300</span>
This means that "TipoCambio" is the value that I am looking for.
So, I have to communicate somehow via R with the server using the Form Data as input variables. Can anyone tell me how to do this?
I mean, understand that I have to combine the URL https://reservations.globocambio.co/DesktopModules/GlobalExchange/API/Widget/GetPrice with the Form Data “ISOAOrigen=CLP&cantidadOrigen=9000&ISOADestino=COP&cantidadDestino=0&centerId=27&operationType=OperationTypesBuying” somehow but I do not know how it works..
Any help will be appreciated!
Update:
I still have no idea how to solve the above issue, yet. However, I try to approach it with small steps.
Using RSelenium, I am currently trying to find out how to click on the option “I WANT COLOMBIAN PESO”. My idea was to use the following code:
library(RSelenium)
remDr <- RSelenium::remoteDriver(remoteServerAddr = "localhost",
port = 4445L,
browserName = "chrome")
remDr$open()
remDr$navigate("https://www.globocambio.co/en/home")
webElem <- remDr$findElement("id", "tabCompra") #What is wrong here?
webElem$clickElement() # Click on "I WANT COLOMBIAN PESO"
But I get an error message after executing webElem <- remDr$findElement("id", "tabCompra"):
Selenium message:no such element: Unable to locate element: {"method":"css selector","selector":"#tabCompra"}
(Session info: chrome=81.0.4044.113)
For documentation on this error, please visit: https://www.seleniumhq.org/exceptions/no_such_element.html
...
Error: Summary: NoSuchElement
Detail: An element could not be located on the page using the given search parameters.
class: org.openqa.selenium.NoSuchElementException
Further Details: run errorDetails method
What am I doing wrong here?

I solved my problem using selenium in Python:
from selenium import webdriver
driver = webdriver.Firefox(executable_path = '/your_path/geckodriver')
driver.get("https://www.globocambio.co/en/")
driver.switch_to.frame("iframeWidget");
elem = driver.find_element_by_id('tabCompra')
elem.click()
elem = driver.find_element_by_id('inputddlMonedaOrigenCompra')
elem.click()
elem.send_keys(Keys.CLEAR)
elem.send_keys("Chilean Peso")
elem.send_keys(Keys.ENTER)
elem.send_keys(Keys.ARROW_DOWN)
elem.send_keys(Keys.RETURN)
elem = driver.find_element_by_id('info-change-compra')
print(elem.text)

Related

Selenium/Beautiful Soup scraper failing after looping through one page (Javascript)

I'm attempting to scrape data on food seasonality from the Seasonal Food Guide but hitting a snag. The site has a fairly simple URL structure:
https://www.seasonalfoodguide.org/produce_name/state_name
I've been able to use Selenium and Beautiful Soup to successfully scrape the seasonality information from one page, but on subsequent loops the section of text I'm looking for doesn't actually load so I get AttributeError: 'NoneType' object has no attribute 'text'. I know it's because months_list_raw is coming back empty due to the fact that the 'wheel-months-list' portion of the page isn't loading on the second loop. Code is below. Any ideas?
for ingredient in produce_list:
for state in state_list:
# grab page content
search_url = 'https://www.seasonalfoodguide.org/{}/{}'.format(ingredient,state)
driver.get(search_url)
page_soup = soup(driver.page_source, 'lxml')
# grab list of months
months_list_raw = page_soup.find('p',{'id':'wheel-months-list'})
months_list = months_list_raw.text
The page is being rendered on the client side, which means when you open the page, another request is being made to a backend server to fetch the data based on your selected filters. So the issue is that when you open the page and read the HTML, the content is not fully loaded yet. The simplest thing you could do is sleep for some time after opening the page with Selenium in order to wait for it to fully load. I've tested your code by throwing in time.sleep(3) after the driver.get(search_url) and it worked fine.
To prevent the error from occuring and continuing with your loop you need to do a check for when the months_list_raw element is not None. It seems like some of the produce pages do not have any data for some states, so you will need to handle that in your program how you want.
for ingredient in produce_list:
for state in state_list:
# grab page content
search_url = 'https://www.seasonalfoodguide.org/{}/{}'.format(ingredient,state)
driver.get(search_url)
page_soup = soup(driver.page_source, 'lxml')
# grab list of months
months_list_raw = page_soup.find('p',{'id':'wheel-months-list'})
if months_list_raw is not None:
months_list = months_list_raw.text
else:
# Handle case where ingredient/state data doesn't exist

Using Wikia API

I am trying to access the X-men API on wikia, to try and extract the name and image of each character, to then be used on a SPA using javascript.
This is the link too the page on the wiki:
http://x-men.wikia.com/wiki/Category:Characters
I cannot for the life of me figure out how to access the API. It doesn't seem to be RESFTful, and that's all I have any experience in.
Has anyone used the Wikia API successfully before? I can get some articles and such, but nothing useful.
(The documentation is shocking, been searching around for hours.)
Probably you have already found a solution, but I think you should write something like this:
import requests
xmen_url = "http://x-men.wikia.com/api/v1/Articles/List?expand=1&category=Characters&limit=10000"
r = requests.get(xmen_url)
response = r.json()
# print response
a = 0
for item in response['items']:
a += 1
print("{}\t{}\t({})".format(str(a),item['title'].encode(encoding='utf-8'),item['id']))
This will print a list of all the articles of the category Characters (I think there also some subcategories, you should check). If you want to take a deeper look at the json file you can uncomment the commented code.
Hope it helps.

Scraping a webpage with python to get onclick values

First of all I have to say: be patient with me because I am not familiar with the argument that I am going to illustrate you.
I'd like to download the intraday historical values of some equities on Frankfurt Boerse website. Let me take this equity for example: http://www.boerse-frankfurt.de/en/equities/adidas+ag+DE000A1EWWW0/price+turnover+history/tick+data#page=1
As you can see there are two options: trades on Frankfurt and trades on Xetra. I'd love to download the latters. I tried to scrape the data but my knowledge of python is very poor.
How can I 'select' the desired onclick option?
Thanks in advance for your replies. Regards
Ps: For your information, I noted the following fact inspecting the Xetra element: it changes value when I move on to next page and if I come back the value is again different. Here an example: first time on page 1 I got
a onclick="d39081344_fkt_set_par('6');d39081344_fkt_set_active(this);" class="brs_d39081344_li current last"
, then I moved on to page 2 and I got
a onclick="d51109535_fkt_set_par('6');d51109535_fkt_set_active(this);" class="brs_d51109535_li current last" and coming back to page 1 I got a onclick="d96086211_fkt_set_par('6');d96086211_fkt_set_active(this);" class="brs_d96086211_li current last"
The trick is to look at what calls are made when you navigate through the pages. Your browser's network analysis tool is invaluable for this. When I go from page to page, a POST is made to 'http://www.boerse-frankfurt.de/en/parts/boxes/history/_tickdata_full.m with data about the request.
Then the goal is to replicate and loop the requests using python. Here is code to get you started:
import requests
r = requests.post('http://www.boerse-frankfurt.de/en/parts/boxes/history/_tickdata_full.m', data={'component_id':'PREKOP97077bf9dec39f14320bf9d40b636c7c589', 'page':"3", 'page_size':'50', 'boerse_id':'6', 'titel':'Tick-Data', 'lang':'en', 'text':'LOcbaec84ecad1b94ad2fd257897c87361', 'items_per_page':'50', 'template':'0', 'pages_total':'50', 'use_external_secu':'1', 'item_count':'2473', 'include_url':'/parts/boxes/history/_tickdata_full.m', 'ag':'291', 'secu':'291', })
print r.text #here is your data of interest, it still needs to be parsed
That is the general idea. You would then put that in a loop, adding one to the page parameter each time.

finding the ajax request in dojo

I am working on a crawlers to scrap all data from the website. they use ajax for pagination. I found this on the href of the page numbers
javascript:dojo.publish("showResultsForPageNumber",[{pageNumber:"4",pageSize:"15", linkId:"WC_SearchBasedNavigationResults_pagination_link_4_categoryResults"}])
what is happening here. I am not aware of these dojo. Can any one help me to find the corresponding server script so that i can scrap all the data including pagination.
update#1
in the console i found
this is the code where it is redirected.
showResultsPage:function(data){
var pageNumber = data['pageNumber'];
var pageSize = data['pageSize'];
pageNumber = dojo.number.parse(pageNumber);
pageSize = dojo.number.parse(pageSize);
setCurrentId(data["linkId"]);
if(!submitRequest()){
return;
}
console.debug(wc.render.getContextById('searchBasedNavigation_context').properties); //line 773
var beginIndex = pageSize * ( pageNumber - 1 );
cursor_wait();
wc.render.updateContext('searchBasedNavigation_context', {"productBeginIndex": beginIndex,"resultType":"products"});
this.updateHistory();
MessageHelper.hideAndClearMessage();
},
It's part of the publisher/subscriber part of the Dojo framework and does not say anything about the executed AJAX request.
If you're not familiar with the publisher/subscriber pattern, then let's explain that first. To decouple certain components/parts of the application, this pattern is commonly used.
On one side, someone publishes information, while on the other side (= some other part of the application) someone listens to it.
In this case, the following data is published (= second parameter):
[{
pageNumber: "4",
pageSize: "15",
linkId: "WC_SearchBasedNavigationResults_pagination_link_4_categoryResults"
}]
Obviously, not all subscribers in the application need to know about this data, so there's a topic system, in this case, the data is published to a topic called "showResultsForPageNumber"(= first parameter)
To know what happens next, you will have to look through the code for someone who subscribes to that topic. So somewhere in the code you will find something like this:
dojo.subscribe("showResultsForPageNumber", function(data) {
// Does something with the data, perhaps an AJAX call?
});
TO answer your question, look in the code for something like: dojo.subscribe("showResultsForPageNumber", as it will tell you what happens next.
However, if you're just interested in the AJAX calls, it will be easier to check the network requests, if you're using Google Chrome/Mozilla Firefox/... you can use the F12 key to open your developer tools, then select the Network tab and activate if necessaray. Now click on the pagination controls and you will see a log of all network traffic and the request + response data.
Here you are publishing the topic with name "showResultsForPageNumber" where "pageNumber", "pageSize", "linkId" are properties of object of your argument array.
See following link: ref1 ref2

Long initial page load on firefox/chrome using faye/nodejs due to transport layer /meta/connect call

I'm having a strange issue with faye/nodejs where the page appears to be loading for a long time on an initial page load due to a /meta/connect call. This page load appears to last for exactly 45s (which is the value of the timeout set on the server)
Here are the details of the call:
The call in question is the following:
RAW GET:
https://MYURL.com:8089/notifications?message=%5B%7B%22channel%22%3A%22%2Fmeta%2Fconnect%22%2C%22clientId%22%3A%220c3gocq1rwi3sl0dskn4u00e8wj7%22%2C%22connectionType%22%3A%22callback-polling%22%2C%22id%22%3A%225%22%7D%5D&jsonp=__jsonp3__
params:
jsonp: __jsonp3__
message: [{"channel":"/meta/connect","clientId":"0c3gocq1rwi3sl0dskn4u00e8wj7","connectionType":"callback-polling","id":"5"}]
response:
__jsonp3__([{"id":"5","clientId":"0c3gocq1rwi3sl0dskn4u00e8wj7","channel":"/meta/connect","successful":true,"advice":{"reconnect":"retry","interval":0,"timeout":45000}}]);
I've tried it without SSL, but the problem still persists, so it doesn't appear to be related to that.
The page is completely responsive during this time, but it's obviously an issue for my customers as they just see the loading bar in ff or chrome and they end up waiting the full 45 seconds for it to stop before proceeding. Any help in debugging or mitigating this issue is appreciated; possibly making the initial connect call asynchronous so it doesn't trigger on an initial page load?
I've also posted on the faye google group here: https://groups.google.com/forum/?fromgroups#!topic/faye-users/xZI4adt3DpA%5B1-25%5D
But I have not gotten a reply yet, though it does seem that I am not the only one with this issue.
Any help is appreciated.
Thanks!
Kevin
Just in case any future googlers stumble on this topic: the issue in question has been resolved in the newer versions of Faye. There are some further details on the google group link in my original question - the issue should be fixed as of faye 0.8.4 (currently 0.8.6)
I can confirm that this fixed the issue for me, I no longer see any timeouts on page load.
Sounds like you're not end()ing the response you're sending out, so your server is keeping the connection open.
When sending to channel /meta/connect add this to your params:
"advice":{"timeout": 0}
So your connect message should looks like that:
{"channel":"/meta/connect","clientId":"0c3gocq1rwi3sl0dskn4u00e8wj7","connectionType":"callback-polling","id":"5","advice":{"timeout":0}}
You can follow my solution starting with this place:
# server.rb
#engine.connect(response['clientId'], message['advice']) do |events|
callback.call([response] + events)
end
...
# proxy.rb
def connect(client_id, options = {}, &callback)
debug 'Accepting connection from ?', client_id
#engine.ping(client_id)
conn = connection(client_id, true)
conn.connect(options, &callback)
#engine.empty_queue(client_id)
end
...
# connection.rb
def connect(options, &block)
options = options || {}
timeout = options['timeout'] ? options['timeout'] / 1000.0 : #engine.timeout
set_deferred_status(:deferred)
callback(&block)
begin_delivery_timeout
begin_connection_timeout(timeout)
end
These methods are called when message comes to /meta/connect channel.

Categories