Selenium WebDriver Wait for javascript generated Table being loaded/visible - javascript

I'm trying to scrape the data within the table of this F5 Article (https://support.f5.com/csp/article/K15386).
The problem I'm facing is that it sometimes crawls the page correctly (tables are within the DOM). Other times however it does not crawl the javascript generated tables.
I have tried using implicit and explicit waits, but with no success. Using explicit wait I am not able to successfully select the table. I am always getting the timeouts.
Any idea on how to always access the data within the tables?
I'm using Java 8, Selenium 3.141.59 & ChromeDriver 85.0.4183.87.
driver.manage().timeouts().implicitlyWait(15, TimeUnit.SECONDS);
WebDriverWait wait = new WebDriverWait(driver, 15);
WebElement element = wait
.until(ExpectedConditions.presenceOfElementLocated(By.cssSelector("table")));
Edit:
I wish I could use their API, however since it is not documented, I am not allowed to.

What about this:
List<WebElement> tables = driver.findElements(By.tagName("table"));
if (tables.size() == 0) {
try {
wait.until(ExpectedConditions.presenceOfElementLocated(By.tagName("table")));
tables.addAll(driver.findElements(By.tagName("table")));
}
catch (TimeOutException e) {
};
}

Related

JavaScript / Python interaction in Linux without a REST framework?

I'm working on some changes to a page that needs to retrieve information from some files under /proc so the page can display version information to the user. Currently, the page is generated entirely by the Python script, which allows me to just read the file and put everything in the page at creation time.
However, this led to the issue that the version numbers wouldn't update when a new version of the software was uploaded. I don't want to regenerate the page every time a new package is installed, so I made the main page static and want to instead just query the information from a Python script and return it to the page to populate the page when loaded.
The Python scripts are set up as CGI and have sudo access, so there's no issue with them retrieving those files. However, if I wanted to use something like AJAX to call the Python script, is there any way I could return the data without using a REST framework such as Flask or Django? The application needs to be lightweight and preferably not rely on a new framework.
Is there a way I can do this with vanilla JavaScript and Python?
Ok, so the solution was fairly simple, I just made a few syntactical errors that led to it not working the first few times I tried it.
So the request looked like this:
window.onload = function() {
var xhr = new XMLHttpRequest();
xhr.onreadystatechange = function() {
if((this.readyState == 4) && (this.status == 200)) {
var response = JSON.parse(this.responseText);
// Do stuff with the JSON here...
}
};
xhr.open("GET", scriptURL, true);
xhr.send();
}
From there, the Python script simply needed to do something like this to return JSON data containing my version numbers:
import sys, cgi, json
result = {}
result['success'] = True
result['message'] = "The command completed successfully"
d = {}
... write version information to the 'd' map ...
result['data'] = d
sys.stdout.write("Content-Type: text/plain\n\n")
sys.stdout.write(json.dumps(result))
sys.stdout.write("\n")
sys.stdout.close()
The most persistent problem that took me forever to find was I forgot a closing quotation in my script tag, which caused the whole page to not load.

Node.js: requesting a page and allowing the page to build before scraping

I've seen some answers to this that refer the askee to other libraries (like phantom.js), but I'm here wondering if it is at all possible to do this in just node.js?
Considering my code below. It requests a webpage using request, then using cheerio it explores the dom to scrape the page for data. It works flawlessly and if everything had gone as planned, I believe it would have outputted a file as i imagined in my head.
The problem is that the page I am requesting in order to scrape, build the table im looking at asynchronously using either ajax or jsonp, i'm not entirely sure how .jsp pages work.
So here I am trying to find a way to "wait" for this data to load before I scrape the data for my new file.
var cheerio = require('cheerio'),
request = require('request'),
fs = require('fs');
// Go to the page in question
request({
method: 'GET',
url: 'http://www1.chineseshipping.com.cn/en/indices/cbcfinew.jsp'
}, function(err, response, body) {
if (err) return console.error(err);
// Tell Cherrio to load the HTML
$ = cheerio.load(body);
// Create an empty object to write to the file later
var toSort = {}
// Itterate over DOM and fill the toSort object
$('#emb table td.list_right').each(function() {
var row = $(this).parent();
toSort[$(this).text()] = {
[$("#lastdate").text()]: $(row).find(".idx1").html(),
[$("#currdate").text()]: $(row).find(".idx2").html()
}
});
//Write/overwrite a new file
var stream = fs.createWriteStream("/tmp/shipping.txt");
var toWrite = "";
stream.once('open', function(fd) {
toWrite += "{\r\n"
for(i in toSort){
toWrite += "\t" + i + ": { \r\n";
for(j in toSort[i]){
toWrite += "\t\t" + j + ":" + toSort[i][j] + ",\r\n";
}
toWrite += "\t" + "}, \r\n";
}
toWrite += "}"
stream.write(toWrite)
stream.end();
});
});
The expected result is a text file with information formatted like a JSON object.
It should look something like different instances of this
"QINHUANGDAO - GUANGZHOU (50,000-60,000DWT)": {
 "2016-09-29": 26.7,
"2016-09-30": 26.8,
},
But since the name is the only thing that doesn't load async, (the dates and values are async) I get a messed up object.
I tried Actually just setting a setTimeout in various places in the code. The script will only be touched by developers that can afford to run the script several times if it fails a few times. So while not ideal, even a setTimeout (up to maybe 5 seconds) would be good enough.
It turns out the settimeouts don't work. I suspect that once I request the page, I'm stuck with the snapshot of the page "as is" when I receive it, and I'm in fact not looking at a live thing I can wait for to load its dynamic content.
I've wondered investigating how to intercept the packages as they come, but I don't understand HTTP well enough to know where to start.
The setTimeout will not make any difference even if you increase it to an hour. The problem here is that you are making a request against this url:
http://www1.chineseshipping.com.cn/en/indices/cbcfinew.jsp
and their server returns back the html and in this html there are the js and css imports. This is the end of your case, you just have the html and that's it. Instead the browser knows how to use and to parse the html document, so it is able to understand the javascript scripts and to execute/run them and this is exactly your problem. Your program is not able to understand that has something to do with the HTML contents. You need to find or to write a scraper that is able to run javascript. I just found this similar issue on stackoverflow:
Web-scraping JavaScript page with Python
The guy there suggests https://github.com/niklasb/dryscrape and it seems that this tool is able to run javascript. It is written in python though.
You are trying to scrape the original page that doesn't include the data you need.
When the page is loaded, browser evaluates JS code it includes, and this code knows where and how to get the data.
The first option is to evaluate the same code, like PhantomJS do.
The other (and you seem to be interested in it) is to investigate the page's network activity and to understand what additional requests you should perform to get the data you need.
In your case, these are:
http://index.chineseshipping.com.cn/servlet/cbfiDailyGetContrast?SpecifiedDate=&jc=jsonp1475577615267&_=1475577619626
and
http://index.chineseshipping.com.cn/servlet/allGetCurrentComposites?date=Tue%20Oct%2004%202016%2013:40:20%20GMT+0300%20(MSK)&jc=jsonp1475577615268&_=1475577620325
In both requests:
_ is a decache parameter to prevent caching.
jc is a name of a JS wrapper function which should be invoked with the result (https://en.wikipedia.org/wiki/JSONP)
So, scrapping the table template at http://www1.chineseshipping.com.cn/en/indices/cbcfinew.jsp and performing two additional requests you will be able to combine them into the same data structure you see in the browser.

IE Webdriver pageSource is not refreshing after chnage in page due to javaScript

I'm trying to automate a Web application using Selenium in C#,
On home page, i'm clicking on link which leads to another page.
then i switch to this new page using following code
string parent = webDriver.CurrentWindowHandle;
while (webDriver.WindowHandles.Count <= 1) ; // wait for new tab
foreach (string handle in webDriver.WindowHandles)
{
if (handle != parent)
{
webDriver.SwitchTo().Window(handle);
break;
}
}
this new page has only two links (to select user role)
after clicking on second link, Entire page changes by a javascipt and new data is loaded on the same page
But, Even after the Page has changed, webdriver returns same pageSource (of the page that had 2 links)
title of the changed page is given correctly by the browser
I've read in the documentation that IE webdriver not always returns latest pageSource
consdering that, it is only the page Source which is incorrect and driver is handling the chnaged page that i'm expecting
So i did a small test using
webDriver.FindElements(By.XPath(//a);
but it did not give the tags from changed page, instead gave tags from page which had two selection links.
why the driver is not returning the latest tags ?
i'm stuck on this issue and i will really appreciate any help ..
thanks in advance!!
I would wait for an element indicating that the page is fully loaded before getting the page source:
WebDriverWait wait = new WebDriverWait(driver, 20);
// switch to the next window
String main_handle = driver.getWindowHandle();
wait.until((WebDriver drv) -> {
for (String handle : drv.getWindowHandles()) {
if (handle != main_handle) {
drv.switchTo().window(handle);
return true;
}
}
return false;
});
// wait for an element which presence indicates that the page is loaded
wait.until(ExpectedConditions.presenceOfElementLocated(By.id("...")));
// get the page source
String page_source = driver.getPageSource();

Executing Javascript on Selenium/PhantomJS

I'm using PhantomJS via Selenium Webdriver in Python and I'm trying to execute a piece of JavaScript on the page in hopes of returning a piece of data:
from selenium import webdriver
driver = webdriver.PhantomJS("phantomjs.cmd") # or add to your PATH
driver.set_window_size(1024, 768) # optional
driver.get('http://google.com') # EXAMPLE, not actual URL
driver.save_screenshot('screen.png') # save a screenshot to disk
jsres = driver.execute('$("#list").DataTable().data()')
print(jsres)
However when run, it reports KeyError. I was unable to find much documentation on the commands available, so I'm a bit stuck here.
The method created for executing javascript is called execute_script(), not execute():
driver.execute_script('return $("#list").DataTable().data();')
FYI, execute() is used internally for sending webdriver commands.
Note that if you want something returned by javascript code, you need to use return.
Also note that this can throw Can't find variable: $ error message. In this case, locate the element with selenium and pass it into the script:
# explicitly wait for the element to become present
wait = WebDriverWait(driver, 10)
element = wait.until(EC.presence_of_element_located((By.ID, "list")))
# pass the found element into the script
jsres = driver.execute_script('return arguments[0].DataTable().data();', element)
print(jsres)

invoking onclick event with beautifulsoup python

I am trying to fetch the links to all accomodations in Cyprus from this website:
http://www.zoover.nl/cyprus
So far I can retrieve the first 15 which are already shown. So now I have to invoke the click on the "volgende"-link. However I don't know how to do that and in the source code I am not able to track down the function called to use e.g. sth like posted here:
Issues with invoking "on click event" on the html page using beautiful soup in Python
I only need the step where the "clicking" happens so I can fetch the next 15 links and so on.
Does anybody know how to help?
Thanks already!
EDIT:
My code looks like this now:
def getZooverLinks(country):
zooverWeb = "http://www.zoover.nl/"
url = zooverWeb + country
parsedZooverWeb = parseURL(url)
driver = webdriver.Firefox()
driver.get(url)
button = driver.find_element_by_class_name("next")
links = []
for page in xrange(1,3):
for item in parsedZooverWeb.find_all(attrs={'class': 'blue2'}):
for link in item.find_all('a'):
newLink = zooverWeb + link.get('href')
links.append(newLink)
button.click()'
and I get the following error:
selenium.common.exceptions.StaleElementReferenceException: Message: Element is no longer attached to the DOM
Stacktrace:
at fxdriver.cache.getElementAt (resource://fxdriver/modules/web-element-cache.js:8956)
at Utils.getElementAt (file:///var/folders/n4/fhvhqlmx23s8ppxbrxrpws3c0000gn/T/tmpKFL43_/extensions/fxdriver#googlecode.com/components/command-processor.js:8546)
at fxdriver.preconditions.visible (file:///var/folders/n4/fhvhqlmx23s8ppxbrxrpws3c0000gn/T/tmpKFL43_/extensions/fxdriver#googlecode.com/components/command-processor.js:9585)
at DelayedCommand.prototype.checkPreconditions_ (file:///var/folders/n4/fhvhqlmx23s8ppxbrxrpws3c0000gn/T/tmpKFL43_/extensions/fxdriver#googlecode.com/components/command-processor.js:12257)
at DelayedCommand.prototype.executeInternal_/h (file:///var/folders/n4/fhvhqlmx23s8ppxbrxrpws3c0000gn/T/tmpKFL43_/extensions/fxdriver#googlecode.com/components/command-processor.js:12274)
at DelayedCommand.prototype.executeInternal_ (file:///var/folders/n4/fhvhqlmx23s8ppxbrxrpws3c0000gn/T/tmpKFL43_/extensions/fxdriver#googlecode.com/components/command-processor.js:12279)
at DelayedCommand.prototype.execute/< (file:///var/folders/n4/fhvhqlmx23s8ppxbrxrpws3c0000gn/T/tmpKFL43_/extensions/fxdriver#googlecode.com/components/command-processor.js:12221)
I'm confused :/
While it might be tempting to try to do this using Beautifulsoup's evaluateJavaScript method, in the end Beautifulsoup is a parser rather than an interactive web browsing client.
You should seriously consider solving this with selenium, as briefly shown in this answer. There are pretty good Python bindings available for selenium.
You could just use selenium to find the element and click it, and then pass the page on to Beautifulsoup, and use your existing code to fetch the links.
Alternatively, you could use the Javascript that's listed in the onclick handler. I pulled this from the source: EntityQuery('Ns=pPopularityScore%7c1&No=30&props=15292&dims=530&As=&N=0+3+10500915');. The No parameter increments with 15 for each page, but the props has me guessing. I'd recommend not getting into this, though, and just interact with the website as a client would, using selenium. That's much more robust to changes on their side, as well.
I tried the following code and was able to load next page. Hope this will help you too.
Code:
from selenium import webdriver
import os
chromedriver = "C:\Users\pappuj\Downloads\chromedriver"
os.environ["webdriver.chrome.driver"] = chromedriver
driver = webdriver.Chrome(chromedriver)
url='http://www.zoover.nl/cyprus'
driver.get(url)
driver.find_element_by_class_name('next').click()
Thanks

Categories