I want to use HtmlUnit (v2.21) to get some search result pages from google. This requires me to click on "people also looked for" link when searching for a person (right side, see example link), which triggers some JavaScript and changes the content of the current page. But this gives me an JavaScript Wrapper Exception (see below).
Clickable example link: https://www.google.de/search?ie=UTF-8&safe=off&q=nicki+minaj
Simple TestCase with errors:
String url = "https://www.google.de/search?ie=UTF-8&safe=off&q=nicki+minaj";
WebClient client = new WebClient(BrowserVersion.BEST_SUPPORTED);
HtmlPage page = client.getPage(url);
HtmlElement link = page.getFirstByXPath("//a[#class='_Zjg']");
HtmlPage newPage = link.click(); //throws exception
this.storeResultFile(newPage.asXml(), "test");
client.close();
Result:
net.sourceforge.htmlunit.corejs.javascript.WrappedException: Wrapped java.lang.NullPointerException
at net.sourceforge.htmlunit.corejs.javascript.Context.throwAsScriptRuntimeEx(Context.java:2053)
at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.doProcessPostponedActions(JavaScriptEngine.java:947)
at com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine.processPostponedActions(JavaScriptEngine.java:1012)
at com.gargoylesoftware.htmlunit.html.DomElement.click(DomElement.java:799)
at com.gargoylesoftware.htmlunit.html.DomElement.click(DomElement.java:742)
at com.gargoylesoftware.htmlunit.html.DomElement.click(DomElement.java:689)
I stored the xml of the "page" object and made sure that the XPath expression is valid and has results.
Anybody got any ideas?
Looks like the JavaScript-Engine (based on Rhino) is very easy to upset and quits on some script-issues, where other browsers are still able to run the script.
I dont know if there is a mistake in the scripts from google, but these two lines solved it for me:
JavaScriptEngine engine = client.getJavaScriptEngine();
engine.holdPosponedActions();
Nevertheless, when running multiple htmlunit-objects in multiple threads it is still possible to get accross this error. This is more a workaround than a solution.
Related
I am trying to login to a website using AjaxForm. I managed to retine the forms and reach the xpath of the desired button though when I call #click I get this error:
EcmaError: lineNumber=[193] column=[0] lineSource=[<no source>] name=[ReferenceError] sourceName=[script in https://test.paypo.com/Account/Login?ReturnUrl=%2FHome%2FStart from (177, 32) to (221, 10)]
message=[ReferenceError: "Paypo" is not defined.
(script in https://test.paypo.com/Account/Login?ReturnUrl=%2FHome%2FStart from (177, 32) to (221, 10)#193)]
com.gargoylesoftware.htmlunit.ScriptException: ReferenceError: "Paypo" is not defined. (script in https://test.paypo.com/Account/Login?ReturnUrl=%2FHome%2FStart from (177, 32) to (221, 10)#193)
I am honestly clueless on how to get around this... important note is that I have no access to the source of the website, the actual website logging works perfectly fine.
I've tried using any kind of BrowserVersion and different HtmlUnit versions...
Current code:
final HtmlPage thePage = ((HtmlPage) page);
final HtmlButtonInput button = (HtmlButtonInput) thePage.getByXPath("//input[#type='button']").get(0);
webClient.getOptions().setThrowExceptionOnScriptError(true);
final HtmlPage newPage = button.click();
Error araises when #click is called!
Any clue? Please!
Ok have done a short check with this code:
final String url = "https://test.paypo.com/Account/Login?ReturnUrl=%2FHome%2FStart";
try (final WebClient webClient = new WebClient(BrowserVersion.FIREFOX_60)) {
HtmlPage page = webClient.getPage(url);
}
Running this produces a bunch of errors; the first one is
com.gargoylesoftware.htmlunit.ScriptException: identifier is a reserved word: class (https://test.paypo.com/bundles/SharedJS?v=qrYYsvxJCv4nRnx8xzi1sMLBQPQlIPteJjoj8eCO1go1#7)
What does this mean?
The page includes some js code from the url https://test.paypo.com/bundles/SharedJS?v=qrYYsvxJCv4nRnx8xzi1sMLBQPQlIPteJjoj8eCO1go1 and there is a problem with this code. In detail the code uses the javascript 'class' language feature and HtmlUnit (in the end Rhino) does not support this syntax in the current version
Because of this the javascript from this external resource is not 'compilable' and thereof not available for the other javascript on that page
And finally this leads to the error you are facing.
I am trying to fetch the links to all accomodations in Cyprus from this website:
http://www.zoover.nl/cyprus
So far I can retrieve the first 15 which are already shown. So now I have to invoke the click on the "volgende"-link. However I don't know how to do that and in the source code I am not able to track down the function called to use e.g. sth like posted here:
Issues with invoking "on click event" on the html page using beautiful soup in Python
I only need the step where the "clicking" happens so I can fetch the next 15 links and so on.
Does anybody know how to help?
Thanks already!
EDIT:
My code looks like this now:
def getZooverLinks(country):
zooverWeb = "http://www.zoover.nl/"
url = zooverWeb + country
parsedZooverWeb = parseURL(url)
driver = webdriver.Firefox()
driver.get(url)
button = driver.find_element_by_class_name("next")
links = []
for page in xrange(1,3):
for item in parsedZooverWeb.find_all(attrs={'class': 'blue2'}):
for link in item.find_all('a'):
newLink = zooverWeb + link.get('href')
links.append(newLink)
button.click()'
and I get the following error:
selenium.common.exceptions.StaleElementReferenceException: Message: Element is no longer attached to the DOM
Stacktrace:
at fxdriver.cache.getElementAt (resource://fxdriver/modules/web-element-cache.js:8956)
at Utils.getElementAt (file:///var/folders/n4/fhvhqlmx23s8ppxbrxrpws3c0000gn/T/tmpKFL43_/extensions/fxdriver#googlecode.com/components/command-processor.js:8546)
at fxdriver.preconditions.visible (file:///var/folders/n4/fhvhqlmx23s8ppxbrxrpws3c0000gn/T/tmpKFL43_/extensions/fxdriver#googlecode.com/components/command-processor.js:9585)
at DelayedCommand.prototype.checkPreconditions_ (file:///var/folders/n4/fhvhqlmx23s8ppxbrxrpws3c0000gn/T/tmpKFL43_/extensions/fxdriver#googlecode.com/components/command-processor.js:12257)
at DelayedCommand.prototype.executeInternal_/h (file:///var/folders/n4/fhvhqlmx23s8ppxbrxrpws3c0000gn/T/tmpKFL43_/extensions/fxdriver#googlecode.com/components/command-processor.js:12274)
at DelayedCommand.prototype.executeInternal_ (file:///var/folders/n4/fhvhqlmx23s8ppxbrxrpws3c0000gn/T/tmpKFL43_/extensions/fxdriver#googlecode.com/components/command-processor.js:12279)
at DelayedCommand.prototype.execute/< (file:///var/folders/n4/fhvhqlmx23s8ppxbrxrpws3c0000gn/T/tmpKFL43_/extensions/fxdriver#googlecode.com/components/command-processor.js:12221)
I'm confused :/
While it might be tempting to try to do this using Beautifulsoup's evaluateJavaScript method, in the end Beautifulsoup is a parser rather than an interactive web browsing client.
You should seriously consider solving this with selenium, as briefly shown in this answer. There are pretty good Python bindings available for selenium.
You could just use selenium to find the element and click it, and then pass the page on to Beautifulsoup, and use your existing code to fetch the links.
Alternatively, you could use the Javascript that's listed in the onclick handler. I pulled this from the source: EntityQuery('Ns=pPopularityScore%7c1&No=30&props=15292&dims=530&As=&N=0+3+10500915');. The No parameter increments with 15 for each page, but the props has me guessing. I'd recommend not getting into this, though, and just interact with the website as a client would, using selenium. That's much more robust to changes on their side, as well.
I tried the following code and was able to load next page. Hope this will help you too.
Code:
from selenium import webdriver
import os
chromedriver = "C:\Users\pappuj\Downloads\chromedriver"
os.environ["webdriver.chrome.driver"] = chromedriver
driver = webdriver.Chrome(chromedriver)
url='http://www.zoover.nl/cyprus'
driver.get(url)
driver.find_element_by_class_name('next').click()
Thanks
i'm trying to migrate from feedly as it is unacceptable (at least to me) that a search query is (fully) enabled only by a pro version.
Anyhow, to export my lengthy list of "saved for later" i found some lovely scripts:
Simple script that exports a users "Saved For Later" list out of Feedly as a JSON string and feedly-to-pocket. where i am instructed to:
You must switch off SSL (http rather than https) or jQuery won't load!
so i though i did by adding (ubuntu 14.04/chrome 40 x64)
--ssl-version-min=tls1
to my /usr/share/applications/google-chrome.desktop file (all lines starting with Exec=). However when i try to run it in the browser console i get
This request has been blocked; the content must be served over HTTPS.
So, any suggestions? (also, excuse me for noobness)
Go to your Feedly "saved" list and scroll down until all articles have loaded.
Open console and paste the following Javascript into it:
function loadJQuery() {
script = document.createElement('script');
script.setAttribute('src', '//code.jquery.com/jquery-2.1.3.js');
script.setAttribute('type', 'text/javascript');
script.onload = loadSaveAs;
document.getElementsByTagName('head')[0].appendChild(script);
}
function loadSaveAs() {
saveAsScript = document.createElement('script');
saveAsScript.setAttribute('src', 'https://cdn.rawgit.com/eligrey/FileSaver.js/5733e40e5af936eb3f48554cf6a8a7075d71d18a/FileSaver.js');
saveAsScript.setAttribute('type', 'text/javascript');
saveAsScript.onload = saveToFile;
document.getElementsByTagName('head')[0].appendChild(saveAsScript);
}
function saveToFile() {
// Loop through the DOM, grabbing the information from each bookmark
map = jQuery(".entry.quicklisted").map(function(i, el) {
var $el = jQuery(el);
var regex = /Published:(.*)(.*)/i;
return {
title: $el.attr("data-title"),
url: $el.attr("data-alternate-link"),
summary: $el.find(".summary")[0].innerHTML,
time: regex.exec($el.find("span.ago").attr("title"))[1]
};
}).get(); // Convert jQuery object into an array
// Convert to a nicely indented JSON string
json = JSON.stringify(map, undefined, 2);
var blob = new Blob([json], {type: "text/plain;charset=utf-8"});
saveAs(blob, "FeedlySavedForLater" + Date.now().toString() + ".txt");
}
loadJQuery()
Source: Feedly-Export-Save4Later
Not javascript but here is how I saved a html page with all the links and excerpts...
Open the saved pages in feedly in chrome
scroll down so they are all there
inspect any element (the top article is a good choice) so it opens the generated html
find the div id="section0_column0" node
right-click & copy it
paste into Notepad++
this html is untidy so carry on...
Do a Regex find & replace
find: (?s)<div id=.+?_main.+?>.+?(<a href=")(.+?)(").+?sans-serif">(.+?)</span>.+?</div>.+?</div>.+?</div>
replace: <div>$1$2$3>$2</a></div> <div> $4<br /> <br /></div>
save the html page.
open it in Chrome
Posted the question in the jquery forum and the solution was rather simple (remove http from attribute string)
line 34 should be
script.setAttribute('src', '//code.jquery.com/jquery-latest.min.js');
So to close the loop - for a full searchable/archived list of links not only by title/url but context also(!) you can:
Follow the instructions in https://github.com/ShockwaveNN/feedly-to-pocket (with the correction suggested by kind stranger jakecigar and you also have to register a pocket app (obtain consumer key) for the ruby script to work)
Export html list from your pocket account
Import pocket list to a Kifi library
and at last feedly-free with my personal search engine
I know I'm a bit late to the party but Ive been hunting around for a few days to find a reasonably simple solution. None of which have been listed clearly or concisely on stack overflow or elsewhere on the web. I have in fact found a much easier way to do this.
Use this java script from this Gist just as it instructs https://gist.github.com/ShockwaveNN/a0baf2ca26d1711f10e2 (Note this is referenced above and found through the link #gep shared in step one)
Once the JS as completed running it will download a text file. (It does still run successfully and on large numbers, I just exported almost 2500 articles)
Create a blank test.json in SublimeText.
Copy all entries from your exported text file into this json file
Weirdly it does seem you need to copy and past as I tried just renaming the text file and when I did that I received errors on the next step
Make sure you are signed into pocket
Go here: https://getpocket.com/import/springpad
Select your newly created test.json
Upload
Note: On large uploads the import page fails to refresh (this did not seem to be an issue as all my articles did make it into my account)
This allows you to directly upload json into your pocket account. Thus no more messing around with random supposed other fixes. I hope this make it a lot easier for everyone in the future.
I'm trying to do this by using a Tampermonkey Script. However I'm open to new approaches...
What I want to do is extract some data (data-video), from a specific <div>. However this data is not available under the HTML code of the page, but it's available under Dev Tools -> Resources and then on Frames.
Anyone knows if it's possible to get that information available under DevTools? And how can I do that?
Comparative between the two pages can be found here: "Original HTML PAGE" and "HTML PAGE under DevTools"
On the first hyperlink the id=video-canvas cannot be seen, however it's on the <object type="application/x-shockwave-flash(...)
As you state in your question the data you're looking for is available in DevTools under the "Resources" tab in the "Frames" folder. What you are looking at there is the Source HTML, similar to View Source.
The code you want, is what is getting replaced. It appears the site is using the JW Player Plugin, which is replacing the <div id="video-canvas"> with the appropriate HTML for the device / browser detected to play the video. With all of my browsers on my Mac, they are being forced to use the Flash, even when it's disabled. When using my iPhone, which can't play flash , and inspecting the page it uses JW's own custom video element. It appears that it must be storing the file location in memory since it is not in the generated markup.
I am able to run through the console in the dev tools and access their JS class. It appears i can call jwplayer._tracker , which has an object b . Object b has an object AlWv3iHmEeOzwBIxOUCPzg This object seems to be consistent each time i check between different browsers, you can use the for loop inmy first example to get the correct value but tirmming it down to .b Following that object is e and in e is the object http://i.n.jwpltx.com/v1.... really long string that appears to contain a url, so it will need to parsed.
So to get the HTML string i ran
for ( var loc in jwplayer._tracker.b.AlWv3iHmEeOzwBIxOUCPzg.e){
loc
}
so if we put that in a function to parse the string and return a value
function getSubURL(){
var initURL;
for ( var loc in jwplayer._tracker.b.AlWv3iHmEeOzwBIxOUCPzg.e){
initURL = loc;
}
//look for 'mp4:' this is in front of the file path
var start = initURL.indexOf("mp4%3A");
//look for the .mp4 for the end of the file name
var stop = initURL.indexOf(".mp4");
//grab the string between
//start+6 to remove characters used to find it
//and stop+4 to include characters used to find it
var subPath = (initURL.substring((start+6),(stop+4))).split("%2F").join("/");
return subPath;
}
//and run it
getSubURL();
it will return ciencia/astronomia/fimsol.mp4
you can run this from your console, but I am unaware of how you can use this in Tamper Monkey, but i think it gets ya a lot closer to what you wanted.
This is the approach I've used to solve my problem... I couldn't grab the code I want under Dev Tools, but I find a way to get the data from jwplayer with the function getPlaylistItem. And this is how I get the url filename of each video:
function getFilename(filename) {
var filename;
if(jwplayer().getPlaylistItem){
filename = jwplayer().getPlaylistItem()['file'];
}
else{
return filename;
}
filename = filename.substring(filename.indexOf("/mp4:") + 5);
return filename;
}
I'm doing a web where I heavily use AJAX requests to
a XML service. In fact, my web is a front-end with almost
no server whatsoever and uses AJAX to communicate with
the back-end.
Everything was going fine (I developed and tested in Ubuntu 9.04
and Firefox 3.0 as a browser).
One day I decided to see how my web did in IE8...
horror!
Nothing was working as it marvelously did in Firefox.
To be more specific, the Request.HTML's were not working.
As I said, my web relied heavily on that, so nothing worked.
I spent a day trying to get something running but I had no luck..
The only conclusion to which I arrived was that the XML was
incorrectly parsed
(I hope I'm in mistake). Let's get to the code:
var req = new Request.HTML({
url: 'service/Catalog.groovy',
onSuccess: function(responseTree, responseElements) {
var catz = responseElements.filter('category');
catz.each(function(cat){
// cat = $(cat);
var cat_id = cat.get('id');
var subcategory = cat.getElement('subcategory');
alert(cat_id);
alert(cat.get('html'));
alert(subcategory.get('html'));
}
},
onFailure: function(){...}
});
for example, that piece of code.
In firefox, it worked perfectly. It alerted an ID (for example, 7),
then it showed the contents of the category element, for example:
<subcategory id='1'>
<category_id>7</category_id>
<code>ACTIO</code>
<name>Action</name>
</subcategory>
and then it showed the contents of some inner element, in this case:
<category_id>7</category_id>
<code>ACTIO</code>
<name>Action</name>
In IE8, the first alert worked OK (alerted 7)
but the next alert (alert(cat.get('html'));) gave an empty string
and the last threw an Exception... it said something about subcategory
beeing null.
What I concluded with this all is that the elements where parsed
correctly
in Firefox, but in IE8 I only got the tags and the attributes OK,
everything else
was completely wrong (in fact, missing). I mean, the inner content of
all the
elements of the response where gone!
Other fact you could use: this code:
alert(cat.get('tag')); resulted in
Firefox: category
IE8: /category <-----------(?)
hmm what else...
oh yeah... the line you see commented above (cat = $(cat);) was
something
I tried to do to fix this. I read in the mootools Docs that IE needed
to explicitly call
the $ function on elements to get all the Element-magic ... but this
didn't fix anything.
I was so desperate... I even fiddled around with mootools.js code
OK, so...
What I want you, dear mootool-pro's is to help me solve this problem,
for I REALLY need the web to function in IE8, and in fact I chose
mootools to forget about compatibility problems...
ps: if something is not clear, please ASK! I'd appreciate any help :D
I had a similar issue like this sometime ago using jQuery. The problem was that, in IE, the incoming response data needed to be handled by the Microsoft.XMLDOM ActiveX object.
The general steps are to:
Instantiate the ActiveX object.
var oXmlDoc = new ActiveXObject("Microsoft.XMLDOM");
Pass it the incoming response data and load it.
oXmlDoc.loadXML(sXmlResponseData);
Parse it as needed.
You can check out the full resolution here.