I am trying to understand the difference between the resulting output of a simple load page with QtWebkit and an wget command, apart from that QtWebkit has a large API that we can make use of in a webpage to do a lot of things with Python.
What is the process of a wget and how does it download a webpage with all its components (images, etc.). Is there a difference in the output size of both processes?
And last question: What is being executed (javascript) in a load page with QtWebkit (besides an onload event handler)?
By default, wget does not retrieve any page requisites unless you tell it to via the -p/--page-requisites or the -r/--recursive flags. It processes no JavaScript commands, nor does it attempt to do anything with the markup or CSS unless you specifically tell it to. Even then, I'm pretty sure it just uses simple string matching to determine resource names and link URLs. All in all, it's pretty stupid until you configure it correctly (the basis for just about every powerful *NIX tool).
Since the WebKit library is so extensive, it would be useful to know what you're trying to do with it, like what code are you executing. But, since you already know what you're doing is performing JavaScript calls, it's reasonable to assume that it's doing a lot more than just retrieving the page.
Perhaps if you gave some examples of what you're trying to do I would be able to more thoroughly answer your question.
Related
Good afternoon!
We're looking to get a javascript variable from a webpage, that we are usually able to retrieve typing app in the Chrome DevTools.
However, we're looking to realize this headlessly as it has to be performed on numerous apps.
Our ideas :
Using a Puppeteer instance to go on the page, type the command and return the variable, which works, but it's very ressource consuming.
Using a GET/POST request to the page trying to inject the JS command, but we didn't succeed.
We're then wondering if there will be an easier solution, as a special API that could extract the variable?
The goal would be to automate this process with no human interaction.
Thanks for your help!
Your question is not so much about a JS API (since the webpage is not yours to edit, you can only request it) as it is about webcrawling / browser automation.
You have to add details to get a definitive answer, but I see two scenarios:
the website actively checks for evidence of human browsing (for example, it sits behind CloudFlare and has requested this option); or the scripts depend heavily on there being a browser execution environment available. In this case, the simplest option is to automate a browser, because a headless option has to get many things right to fool the server or the scripts. I would use karate, which is easier than, say, selenium and can execute in-browser scripts. It is written in Java, but you can execute it externally and just read its reports.
the website does not check for such evidence and the scripts do not really require a browser execution environment. Then you can simply download everything requires locally and attempt to jury-rig the JS into executing in any JS environment. According to your post, this fails; but it is impossible to help unless you can describe how it fails. This option can be headless.
You can embed Chrome into your application and instrument it. It will be headless.
We've used this approach in the past to copy content from PowerPoint Online.
We were using .NET to do this and therefore used CEFSharp.
There's a lot of mixed results when I search around for emulating a browser. Long story short, I need my Node server to do get & post requests. Usually I'd just do this with the http package. However, there is some anti-scripting things in place on the other side. Namely javascripts that let the server know it's a real browser. So, I need these to be executed.
I actually solved this problem like 5 years ago, but my site was only using PHP then. The solution involved using a Qt webkit widget, and a fake X-server. Not elegant, but it was pretty easy to do. The only javascript engines I found available in Perl, PHP, or Python at the time were crazy slow.
As NodeJS is built on V8, I gotta think there's an easy way to do this. For the record, I'm hoping to get something a la the following.
// Omitting some callbacks
http.get('http://remote.site', function(res) {
res.on('end', function() {
// previously accumulated data is the page returned by
// the request. Any thing found in a <script> tag would have
// been executed.
});
});
As NodeJS is built on V8, I gotta think there's an easy way to do this.
Actually, no! There's a lot more to running in the context of a browser than simply being able to execute JavaScript. All of the DOM stuff and what not is no present in Node.js. Node.js has the JavaScript engine only.
Without the browser engine, you won't know what scripts to load, in what order, or be able to provide everything that comes with the document or window, which is likely a required part of what you're trying to do.
The solution involved using a Qt webkit widget, and a fake X-server. Not elegant, but it was pretty easy to do.
This is actually the right solution... mostly. Fortunately these days there are existing tools which have optimized this reasonably well.
Take a look at PhantomJS. http://phantomjs.org/ You can write scripts for it much in the same way you do Node.js. (It supports require() and what not, and most of the NPM packages you'd want work.) PhantomJS will allow you to run the page and pull the DOM contents out easily.
In the event PhantomJS' built in JavaScript environment doesn't contain some Node.js component you need (for filesystem or network access for example), you can always control PhantomJS from your Node.js application. https://github.com/amir20/phantomjs-node
I got a webpage that calls oracle and then does some processing and then a lot of javascript.
The problem is that all of this make it slow for the user. I have to use internet explorer 6 so the javascript takes very long to load, around 15 seconds.
How can i make my server do all of this every minute for example and save the page so if a user requests it it would server them that page that is all ready calculated etc
im using tomcat server my webpage is mainly javascript and html
edit:
By the way I can not rewrite my webpage, it would have to remain as it is
I'm looking for something that would give the user a snapshot of the webpage that the server loaded
YSlow recommendations would tell you that you should put all your CSS in the head of your page and all JavaScript at the bottom, just before the closing body tag. This will allow the page to fully load the DOM and render it.
You should also minify and compress your JavaScript to reduce download size.
To do that, you'd need to have your server build up the DOM, run the JavaScript in an environment that looks (enough) like web browser, and then serialize the result as HTML.
There have been various attempts to do that, Jaxer is one of them (it was originally a product from Aptana, now an Apache project). Another related answer here on SO pointed to the jsdom project, which is a DOM implementation in JavaScript (video here).
Re
By the way I can not rewrite my webpage, it would have to remain as it is
That's very unlikely to be successful. There is bound to be some modification involved. At the very least, you're going to have to tell your server-side framework what parts it should process and what parts should be left to the client (e.g., user-interaction code).
Edit:
You might also look for "website thumbnail" services like shrinktheweb.com and similar. Their "pro" account allows full-size thumbnails (what I don't know is whether it's an image or HTML). But I'm not specifically suggesting them, just a line you might pursue. If you can find a project that does thumbnails, you may be able to adapt it to do what you want.
But again, take a look at Jaxer, you may find that it does what you need or very similar (and it's open-source, so you can modify it or extract the bits you want).
"How can i make my server do all of this every minute for example"
If you are asking how you can make your database server 'pre-run' a query, then look into materialized views.
If the Oracle query is responsible for (for example) 10 seconds of the delay there may be other things you can do to speed it up, but we'd need a lot more information on what the query does
As mentioned in a previous question, I'm coding a crawler for the QuakeLive website.
I've been using WWW::Mechanize to get the web content and this worked fine for all the pages except the one with matches. The problem is that I need to get all these kind of IDs:
<div id="ffa_c14065c8-d433-11df-a920-001a6433f796_50498929" class="areaMapC">
These are used to build specific matches URLs, but I simply can't.
I managed to see those IDs only via FireBug and no page downloader, parser, getter I tried was able to help here. All I can get is a simpler version of the page which code is the one you can see by "showing source code" in Firefox.
Since FireBug shows the IDs I can safely assume they are already loaded, but then I can't understand why nothing else gets them. It might have something to do with JavaScript.
You can find a page example HERE
To get at the DOM containing those IDs you'll probably have to execute the javascript code on that site. I'm not aware of any libraries that'd allow you to do that, and then introspect the resulting DOM within perl, so just controlling an actual browser and later asking it for the DOM, or only parts of it, seems like a good way to go about this.
Various browsers provide ways to be controlled programatically. With a Mozilla based browser, such as Firefox, this could be as easy as loading mozrepl into the browser, opening a socket from perl space, sending a few lines of javascript code over to actually load that page, and then some more javascript code to give you the parts of the DOM you're interested in back. The result of that you could then parse with one of the many JSON modules on CPAN.
Alternatively, you could work through the javascript code executed on your page and figure out what it actually does, to then mimic that in your crawler.
The problem is that mechanize mimics the networking layer of the browser but not the rendering or javascript execution layer.
Many folks use the web browser control provided by Microsoft. This is a full instance of IE in a control that you can host in a WinForm, WPF or plain old Console app. It allows you to, among other things, load the web page and run javascript as well as send and receive javascript commands.
Here's a reasonable intro into how to host a browser control: http://www.switchonthecode.com/tutorials/csharp-snippet-tutorial-the-web-browser-control
A ton of data is sent over ajax requests. You need to account for that in your crawler somehow.
It looks like they are using AJAX. I can see where the requests are being sent using FireBug. You may need to either pick up on this by trying to parse and execute javasript that affects the DOM.
You should be able to use WWW::HtmlUnit - it loads and executes javascript.
Read the FAQ. WWW::Mechanize doesn't do javascript. They're probably using javascript to change the page. You'll need a different approach.
Google hosts some popular JavaScript libraries at:
http://code.google.com/apis/ajaxlibs/
According to google:
The most powerful way to load the libraries is by using google.load() ...
What are the real advantages of using
google.load("jquery", "1.2.6")
vs.
<script type="text/javascript" src="http://ajax.googleapis.com/ajax/libs/jquery/1.2.6/jquery.min.js"></script>
?
Aside from the benefit of Google being able to bundle multiple files together on the request, there is no perk to using google.load. In fact, if you know all libraries that you want to use (say just jQuery 1.2.6), you're possibly making the user's browser perform one unneeded HTTP connection. Since the whole point of using Google's hosting is to reduce bandwidth consumption and response time, the best decision - if you're just using 1 library - is to call that library directly.
Also, if your site will be using any SSL certificates, you want to plan for this by calling the script via Google's HTTPS connection. There's no downside to calling a https script from an http page, but calling an http script from an https page will causing more obscure debugging problems than you would want to think about.
It allows you to dynamically load the libraries in your code, wherever you want.
Because it lets you switch directly to a new version of the library in the javascript, without forcing you to rebuild/change templates all across your site.
It lets Google change the URL (but they can't since the URL method is already established)
In theory, if you do several google.load()s, Google can bundle then into one file, but I don't think that is implemented.
I find it's very useful for testing different libraries and different methods, particularly if you're not used to them and want to see their differences side by side, without having to download them. It appears that one of the primary reason to do it, would be that it is asynchronous versus the synchronous script call. You also get some neat stuff that is directly included in the google loader, like client location. You can get their latitude and longitude from it. Not necessarily useful, but it may be helpful if you're planning to have targeted advertising or something of the like.
Not to mention that dynamic loading is always useful. Particularly to smooth out the initial site load. Keeping the initial "site load time" down to as little as possible is something every web designer is fighting an uphill battle on.
You might want to load a library only under special conditions.
Additionally the google.load method would speed up the initial page display. Otherwise the page rendering will freeze until the file has been loaded if you include script tags in your html code.
Personally, I'm interested in whether there's a caching benefit for browsers that will already have loaded that library as well. Seems like if someone browses to google and loads the right jQuery lib and then browses to my site and loads the right jQuery lib... ...both might well use the same cached jQuery. That's just a speculative possibility, though.
Edit: Yep, at very least when using the direct script tags to the location, the javascript library will be cached if someone has already called for the library from google (e.g. if it were included by another site somewhere).
If you were to write a boatload of JavaScript that only used the library when a particular event happens, you could wait until the event happens to download the library, which avoids unnecessary HTTP requests for those who don't actually end up triggering the event. However, in the case of libraries like Prototype + Scriptaculous, which downloads over 300kb of JavaScript code, this isn't practical.