Retrieve html code of skyscanner result using phantomjs - javascript

As it happens that skyscanner only provides their api to big commercial websites, I wanted to build a small application on my own to retrieve the results for multiple destinations for my own purpose (non commercial).
I found that getting the result of a flight search seems to be pretty difficult as the page takes a few seconds to complete the flight search and display the results.
Using wget, lynx, links2 or edbrowse didn't work for me, as I got the result that javascript is not enabled in my browser, even when links2 was compiled with javascript support. Maybe I did something wrong, I don't know.
However phantomjs provided the best effort so far and I tried multiple code-fragments to retrieve the flight search results.
Sources from:
[Stackoverflow#1][1]
[Stackoverflow#2][2]
[Techslides][3]
[Stackoverflow#3][4]
[Stackoverflow#4][5]
[1]: http://stackoverflow.com/questions/18526140/how-to-get-html-generated-from-javascript-using-phantomjs
[2]: http://stackoverflow.com/questions/28209509/get-javascript-rendered-html-source-using-phantomjs
[3]: http://techslides.com/grabbing-html-source-code-with-phantomjs-or-casperjs
[4]: http://stackoverflow.com/questions/12450868/how-to-print-html-source-to-console-with-phantomjs
[5]: http://stackoverflow.com/questions/8692038/phantomjs-page-dump-script-issue
Even with the time lag described in [Stackoverflow#4][5] it did not work.
The scripts resulted (in case of a successful return) only an error page of skyscanner, saying that they got a problem.
The last effort I tried which resulted in the described error-page was:
var page = new WebPage(),t, address;
var fs = require('fs');
var url = 'http://www.skyscanner.at/transport/fluge/nyca/lax/150626/150627/flugpreise-von-new-york-nach-los-angeles-international-im-juni-2015.html?adults=1&children=0&infants=0&cabinclass=economy&rtn=1&preferdirects=false&outboundaltsenabled=false&inboundaltsenabled=false';
address = encodeURI(url);
page.open(address, function (status) {
if (status !== 'success') {
console.log('FAIL to load the address');
} else {
f = null;
var markup = page.content;
console.log(markup);
try {
f = fs.open('htmlcode.txt', "w");
f.write(markup);
f.close();
} catch (e) {
console.log(e);
}
}
phantom.exit();
});
Did someone try something like that before and was successful? How did you get it working? I am trying to build a php-based and/or shell-script based solution on a gui-less Debian-Linux system.

I work in engineering at Skyscanner. This isn't an answer to your question but, if you end up on that error page (or a captcha page), it is likely that our bot-blocker is catching you. Which is kind of "by design" :)
I can get you an API key, with a conservative rate limit. Would that be of interest?
Cheers,
Iain

Related

"Illegal group end indicator...(not a group)" when decoding gtfs-r data

I'm trying to use a node.js app to regularly decode some gtfs-realtime data. It's mostly running fine, but every few hours I run into an error that crashes my app. The error message in my log says that there is an "Illegal group end indicator for Message .transit_realtime.FeedMessage 7 (not a group)"
I found this question/answer on StackOverflow but it doesn't seem to solve my particular problem. Here is an outline of the code I am using to decode the gtfs-r feed:
//process the response
var processBuffers = function(response) {
var data = [];
response.on('data', function (chunk) {
data.push(chunk);
});
response.on('end', function () {
data = Buffer.concat(data);
var decodedFeedMessage = transit.FeedMessage.decode(data);
allData = decodedFeedMessage.entity;
//continues processing with allData...
});
}
Thanks!
NodeJs crashed issue basically happen every time, everydays that any kind of fatal error trigger. And since your received data from 3-rd party, It will very had to make sure the data always correct to prevent error as well.
The simple solution is using another system to deploy your NodeJS application. I recommend 2 tools that very popular today, PM2 and Passenger . (PM2 is very simple to use). Those tool will help to auto restart your NodeJS application once it crashed
http://pm2.keymetrics.io/
https://www.phusionpassenger.com/library/walkthroughs/deploy/nodejs/ownserver/nginx/oss/install_passenger_main.html

What is the best technology for a chat/shoutbox system? Plain JS, JQuery, Websockets, or other?

I have an old site running, which also has a chat system, which always used to work fine. But recently I picked up the project again and started improving and the user base has been increasing a lot. (running on a VPS)
Now this shoutbox I have (running at http://businessgame.be/shoutbox) has been getting issues lately, when there are over 30 people online at the same time, it starts to really slow down the entire site.
This shoutbox system was written years ago by the old me (which ironically was the young me) who was way too much into old school Plain Old JavaScript (POJS?) and refused to use frameworks like JQuery.
What I do is I poll every 3 seconds with AJAX if there are new messages, and if YES, load all those messages (which are handed as an XML file which is then parsed by the JS code into HTML blocks which are added to the shoutbox content.
Simplified the script is like this:
The AJAX functions
function createRequestObject() {
var xmlhttp;
if (window.XMLHttpRequest) { // code for IE7+, Firefox, Chrome, Opera, Safari
xmlhttp = new XMLHttpRequest();
} else { // code for IE6, IE5
xmlhttp = new ActiveXObject("Microsoft.XMLHTTP");
}
// Create the object
return xmlhttp;
}
function getXMLObject(XMLUrl, onComplete, onFail) {
var XMLhttp = createRequestObject();
// Check to see if the latest shout time has been initialized
if(typeof getXMLObject.counter == "undefined") {
getXMLObject.counter = 0;
}
getXMLObject.counter++;
XMLhttp.onreadystatechange = function() {
if(XMLhttp.readyState == 4) {
if(XMLhttp.status == 200) {
if(onComplete) {
onComplete(XMLhttp.responseXML);
}
} else {
if(onFail) {
onFail();
}
}
}
};
XMLhttp.open("GET", XMLUrl, true);
XMLhttp.send();
setTimeout(function() {
if(typeof XMLhttp != "undefined" && XMLhttp.readyState != 4) {
XMLhttp.abort();
if(onFail) {
onFail();
}
}
}, 5000);
}
Chat functions
function initShoutBox() {
// Check for new shouts every 2 seconds
shoutBoxInterval = setInterval("shoutBoxUpdate()", 3000);
}
function shoutBoxUpdate() {
// Get the XML document
getXMLObject("/ajax/shoutbox/shoutbox.xml?time=" + shoutBoxAppend.lastShoutTime, shoutBoxAppend);
}
function shoutBoxAppend(xmlData) {
process all the XML and add it to the content, also remember the timestamp of the newest shout
}
The real script is far more convoluted, with slower loading times when the page is blurred and keeping track of AJAX calls to avoid double calls at the same time, ability to post a shout, load settings etc. All not very relevant here.
For those interested, full codes here:
http://businessgame.be/javascripts/xml.js
http://businessgame.be/javascripts/shout.js
Example of the XML file containing the shout data
http://businessgame.be/ajax/shoutbox/shoutbox.xml?time=0
I do the same for getting a list of the online users every 30 seconds and checking for new private messages every 2 minutes.
My main question is, since this old school JS is slowing down my site, will changing the code to JQuery increase the performance and fix this issue? Or should I choose to go for an other technology alltogether like nodeJS, websockets or something else? Or maybe I am overlooking a fundamental bug in this old code?
Rewriting an entire chat and private messages system (which use the same backend) requires a lot of effort so I'd like to do this right from the start, not rewriting the whole thing in JQuery, just to figure out it doesn't solve the issue at hand.
Having 30 people online in the chatbox at the same time is not really an exception anymore so it should be a rigid system.
Could perhaps changing from XML data files to JSON increase performance as well?
PS: Backend is PHP MySQL
I'm biased, as I love Ruby and I prefer using Plain JS over JQuery and other frameworks.
I believe you're wasting a lot of resources by using AJAX and should move to websockets for your use-case.
30 users is not much... When using websockets, I would assume a single server process should be able to serve thousands of simultaneous updates per second.
The main reason for this is that websockets are persistent (no authentication happening with every request) and broadcasting to a multitude of connections will use the same amount of database queries as a single AJAX update.
In your case, instead of everyone reading the whole XML every time, a POST event will only broadcast the latest (posted) shout (not the whole XML), and store it in the XML for persistent storage (used for new visitors).
Also, you don't need all the authentication and requests that end up being answered with a "No, there aren't any pending updates".
Minimizing the database requests (XML reads) should prove to be a huge benefit when moving from AJAX to websockets.
Another benifit relates to the fact that enough simultaneous users will make AJAX polling behave the same as a DoS attack.
Right now, 30 users == 10 requests per second. This isn't much, but it can be heavy if each request would take more than 100ms - meaning, that the server answers less requests than it receives.
The home page for the Plezi Ruby Websocket Framework has this short example for a shout box (I'm Plezi's author, I'm biased):
# finish with `exit` if running within `irb`
require 'plezi'
class ChatServer
def index
render :client
end
def on_open
return close unless params[:id] # authentication demo
broadcast :print,
"#{params[:id]} joind the chat."
print "Welcome, #{params[:id]}!"
end
def on_close
broadcast :print,
"#{params[:id]} left the chat."
end
def on_message data
self.class.broadcast :print,
"#{params[:id]}: #{data}"
end
protected
def print data
write ::ERB::Util.html_escape(data)
end
end
path_to_client = File.expand_path( File.dirname(__FILE__) )
host templates: path_to_client
route '/', ChatServer
The POJS client looks like so (the DOM update and from data access ($('#text')[0].value) use JQuery):
ws = NaN
handle = ''
function onsubmit(e) {
e.preventDefault();
if($('#text')[0].value == '') {return false}
if(ws && ws.readyState == 1) {
ws.send($('#text')[0].value);
$('#text')[0].value = '';
} else {
handle = $('#text')[0].value
var url = (window.location.protocol.match(/https/) ? 'wss' : 'ws') +
'://' + window.document.location.host +
'/' + $('#text')[0].value
ws = new WebSocket(url)
ws.onopen = function(e) {
output("<b>Connected :-)</b>");
$('#text')[0].value = '';
$('#text')[0].placeholder = 'your message';
}
ws.onclose = function(e) {
output("<b>Disonnected :-/</b>")
$('#text')[0].value = '';
$('#text')[0].placeholder = 'nickname';
$('#text')[0].value = handle
}
ws.onmessage = function(e) {
output(e.data);
}
}
return false;
}
function output(data) {
$('#output').append("<li>" + data + "</li>")
$('#output').animate({ scrollTop:
$('#output')[0].scrollHeight }, "slow");
}
If you want to add more events or data, you can consider using Plezi's auto-dispatch feature, that also provides you with an easy to use lightweight Javascript client with an AJAJ (AJAX + JSON) fallback.
...
But, depending on your server's architecture and whether you mind using heavier frameworks or not, you can use the more common socket.io (although it starts with AJAX and only moves to websockets after a warmup period).
EDIT
Changing from XML to JSON will still require parsing. The question is actually whether XML vs. JSON parsing speeds.
JSON will be faster on the client javascript, according to the following SO question and answer: Is parsing JSON faster than parsing XML
JSON seems to be also favored on the server-side for PHP (might be opinion based rather than tested): PHP: is JSON or XML parser faster?
BUT... I really think your bottleneck isn't the JSON or the XML. I believe the bottleneck relates to the multitude of times that the data is accessed, (parsed?) and reviewed by the server when using AJAX.
EDIT2 (due to comment about PHP vs. node.js)
You can add a PHP websocket layer using Ratchet... Although PHP wasn't designed for long running processes, so I would consider adding a websocket dedicated stack (using a local proxy to route websocket connections to a different application).
I love Ruby since it allows you to quickly and easily code a solution. Node.js is also commonly used as a dedicated websocket stack.
I would personally avoid socket.io, because it abstracts the connection method (AJAX vs Websockets) and always starts as AJAX before "warming up" to an "upgrade" (websockets)... Also, socket.io uses long-polling when not using websockets, which I this is terrible. I'd rather show a message telling the client to upgrade their browser.
Jonny Whatshisface pointed out that using a node.js solution he reached a limit of ~50K concurrent users (which could be related to the local proxy's connection limit). Using a C solution, he states to have no issues with more than 200K concurrent users.
This obviously depends also on the number of updates per second and on whether you're broadcasting the data or sending it to specific clients... If you're sending 2 updates per user per second for 200K users, that's 400K updates. However, updating all the users only once every 2 seconds, that's 100K updates per second. So trying to figure out the maximum load can be a headache.
Personally, I didn't get to reach these numbers on my apps, so I never got to discover Plezi's limits first hand... although, during testing, I had no issues with sending hundred of thousands of updates per second (but I did had a connection limit due to available ports and open file handle limits on my local machine).
This definitely shows how vast of an improvement you can reach by utilizing websockets (especially since you stated to notice slowdowns with 30 concurrent users).

Authorization header using OAuth1.0A for Twitter API v1.1

I am building a search engine as a start-up project in Web Development. The search engine takes a string and queries the Wiki and Twitter API. I have been working on it for days and, I believe, I have found a nice, clean way to implement it.
BoyCook's TwitterJSClient
In his code (which is beautifully written) he has set up a Twitter function which takes in a config variable and sets up the authorization for us. Is there something it is missing? I have been through it and it all looks great. I am new to Javascript and might be missing something..
My code:
var error = function (err, response, body) {
console.log('ERROR [%s]', err);
};
var success = function (data) {
console.log('Data [%s]', data);
};
var config = {
"consumerKey": "{Mine}",
"consumerSecret": "{Mine}",
"accessToken": "{Mine}",
"accessTokenSecret": "{Mine}",
"callBackUrl": ""
}
var Twitter = require('twitter-node-client').Twitter;
var twitter = new Twitter(config);
twitter.getSearch({'q':'Lebron James','count': 10}, error, success);
Can anyone help? Have you done this before and know an easier way? Has anyone been able to get it working using postMessage()?
And yes... the origin is using HTTPS protocol. See: JBEngine. My app permission (on Twitter) is set to read only.
[EDIT] Should probably also mention that I glob'd it all together with browserify and am running the script client-side.

PhantomJS using too many threads

I wrote a PhantomJS app to crawl over a site I built and check for a JavaScript file to be included. The JavaScript is similar to Google where some inline code loads in another JS file. The app looks for that other JS file which is why I used Phantom.
What's the expected result?
The console output should read through a ton of URLs and then tell if the script is loaded or not.
What's really happening?
The console output will read as expected for about 50 requests and then just start spitting out this error:
2013-02-21T10:01:23 [FATAL] QEventDispatcherUNIXPrivate(): Can not continue without a thread pipe
QEventDispatcherUNIXPrivate(): Unable to create thread pipe: Too many open files
This is the block of code that opens a page and searches for the script include:
page.open(url, function (status) {
console.log(YELLOW, url, status, CLEAR);
var found = page.evaluate(function () {
if (document.querySelectorAll("script[src='***']").length) {
return true;
} else { return false; }
});
if (found) {
console.log(GREEN, 'JavaScript found on', url, CLEAR);
} else {
console.log(RED, 'JavaScript not found on', url, CLEAR);
}
self.crawledURLs[url] = true;
self.crawlURLs(self.getAllLinks(page), depth-1);
});
The crawledURLs object is just an object of urls that I've already crawled. The crawlURLs function just goes through the links from the getAllLinks function and calls the open function on all links that have the base domain of the domain that the crawler started on.
Edit
I modified the last block of the code to be as follows, but still have the same issue. I have added page.close() to the file.
if (!found) {
console.log(RED, 'JavaScript not found on', url, CLEAR);
}
self.crawledURLs[url] = true;
var links = self.getAllLinks(page);
page.close();
self.crawlURLs(links, depth-1);
From the documentation:
Due to some technical limitations, the web page object might not be completely garbage collected. This is often encountered when the same object is used over and over again.
The solution is to explicitly call close() of the web page object (i.e. page in many cases) at the right time.
Some included examples, such as follow.js, demonstrate multiple page objects with explicit close.
Open Files Limit.
Even with closing files properly, you might still run into this error.
After scouring the internets I discovered that you need to increase your limit of the number of files a single process is allowed to have open. In my case, I was generating PDFs with hundreds to thousands of pages.
There are different ways to adjust this setting based on the system you are running but here is what worked for me on an Ubuntu server:
Add the following to the end of /etc/security/limits.conf:
# Sets the open file maximum here.
# Generating large PDFs hits the default ceiling (1024) quickly.
* hard nofile 65535
* soft nofile 65535
root hard nofile 65535 # Need these two lines because the wildcards (above)
root soft nofile 65535 # are not applied to the root user as well.
A good reference for the ulimit command can be found here.
I hope that puts some people on the right track.
I had this error come up while running multiple threads in my ruby program.
I was running phantomjs with Capybara-poltergeist and each thread was visiting a page opening up the same CSV file and writing to it.
I was able to fix it by using the Mutex class.
lock = Mutex.new
lock.synchronize do
CSV.open("reservations.csv", "w") do |file|
file << ["Status","Name","Res-Code","LS-Num","Check-in","Check-out","Talk-URL"]
$status.length.times do |i|
file << [$status[i],$guest_name[i],$reservation_code[i],$listing_number[i],$check_in[i],$check_out[i], $talk_url[i]]
end
end
puts "#{user.email} PAGE NUMBER ##{p+1} WRITTEN TO CSV"
end
end

Issuing a synchronous HTTP GET request or invoking shell script in JavaScript from iOS UIAutomation

I am trying to use Apple's UIAutomation to write unit tests for an iOS Application that has a server-side component. In order to setup the test server in various states (as well as simulate two clients communicating through my server), I would like to issue HTTP get requests from within my javascript-based test.
Can anyone provide an example of how to either issue HTTP GET requests directly from within UIAutomation javascript tests, or how to invoke a shell script from within my UIAutomation javascript tests?
FWIW, most of the core objects made available by all browsers are missing within the UIAutomation runtime. Try to use XMLHTTPRequest for example and you will get an exception reporting that it cannot find the variable.
Thanks!
Folks,
I was able to work around this by sending HTTP requests to the iOS client to process and return the results in a UIAlertView. Note that all iOS code modifications are wrapped in #if DEBUG conditional compilation directives.
First, setup your client to send out notifications in the event of a device shake. Read this post for more information.
Next, in your iOS main app delegate add this code:
[[NSNotificationCenter defaultCenter] addObserver:self
selector:#selector(deviceShakenShowDebug:)
name:#"DeviceShaken"
object:nil];
Then add a method that looks something like this:
- (void) deviceShakenShowDebug:(id)sender
{
if (!self.textFieldEnterDebugArgs)
{
self.textFieldEnterDebugArgs = [[[UITextField alloc] initWithFrame:CGRectMake(0, 0, 260.0, 25.0)] autorelease];
self.textFieldEnterDebugArgs.accessibilityLabel = #"AlertDebugArgsField";
self.textFieldEnterDebugArgs.isAccessibilityElement = YES;
[self.textFieldEnterDebugArgs setBackgroundColor:[UIColor whiteColor]];
[self.tabBarController.selectedViewController.view addSubview:self.textFieldEnterDebugArgs];
[self.tabBarController.selectedViewController.view bringSubviewToFront:self.textFieldEnterDebugArgs];
}
else
{
if ([self.textFieldEnterDebugArgs.text length] > 0)
{
if ([self.textFieldEnterDebugArgs.text hasPrefix:#"http://"])
{
[self doDebugHttpRequest:self.textFieldEnterDebugArgs.text];
}
}
}
}
- (void)requestDidFinishLoad:(TTURLRequest*)request
{
NSString *response = [[[NSString alloc] initWithData:((TTURLDataResponse*)request.response).data
encoding:NSUTF8StringEncoding] autorelease];
UIAlertView *resultAlert =
[[[UIAlertView alloc] initWithTitle:NSLocalizedString(#"Request Loaded",#"")
message:response
delegate:nil
cancelButtonTitle:NSLocalizedString(#"OK",#"")
otherButtonTitles:nil] autorelease];
resultAlert.accessibilityLabel = #"AlertDebugResult";
[resultAlert show];
}
This code will add a UITextField to the very top view controller after a shake, slapped right above any navigation bar or other UI element. UIAutomation, or you the user, can manually enter a URL into this UITextField. When you shake the device again, if the text begins with "http" it will issue an HTTP request in code (exercise for the reader to implement doDebugHttpRequest).
Then, in my UIAutomation JavaScript file, I have defined the following two functions:
function httpGet(url, delayInSec) {
if (!delayInSec) delay = 1;
var alertDebugResultSeen = false;
var httpResponseValue = null;
UIATarget.onAlert = function onAlert(alert) {
httpResponseValue = alert.staticTexts().toArray()[1].name();
alert.buttons()[0].tap();
alertDebugResultSeen = true;
}
var target = UIATarget.localTarget();
var application = target.frontMostApp();
target.shake(); // bring up the input field
application.mainWindow().textFields()["AlertDebugArgsField"].setValue(url);
target.shake(); // send back to be processed
target.delay(delayInSec);
assertTrue(alertDebugResultSeen);
return httpResponseValue;
}
function httpGetJSON(url, delayInSec) {
var response = httpGet(url, delayInSec);
return eval('(' + response + ')');
}
Now, in my javascript file, I can call
httpGet('http://localhost:3000/do_something')
and it will execute an HTTP request. If I want JSON data back from the server, I call
var jsonResponse = httpGetJSON('http://localhost:3000/do_something')
If I know it is going to be a long-running call, I call
var jsonResponse = httpGetJSON('http://localhost:3000/do_something', 10 /* timeout */)
I've been using this approach successfully now for several weeks.
Try performTaskWithPathArgumentsTimeout
UIATarget.host().performTaskWithPathArgumentsTimeout("/usr/bin/curl", "http://google.com", 30);
Just a small correction. The answer that suggests using UIATarget.host().performTaskWithPathArgumentsTimeout is an easy way to make a request on a URL in iOS 5.0+, but the syntax of the example is incorrect. Here is the correct way to make this call:
UIATarget.host().performTaskWithPathArgumentsTimeout("/usr/bin/curl", ["http://google.com"], 30);
The "[" around the "args" param is important, and the test will die with an exception similar to the following if you forget the brackets:
Error: -[__NSCFString count]: unrecognized selector sent to instance
Here is a fully working example that hits google.com and logs all the output:
var result = UIATarget.host().performTaskWithPathArgumentsTimeout("/usr/bin/curl", ["http://www.google.com"], 30);
UIALogger.logDebug("exitCode: " + result.exitCode);
UIALogger.logDebug("stdout: " + result.stdout);
UIALogger.logDebug("stderr: " + result.stderr);
+1 for creative use of "shake()". However, that's not an option for some projects, especially those that actually use the shake feature.
Think outside the box. Do the fetching with something else (Python, Ruby, node.js, bash+wget, etc). Then, you can use the pre-canned response and auto-generate the ui-test.js on the fly by including that dynamically generated json payload as the "sample data" into the test. Then you simply execute the test.
In my opinion, the test is the test, leave that alone. The test data you are using, if it's that dynamic, it ought to be separated from the test itself. By doing it this way of fetching / generating JSON, and referencing it from the test, you can update that JSON however often you like, either immediately right before every test, or on a set interval like when you know the server has been updated. I'm not sure you would want to generate it while the test is running, that seems like it would create problems. Taking it to the next level, you could get fancy and use functions that calculate what values ought to be based on other values, and expose them as "dynamic properties" of the data, rather than that math being inside the test, but at that point I think the discussion is more of an academic one rather than the practical one of how.
Apple has recently updated UIAutomation to include a new UIAHost element for performing a task on the Host that is running the instance of Instruments that is executing the tests.

Categories