How to parse DOM (REACT) - javascript

I am trying to scrape data from a website. The website uses Facebook's React. As such the source code that I can parse using Jaunt is completely different to the code I see when inspecting the elements using Chrome's inspector.
I know very little about all of this, but having done some research I think this is something to do with DOM rather than the source code. I need a way to be able to get my hands on this DOM code as the original source contains nothing I want, but I don't have the foggiest idea where to begin (even having read many answers on here).
Here is an example of one the pages I want to scrape. For example to scrape the description I'd want to grab what is in between the tag:
<span class="light-font extended-card-description list-group-item">Example description....</span>
But as you can see this element only appears when you "Inspect Element", and not when I just view the page's source.
My question to you geniuses on here is, how can I grab this DOM Code and start scraping the elements I actually want to?
Forgive me if my terminology is completely off but as I say this is a completely new area for me, and I've done the research that I can.
Thank you very much in advance!

ReactJS, like many other Javascript libraries / frameworks, uses client-side code (Javascript) to render the final HTML. This means that when you, Jaunt, or your browser fetch the HTML source code from the server, it doesn't yet contain the final code the user will see. The browser needs to run the Javascript program(s) contained in the page, in order to generate the final content you wish to scrape.
My favorite tool for this kind of job is CasperJS
It (or rather the PhantomJS tool that CasperJS uses) is a headless browser, meaning it's a version of Webkit (like Chrome or Safari) that has been stripped of all the GUI (windows, buttons, menus.) What's left is a tool that you can run from a terminal or from your Java program. It won't show any window on the screen, but it will fetch the webpages you ask it to; run any Javascript they contain; and then respond to your commands, such as "click on this link", "give me that text", "capture a screenshot", and so on.
Let's start with a simple ReactJS example:
We want to scrape the "Hello John" text, but if you look at the plain HTML source (Ctrl+U or Alt+Ctrl+U) you won't see it. On the other hand, if you open the console in your browser and use the following selector, you will get the text:
> document.querySelector('#helloExample .playgroundPreview').textContent
"Hello John"
Here is a simple CasperJS script to do the same thing:
var casper = require("casper").create();
casper.start("http://facebook.github.io/react/index.html", function() {
this.echo(this.fetchText("#helloExample .playgroundPreview"));
});
casper.run();
You can save it as hello.js and execute it with casperjs hello.js from a terminal, or use the equivalent Java code Runtime.getRuntime().exec(...)
Here is a better script, that avoids loading images and third-party resources (such as Facebook button, Twitter button, Google Analytics, and such) cutting the loading time by half. It also adds a waitForSelector step, so that we don't risk trying to fetch the text before ReactJS has had a chance to create it.
var casper = require("casper").create({
pageSettings: {
loadImages: false
}
});
casper.on('resource.requested', function(requestData, request) {
if (requestData.url.indexOf("http://facebook.github.io/") != 0) {
request.abort();
}
});
casper.start("http://facebook.github.io/react/index.html", function() {
this.waitForSelector("#helloExample .playgroundPreview", function() {
this.echo(this.fetchText("#helloExample .playgroundPreview"));
});
});
casper.run();
How to install CasperJS
I have had some trouble scraping ReactJS and other modern Javascript pages with the older versions of PhantomJS and CasperJS, so I recommend installing PhantomJS 2.0 and the latest CasperJS from GitHub.
For PhantomJS you can just download the official 2.0 package.
For CasperJS, since it's a Python script, you should be able to check out the latest commit from GitHub and link bin/casperjs onto your PATH. Here's a script for Linux or Mac OS X:
> git clone git://github.com/n1k0/casperjs.git
> cd casperjs
> ln -sf `pwd`/bin/casperjs /usr/local/bin/casperjs
You may also want to comment out the line printing Warning PhantomJS v2.0 ... from your bin/bootstrap.js file.

Related

Why don't web workers work?

So, I'm trying to use a web worker in my project to run a long-running process that is currently tying up the UI. I've been to I don't know how many sites trying to get a worker to work, but to no avail.
All of my javascript is kept in separate files and referenced in the HTML file. As a test to get my feet wet, I created a test.js file and put the following code in it:
self.addEventListener('message', function(e) {
self.postMessage('return');},false);
Then, in the UI page's javascript file I placed this code in a function triggered by a button click event:
var w = new Worker('test.js');
w.addEventListener('message',function(e){
alert(e.data);},false);
w.postMessage('hi');
The code is derived from:
html5rocks.com/en/tutorials/workers/basics
Other websites I visited provided similar instructions on how to set up a worker.
For the life of me, I cannot get this to work. When I execute it does absolutely nothing and I seemingly get no errors. Stepping through the code, it appears to create the worker, but I don't see any evidence of the event listener being created and the 'postMessage' event doesn't do anything. I've tried IE11 and Chrome with the same results.
In my research, I came across a part of Chrome's developer tools that revealed the test.js file couldn't be found. Yet, the file is in the same folder as the page's js file. So, I tried adding in the relative directory information as I do in the page's HTML section. That didn't work either.
I then found claims that for security reasons you couldn't have one js file reference another js in the code. It's unclear whether this is a Chrome-only feature or part of some spec.
So, now I'm in a quandary. The worker requires a reference to a separate js file for the code to be executed, yet, the browser isn't allowed to reference another file? How is the worker supposed to work if you aren't allowed to do what it requires to work?
To now, I've successfully pissed away two days trying to get this one seemingly simple function to work. To say I'm mildly frustrated would be an understatement. Being a fairly novice programmer and not understanding every last little nuance about web programming I'm clearly missing a key part of this whole thing.
How the heck is one supposed to make web workers work?
Turns out browsers won't allow local files to be fetched via javascript. Because that means a website can read your personal files! So you need to develop and test your project using a web server. The easiest way to do this for me was to install:
docker-compose
and make sure it works. Then create a file named:
docker-compose.yml
inside root folder of my project with index.html file. Then put this inside the docker-compose.yml file:
version: '3'
services:
nginx:
image: nginx:alpine
volumes:
- .:/usr/share/nginx/html
ports:
- "80:80"
Then inside the root folder of my project run:
docker-compose up
And then in the browser go to:
http://localhost/
And it worked!
I appear to have found a solution, though it escapes me why.
If I use:
var w = new Worker('js\test.js');
the worker doesn't work.
But, if I use:
var w = new Worker('js/test.js');
the worker does work.
I characteristically use the back slash throughout the project to delineate paths without issue. Why the forward slash must be used to set the worker's file location is a mystery. I have seen nothing in any documentation that even remotely addresses that tiny, yet seemingly critical detail.
Thank you, Mr. Starke, for your help!

selenium + chrome.fileSystem.chooseEntry = invalid calling page error

I am writing a Selenium script to test a Chrome app that uses the Chrome.fileSystem.chooseEntry API to select a directory. When I do this manually, it works fine. But when I do this in a Selenium script, I get back this error:
Unchecked runtime.lastError while running fileSystem.chooseEntry: Invalid calling page. This function can't be called from a background page.
Any ideas on how to make Selenium and chooseEntry play nicely together?
I updated to the latest Chromedriver, but still no luck. I also looked at ChromeOptions, but didn't see anything that looked like it would be helpful. The interwebs doesn't seem to have much to say about Selenium and chooseEntry. I'm on version 51 of Chrome.
I'm down to thinking I'll need a special javascript entry point to set the path values for testing instead of using chooseEntry. But I would strongly prefer to not have a separate code execution path for my tests. Anybody have a cleaner solution?
EDIT: per commenter's request, here's the offending code:
chrome.fileSystem.chooseEntry({type:'openDirectory'},function(entry) {
chrome.fileSystem.getWritableEntry(entry,function(writeable_entry) {
console.log("got writeable entry");
});
}, function(e) { errorHandler(e); });
EDIT #2: I've gone with the special javascript entry point hack. In manual mode -- i.e., not running under Selenium -- I run code that executes chooseEntry, and then use the retainEntry API to get the entry id. I added an entry point in my javascript to take an entry id and call the restoreEntry API to turn it back into an entry. I also modified my code so if this entry object is set, then use that as the file instead of calling chooseEntry. Lastly, I modified my Selenium script to call the restoreEntry entry point before running the rest of the script.
This is not ideal, since now my test code execution path is somewhat different from my actual live-human-being-at-the-controls code execution path. But at least it lets me use Selenium scripts now. Of course, if anyone can think of a non-horrible way to solve this solution, I'd love to hear about it.
EDIT #3: Per #Xan's comment, corrected my terminology from "extension" to "Chrome App."
I can only offer this horrible hack. For Chrome Apps under OSX I created folder favorites and use Robot keyPress to navigate and select the 'favorite' folders needed for the App. The only possible redeeming factor is that it does mirror a valid/possible actual human interaction with the file interface.
private void selectOSXFolderFavorite(int favorite) {
// With an OSX file folder dialog open, Shift-Tab to favorites list
robot.keyPress(KeyEvent.VK_SHIFT);
robot.keyPress(KeyEvent.VK_TAB);
robot.keyRelease(KeyEvent.VK_TAB);
robot.keyRelease(KeyEvent.VK_SHIFT);
// move to the top of favorites list
int i = 40;
while (i-- > 0) {
robot.keyPress(KeyEvent.VK_UP);
robot.keyRelease(KeyEvent.VK_UP);
}
while (favorite-- > 0) {
robot.keyPress(KeyEvent.VK_DOWN);
robot.keyRelease(KeyEvent.VK_DOWN);
}
// Send an enter key to Select the selected folder
robot.keyPress(KeyEvent.VK_ENTER);
robot.keyRelease(KeyEvent.VK_ENTER);
}

how do you allow javascript communications with Flex/Flash/Actionscript

Well here's a problem.
I've got a website with large javascript backend. This backend talks to a server over a socket with a socket bridge using http://blog.deconcept.com/swfobject/
The socket "bridge" is a Flex/Flash .swf application/executable/plugin/thing for which the source is missing.
I've got to change it.
More facts:
file appExePluginThing.swf
appExePluginThing.swf Macromedia Flash data (compressed), version 9
I've used https://www.free-decompiler.com/flash/ to decompile the .swf file and I think I've sorted out what's the original code vs the libraries and things Flash/Flex built into it.
I've used FDT (the free version) to rebuild the decompiled code into MYappExePluginThing.swf so I can run it with the javascript code and see what happens.
I'm here because what happens isn't good. Basically, my javascript code (MYjavascript.js) gets to the point where it does
window.log("init()");
var so = new SWFObject("flash/MYappExePluginThing.swf"", socketObjectId, "0", "0", "9", "#FFFFFF");
window.log("init() created MYappExecPluginThing!!!");
so.addParam("allowScriptAccess", "always");
log("init() added Param!!");
so.write(elId);
log("init() wrote!");
IE9's console (yeah, you read that right) shows
init()
created MYappExecPluginThing!!!
init() added Param!!
init() wrote!
but none of the debugging i've got in MYappExePluginThing.as displays and nothing else happens.
I'm trying to figure out what I've screwed up/what's going on? Is MYappExePluginThing.as running? Is it waiting on something? Did it fail? Why aren't the log messages in MYappExePluginThing.as showing up?
The first most obvious thing is I'm using FDT which, I suspect, was not used to build the original. Is there some kind of magic "build javascript accessible swf thing" in FlashBuilder or some other IDE?
First noteworthy thing I find is:
file MYappExePluginThing.swf
MYappExePluginThing.swf Macromedia Flash data (compressed), version 14
I'm using Flex 4.6 which, for all I know, may have a completely different mechanism for allowing javascript communication than was used in appExePluginThing.swf
Does anyone know if that's true?
For example, when FDT runs this thing (I can compile but FDT does not create a .swf unless i run it) I get a warning in the following method:
private function init() : void
{
Log.log("console.log", "MYappExePluginThing init()");
//var initCallback:String = Application.application.parameters.initCallback?Application.application.parameters.initCallback:"MYjavascript.MYappExePluginThing_init";
var initCallback:String = FlexGlobals.topLevelApplication.parameters.initCallback?FlexGlobals.topLevelApplication.parameters.initCallback:"MYjavascript.MYappExePluginThing_init";
try
{
ExternalInterface.addCallback("method1Callback",method1);
ExternalInterface.addCallback("method2Callback",method2);
ExternalInterface.call(initCallback);
}
catch(err:Error)
{
Log.log("console.log", "MYappExePluginThing init() ERROR err="+err);
}
}
I got a warning that Application.application was deprecated and I should change:
var initCallback:String = Application.application.parameters.initCallback?Application.application.parameters.initCallback:"MYjavascript.MYappExePluginThing_init";
to:
var initCallback:String = FlexGlobals.topLevelApplication.parameters.initCallback?FlexGlobals.topLevelApplication.parameters.initCallback:"MYjavascript.MYappExePluginThing_init";
which I did but which had no effect on making the thing work.
(FYI Log.log() is something I added:
public class Log{
public static function log(dest:String, mssg:String):void{
if(ExternalInterface.available){
try{
ExternalInterface.call(dest, mssg);
}
catch(se:SecurityError){
}
catch(e:Error){
}
}
trace(mssg);
}
}
)
Additionally, in MYjavascript.js MYappExePluginThing_init looks like this:
this.MYappExePluginThing_init = function () {
log("MYjavascript.js - MYappExePluginThing_init:");
};
Its supposed to be executed when MYappExePluginThing finishes initializing itself.
Except its not. The message is NOT displaying on the console.
Unfortunately, I cannot find any references explaining how you allow javascript communication in Flex 4.6 so I can check if I've got this structured correctly.
Is it a built in kind of thing all Flex/Flash apps can do? Is my swf getting accessed? Is it having some kind of error? Is it unable to communicate back to my javascript?
Does anyone have any links to references?
If this was YOUR problem, what would you do next?
(Not a full solution but I ran out of room in the comment section.)
To answer your basic question, there's nothing special you should need to do to allow AS3-to-JS communication beyond what you've shown. However, you may have sandbox security issues on localhost; to avoid problems, set your SWFs as local-trusted (right-click Flash Player > Global Settings > Advanced > Trusted Location Settings). I'm guessing this not your problem, though, because you'd normally get a sandbox violation error.
More likely IMO is that something is broken due to decompilation and recompilation. SWFs aren't meant to do that, it's basically a hack made mostly possible due to SWF being an open format.
What I suggest is that you debug your running SWF. Using break-points and stepping through the code you should be able to narrow down where things are going wrong. You can also more easily see any errors your SWF is throwing.
Not really an answer, but an idea to get you started is to start logging everything on the Flash side to see where the breakage is.
Since you're using IE, I recommend getting the Debug flash player, installing it, then running Vizzy along side to show your traces.
Should give you a good idea of where the app is breaking down.
Vizzy
Debug Player

How do I change the underlying Phantomjs object settings using Chutzpah?

We have some QUnit javascript tests running in Visual Studio using the Chutzpah test adapter. Everything was working fine until we changed our api (the one being tested by the js files) recently, and added some validations over the UserAgent http header. When I tried to update the tests to change/mock the user agent I realized it was not directly possible even by overriding the default browser property.
After a few days of scavenging, I finally found what exactly is happening. Chutzpah is creating a phantomjs page object for the test files to run on. This is being done on a base javascript file (chutzpahRunner.js) located at the Chutzpah adapter installation path. These are the last lines on the file, that effectively start the tests:
...
// Allows local files to make ajax calls to remote urls
page.settings.localToRemoteUrlAccessEnabled = true; //(default false)
// Stops all security (for example you can access content in other domain IFrames)
page.settings.webSecurityEnabled = false; //(default true)
page.open(testFile, pageOpenHandler);
...
Phatomjs supports changing the user agent header by specifying it in the page settings object. If I edit this chutzpahRunner.js file in my machine, and manually set the user agent there, like this:
page.settings.userAgent = "MyCustomUserAgent";
My tests start to work again. The problem is that this is not in the project itself, and thus cannot be shared with the rest of the team.
Is it possible to change the properties of the phantomjs objects created by Chutzpah to run the tests? I'd like to either change them from inside my own tests, or from another script file I could embed on the pipeline.
Without a code change in Chutzpah it is not possible to set those properties on the PhantomJS object. Please file an issue at https://github.com/mmanela/chutzpah asking for this functionality and then fork/patch Chutzpah to add it (or wait for a developer on the project to hopefully get to this).
Update:
I pushed a fix for this issue. Once this is released you can use the following in a Chutzpah.json file:
{
"userAgent": "myUserAgent"
}

Can I create a cross browser compatible silverlight page without code behind?

I have been developing a silverlight page using just xaml, javascript and html (I literally only have a .html, .js and .xaml file). The problem is, I just realized that it isn't working in any browser EXCEPT Internet Explorer (7 for sure).
I have too many lines of code to want to add vb.net or visual c code behind and use the html bridge. I just want the xaml mouse events to work directly as before. In other words, when the xaml's MouseLeftButtonDown says "highlightMe" I want that highlightMe function to be a javascript function. But I also want my page to work in any browser.
Now, I've played around with creating a brand new visual studio project with vb.net or visual c.net but the xaml file events seem to point to code behind events. Also, it compiles the silverlight into a .XAP file. The XAP file is actually a .ZIP file with a compiled dll and an appmanifest.xaml.
So, how do I configure my appManifest.xaml to handle a silverlight page that has only javascript and xaml (and an html file pointing to the .XAP as the source). The html part, I THINK I understand. AppManifest is a different story and I definitely need help with that one.
I think it has something to do with creating an app.xaml and page.xaml and using the x:Class value of the main tag.
Since I asked this question I found a page...
http://pagebrooks.com/archive/2009/02/19/custom-loading-screens-in-silverlight.aspx
...that 1) showed people recently using a similar model of .js, .xaml and .html for their silverlight page and 2) someone in the comments recommended using firebug to track down issues with silverlight javascript errors.
This proved to me it's ok to use this model of silverlight and that it should work in other browsers. This also made me go try firebug. Firebug is AWESOME. If you enable the console tab, you can see exactly where the javascript was hanging up. And now that it's working, I can see the result of my gets/posts to google app engine.
Firebug showed that I was using if then else statements in a way that only internet explorer allows. For example,
if (blah == 1) { blah2 = 3}
else { blah2 = 5};
works in every browser, but this doesn't:
if (blah == 1) { blah2 = 3} ;
else { blah2 = 5};
Firefox and chrome and safari all apparently need there to NOT be a ; end statement character between the else and if.
So, for the moment, I appear to have fixed my problem with cross-browser compatibility, but I'd still like to know more about appmanifest.xaml and how to make a .xap file with only javascript. I might need it later.

Categories