Solution for server side web scraping/navigation (with JavaScript support)

Solution for server side web scraping/navigation (with JavaScript support) - javascript

I need to do server side web scraping/navigation, including sites with JavaScript, and I need a solution that would work on a hosting plan - I dont have my own server.
I came across python/pyside/pyqt4 - this would work perfectly/allow me to navigate sites like a headless browser. However I don't know if this would be possible to install on a remote server/host...

If you need a headless browser, you should check out PhantomJS, and in particular PyPhantomJS, the Python implementation. These might work in a shared hosting context - it really depends on the host. See the build instructions for different platforms - you'd likely need to ask your hosting provider to install.
If you can get this running, you might be interested in checking out pjscrape (disclaimer: this is my project). It's a command-line tool using PhantomJS to allow scraping using JavaScript and jQuery in a full browser context.

Related

Questions regarding AppJS / Tidesdk

So not sure if this would be the correct place to ask these but I know I could perhaps get some answers.
I am getting into Meteor and now would like to make some desktop apps. I was going to go the route of just making a native Mac app. But then I found the app wunderlist and its open source making use of the tidesdk.
Anyways I was hoping to get some feedback just in general about these frameworks (pros/cons etc). I don't really have a conceptual understanding of what they do. (or what the main difference between the two is).
I notice you can do routing in them. How is this working exactly? Because there is no URLs or client/server side.
Another thing I was wondering is if it would be possible to use MeteorJS on the desktop in a similar way?
Thanks.

Working with TideSDK is quite easy. We are working to make the experience great for developers. You are essentially just creating an HTML5 app in a special Resources folder. In most cases you can drop an HTML5 app directly into the Resources folder, point to the index.html using TideSDK's configuration and have it running in minutes. TideSDK can be used to run clients, servers, processes, and workers. I tend to work with frameworks such as backbone.js where routing is baked into a single page app.
At the core of TideSDK is WebKit, the core technology that powers the Safari and Chrome web browsers. We use three different ports of WebKit in TideSDK, one to reach each platform (Windows, Mac, Linux). On OSX, we can also use the native WebKit. The APIs of TideSDK provide native UI capabilities (that we are enhancing over time). These include native windows, system trays, menus, and dialogs. You can also interact with the clipboard. We have networking and database capabilities, system notifications, and more. We patch Webkit to allow the interpretation of python, php or ruby in the DOM in script tags and are able to bridge objects between languages. Our API's really allow your to reach the resources of your system including interacting with its filesystem.
It would be fun to run meteor in TideSDK. It is currently possible to run node.js within TideSDK using an appropriate startup process so I cannot see an issue running meteor so that it can run client and server within an app.
If you need your apps to reach Apple's AppStore, TideSDK is the only framework that I am aware of that has this potential. Competitive frameworks use ports of WebKit that are not native to the Mac such as the Chrome port (appjs) or the QT port (Sencha Ion). Apple's scan of an app based on these ports will reveal the use of "private APIs". Therefore, you would could not enter the AppStore marketplace with an app based on these. TideSDK is different and can use the native WebKit implementation on OSX. More about this capability will be revealed in the upcoming TideSDK-1.4.0 release. Our upgraded WebKit will also bring the HTML5 capabilities right up to date with the trunk of WebKit. Many of our users are waiting for this important update.
With WebKit eliminated as a barrier to the AppStore, the last issue facing a developer is Apple's sandboxing and entitlement to the resources of the system. We are looking at possible solutions to aid developers with sandboxing requirements. Some apps will be suitable for sandboxing and others will not. That said, if your goal is AppStore compliance, you will need to work with restrictions Apple has in place. I hope this helps.

How to change registry settings on a system through web browser

I am trying to develop a web page that will allow user to edit registry settings in windows system. Can i achieve it with client side scripting language.? If yes please suggest me language to do.
Can we do it with jQuery or any other type of library.

Due to obvious security concerns, this is only possible in Internet Explorer(!). This is not a jQuery library, but an activeX control; so it's quite unpleasant to use.
You have been warned, so here is the documentation :
http://technet.microsoft.com/en-us/library/ee156602.aspx

Fortunately is impossible to access the registry from a web app: the only way you have is through an ActiveX control but I would not go down this road.
have a look at the below
Access registry from a web aplication

Far from ideal but ...
If you serve up a ".hta" file (HTml Application) from your web server, Windows will run it as a program outside of IE and give it the privileges of the PC user. It will be in a separate window and there won't be any browser features (Back/Refresh/Address bar etc).
Even then, modern versions of Windows will prompt the user with security warnings if a HTA is launched from anywhere other than a local drive.

I know this thread is old, but I am not sure I like any answers for this problem. Instead of trying to access the Registry directly through Javascript, try writing a Java Applet and talk to the java applet using Javascript. Then in the JavaApplet you can write some JNI code to write a native dll to do what you need. It isn't a direct solution to your problem, but it will allow you to do what you need across multiple browsers. The downside is that you can't use it on browsers that do not support running a Java Applet, such as a mobile platform.
This method will also require you to sign your Java Applet. This is how you get around the security issues. The user will have to accept the applet the first time to give the security access.

how to start up a desktop application in client side

In my web page, I have to start a desktop application on the client's computer if it's installed. Any idea how I can do this?
If the application is MS Office or Adobe Reader, I know how to start them, but the application I want to start is a custom application. You can not find it on the internet.
How can I open the application?

Basically it's not possible to achieve unless an application registers a protocol that will trigger it. If it does that all you need to do is to provide a link using this protocol
yourcustomapp://some.parameters
Another way the 3rd party app can integrate with the browser is if it hooks to it as a plugin. This is how flash apps work etc.
If the app you are trying to launch does not support something like that it's going to be close to impossible to achieve what you want.

The browser sandbox prohibits you from executing local resources, for good reason - to thwart a website destroying your box with malicious code. I've been researching the same functionality.
The only solution I've found is to build an extension in Mozilla Firefox which can launch your app. Extensions live outside the sandbox so they can execute local resources. See this page for how to do that. You may be able to do it cross-browser using crossrider, though I haven't had success with that yet.
You could alternatively build a thick client populated from a web service, and launched from the browser through an extension as mentioned above. This is what I'm doing to get around the sandbox. I'm using local XUL for this.
See my question for additional discussion.

First off - you can't do it using javascript in any sort of a portable mechanism.
If the application is ms office or adobe reader,I know how to startup them
No you don't - you know how to send a document, which the browser associates with these applications and invokes them supplying the name of the local copy of the response. You can't just start the programs.
You just need to do the same for your app - invent a new mime type (the major type would be 'application' and by convention, non-standard minor types are prefixed with 'x-', so you might use application/x-hguser) then associate that mimetype with the relevant program browser side.
i.e: You need to explicitly configure each browser

I already encouter that problem in some complex production environnements.
I do the trick using the following code :
function launch(p_app_path)
{
var oShell = new ActiveXObject("WScript.Shell");
oShell.Run('"' + p_app_path + '"', 1);
}
In IE options > Security > Customize the level > ActiveX controls and plugins > Initialization and script ActiveX controls not marked as safe for scripting, set the value to Ask or Active.
It isn't a security problem when your website is enclosed into a specific security context.
And as they say, it's not worth it to build a gas plant.

JavaScript alone can't do this. (No, not even with MS Office or Adobe Reader.) Thankfully.
There are a number of old ways, including using ActiveX, which may work for your needs. As others have pointed out while typing this, you can customize responses based on the mime type or the protocol, etc.
Any way you look at it, you're going to need control over the end users' browser. If you're in a close environment where you can dictate policy (users must use a specific browser, with a specific configuration), then you're going to need to do that. For an open environment with no control over the end users, you're out of luck.

I'm actually having a lot of success right now with SiteFusion. It's a PHP client/server application framework that serves out XUL/JavaScript applications from a server deamon running in Apache. You access applications from a very thin client in XULRunner, or potentially off a web page using extensions. Clients can execute on any platform, and they're outside of the browser sandbox so you can access local resources such as executables. It'a a fairly elegant solution, their website provides great examples and documentation, and their forum is very responsive. I actually found a minor bug in passing arguments to local executables, posted a question about the forum, and it was fixed by the chief developer in under 15 minutes. Very impressive, overall!

How are windows executables [.exe] launched out of browsers?

I'm not talking about browser exploits. I'm talking about real applications used in real companies, like Ijji and Nexon.
Basically, from their websites you can click a "Start Game" button, which will launch an executable located at c:\ijji\english or c\nexon[gamename] respectively. These applications are real desktop applications, meaning that they can take advantage of the filesystem, direct3d, and OS [in the form of executing other applications]. The applications can also be launched through command line [as opposed to going to the game host's website].
I figured this would be possible if the application created an ActiveX object to call for the creation of a new process. However, the websites are able to launch applications from multiple browsers other than Internet Explorer, including chrome, which, to my knowledge, does not implement ActiveX.
Obviously the people developing these applications use their own means to do this.
From looking at the services list as well as currently running applications list, I have no indication that they're running something like "gameLaunchingServer.exe" which listens to some obscure port for an incoming connection [to be accessed using iframe - HTTP Protocol] and responds by launching an application...
I'm stumped, and this is sort of stuck in my mind. Obviously, they're not using some random browser exploit, otherwise people at http://www.[insertMaliciousWebsiteHere].com would have jumped on the opportunity already to install random crap. Regardless, it seems pretty cool, and I wanted to know how it worked.
Just curious, hehe.

I believe what they're doing is setting up their own protocol handler on install - when a browser is asked to access an address with a protocol that it doesn't know how to handle (for instance, a steam:// address), it looks at all the installed protocol handlers to find a match.
So you can register your application as a myApplication:// protocol handler, and then your web page can link to a myApplication:// address and launch your application.

I didn't quite find the button you are talking about, but I'm thinking it works only after you installed the application once, isn't it?
In that case, the application probably created its own protocol, just as skype, msn and a bunch of clients.

Having a protocol is the easiest way (and very easy indeed to implement - a simple registry key).
Another way which is used is an extension or plugin.

I thought they were run through plug-ins or like applets.
For example, MS SilverLight

Can you use the JavaScript engine in web browsers to process local files?

I have a number of users with multi-megabyte files that need to be processed before they can be uploaded. I am trying to find a way to do this without having to install any executable software on their machines.
If every machine shipped with, say, Python it would be easy. I could have a Python script do everything. The only scripting language I can think of that's on every machine is JavaScript. However I know there are security restrictions that prevent reading and writing local files from web browsers.
Is there any way to use this extremely pervasive scripting language for general purpose computing tasks?
EDIT: To clarify the requirements, this needs to be a cross platform, cross browser solution. I believe that HTA is an Internet Explorer only technology (or that the Firefox equivalent is broken).

Would Google Gears work here? Yes, users have to install something, but I think the experience is fairly frictionless. And once it's installed, no more worries.

The application that I maintain and develop for work is an HTML Application or HTA, linked with a SQL Server 2005 backend. This allows various security restrictions to be "avoided". All the client-side components in the application are done with javascript, including writing files to locally mapped network drives and loading data into screens/pages in an AJAXy way.
Perhaps HTA could be helpful for your situation.

For an example of javascript accessing a local file, you might try taking a look at the source of TiddlyWiki, specifically the saveFile, mozillaSaveFile, and ieSaveFile functions. Just click the download link, open the html file it sends you, and search for those functions.
Of course, tiddlywiki is supposed to be used as a local file, not served over the web, so the methods it uses may only work locally.. But it might be a start.

Why not use a flash uploader? http://swfupload.org/

Adobe Flex 4 lets you to open and process a file on a local machine:
http://livedocs.adobe.com/flex/3/langref/flash/net/FileReference.html#load()
It's not exactly JavaScript, but hope that helps.

I believe you can accomplish this using the HTML5 File API.
It is supported in Opera, IE, Safari, Firefox, and Chrome.

you can use fs module from nodeJS to manipulate with filesystem nowadays!

We Keep Coding

JavaScript is the programming language of the Web.