Make a SPA crawlable - javascript

I have an SPA. What happens is that when user click any button or link it retrieves new contents from server but it does not update the url. Now my task is that I have to make it crawlable for search engines such as google. I heard that Phantomjs could be used to get all of the html from website and make it crawable that way somehow. But I am not sure about this method. I want to know how can I use this method to make website crawlable via phantomjs. Any help about this?

One of the solution would be to pre-render the pages with phantom on the server, and when a robot requests a page, the server returns a static html.
Check this link to see it in detail,
phantomjs --disk-cache=no angular-seo-server.js 9090 http://127.0.0.1:9000
This will start a phantomJS server with no disk caching on port 9090. It’s
important to note that PhantomJS’s port needs to be different from the
port that your application runs on.

Related

Single Page Application Web crawlers and SEO

I have created my blog as a single page application using mithril framework on the front end. To make queries I've used a rest API and Django at the backend. Since everything is rendered using javascript code and when the crawlers hit my blog all they see is an empty page. And to add to that whenever I share a post on social media for instance all Facebook sees is just an empty page and not the post content and title.
I was thinking of looking at the user agents and whenever the USER-AGENT is from a crawler I would feed it the rendered version of the pages but I'm having problems implementing the above method described.
What is the best practice to create a single page app that uses rest API and Django in the backend SEO friendly for web crawlers?
I'm doing this on a project right now, and I would really recommend doing it with Node instead of Python, like this:
https://isomorphic-mithril.mvlabs.it/en/
You might want to look into a server-side rendering of the page that crawlers visit.
Here is a good article on Client Side vs Server Side
I haven't heard of Mithril before, but you might find some plugins that does this for you.
https://github.com/MithrilJS/mithril-node-render
This might help you : https://github.com/sharjeel619/SPA-SEO
The above example is made with Node/Express but you can use the same logic with your Django server.
Logic
A browser requests your single page application from the server,
which is going to be loaded from a single index.html file.
You program some intermediary server code which intercepts the client
request and differentiates whether the request came from a browser or
some social crawler bot.
If the request came from some crawler bot, make an API call to
your back-end server, gather the data you need, fill in that data to
html meta tags and return those tags in string format back to the
client.
If the request didn't come from some crawler bot, then simply
return the index.html file from the build or dist folder of your single page
application.

Is it possible to run a Node script from a web page?

I'am searching for days now but could not get an answer.
I would like to do the following:
User connects to editor.html (Apache2 with basic http auth)
User want to open a file (lets say /home/user1/myfile.txt) on the server with his user/pass (same as in passwd)
Node.js Script gets startet with user rights from above and user can edit file
The Node Script will handle the connection via websockets and read/writes files.
I think the biggest problem is that its not possible to run a node script on the server from a web page... and I don´t want to involve any php/cgi scripts... only Apache and Node.js / JS.
Please also comment or answer if you know that it is really not possible...
Thanks!
Kodak
Edit: The workflow should be the following:
User access webpage -> enters his credential (same as in passwd) -> node.js script gets started with the user rights of the logged in user -> files getting read or written with user rights
Biggest Problem: who starts the Node.js script? Apache? How?
I hate to be this person, but...
That is not the way node is designed, it is designed to use the event loop, I would recommend having node serve the static files, maybe using apache as a proxy, then when someone requests a certain page, doing what ever needs to be done, if you really must spawn a child process, use child_process.spawn, as for the rights of the user, I recommend just passing in a code, like 1=admin, 2=user, 3=guest, and the child process can do what is needs.
Use Socket.io - Official Socket.IO Website
You can also use Express with socket IO to create a separate app server. - Express JS Website
You may want to consider security implications of allowing a user to connect directly using their server side account. There are also many applications available that already do this that you might consider implementing instead of writing your own, with all the properly embedded security that will be required.
Let your users GET static auth.html page (via apache) without any authentication.
Let form submit action is some auth.js (Node.js script). This auth.js check if user's authentication is success. If so it starts node.js server, setups socket.io on it and redirects user to some editor.html.
In this case as you can notice that there is an authentication based on node.js scripting. If you want basic apache2 one I can recommend you the next scenario:
There is auth.html and editor.html pages on the server. Last one placed in /private folder and direct access to this folder is denied by .htaccess. So when the user pass apache2 authentication in auth.html he GET this auth.html which is empty document with onload event handler that send AJAX to auth.js (Node.js script). Node.js get private/editor.html and send it to user like /editor.html.
In this case user never has an access to editor without passing authentication. And after authentication node.js server is started and socket.io is setup and everything fine.
I found a solution:
It is possible to write a custom authentication program for apache with mod-auth-external:
https://code.google.com/p/mod-auth-external/
With basic authentication enabled the webserver would pass the credentials to a script/program and this can then run the node app.

Scraping a website which has javascript

I'm looking for a method to scrape a website from server side (which uses javascript) and save the output after analyzing data into a mysql database. I need to navigate from page to page by clicking links and submitting data from the database,without session expiring . Is this possible using phpquery web browser plugin? . I've started doing this using casperjs. I would like to know the pros and cons of both methods. I'm a beginner in the coding space. Please help.
I would recommend that you use PhantomJS or CasperJS and parse the DOM with JavaScript selectors to get the parts of the pages you want back. Don't use phpQuery as it's based on PHP and would require a separate step in your processing versus using just JavaScript DOM parsing. Also, you won't be able to perform click events using PHP. Anything client side would need to be run in PhantomJS or CasperJS.
It might even be possible to write a full scraping engine using just PHP if that's your server side language of choice. You would need to reverse engineer the login process and maintain a cookie jar with your cURL requests to keep your login valid with each request. Once you've established a session with the the website, you can then setup your navigation path with an array of links that you would like to crawl. The idea behind web crawling is that you load a page from some link and process the page and then move to the next link. You continue this process until all pages have been processed and then your crawl is complete.
I would check out Google's guide Making AJAX Applications Crawlable the website you're trying to scrap might have adopted the scheme (making their site's content crawlable).
You want to look for #! in the URL's hash fragment, this indicates to the crawler that the site supports the AJAX crawling scheme.
To put it simply, when you come across a URL like this.
www.example.com/ajax.html#!key=value you would modify it to www.example.com/ajax.html?_escaped_fragment_=key=value. The server should respond with a HTML snapshot of that page.
Here is the Full Specification

Crazy need to ENABLE cross site scripting

Yes, I need to enable cross site scripting for internal testing of an application I am working on. I would have used Chrome's disable-xss-auditor or disable-web-security switches, but it looks like they are no longer included in the chrome build:
http://src.chromium.org/svn/trunk/src/chrome/common/chrome_switches.cc
What I am basically trying to achieve is to have a javascript application running locally on pages served by Apache (also running locally) be allowed to run scripts from a resource running on another server on our network.
Failing a way to enable xss for Firefox, Chrome, or my least favourite - IE, would there be a way to run some kind of proxy process to modify headers to allow the xss to happen? Any quick way to use Apache mod rewrite or some such to do this?
Again, this is for testing only. In production, all these scripts run from the same server, so there isn't even a need to sign them, but during development and testing, it is much easier to work only on the parts of the application you are concerned with and not have to run the rest that requires an full-on application server setup.
What you need is just a little passthrough service running on the first server that passes requests over to the second server, and returns the results it gets back from the second server.
You don't say what language the server side of your application is written in or what kind of data is passed to or returned from your service, so I can't be more specific than that, but it really should be about 15 lines of code to write the passthrough service.
What are asking for isn't cross-site scripting (which is a type of security vulnerability in which user input (e.g. from the URL) is injected into the page in such a way that third party scripts could be added via a link).
If you just want to run a script on a different server, then just use an absolute URI.
<script src="http://example.com/foo.js"></script>
If you need to perform Ajax requests to a remote server, use CORS or run a proxy on the current origin.
Again, this is for testing only
Just for testing, look at Charles Proxy. It's Map Remote feature allows you to (transparently) forward some requests to a remote server (based on wild card URL matching).

Running command with browser

I want to have a "control panel" on a website, and when a button is pressed, I want it to run a command on the server (my computer). The panel is to run different python scripts I wrote (one script for each button), and I want to run the panel on my Mac, my iPod touch, and my wii. The best way I see for this is a website, since they all have browsers. Is there a javascript or something to run a command on my computer whenever the button is pressed?
EDIT: I heard AJAX might work for server-based things like this, but I have no idea how to do that. Is there like a 'system' block or something I can use?
Here are three options:
Have each button submit a form with the name of the script in a hidden field. The server will receive the form parameters and can then branch off to run the appropriate script.
Have each button hooked to it's own unique URL and use javascript on the button click to just set window.location to that new URL. Your server will receive that URL and can decide which script to run based on the URL. You could even just use a link on the web page with no javascript.
Use Ajax to issue a unique URL to your server. This is essentially the same (from the server's point of view) as the previous two options. The main difference is that the web browser doesn't change what URL it's pointing to. The ajax call just directs the server to do something and return some data which the host web page can then do whatever it wants with.
On the client side (the browser), you can do it with the simplest approach. Just an html form. javascript would make it nicer for validation and to do ajax calls so the page doesnt have to refresh. But your main focus is handling it on the server. You could receive the form request in the language of your choice. If you are already running python, you could write a super fast cgi python script. Look at the cgi module for python. You would need to put this into the apache server on osx if thats where you will host it.
Unfortunately, your question about exactly how to write it is beyond the scope of a simple answer. But google for how to write and html form, or look at maybe jquery to build a quick form that can make ajax calls easily.
Then search for how to use the python cgi module and receive POST requests.
Javascript is basically for doing work in the browser (usually to render something nice for the end user to look at). What you want (as others have said already) is a way to connect an HTML form action to an action on the webserver "back end". And this is exactly (as RobG has pointed out) what CGI is for. An alternative to CGI which is quite popular with Apache users is mod_python - the difference is basically whether the "back end" operation runs as a standalone process (CGI) or inside a webserver process (mod_python), but for most basic applications your server side scripts don't need to care. And if you're in a shared hosting environment you may not have a choice - ask your sysadmin (or read your hosting service docs) to learn how best to run CGI scripts in this case.
Caveats:
You will probably need fairly elevated webserver admin access & expertise in order to get everything set up the way you want. You will at least need to be able (both in the sense of permissions and technical understanding) to view your webserver logs, edit your webserver configs and bounce (restart) your http service.
Whatever "back end" operations you want done will be done with the permissions/privileges of the webserver, which may not be the same as the permissions/privileges of the user account which you normally use to perform these operations. There are various ways around this (using custom daemons and/or sudo operations), but you really need to have a clear understanding with the webserver sysadmin (if the webserver is exposed to the Big Bad Internet) about how this is going to work before you deploy anything, otherwise you run the very real risk (especially if you are a noob) of making it possible for hackers to exploit your "command gateway" to hack the webserver.
Of course if you're just doing all this for fun on your personal laptop (there is an OSX tag on the question, after all), then you are the webserver sysadmin, and you're free to hack away and happily shoot yourself in the foot repeatedly while learning everything you need to know along the way, which is fine as long as you're not on a network. In this case, you may find this tutorial to be useful.

Categories