Check if viewed source not render

Check if viewed source not render - javascript

Is there a way to check if the current request was for page source (HTML) not actual site?
And if not (which I think is the case), is there a way I could somehow "parse" this out of request parameters and maybe times or something?
I need this to display real source when viewing it, and trimmed one when rendering it..

Is there a way to check if the current request was for page source (HTML) not actual site?
No. The request is always for the page source. There is no way to distinguish what the browser is going to do with it.
Also, many browsers (like IE) can't make a request for "view source" at all - you always load the whole site, render it, and then do a "view source".
Workaround ideas: (All terribly flawed)
Add some JavaScript to the page making an Ajax call. If the call is made, the page was rendered.
Add some image resource to the page. If it's loaded, the page was rendered.
If this is to protect your HTML source code, forget it and go do something productive instead. :)

It sounds like you're wanting to send optimized/compressed/bandwidth-friendly code when it's being viewed in the browser, and readable/understandable/indented code if the user is wanting to view the source code, right?
Unfortunately, that's not possible, and I suspect it would be counter-productive, since it would prevent you from debugging problems if your Javascript minimizer or HTML compressor code was introducing problems. It would be far better to use something that reintroduces whitespace and indentation for readability. For example, the View Source Chart extension for Firefox. (I don't know what options there are for unminimizing Javascript code.)

Related

web crawler HTML and JS "understanding" with python

I am making a web crawler that try to find security issues in web sites (something like w3af). for example finding xss etc...
but I have a problem: my crawler try to find the forms in the pages using regex (I know regex can't pars html but i'm only parsing the form tag), anyway I found that some pages (like google) have the entire page encoded in some sort of ugly JS code and then nothing works, in addition when I inject scripts to the urls and want to check if they execute (reflected xss) I need something that could tell me if (for example) an alert showed up.
so I need a module that will get the source code of a page and will be able to give me the rendered page or if alert occured.
do someone know something that might work? (I'v found some solutions such as selenium but it require a browser and therefor is way to slow...)
thanks ahead!

Why does the WebBrowser control not display HTML correctly after it has been saved?

I've read a lot of related answers but I still don't see the problem. I think MY problem is that I don't have a good grasp of the basics of HTML and potentially javascript. I'm talking about how they are stuck together and operate, not the particular language syntax. Perhaps somebody could give me the big picture explanation of what is going wrong here.
I'm using a simple WebBrowser control to navigate to a web page. This results in everything displaying correctly. Now, I'd like to save that HTML content locally on the machine and open it again later, then render it in another WebBrowser control. This has not worked so far. The page renders briefly but without images and effects, then I get an exception regarding scripts. So I decided to do a very simple test. I would get the HTML from the browser, then immediately read that text back into the browser.
After navigating to the page successfully, I get the HTML text as follows:
string html = myWebBrowser.DocumentText;
I then immediately set the DocumentText property to its original value.
myWebBrowser.DocumentText = html;
This gives me the same error and effect as if my other application was reading the saved HTML. So what is going on here? The browser initially shows all content successfully but then extracting and reloading the HTML text breaks it all. I must be missing a very obvious and basic concept here. Is it that the WebBrowser control's DocumentText property does not actually return the original HTML code, but rather a modified version? Or is it that setting that property modifies something? Is it neither? Thanks to anyone who can sort out my understanding of how all this works.

This is by design. HTML is not a fully inclusive file. The browser will parse the html and pull resources from other urls including scripts, images, styles, etc. If you save only the html and then open that file in a browser, many of the resources will not be found since they rely on the html loading them from their location relative to the html file. Once the file is on your computer, any relative link to a resource will be invalid resulting in the browser only showing the basic html and any resources that have been referenced by absolute paths and not relative paths.

Ruby on Rails code for taking a screenshot of different sections of a page

I am creating a Ruby on Rails app. A specific page in my app is divided into several sections by <div> tags. Each <div> includes a combination of text (using different fonts), symbols and mathematic formulas. I use MathJax and a few other Javascript codes to display them correctly and everything works great on my computer. However Javascript is not enabled on everyone's browser and some Javascript codes might not load correctly on some other people's browsers. One solution I was thinking is this: after all the javascripts are done processing and the page is displayed correctly on my computer (server) I use some code to generate a snapshot of each <div> in PNG and send them to the server (for example I click a <button> tag on the page to activate this code after I'm happy what is displayed is correct). Then I'll save these images in the database and serve them which will look the same on everyone's computer regardless of whether Javascript is enabled, what browser they're using, etc. Is anyone aware of a code or command that I can use? Please note, currently after the page is loaded, Javascripts process the HTML content and produce the correct display. Also I don't want to take a snapshot of the whole page; snapshot of each <div> separately.
Thanks a lot.

Well this is a client-side problem, here is a javascript that will work for you http://experiments.hertzen.com/jsfeedback/

You've got a bit of a problem there. Javascript is not executed until the page has finished loading, i.e. all of the information has already been sent to the client. You're not executing javascript at the server level, so you wouldn't be able to do that kind of processing at all. If they have javascript disabled, your code will never get executed.
You could generate the images using Imagemagick or something similar, I know PHP has bindings for that. There are a couple of extremely messy solutions like rendering it in a browser on the server side with something like selenium, but I definitely wouldn't recommend doing that. Overall, it depends on the platform on which your developing, but most major languages have support for generating images that don't require 100% javascript.

Web Development Best Practices - How to Support Javascript Disabled

What is the best thing to do when a user doesn't have JavaScript enabled? What is the best way to deliver content to that kind of user? What is the best way to keep a site readable by search engines?
I can think of two ways to achieve this, but do not know what is better (or if a 3rd option is better):
Rely on the meta-refresh tag to redirect users to a non-javascript version of site. Wrap the meta-refresh tag in a noscript tag so it will be ignored by those with javascript.
Rely on an iframe tag located within the body tag to deliver a non-javascript version of site. Wrap the iframe tag in a a noscript tag so it will be ignored by those with javascript.
I would also appreciate high-profile examples of the correct or incorrect way to do this.
--------- ADDITION TO QUESTION -----------
Here is an example of what I have done in the past to address this: http://photocontest.highpoint.edu/
I want to make sure there aren't better ways to do this.

You are talking about graceful degradation: Designing and making the site to work with javascript, then making the site still work with javascript turned off. The easiest thing to do is include the html "noscript" tag somewhere near the top of your page that gives a message saying that the site REQUIRES javascript or things won't work right. SO is a perfect example of this. Most of the buttons at the top of the screen run via javascript. Turn it off and you get a nice red banner and the drop down js effects are gone.
I prefer progressive enhancement development. Get the site working in it's entirety without javascript / flash / css3 / whatever, THEN enhance it bit by bit (still include the noscript tag) to improve the user experience. This ensures you have a fully working, readable website regardless if you're a disabled user with a screen reader or search engine, whilst providing a good user experience for users with newer browsers.
Bottom line: for any dynamically generated content (for example page elements generated via AJAX) there has to be a static page alternative where this content must be available via a standard link. If you are using javascript for tabbed content, then show all the content in a way that is consistent with the rest of the webpage.
An example is http://www.bbc.co.uk/news/ Turn off javascript and you have a full page of written content, pictures, links etc. Turn on javascript and you get scrolling news stories, tabbed content, scrolling pictures and so on.
I'm going to be naughty and post links to wikipedia:
Progressive Enhancement
Graceful Degredation

You have another option, just load the same page but make it work for noscript users (progressive enhancement/gracefull degradation).
A simple example:
You want to load content into a div with ajax, make an <a> tag linking to the full page with the new content (noscript behavior) and bind the <a> tag with jQuery to intercept clicks and load with ajax (script behavior).
$('a.ajax').click(function(){
var anchor = $(this);
$('#content').load(anchor.attr('href') + ' #content');
return false;
});

I'm not entirely sure if Progressive Enhancement is considered to be best practice these days but it's the approach I personally favour. In this case you write your server side code so that it functions like a standard web 1.0 web app (no JavaScript) to provide at least enough functionality for the system to work without JavaScript. You then start layering JavaScript functionality on top of this to make the system more user friendly. If done properly you should end up with a web app that at least provides enough functionality to be useful for non-JavaScript users.
A related process is known as Graceful Degradation, which works in a similar way but starts with the assumption that a user has JavaScript enabled and build in workarounds for cases where they don/t. This has a drawback, however, in that if you overlook something you can leave a non-JavaScript user without a fallback.
Progressive Enhancement example for a search page: Build your search page so that it normally just returns a HTML page of search results, but also add a flag that can be set via GET that when set, it returns XML or JSON instead. On the search page, include a script that does an AJAX request to the search page with the flag appended onto the query string and then replaces the main content of the page with the result of the AJAX call. JavaScript users get the benefit of AJAX but those without JavaScript still get a workable search page.
http://en.wikipedia.org/wiki/Progressive_enhancement

If your application must have javascript to function then there's nothing you can do except show them a polite message in a noscript tag.
Otherwise, you should be thinking the other way around.
Build your site without JS
Give awesome user experience and make it full functional
Add JS and make the UX even more functional. Layer the JS on top.
So if the user doesn't have JS, your site will still revert to step two of your site state.
As for crawling. If your site depends on AJAX and a lot of JS to work, you can make gogole aware of it : http://code.google.com/web/ajaxcrawling/docs/getting-started.html

One quick tip that may help you: just install lynx, a command-line web browser, and you'll immediately see how google and other seo see your site (and blind people too). This is very useful. Of course, in a command line windows, there's no graphics and javascript is disabled.

If you're doing "serious" Ajax (e.g. client side-routing) the following technique could be useful:
Use Urls without GET/"?"-parameters (it makes your life easier later on)
Use http://baseurl.com/#!/path/to/resource for client side-routing
Implement rendering of non-script HTML-version of your site (HTML snapshot is what Google calls it) at http://baseurl.com/path/to/resource
Wrap the whole content of your HTML snapshot in noscript-tags and redirect via top.location.href to the full version of the site
Handle http://baseurl.com/?_escaped_fragment=/path/to/resource - it should redirect via 301-response to http://baseurl.com/path/to/resource
Use a-tags only for GET-links, use forms for POST/PUT/DELETE-links - unstyle the hell out of them if necessary
A nice example code for links I found while researching "How to write proper Ajax-code":
Resource
This is of course a pretty complex solution but it should enable both SEO (including non-search engine crawlers) and accessibility. The problem is that you have to be able to render your page server- AND client side.
One solution could be to use a templating framework like mustache where implementations for different platforms exist.
Use something like {{#pagelet}}/path/to/partial{{/pagelet}} for dynamic parts of your page - example: {{#pagelet}}/image/{{image_id}}/preview{{/pagelet}}
In your client-side rendering, pagelet would be implemented to be dynamically replaced with something loaded via Ajax (for example: render )
In your server-side rendering, pagelet would just be rendered directly (in doubt just curl the pagelet and render it right away - or if you can write the code asynchronously do it just as you would do it client side: write some temporary span into a buffer, start fetching all the pagelets, replace the temporary spans as the pagelets arrive and flush the buffer once all pagelets have been rendered.
That's the best general design I found so far. You can deep link into your app, it's search engine friendly and it should force you to build a page that gracefully degrades.
P.S.: One advantage of the techniques described above is that both the Ajax- and the "Web 1.0"-rendering of a page could profit from memcached-caching of whole pagelets.

I would prefer to code the page without javascript and then if javascript is enabled, we redirect users to a similar page with javascript. (same concept as progressive enhancement)
redirecting with javascript

Using injected JavaScript to copy text from a web page

As part of a job I'm doing on a web site I have to copy a few thousand lines of text from several pages of the old site and paste them into the HTML for the new site. The long and painstaking way of going to the old page and copying the many lines of text and then going to my editor and pasting it there line by line is getting really old. I thought of using injected JavaScript to do this but I'm not quite sure where to start. Thanks in advance for any help.
Here are links to a page of the old site and a page of the new site. As you can see in the tables on each page it would take a ton of time to copy it all manually.
Old site: http://temp.delridgelegalformscom.officelive.com/macorporation1.aspx
New Site: http://ezwebsites.us/delridge/macorporation1.html

In order to do this type of work, you need two things: a way of injecting or executing your script on that page, and a good working knowledge of the Document Object Model for the target site.
I highly recommend using the Firefox plugin FireBug, or some equivalent tool on your browser of choice. FireBug lets you execute commands from a JavaScript console which will help. Hopefully the old site does not have a bunch of <FONT>, <OBJECT> or <IFRAME> tags which will make this even more tedious.
Using a library like Prototype or JQuery will also help selecting parts of the website you need. You can submit results using JQuery like this:
$(function() {
snippet = $('#content-id').html;
$.post('http://myserver/page', {content: snippet});
});
A problem you will very likely run into is the "same origination policy" many browsers enforce for JavaScript. So if your JavaScript was loaded from http://myserver as in this example, you would be OK.
Perhaps another route you can take is to use a scripting language like Ruby, Python, or (if you really have patience) VBA. The script can automate the list of pages to scrape and a target location for the information. It can just as easily package it up as a request to the new server if that's how pages get updated. This way you don't have to worry about injecting the JavaScript and hoping all works without problems.

I think you need Grease Monkey http://www.greasespot.net/

We Keep Coding

JavaScript is the programming language of the Web.