Parse a javascript generated HTML page - javascript

I am currently implementing a chrome extension to parse certain websites. I came across a site whose contents are generated by inline/external js code (I Think!). How can I parse a website of this kind. I am trying to extract the whole page through XMLHttpRequest() inside my parser. I tried using eval() and html() of Jquery. With Jquery I could parse some of the elements, but inaccurate.
sample code of my parser:
var siteaddress="www.xyz.com/search?q=abcd";
var req = new XMLHttpRequest()
req.open('GET',siteaddress,true)
parseHT(req,x);
req.send(null);
function parseHT(req_new,x){
req_new.onload=function(){
//console.log(this.responseText);
var jshtml=req_new.responseText;
var el = $( '<div></div>' );
html=el.html(jshtml)
//process steps follows this
Thanks

Related

Modify <div> container on click of a button, insert HTML from GET request

Prerequisites
I have a Website, that displays a page with an input and a button. On the other end is a server that exposes a very basic HTTP API. The API is called like this:
http://127.0.0.1/api/arg1/arg2/arg3
where argX are the arguments. It returns raw HTML. This HTML code needs to be inserted into the Website (another domain). There is a
<div id="container5"></div>
on the website. The HTML needs to be inserted into this container. The code returned by the API is specifically made to be inserted into this container, as it uses CSS classes and scripts from the website, i.e.: the code is not valid for it self.
The Goal
Here is what I have: I've got the API to return what I want, and I got a small JavaScript to run on the website to change the contents of the container:
var element = document.getElementById("container5");
element.innerHTML = "New Contents";
This works so far. Now I need a way to get the HTML from the API to the page. By reading numerous SO questions, it quickly became clear that reading HTML from another URL is close to impossible in JavaScript, due to security constraints.
Is there an easy way to do this with JavaScript or do I need rethink the whole process somehow? One last constraint on my side is that I can only insert JS into the website, I can't - for example - upload a new file to the server.
Edit 1: Workaround!
I solved this for me by using a PHP intermediate file on the requesting server:
<?php
echo file_get_contents('http://example.com');
?>
This will generate a site using the HTML content of any URL. Now the requesting site can read this by using JavaScript:
var getHTML = function ( url, callback ) {
// Feature detection
if ( !window.XMLHttpRequest ) return;
// Create new request
var xhr = new XMLHttpRequest();
// Setup callback
xhr.onload = function() {
if ( callback && typeof( callback ) === 'function' ) {
callback( this.responseXML );
}
}
// Get the HTML
xhr.open( 'GET', url );
xhr.responseType = 'document';
xhr.send();
};
This modifies any element:
var element = document.getElementById("resultpage");
getHTML( 'http://localserver.org/test.php', function (response) {
element.innerHTML = response.documentElement.innerHTML;
});
Checkout CORS https://en.wikipedia.org/wiki/Cross-origin_resource_sharing
also JSONP in same article.

Parse dynamically loaded document on background using createHTMLDocument

Using this post, I'm trying to load document via ajax and find contents of specific document node(s) so that I can display them without re-navigating browser.
However, my document always seems to be an empty document.
Ajax callback:
function processRatingToken(data) { //Data is just standart HTML document string
var doc = document.implementation.createHTMLDocument();
doc.open();
//Replace scripts
data = data.replace(/<script\b[^<]*(?:(?!<\/script>)<[^<]*)*<\/script>/gi, "");
//Write HTML to the new document
doc.write(data);
doc.close();
console.log(doc.body); //Empty
}
So what's wrong?
Note: I'm using this strategy, because I'm building a Greasemonkey Userscript. If you are developing an Ajax application, this strategy is NOT recomended. Use JSON instead.
There is a workaround with .innerHTML property:
doc.childNodes[1].innerHTML = data;
Where .childNodes[1] is the <html> element.

How do I use jQuery in Windows Script Host?

I'm working on some code that needs to parse numerous files that contain fragments of HTML. It seems that jQuery would be very useful for this, but when I try to load jQuery into something like WScript or CScript, it throws an error because of jQuery's many references to the window object.
What practical way is there to use jQuery in code that runs without a browser?
Update: In response to the comments, I have successfully written JavaScript code to read the contents of files using new ActiveXObject('Scripting.FileSystemObject');. I know that ActiveX is evil, but this is just an internal project to get some data out of some files that contain HTML fragments and into a proper database.
Another Update: My code so far looks about like this:
var fileIo, here;
fileIo = new ActiveXObject('Scripting.FileSystemObject');
here = unescape(fileIo.GetParentFolderName(WScript.ScriptFullName) + "\\");
(function() {
var files, thisFile, thisFileName, thisFileText;
for (files = new Enumerator(fileIo.GetFolder(here).files); !files.atEnd(); files.moveNext()) {
thisFileName = files.item().Name;
thisFile = fileIo.OpenTextFile(here + thisFileName);
thisFileText = thisFile.ReadAll();
// I want to do something like this:
s = $(thisFileText).find('input#txtFoo').val();
}
})();
Update: I posted this question on the jQuery forums as well: http://forum.jquery.com/topic/how-to-use-jquery-without-a-browser#14737000003719577
Following along with your code, you could create an instance of IE using Windows Script Host, load your html file in to the instance, append jQuery dynamically to the loaded page, then script from that.
This works in IE8 with XP, but I'm aware of some security issues in Windows 7/IE9. IF you run into problems you could try lowering your security settings.
var fileIo, here, ie;
fileIo = new ActiveXObject('Scripting.FileSystemObject');
here = unescape(fileIo.GetParentFolderName(WScript.ScriptFullName) + "\\");
ie = new ActiveXObject("InternetExplorer.Application");
ie.visible = true
function loadDoc(src) {
var head, script;
ie.Navigate(src);
while(ie.busy){
WScript.sleep(100);
}
head = ie.document.getElementsByTagName("head")[0];
script = ie.document.createElement('script');
script.src = "http://ajax.googleapis.com/ajax/libs/jquery/1.8.3/jquery.min.js";
head.appendChild(script);
return ie.document.parentWindow;
}
(function() {
var files, thisFile, win;
for (files = new Enumerator(fileIo.GetFolder(here).files); !files.atEnd(); files.moveNext()) {
thisFile = files.item();
if(fileIo.GetExtensionName(thisFile)=="htm") {
win = loadDoc(thisFile);
// your jQuery reference = win.$
WScript.echo(thisFile + ": " + win.$('input#txtFoo').val());
}
}
})();
This is pretty easy to do in Node.js with the cheerio package. You can read in arbitrary HTML from whatever source you want, parse it with cheerio and then access the parsed elements using jQuery style selectors.

Parsing XML / RSS from URL using Java Script

Hi i want to parse xml/rss from a live url like http://rss.news.yahoo.com/rss/entertainment using pure Java Script(not jquery). I have googled a lot. Nothing worked for me. can any one help with a working piece of code.
(You cannot have googled a lot.) Once you have worked around the Same Origin Policy, and if the resource is served with an XML MIME type (which it is in this case, text/xml), you can do the following:
var x = new XMLHttpRequest();
x.open("GET", "http://feed.example/", true);
x.onreadystatechange = function () {
if (x.readyState == 4 && x.status == 200)
{
var doc = x.responseXML;
// …
}
};
x.send(null);
(See also AJAX, and the XMLHttpRequest Level 2 specification [Working Draft] for other event-handler properties.)
In essence: No parsing necessary. If you then want to access the XML data, use the standard DOM Level 2+ Core or DOM Level 3 XPath methods, e.g.
/* DOM Level 2 Core */
var title = doc.getElementsByTagName("channel")[0].getElementsByTagName("title")[0].firstChild.nodeValue;
/* DOM Level 3 Core */
var title = doc.getElementsByTagName("channel")[0].getElementsByTagName("title")[0].textContent;
/* DOM Level 3 XPath (not using namespaces) */
var title = doc.evaluate('//channel/title/text()', doc, null, 0, null).iterateNext();
/* DOM Level 3 XPath (using namespaces) */
var namespaceResolver = (function () {
var prefixMap = {
media: "http://search.yahoo.com/mrss/",
ynews: "http://news.yahoo.com/rss/"
};
return function (prefix) {
return prefixMap[prefix] || null;
};
}());
var url = doc.evaluate('//media:content/#url', doc, namespaceResolver, 0, null).iterateNext();
(See also JSX:xpath.js for a convenient, namespace-aware DOM 3 XPath wrapper that does not use jQuery.)
However, if for some (wrong) reason the MIME type is not an XML MIME type, or if it is not recognized by the DOM implementation as such, you can use one of the parsers built into recent browsers to parse the responseText property value. See pradeek's answer for a solution that works in IE/MSXML. The following should work everywhere else:
var parser = new DOMParser();
var doc = parser.parseFromString(x.responseText, "text/xml");
Proceed as described above.
Use feature tests at runtime to determine the correct code branch for a given implementation. The simplest way is:
if (typeof DOMParser != "undefined")
{
var parser = new DOMParser();
// …
}
else if (typeof ActiveXObject != "undefined")
{
var xmlDoc = new ActiveXObject("Microsoft.XMLDOM");
// …
}
See also DOMParser and HTML5: DOM Parsing and Serialization (Working Draft).
One big problem you might run into is that generally, you cannot get data cross domain. This is big issue with most rss feeds.
The common way to deal with loading data in javascript cross domain is calls JSONP. Basically, this means that the data you are retrieving is wrapped in a javascript callback function. You load the url with a script tag, and you define the function in your code. So when the script loads, it executes the function and passes the data to it as an argument.
The problem with most xml/rss feeds is that services that only provide xml tend not to provide JSONP wrapping capability.
Before you go any farther, check to see if your data source provides a json format and JSONP functionality. That will make this a lot easier.
Now, if your data source doesn't provide json and jsonp functionality, you have to get creative.
On relatively easy way to handle this is to use a proxy server. Your proxy runs somewhere under your control, and acts as a middleman to get your data. The server loads your xml, and then your javascript does the requests to it instead. If the proxy server runs on the same domain name then you can just use standard xhr(ajax) requests and you don't have to worry about cross-domain stuff.
Alternatively, your proxy server can wrap the data in a jsonp callback and you can use the method mentioned above.
If you are using jQuery, then xhr and jsonp requests are built-in methods and so make doing the coding very easy. Other common js libraries should also support these. If you are coding all of this from scratch, its a little more work but not terribly difficult.
Now, once you get your data hopefully its just json. Then there's no parsing needed.
However, if you end up having to stick with an xml/rss version, and if you're jQuery, you can simply use jQuery.parseXML http://api.jquery.com/jQuery.parseXML/.
better convert xml to json. http://jsontoxml.utilities-online.info/
after converting if you need to print json object check this tutorial
http://www.w3schools.com/json/json_eval.asp

How receive xml code without AJAX

In an app that i'am creating i have to receive from the server an xml string with this format eg: <reply>
<script>
alert('Hello World!');
</script>
</reply>
when i did this using ajax work perferct, but when i try to receive the data in an iframe i can't extract the data from the frame because is not there, IE and FF open new tabs and append the data on that tab, how i avoid that and makes them insert the data on the frame.
I can do this work still using Javascript, get the result of the ajax and write it inside the iframe:
first create your iframe tag like this:
than the javascript code to insert the ajax:
var t = document.getElementById('iftarget');
h = t.contentWindow.document.getElementsByTagName('html');
h[0].innerHTML = '<h1>Hello</h1> This must work! Put your data here';
I have created a jsFiddle for this
http://jsfiddle.net/nunomazer/JGyEr/
Best Regards

Categories