How to parse javascript from web page - javascript

I need to get links from the javascript. I using jsoup, but it didn't work.
screen
I need to get this link from the source of page. Can anyone help me how to do it?
String url = "http://www.cda.pl/video/149016ec/Rybki-z-ferajny-2004-1080p-Dubbing-pl";
Document doc = Jsoup.connect(url).get();
Elements scriptElements = doc.getElementsByTag("script");
for (Element element :scriptElements ){
for (DataNode node : element.dataNodes()) {
System.out.println(node.getWholeData());
}
System.out.println("-------------------");
}
I marked on screen what urls i want to get.

You can use this code:
String url = "http://www.cda.pl/video/149016ec/Rybki-z-ferajny-2004-1080p-Dubbing-pl";
Document doc = Jsoup.connect(url).get();
//we pick the script node
Element script = doc.select("#player > script").get(0);
String text = script.html();
//then we parse the script for the desired uri
final String prefix = "l='";
int p1 = text.indexOf(prefix) + prefix.length();
int p2 = text.indexOf("'", p1);
String uri = text.substring(p1, p2);
System.out.println(uri);
It will give the desired output:
http://vgra001.cda.pl/lqcc6f8b3c8f76d1b58c1234813fcf67c7.mp4?st=SjoQ8DDcnH7pW8_XNNkA3w&e=1416438406
Please note that this is only a example, you will need to do error checking.
Now the explanation:
You almost had it done, you had the location of the code with the uri, then it was easy to find the important script node: You can see a <div class="wrapqualitybtn"> near the script tag, then you can find the div that contains both your script tag and that div tag (the <div id="player" ... >, the script's tag parent node)
Once you have the script node you only need to do String parsing. Parsing javascript code could be risky because a little change in the code can break your parser, but I think in this case looking for l=' is a solid bet.
A couple of advices:
When a page uses jQuery you can use jQuery in browser console too! If you put $('#player > script')[0] In the browser you will see your script tag.
You can search through the DOM of a page in Developer tools of your browser (F12) and then right click a node and click in Copy CSS Path (in chrome, something similar in firefox) And you will obtain a selector useable in JSoup.
For a more resiliant script parsing you could use regular expressions instead of plain indexOf search.
I hope it will help, excuse me for the verbosity.

Related

How to apply styles for appended element? [duplicate]

I have a little testcase over at:
http://jsfiddle.net/9xwUx/1/
The code boils down to the following (given a node with id "target"):
var string = '<div class="makeitpink">this should be pink, but is not</div>';
var parser = new DOMParser();
var domNode = parser.parseFromString(string,"text/xml");
document.getElementById("target").appendChild(domNode.firstChild);
If you run the testcase, and then inspect the target node via firebug/chrome web inspector and select any node within the body tag of jsfiddle's iframe, and do "edit as HTML", add a random charachter anywhere as a string [not an attribute to a domnode, to be clear], and "save", the style is applied. but not before that.
To say that i'm confused is an understatement.
Can anybody please clarify what is going on here?
Thanks.
You can change the mime type to text/html and do the following:
var parser = new DOMParser()
var doc = parser.parseFromString(markup, 'text/html')
return doc.body.firstChild
I didn't test on every browser but it works on Chrome and Firefox. I don't see any reason it wouldn't work elsewhere.
a bit late, but the reason is that you have parsed these using the text/xml option, which means that the results are XML nodes, which don't have CSS applied to them. When you right-click and go "edit as HTML" the browser reinterprets them as HTML and the change in the element will cause a redraw, reapplying the CSS.
I've been parsing my using the relatively hack-ish, yet definitely working method of creating a temporary element and manipulating the innerHTML property, making the browser do the parsing instead:
var temp = document.createElement("div")
//assuming you have some HTML partial called 'fragment'
temp.innerHTML = fragment
return temp.firstChild
Which you've noted in your jsfiddle. Basically it boils down to the output of the DOMParser being an instance of XMLDocument when you use the text/xml option.

href attribute set, javascript sees as empty string

I am experiencing odd behaviour in Chrome (v43.0.2357.134) whereby I am reading an anchor element's .href attribute, but in specific circumstances its value is an empty string.
I would like the .href value to be populated on all anchors.
Issue
Specifically, this is what is being observed:
//Bad (unwanted) behaviour
var currentElem = ; //Code to pick out an anchor element
console.info(currentElem.href); //"" (empty string)
console.info(currentElem.getAttribute('href'); //"path/to/other/page.html"
Edited to add/clarify: Note that in this screenshot, at the point of reaching the fourth line of code the value of nextPageUri is an empty string (otherwise would not have reached the debugger; line). The fifth line then populates nextPageUri with the .getAttribute('href') value, hence the value showing next to line two.
This is what is (correctly) seen within Firefox, and on the first TWO DOMs via Chrome:
//Good (desired) behaviour
var currentElem = ; //Code to pick out an anchor element
console.info(currentElem.href); //"http://example.org/root/dir/path/to/other/page.html"
console.info(currentElem.getAttribute('href'); //"path/to/other/page.html"
Background
Context: This is within a script to inline multiple pages of search results to a single page, and the anchor elements are located within a DOM retrieved via xmlHttpRequest. The code runs perfectly via Firefox for >100 pages.
Confusingly, the incorrect behaviour described above only occurs on the third and subsequent requests in the Chrome browser.
This is an issue with Chromium/Blink-based browsers: if you use DOMParser to parse string into a document, href properties with relative URI (i.e. doesn't start with http[s]) will be parsed as empty string.
To quote tkent from Chromium issue 291791:
That's because a document created by DOMParser doesn't have baseURI. Without baseURI, relative URI references are assumed as invalid.
Same thing happens if you use createHTMLDocument. (Chromium issue 568886)
Also, based on this test code posted by scarfacedeb on Github, src properties also exhibit the same behavior with relative URIs.
As you have pointed out, using getAttribute() instead of the dot notation works fine.
Chrome's "element.href" doesn't act any differently on the 3rd try than it does on the first 2 -- you mentioned that you are paginating, when this error happens. how does the "href" attribute on the Next Page link get set each time you arrive at the page? It seems likely that your code to evaluate the element's href attribute is simply running before the href is set -- as evidenced by your debugger being able to evaluate it after a pause.
Try and reproduce this issue in a plunkr.
I know you're checking if nextPageUri is empty.
But, could you try always using
nextPageUri = currentElem.getAttribute('href');
and see if that works?
I experienced a similar problem using a DOMParser for translating text/html pages coming from ajax requests and, after that, finding href's of <a> elements inside it.
For instructions purpose, this is how I'm using the parser
var parser = new DOMParser();
$.ajax({....}).done(function(request){
var page = parser.parseFromString(request, 'text/html');
});
Test yourself
If you want to test the behaviour of .href and .getAttribute("href") yourself, please run this code at chrome dev tools console:
parser = new DOMParser(); // create your DOMParser
// the next line creates a "document" element with an <a> tag inside it
parsed_page = parser.parseFromString('click here', 'text/html');
link = parsed_page.getElementsByTagName('a')[0]; // locate your <a> tag
link.href; // this line returns ""
link.getAttribute('href'); // this line returns "test"

Remove HTML tags and entites from string coming from server

In an app I receive some HTML text: since the app can't display (interpret) HTML, I need to remove any HTML tag and entity from the string I receive from the server.
I tried the following, but this one removes HTML tags but not entities (eg. &bnsp;):
stringFromServer.replace(/(<([^>]+)>)/ig,"");
Any help is appreciated.
Disclaimer: I need a pure JavaScript solution (no JQuery, Underscore, etc.).
[UPDATE] I'm reading all your answers now and I forgot to mention that I'm using JavaScript BUT the environment is not a web page, so I have no DOM.
You can try something like this:
var placeholder = document.createElement('div');
placeholder.innerHTML = stringFromServer;
var theText = placeholder.innerText;
.innerText only grabs text content from the element.
However, since it appears you don't have access to any DOM manipulation at all, you're probably going to have to use some kind of HTML parser, like these:
https://www.npmjs.org/package/htmlparser
http://ejohn.org/blog/pure-javascript-html-parser/
A solution without using regexes or phantom divs can be found on Mozilla's MDN.
I put the code in a JSfiddle here:
var sMyString = "<a id=\"a\"><b id=\"b\">hey!<\/b><\/a>";
var oParser = new DOMParser();
var oDOM = oParser.parseFromString(sMyString, "text/xml");
// print the name of the root element or error message
alert(oDOM.documentElement.nodeName == "parsererror" ?
"error while parsing" : oDOM.documentElement.textContent);
Alternatively, parse the HTML snippet in a new document and do your dom manipulations from that (if you'd rather keep it separate from the current document):
var tmpDoc=document.implementation.createHTMLDocument("");
tmpDoc.body.innerHTML="<a href='#'>some text</a><p style=''> more text</p>";
tmpDoc.body.textContent;
tmpDoc.body.textContent evaluates to:
some text more text
stringFromServer.replace(/(<([^>]+)>|&[^;]+;)/ig, "")

Is there a way to prevent loading of images in a jQuery objectified HTML fragment

I have a HTML fragment that I'm objectifying via jQuery for the purpose of extracting some data from it. This fragment has some image resources that I don't want the browser to download. Is there a way to do it.
A simplified version of my current code:
var html = '<p class="data">Blah Blah</p>...<img src="/a/b/...png">...<div>...</div>';
var obj = $(html); // this makes the browser download the contained images as well!!!
var myData = {
item_1: obj.find('.data:first').text(),
item_2: obj.find('.data2:first').text(),
.... // and so on..
};
Unless you think there will be instances of the substring src= in the string that matter to you, you could do:
html = html.replace(/src=/g, "data-src=");
or possibly
html = html.replace(/src\s*=/g, "data-src=");
...to allow for whitespace. That way, the browser won't see the src attribute, and won't trigger the download.
Sometimes the direct approach is best. Of course, if you think there may be src= substrings that will matter in terms of what you're trying to extract, then...

Remove formatting tags from string body of email

How do you remove all formatting tags when calling:
GmailApp.getInboxThreads()[0].getMessages()[0].getBody()
such that the only remainder of text is that which can be read.
Formatting can be destroyed; the text in the body is only needed to be parsed, but tags such as:
"&"
<br>
and possibly others, need to be removed.
Even though there's no DOM in Apps Script, you can parse out HTML and get the plain text this way:
function getTextFromHtml(html) {
return getTextFromNode(Xml.parse(html, true).getElement());
}
function getTextFromNode(x) {
switch(x.toString()) {
case 'XmlText': return x.toXmlString();
case 'XmlElement': return x.getNodes().map(getTextFromNode).join('');
default: return '';
}
}
calling
getTextFromHtml("hello <div>foo</div>& world <br /><div>bar</div>!");
will return
"hello foo& world bar!".
To explain, Xml.parse with the second param as "true" parses the document as an HTML page. We then walk the document (which will be patched up with missing HTML and BODY elements, etc. and turned into a valid XHTML page), turning text nodes into text and expanding all other nodes.
This is admittedly poorly documented; I wrote this by playing around with the Xml object and logging intermediate results until I got it to work. We need to document the Xml stuff better.
I noticed you are writing a Google Apps Script. There's no DOM in Google Apps Script, nor you can create elements and get the innerText property.
getBody() gives you the email's body in HTML. You can replace tags with this code:
var html = GmailApp.getInboxThreads()[0].getMessages()[0].getBody();
html=html.replace(/<\/div>/ig, '\n');
html=html.replace(/<\/li>/ig, '\n');
html=html.replace(/<li>/ig, ' *');
html=html.replace(/<\/ul>/ig, '\n');
html=html.replace(/<\/p>/ig, '\n');
html=html.replace(/<br\/?>/ig, '\n');
html=html.replace(/<[^>]+>/ig, '');
May be you can find more tags to replace. Remember this code isn't for any HTML, but for the getBody() HTML. GMail has its own way to format de body, and doesn't use every possible existing tag in HTML, only a subset of it; then our GMail specific code is shorter.
I found an easier way to accomplish this task.
Use the htmlBody advanced argument within the arguments of sendEmail(). Heres an example:
var threads = GmailApp.search ('is:unread'); //searches for unread messages
var messages = GmailApp.getMessagesForThreads(threads); //gets messages in 2D array
for (i = 0; i < messages.length; ++i)
{
j = messages[i].length; //to process most recent conversation in thread (contains messages from previous conversations as well, reduces redundancy
messageBody = messages[i][j-1].getBody(); //gets body of message in HTML
messageSubject = messages [i][j-1].getSubject();
GmailApp.sendEmail("dummyuser#dummysite.com", messageSubject, "", {htmlBody: messageBody});
}
First I find all the threads containing unread messages. Then I get the messages contained within the threads into a two dimensional array using the getMessagesForThreads() method within GmailApp. Then I created a for loop that runs for all of the threads I found. I set j equal to the threads message count so I can send only the most recent message on the thread (j-1). I get the HTML body of the message with getBody() and the subject through getSubject(). I use the sendEmail(recipients, subject, body, optAdvancedArgs) to send the email and process the HTML body. The result is an email sent properly formatted with all features of HTML included. The documentation for these methods can be found here: https://developers.google.com/apps-script/service_gmail
I hope this helps, again the manual parsing method does work, but I still found bits and pieces of HTML left hanging around so I thought I would give this a try, It worked for me, if I find any issues in the longrun I will update this post. So far so good!
Google now has the getPlainBody() function that will get the plain text from the body of an email. It is in the text class.
I had been using a script to send emails to convert them to tasks and google broke it with a change to the functionality of Corey's answer above. I've replaced it with the following.
var taskNote = ((thread.getMessages()[0]).getPlainBody()).substring(0,1000);
I am not sure what you mean by .getBody() - is this supposed to return a DOM body element?
However, the simplest solution for removing HTML tags is probably to let the browser render the HTML and ask him for the text content:
var myHTMLContent = "hello & world <br />!";
var tempDiv = document.createElement('div');
tempDiv.innerHTML = myHTMLContent;
// retrieve the cleaned content:
var textContent = tempDiv.innerText;
With the above example, the textContent variable will contain the text
"hello & world
!"
(Note the line break due to the <br /> tag.)

Categories