JSOUP: Parsing Javascript fields from an HTML document? - javascript
I'm fairly new to JSOUP, and i've had no issues parsing using Element.select on tags or id values. The issue i'm having is how to screen scrape javascript code in the page. Here i load the document:
Document doc = Jsoup.connect(pageUrl)
.userAgent(Agent)
.timeout(5000)
.get();
The javascript field values i'm trying to extract are the following:
arrayGPSLocation["0"] = "-19473982376,6848295867";
arrayGPSLocation["1"] = "-19473982376,6848296245";
Since these array values are not in a standard code tag <> is JSOUP the appropriate way to do this? I like JSOUP's API. The only other method is hacking together a String routine...
ie:
int start = pageBuffer.indexOf("arrayGPSLocation[\" + counter + \"]");
int end = pageBuffer.indexOf(";");
String result = pageBuffer.subString(start,end);
This pseudo-code example would have a serious performance problem when parsing a large page. Does anyone know how to accomplish this with JSOUP or should i write my own scraper?
All you can do with Jsoup - is select Element that contains javascript code, get its value as String and work with this string. Right like you doing it in example.
Related
HTML entities in CSS content (convert entities to escape-string at runtime)
I know that html-entities like or ö or ð can not be used inside a css like this: div.test:before { content:"text with html-entities like ` ` or `ö` or `ð`"; } There is a good question with good answers dealing with this problem: Adding HTML entities using CSS content But I am reading the strings that are put into the css-content from a server via AJAX. The JavaScript running at the users client receives text with embedded html-entities and creates style-content from it instead of putting it as a text-element into an html-element's content. This method helps against thieves who try to steal my content via copy&paste. Text that is not part of the html-document (but part of css-content) is really hard to copy. This method works fine. There is only this nasty problem with that html-entities. So I need to convert html-entities into unicode escape-sequences at runtime. I can do this either on the server with a perl-script or on the client with JavaScript, But I don't want to write a subroutine that contains a complete list of all existing named entities. There are more than 2200 named entities in html5, as listed here: http://www.w3.org/TR/2011/WD-html5-20110113/named-character-references.html And I don't want to change my subroutine every time this list gets changed. (Numeric entities are no problem.) Is there any trick to perfom this conversion with javascript? Maybe by adding, reading and removing content to the DOM? (I am using jQuery)
I've found a solution: var text = 'Text that contains html-entities'; var myDiv = document.createElement('div'); $(myDiv).html(text); text = $(myDiv).text(); $('#id_of_a_style-element').html('#id_of_the_protected_div:before{content:"' + text + '"}'); Writing the Question was half way to get this answer. I hope this answer helps others too.
Javascript - search for HTML elements in string
I have string with html elements. There are tables with captions. I need to find table which has caption with certain text and then return this table - as a string. What is the best way to do this with simple javascript, without any libraries ? F.e. this is an initial string <table border="1"><caption><strong>First</strong></caption><tbody><tr><td>...</td></tr></tbody></table><table border="1"><caption><strong>Result</strong></caption><tbody><tr><td>...</td></tr></tbody></table><table border="1"><caption><strong>Last</strong></caption><tbody><tr><td>...</td></tr></tbody></table> I want to get this string : <table border="1"><caption><strong>Result</strong></caption><tbody><tr><td></td></tr></tbody></table> Any advice or algorithm how to effeciently resolve this problem ? The challenge is to resolve it with javascript without using any third-party libraries and also without converting text into xml or something similar (because some of html code is not well formatted and it causes errors).
I have not had time to completely test this, but you might be able to try using a regular expression and the match() function. Assuming your table string is in a variable called str, then something along the lines of var res = str.match(\b<table\.\w+_</table>\b); res will be an array of matches of strings that begin with '', which you could then check to see which string contains the caption that you need. Hope that helps!
Unable to remove html tags from response JSON
Hi I am new to AngularJS. I am having a problem parsing JSON data to proper format. Actually the JSON response itself returned HTML format data (it contains HTML tags like <,;BR,> etc). If I check the response in browser it returns fine, but in device(TAB,MOBILE) the HTML tags are also getting appended. I am using AngularJS to bind the JSON response to DOM. Is there any way to simply ignore HTML tags in JQuery or in AngularJs? At the same time I don't want to remove the HTML tags as they are necessary to define "new line", "space", "table tag" etc. A sample response I am getting is like: A heavier weight, stretchy, wrinkle resistant fabric.<BR><BR>Fabric Content:<BR>100% Polyester<BR><BR>Wash Care:<BR> If I apply the binding using {{pdp.desc}}, the HTML tags are also getting added. Is there any way to accomplish this? I have added ng-bind-html-unsafe="pdp.desc", but still "BR" tags r coming.
useless html tags can be remove using regix expression, try this str.replace(/<\/?[^>]+>/gi, '')
Try to use three pairs of brackets {{{pdp.desc}}} In Handlebars it works, possible in your case to.
Use JS HTML parser var pattern = #"<(img|a)[^>]*>(?<content>[^<]*)<"; var regex = new Regex(pattern); var m = regex.Match(sSummary); if ( m.Success ) { sResult = m.Groups["content"].Value; courtesy stackoverflow.
Get the current html content of UIWebView?
I load a html template from the NSBundle using UIWebView's loadRequest method. The html template itself have large number of textboxes (say 20). After I fill all the textboxes I want the html with these filled values. I tried, [myWebView stringByEvaluatingJavaScriptFromString:#"document.body.innerHTML"] //even body.outerHTML, documentElement.innerHTML won't work But it is giving me the raw html template before these values filled. How can I do it? I am thinking this to do this, by reading the content of each textboxes, by javascript, [myWebView stringByEvaluatingJavaScriptFromString:#"document.getElementById('name').value"] and then read and replace the strings of the html. Is there any other better way to do this? Or is there any javascript methods to read the current content of the webpage?
You are completely right that .innerHTML and .outerHTML won't give you the updated state of the DOM. However, since your are already into querying by Javascript, you can create a JSON string containing a list of all the currently entered values: NSString *jsonString = [myWebView stringByEvaluatingJavaScriptFromString:#"(function(fields) { var O=[]; for(var i=0; i<fields.length;i++) {O.push(fields[i].value);} return JSON.stringify(O); })(document.querySelectorAll('input[type="text"]'))"]; You can then parse that in your native code an if it worked it should come out as an NSArray of strings.
Create 2d array from string
I have the following string : [[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,],[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,],[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,],[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,],[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,],[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,],[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,],[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,],[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,],[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,],] How can I create a 2d array of strings from it ? EDIT I've removed html tags since they're not the problem here. Also I'd like to do it without using any additional libs to keep it lightweight.
Except from the HTML tags in it, it would be valid JSON. You could remove the HTML tags and parse it using any library that handles JSON, like jQuery: var arr = $.parseJSON(theString.replace(/<br\/>/g,'')); It would also be valid Javascript code with the HTML tags removed, so if you have full control over where the string comes from so that you are certain that it can never contain any harmful code, you could use the eval function to execute the string: // Warning: 'eval' is subject to code injection vulnerabilities var arr = eval(theString.replace(/<br\/>/g,''));
You will need to remove the <br/> from the string. Then you should be able to do: var my2darray = eval(mystring);