Best way to parse HTML in Javascript - javascript

I am having a lot of trouble learning RegExp and coming up with a good algorithm to do this. I have this string of HTML that I need to parse. Note that when I am parsing it, it is still a string object and not yet HTML on the browser as I need to parse it before it gets there. The HTML looks like this:
<html>
<head>
<title>Geoserver GetFeatureInfo output</title>
</head>
<style type="text/css">
table.featureInfo, table.featureInfo td, table.featureInfo th {
border:1px solid #ddd;
border-collapse:collapse;
margin:0;
padding:0;
font-size: 90%;
padding:.2em .1em;
}
table.featureInfo th {
padding:.2em .2em;
font-weight:bold;
background:#eee;
}
table.featureInfo td{
background:#fff;
}
table.featureInfo tr.odd td{
background:#eee;
}
table.featureInfo caption{
text-align:left;
font-size:100%;
font-weight:bold;
text-transform:uppercase;
padding:.2em .2em;
}
</style>
<body>
<table class="featureInfo2">
<tr>
<th class="dataLayer" colspan="5">Tibetan Villages</th>
</tr>
<!-- EOF Data Layer -->
<tr class="dataHeaders">
<th>ID</th>
<th>Latitude</th>
<th>Longitude</th>
<th>Place Name</th>
<th>English Translation</th>
</tr>
<!-- EOF Data Headers -->
<!-- Data -->
<tr>
<!-- Feature Info Data -->
<td>3394</td>
<td>29.1</td>
<td>93.15</td>
<td>བསྡམས་གྲོང་ཚོ།</td>
<td>Dam Drongtso </td>
</tr>
<!-- EOF Feature Info Data -->
<!-- End Data -->
</table>
<br/>
</body>
</html>
and I need to get it like this:
3394,
29.1,
93.15,
བསྡམས་གྲོང་ཚོ།,
Dam Drongtso
Basically an array...even better if it matches according to its field headers and from which table they are somehow, which look like this:
Tibetan Villages
ID
Latitude
Longitude
Place Name
English Translation
Finding out JavaScript does not support wonderful mapping was a bummer and I have what I want working already. However it is VERY VERY hard coded and I'm thinking I should probably use RegExp to handle this better. Unfortunately I am having a real tough time :(. Here is my function to parse my string (very ugly IMO):
function parseHTML(html){
//Getting the layer name
alert(html);
//Lousy attempt at RegExp
var somestring = html.replace('/m//\<html\>+\<body\>//m/',' ');
alert(somestring);
var startPos = html.indexOf('<th class="dataLayer" colspan="5">');
var length = ('<th class="dataLayer" colspan="5">').length;
var endPos = html.indexOf('</th></tr><!-- EOF Data Layer -->');
var dataLayer = html.substring(startPos + length, endPos);
//Getting the data headers
startPos = html.indexOf('<tr class="dataHeaders">');
length = ('<tr class="dataHeaders">').length;
endPos = html.indexOf('</tr><!-- EOF Data Headers -->');
var newString = html.substring(startPos + length, endPos);
newString = newString.replace(/<th>/g, '');
newString = newString.substring(0, newString.lastIndexOf('</th>'));
var featureInfoHeaders = new Array();
featureInfoHeaders = newString.split('</th>');
//Getting the data
startPos = html.indexOf('<!-- Data -->');
length = ('<!-- Data -->').length;
endPos = html.indexOf('<!-- End Data -->');
newString = html.substring(startPos + length, endPos);
newString = newString.substring(0, newString.lastIndexOf('</tr><!-- EOF Feature Info Data -->'));
var featureInfoData = new Array();
featureInfoData = newString.split('</tr><!-- EOF Feature Info Data -->');
for(var s = 0; s < featureInfoData.length; s++){
startPos = featureInfoData[s].indexOf('<!-- Feature Info Data -->');
length = ('<!-- Feature Info Data -->').length;
endPos = featureInfoData[s].lastIndexOf('</td>');
featureInfoData[s] = featureInfoData[s].substring(startPos + length, endPos);
featureInfoData[s] = featureInfoData[s].replace(/<td>/g, '');
featureInfoData[s] = featureInfoData[s].split('</td>');
}//end for
alert(featureInfoData);
//Put all the feature info in one array
var featureInfo = new Array();
var len = featureInfoData.length;
for(var j = 0; j < len; j++){
featureInfo[j] = new Object();
featureInfo[j].id = featureInfoData[j][0];
featureInfo[j].latitude = featureInfoData[j][1];
featureInfo[j].longitude = featureInfoData[j][2];
featureInfo[j].placeName = featureInfoData[j][3];
featureInfo[j].translation = featureInfoData[j][4];
}//end for
//This can be ignored for now...
var string = redesignHTML(featureInfoHeaders, featureInfo);
return string;
}//end parseHTML
So as you can see if the content in that string ever changes, my code will be horribly broken. I want to avoid that as much as possible and try to write better code. I appreciate all the help and advice you can give me.

Do the following steps:
Create a new documentFragment
Put your HTML string in it
Use selectors to get what you want
Why do all the parsing work - which won't work anyways, since HTML is not parsable via RegExp - when you have the best HTML parser available? (the Browser)

You can use jQuery to easily traverse the DOM and create an object with the structure automatically.
var $dom = $('<html>').html(the_html_string_variable_goes_here);
var featureInfo = {};
$('table:has(.dataLayer)', $dom).each(function(){
var $tbl = $(this);
var section = $tbl.find('.dataLayer').text();
var obj = [];
var $structure = $tbl.find('.dataHeaders');
var structure = $structure.find('th').map(function(){return $(this).text().toLowerCase();});
var $datarows= $structure.nextAll('tr');
$datarows.each(function(i){
obj[i] = {};
$(this).find('td').each(function(index,element){
obj[i][structure[index]] = $(element).text();
});
});
featureInfo[section] = obj;
});
Working Demo
The code can work with multiple tables with different structures inside.. and also multiple data rows inside each table..
The featureInfo will hold the final structure and data, and can be accessed like
alert( featureInfo['Tibetan Villages'][0]['English Translation'] );
or
alert( featureInfo['Tibetan Villages'][0].id );

The "correct" way to do it is with DOMParser. Do it like this:
var parsed=new DOMParser.parseFromString(htmlString,'text/html');
Or, if you're worried about browser compatibility, use the polyfill on the MDN documentation:
/*
* DOMParser HTML extension
* 2012-09-04
*
* By Eli Grey, http://eligrey.com
* Public domain.
* NO WARRANTY EXPRESSED OR IMPLIED. USE AT YOUR OWN RISK.
*/
/*! #source https://gist.github.com/1129031 */
/*global document, DOMParser*/
(function(DOMParser) {
"use strict";
var
DOMParser_proto = DOMParser.prototype
, real_parseFromString = DOMParser_proto.parseFromString
;
// Firefox/Opera/IE throw errors on unsupported types
try {
// WebKit returns null on unsupported types
if ((new DOMParser).parseFromString("", "text/html")) {
// text/html parsing is natively supported
return;
}
} catch (ex) {}
DOMParser_proto.parseFromString = function(markup, type) {
if (/^\s*text\/html\s*(?:;|$)/i.test(type)) {
var
doc = document.implementation.createHTMLDocument("")
;
if (markup.toLowerCase().indexOf('<!doctype') > -1) {
doc.documentElement.innerHTML = markup;
}
else {
doc.body.innerHTML = markup;
}
return doc;
} else {
return real_parseFromString.apply(this, arguments);
}
};
}(DOMParser));

Change server-side code if you can (add JSON)
If you're the one that generates the resulting HTML on the server side you could as well generate a JSON there and pass it inside the HTML with the content. You wouldn't have to parse anything on the client side and all data would be immediately available to your client scripts.
You could easily put JSON in table element as a data attribute value:
<table class="featureInfo2" data-json="{ID:3394, Latitude:29.1, Longitude:93.15, PlaceName:'བསྡམས་གྲོང་ཚོ།', Translation:'Dam Drongtso'}">
...
</table>
Or you could add data attributes to TDs that contain data and parse only those using jQuery selectors and generating Javascript object out of them. No need for RegExp parsing.

Use John Resig's* pure javascript html parser
See demo here
*John Resig is the creator of jQuery

I had a similar requirement and not being that experienced with JavaScript I let jquery handle it for me with parseHTML and using find. In my case I was looking for divs with a particular class name.
function findElementsInHtmlString(document, htmlString, query) {
var domArray = $.parseHTML(htmlString, document),
dom = $();
// create the dom collection from the array
$.each(domArray, function(i, o) {
dom = dom.add(o);
}
// return a collection of elements that match the query
return dom.find(query);
}
var elementsWithClassBuild = findElementsInHtmlString(document, htmlString, '.build');

Related

How to show table rows based on URL parameter

Currently the page is accessed using a link, for instance help.html?show=charInPw.
The table is written in the following manner:
<table>
...
<tbody class="table-body">
<tr class="pwLen"><td colspan="5" class="subheading">Password Length</td></tr>
<tr class="pwLen"><td class="first">Minimum length</td><td>Y</td><td> </td><td>Y</td><td>Y</td></tr>
<tr class="pwLen"><td class="first">Maximum length</td><td>Y</td><td> </td><td>Y</td><td>Y</td></tr>
<tr class="charInPw"><td colspan="5" class="subheading">Characters in Password</td></tr>
<tr class="charInPw"><td class="first">Minimum numeric characters</td><td>Y</td><td> </td><td>Y</td><td>Y</td></tr>
<tr class="charInPw"><td class="first">Minimum alphabetic characters</td><td>Y</td><td> </td><td>Y</td><td>Y</td></tr>
...
The CSS is as follows:
table tbody tr{
text-align: center;
display: none;
}
(There are also trs in thead and they are always shown by default.)
Then I have some Javascript code as follows (jQuery is not an option):
<script>
var url = new URL(window.location.href);
var c = url.searchParams.get("show");
for (i = 0; i < document.getElementsByClassName(c); i++)
document.getElementsByClassName(c)[i].style.display='table-row';
</script>
However I am not able to get my rows to show.
How should I change the code on the page to show only the rows referenced by the show parameter?
Edit #1: As a test I did the following hard-coding but it didn't work too!
<script>
var url = new URL(window.location.href);
var c = url.searchParams.get("show");
var trs = document.getElementsByClassName('pwLen');
for (i = 0; i < trs.length; i++)
trs[i].style.display='table-row';
</script>
Edit #2: I have combined two solutions below into one - please see https://jsfiddle.net/tea45p2o/. The demo output shown is what I want, however I am not able to see that when I save the file and open the page in my browser. What is going on?
With much help from BJohn and MrJ, I have come up with my solution as follows:
<script>
function show() {
var url = new URL(window.location.href);
var c = '.' + url.searchParams.get("show");
for (let el of document.querySelectorAll(c)) el.style.display = 'table-row';
}
</script>
Then, I changed my <body> to
<body onLoad="show();">
Use below code:
for (let el of document.querySelectorAll(c)) el.style.display = 'table-row';
You also need to append dot (.) with class name, which you are getting from the URL.
Please find the fiddle here
ŷou are missing to place .length
for (i = 0; i < document.getElementsByClassName(c).length; i++)
But I think it would be more readable on this way
for(let TR_x of [... document.getElementsByClassName(c) ]) {
TR_x.style.display='table-row';
}
but I definitely prefer
document.querySelectorAll('.'+c).forEach(trX=>{ trX.style.display='table-row' })

Get value from a JSON file with multi dimensional array using jQuery $.getJSON method

I've been trying to fetch some values from a JSON file using the $.getJSON method. The first two loops are static so I wrote the below code to fetch the value of "layers.name". From the third loop, the data in the layers may or may not be available. How can I fetch the value of all "layers.name"presented in the JSON file
PS: The JSON file is an output generated from a software where the layer is presented
in this format
Here the code I've worked so far where I get the first two loop layers.
Html
<body>
<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.2.1/jquery.min.js"></script>
<script src="test.js"></script>
</body>
Javscript
$.getJSON('https://api.myjson.com/bins/6atbz', function(data) {
var layer = data.layers.reverse()
for (i=0; i<layer.length; i ++){
name = data.layers[i].name
id= data.layers[i].do_objectID
var className = '.'+id
var main = "<div class=\""+id+"\" data-number=\""+i+"\">"+name+"<\/div>"
$('body').append(main);
var subLayer = data.layers[i].layers.reverse()
for(j=0; j<subLayer.length; j++){
newname = data.layers[i].layers[j].name
$().append(' '+newname);
var subsubLayer = data.layers[i].layers[j]
var sub = "<div class=\""+newname+"\" data-number=\""+j+"\">"+newname+"<\/div>"
$(className).append(sub);
}
}
})
Thanks
Link to Fiddle
I think it's a good idea use recursion. Here is example:
var container = document.getElementById("container");
$.getJSON('https://api.myjson.com/bins/6atbz', function(data) {
buildTree(data, container)
})
function buildTree (node, container) {
var layers = node.layers || [];
console.info(node);
layers.forEach(function(item) {
var newContainer = document.createElement('div');
var span = document.createElement('span');
span.innerHTML = item.name;
newContainer.appendChild(span);
container.appendChild(newContainer);
if(item.layers){
buildTree(item, newContainer)
}
});
}
Here is live demo

text string output stops after first space, js/html

I apologize in advance, this is the first Stack Overflow question I've posted. I was tasked with creating a new ADA compliant website for my school district's technology helpdesk. I started with minimal knowledge of HTML and have been teaching myself through w3cschools. So here's my ordeal:
I need to create a page for all of our pdf and html guides. I'm trying to create a somewhat interactable menu that is very simple and will populate a link array from an onclick event, but the title="" text attribute drops everything after the first space and I've unsuccessfully tried using a replace() method since it's coming from an array and not static text.
I know I'm probably supposed to use an example, but my work day is coming to a close soon and I wanted to get this posted so I just copied a bit of my actual code.
So here's what's happening, in example 1 of var gmaildocAlt the tooltip will drop everything after Google, but will show the entire string properly with example 2. I was hoping to create a form input for the other helpdesk personnel to add links without knowing how to code, but was unable to resolve the issue of example 1 with a
var fix = gmaildocAlt.replace(/ /g, "&nb sp;")
//minus the space
//this also happens to break the entire function if I set it below the rest of the other variables
I'm sure there are a vast number of things I'm doing wrong, but I would really appreciate the smallest tip to make my tooltip display properly without requiring a replace method.
// GMAIL----------------------------
function gmailArray() {
var gmaildocLink = ['link1', 'link2'];
var gmaildocTitle = ["title1", "title2"];
var gmaildocAlt = ["Google Cheat Sheet For Gmail", "Google 10-Minute Training For Gmail"];
var gmailvidLink = [];
var gmailvidTitle = [];
var gmailvidAlt = [];
if (document.getElementById("gmailList").innerHTML == "") {
for (i = 0; i < gmaildocTitle.length; i++) {
arrayGmail = "" + gmaildocTitle[i] + "" + "<br>";
document.getElementById("gmailList").innerHTML += arrayGmail;
}
for (i = 0; i < gmailvidTitle.length; i++) {
arrayGmail1 = "";
document.getElementById("").innerHTML += arrayGmail1;
}
} else {
document.getElementById("gmailList").innerHTML = "";
}
}
<div class="fixed1">
<p id="gmail" onclick="gmailArray()" class="gl">Gmail</p>
<ul id="gmailList"></ul>
<p id="calendar" onclick="calendarArray()" class="gl">Calendar</p>
<ul id="calendarList"></ul>
</div>
Building HTML manually with strings can cause issues like this. It's better to build them one step at a time, and let the framework handle quoting and special characters - if you're using jQuery, it could be:
var $link = jQuery("<a></a>")
.attr("href", gmaildocLink[i])
.attr("title", gmaildocAlt[i])
.html(gmaildocTitle[i]);
jQuery("#gmailList").append($link).append("<br>");
Without jQuery, something like:
var link = document.createElement("a");
link.setAttribute("href", gmaildocLink[i]);
link.setAttribute("title", gmaildocAlt[i]);
link.innerHTML = gmaildocTitle[i];
document.getElementById("gmailList").innerHTML += link.outerHTML + "<br>";
If it matters to your audience, setAttribute doesn't work in IE7, and you have to access the attributes as properties of the element: link.href = "something";.
If you add ' to either side of the variable strings then it will ensure that the whole value is read as a single string. Initially, it was assuming that the space was exiting the Title attribute.
Hope the below helps!
UPDATE: If you're worried about using apostrophes in the title strings, you can use " by escaping them using a . This forces JS to read it as a character and not as part of the code structure. See the example below.
Thanks for pointing this one out guys! Sloppy code on my part.
// GMAIL----------------------------
function gmailArray() {
var gmaildocLink = ['link1', 'link2'];
var gmaildocTitle = ["title1", "title2"];
var gmaildocAlt = ["Google's Cheat Sheet For Gmail", "Google 10-Minute Training For Gmail"];
var gmailvidLink = [];
var gmailvidTitle = [];
var gmailvidAlt = [];
if (document.getElementById("gmailList").innerHTML == "") {
for (i = 0; i < gmaildocTitle.length; i++) {
var arrayGmail = "" + gmaildocTitle[i] + "" + "<br>";
document.getElementById("gmailList").innerHTML += arrayGmail;
}
for (var i = 0; i < gmailvidTitle.length; i++) {
var arrayGmail1 = "";
document.getElementById("").innerHTML += arrayGmail1;
}
} else {
document.getElementById("gmailList").innerHTML = "";
}
}
<div class="fixed1">
<p id="gmail" onclick="gmailArray()" class="gl">Gmail</p>
<ul id="gmailList"></ul>
<p id="calendar" onclick="calendarArray()" class="gl">Calendar</p>
<ul id="calendarList"></ul>
</div>

How to set byte[] property of ActiveX component from Javascript?

I'd like to set RTF formatted calendar entries, but don't know how to pass the byte[] to the ActiveX object, i.e. the RTFBody property.
The following code reads the RTFBody property after some content has been set - so reading the byte[] is working, but when I try to write exactly the same content (+ trailing 0) back, neither an U/Int8Array nor a Scripting.Directory works.
Maybe it's possible to workaround with some .NET objects, but I don't know how to instanciate those Non-ActiveX components. An alternative solution shouldn't require to script the formattings, e.g. "go to line 2 and make it bold", i.e. I like to generate the rtf via a template and only paste the result into the calendar object.
I'm aware that this has to be eventually encoded in Windows-1252, but for a start I simply want to see the same bytes to be written successfully. The script is executed within a HTA context - so script security is not an issue.
<html>
<head>
<hta:application id="foo" applicationname="foo" version="1" navigable="yes" sysMenu="yes"></hta>
</head>
<script language="JavaScript" type="text/javascript">
function doit2() {
var rtfBody =
"{\\rtf1\\ansi\\ansicpg1252\\deff0\\nouicompat\\deflang1031{\\fonttbl{\\f0\\fswiss\\fcharset0 Calibri;}}\r\n"+
"{\\*\\generator Riched20 14.0.7155.5000;}{\\*\\mmathPr\\mwrapIndent1440}\\viewkind4\\uc1\r\n"+
"\\pard\\f0\\fs22 bla\\par\r\n"+
"}\r\n";
// https://github.com/mathiasbynens/windows-1252
var rtfBody1252 = rtfBody; // windows1252.encode(rtfBody);
var dict = new ActiveXObject("Scripting.Dictionary");
for (var i = 0; i < rtfBody1252.length; i++) {
dict.add(i, rtfBody1252.charCodeAt(i));
}
dict.add(rtfBody1252.length, 0);
// Alternative setting via U/Int8Array also doesn't work ...
// var buf = new ArrayBuffer(rtfBody1252.length+1);
// var bufView = new Int8Array(buf);
// for (var i=0, strLen=rtfBody1252.length; i<strLen; i++) {
// bufView[i] = rtfBody1252.charCodeAt(i);
// }
// bufView[rtfBody1252.length] = 0;
var myOlApp = new ActiveXObject("Outlook.Application");
var nameSpace = myOlApp.GetNameSpace("MAPI");
var recipient = nameSpace.CreateRecipient("user#host.com");
var cFolder = nameSpace.GetSharedDefaultFolder(recipient,9);
var appointment = cFolder.Items.Add(1);
appointment.Subject = "Subject";
appointment.Location = "Location";
appointment.Start = "22.02.2017 17:00";
appointment.Duration = "120";
appointment.Categories = "deleteme";
appointment.Body = "bla";
var va = new VBArray(appointment.RTFBody).toArray();
var bla = String.fromCharCode.apply(null, va);
document.forms[0].output.value = bla;
// var bla2 = windows1252.decode(bla);
appointment.RTFBody = dict.Items();
appointment.ReminderSet = "true";
appointment.Save();
entryId = appointment.EntryId;
appointment.Display();
delete appointment;
delete cFolder;
delete recipient;
delete nameSpace;
delete myOlApp;
}
</script>
<body>
<form>
<input type="button" onclick="doit2()" value="doit"/>
<textarea name="output" rows="5" cols="50"></textarea>
</form>
</body>
</html>

Javascript for loop running out of sequence

I have a confusing problem where a line of code in my function is running before a loop which is stated above it. In my HTML I have:
<textarea id="textBox"></textarea>
<button id="submitButton" onclick="parseData()">submit</button>
<div id="put"></div>
And my JS function is:
function parseData() {
var data = $("#textBox").val();
var tab = /\t/;
data = data.split(tab);
$("#put").html($("#put").html() + "<table>");
for (var i = 0; i < data.length; i++) {
$("#put").html($("#put").html() + "<tr>"+data[i]+"</tr>");
};
$("#put").html($("#put").html() + "</table>");
return;
};
The resulting html in $("#put") is this:
<table></table>
"SiO2 (%)Al2O3 (%)TiO2 (%)CaO2 (%)MgO2 (%) 8.21.25.31.51.8 45.32.52.60.210.5 65.23.48.70.0662.3 20.11.85.42.540.2 18.91.12.34.810.7"
I'm not sure why the final </table> is being placed before the for loop runs, and I'm not sure why the <tr> tags aren't being added within the for loop. Any pointers?
jQuery automatically closes up tags upon insertion. Try this.
function parseData() {
var data = $("#textBox").val();
var tab = /\t/;
var put_html = $("#put").html() + "<table>";
data = data.split(tab);
for (var i = 0; i < data.length; i++) {
put_html += "<tr>"+data[i]+"</tr>";
};
put_html += '</table>';
$("#put").html(put_html);
return;
};
However, I notice that you aren't using <td> elements. You might want to look into fixing that too.
Every time you are adding content into the html() property rather than building the entire content and adding it.
Since you are using jQuery you can bind the event using jQuery rather than adding that directly in the HTML
<textarea id="textBox"></textarea>
<button id="submitButton">submit</button>
<div id="put"></div>
$(function(){
$("#submitButton").click(function(){
parseData();
});
function parseData() {
var data = $("#textBox").val();
var tab = /\t/;
data = data.split(tab);
// Build your html
$("#put").html('htmlstructure');
return;
}
});
Also you can look for something similar like concatenating the string in an array so that you don't create string isntances each time when you append, since strings are immutable.
Good example

Categories