Strategy to extract structured data with xpath - javascript

Is there a pattern to extract structured data from an HTML page using XPath? I'm trying to extract data from one or more HTML tables on a page. XPath makes it easy to find the table(s), but I'm struggling once I've got that far.
I'm currently doing the following:
Iterate the tables (there may be more than one)
Iterate the rows within that table
Iterate the cells within that row
(Then probably put them in an array and parse the contents)
My code is something like this:
var tables = mydoc.evaluate( "//table", mydoc, null, XPathResult.ORDERED_NODE_ITERATOR_TYPE, null );
table = tables.iterateNext();
while (table)
{
var rows = mydoc.evaluate("tbody/tr", table, null, XPathResult.ORDERED_NODE_ITERATOR_TYPE, null);
row = rows.iterateNext();
while (row)
{
var tds = mydoc.evaluate("td", row, null, XPathResult.ORDERED_NODE_ITERATOR_TYPE, null)
td = tds.iterateNext()
while(td)
{
// TODO: store content in an array to process later
print('*' + td.textContent);
td = tds.iterateNext();
}
row = rows.iterateNext();
}
table = iterator.iterateNext();
}
This seems a little nasty as all the XPath examples seem to do their processing in one step. There appear to be few non-trivial examples where two types of data (e.g. labels and values in a table) are selected and combined. I can use the following selectors, but I end up with two lists with no structure:
//table/tbody/tr/td[#class='label']
//table/tbody/tr/td/a[#class='value']
(I know I'm using XPath for HTML parsing for which it wasn't really intended, but it seems to work so far.)

There appear to be few non-trivial
examples where two types of data (e.g.
labels and values in a table) are
selected and combined. I can use the
following selectors, but I end up with
two lists with no structure:
//table/tbody/tr/td[#class='label']
//table/tbody/tr/td/a[#class='value']
Use:
//table/tbody/tr/td[#class='label']
|
//table/tbody/tr/td/a[#class='value']
This single XPath expression selects all the wanted nodes (all XPath engines I am aware of return the selected nodes in document order). The | (union) operator produces the set union of its arguments.
If the (x)Html document has regular structure, you may expect in the returned result every selected td element (label) to be followed by its corresponding a element (value)

If it's on the main HTML page, you could just do:
for(var tables=document.getElementsByTagName("table"),i=0;i<tables.length;++i)
for(var rows=tables[i].getElementsByTagName("tr"),j=0;j<rows.length;++j)
for(var cells=rows[j].getElementsByTagName("td"),k=0;k<cells.length;++k)
print("*"+cells[i].textContent);
getElementsByTagName does /not/ return an array - it returns a live NodeList similar to ORDERED_NODE_ITERATOR_TYPE.

Related

Getting table name from cell selection in Apple Numbers using Javascript for Automation (JXA)

I'm trying to convert this Applescript to Javascript for Automation.
Is there anyone who can help to write the line bellow in Javascript for Automation?
set thisTable to the first table whose class of selection range is range
Appreciate any help!
This will require the use of a whose clause for querying the element array of tables in the selected sheet—essentially, you want to find all Table objects whose selectionRange attribute is not null (which is analogous to asserting that its class is range). Unfortunately, there seems to be a bug in JXA that prevents comparisons with null. So while it would be optimal to simply check _not: [{ selectionRange: null }] in the code below, I found that that approach did not work. Instead, the following appears (albeit not particularly elegantly) to do what you're looking for:
Numbers = Application('Numbers')
selectedTables = Numbers.documents[0].activeSheet.tables.whose({
_not: [{
_match: [ObjectSpecifier().selectionRange.name, '']
}]
})
if (selectedTables.length > 0) {
thisTable = selectedTables[0]
} else {
// No table is selected -- handle accordingly
}
This is a pretty hacky approach (it relies on the fact that selectionRange.name isn't even a valid key for non-selected tables), but given the buggy state of JXA, it may be the best one can do.
Further Reading:
Apple's documentation on whose clauses
An OmniGraffle bug report showing some of the varieties of (and issues with) the whose notation
A list of supported whose filter operators
Another way ... selected table and selection range
const app = Application("Numbers"),
doc = app.documents[0],
sheet= doc.activeSheet,
tbls = sheet.tables;
let seltable = tbls[tbls.selectionRange().findIndex(el => !!el)];
let selrange = seltable.selectionRange;

Recursive function for detailCellRendererParams, Ag-Grid?

I have data which consists of multiple rows of data. Each row contains a 'children' array property, which may have data in the form of more rows, or may be empty. On top of that, each of the rows within the 'children' array property may also contain more 'children' data or rows and so on, so it looks like this (think of each line as a row and each indented line as a child row of that row):
r|-------
r1|------*
r1a|------
r1b|------*
r1b1|------
r1c|------*
r1c1|------
r1c2|------
r2|------
r3|------*
r3a|------
r3b|------
Each parent containing child rows (I marked them with '*') must have detailCellRendererParams defined, which is fine if I was just going to define each one manually (as shown in Ag-Grid documentation under Nesting Master / Detail, however, it is uncertain how many parent/children rows there will be. I am looking to create a recursive function that defines the detailCellRendererParams for each parent row with children. How might I write something like this?
No recursion required, just use the tree data functionality of ag-grid:
https://www.ag-grid.com/javascript-grid-tree-data/
You need to enable tree functionality with:
var gridOptions = {
treeData: true,
...
}
and provide the grid with the field that creates your tree-hierarchy
gridOptions.getDataPath: function(data) {
return data.myHierarchyField;
},

jQuery DataTables 1.10: Which column is it sorted on right now?

Using DataTables 1.10,
I have a DataTable with a default sort and the user can resort by some of the other columns.
How do I detect the column by which the table is currently sorted?
Some context which may not be relevant to answering the question: What I'm really trying to do is "export" the table to a non-interactive HTML table. This DataTable is generated programmatically and then turned into a DataTable, so after some searching for export options it looks like it will be easier to essentially regenerate the original table than to actually export. But I need the regenerated table to have the rows in the same order as the current sort.
The current sort state sortInfo can be retrieved like this:
var apiObject = $("#myPlainTable).DataTable( dtOptions );
// ...
var sortInfo = apiObject.settings().order()
More specifically, the column and direction are encoded like this:
var sortCol = sortInfo[0][0]; // counting from left, starting with 0
var sortDir = sortInfo[0][1]; // either "asc" or "desc"
Caveats:
The sortInfo object will have the above format after the user changes the sorting; if you specify the initial sort by setting dtOptions.order using a different format, then the sortInfo object will have the original value you specified until the user changes the sorting. (For example, DataTables will accept [1,'asc'] in addition to the above [[1,'asc']]; I didn't test what happens if you pass a value DataTables can't use.)
This describes the default case where you sort by one column only, not using the multi-column sort feature.
When you are using dataTables 1.10.x already, why not use the API? By that it is easy :
table.rows().data()
returns an array of arrays containing the current content of the table, i.e the rows as they are currently sorted. So if you want to export or clone the content of a dataTable to a static table, you can actually do it very simple :
$("#clone").click(function() {
var cloneTable = '';
table.rows().data().each(function(row) {
cloneTable += '<tr><td>' + row.join('</td><td>') + '</td></tr>';
})
$('#cloneTable tbody').html(cloneTable);
})
demo -> http://jsfiddle.net/zuxm2e68/
If sending to the server you check the
order[i][column]
order[i][dir]
for column and direction. See here
Since you are using 1.10, you can use the following:
var table = $('.myTable').dataTable({ /* Your options */ });
var sortArray = table.api().settings().aaSorting; // gives array
If you are using API already via $('.myTable').DataTable({...}), you can omit the .api().

DataTables: How to bypass the filtering rules?

How can I exempt a single row in a DataTables.js table from DataTables' builtin filtering, so thta it is always shown?
Background: I'm building a table editing component using the jQuery-based DataTables.js library. Instead of using dialogs or overlays, I wanted to present editing controls right within the datatable, like this:
This works like a charm, even with active filters: I keep the original, unchanged data in the record while it is being edited, so I can use that data for the 'sort' and 'filter' modes of mDataProp, and my row stays in place and visible until editing is finished.
A bigger problem arises when I add a new row: There is no data to use for filtering, so if a filter is active, my row won't be visible. This breaks the workflow where the user searches through the dataset, sees that some record is missing, and (without clearing the filter) presses the "Add" button, waiting for an empty row with edit controls to appear:
How can I exempt this special row from DataTables' filtering?
After reading through the source code of DataTables.js for some time, I came to the conclusion that there is no way to hook into the filtering in the desired way. There are hooks for custom filters, but they can only be used to hide stuff, not to show stuff.
However, there's a 'filter' event which is triggered after filtering, but before the table is rendered. My solution installs an handler for this event:
$('table#mydatatable').bind('filter', function() {
var nTable = $(this).dataTable();
var oSettings = nTable.fnSettings();
//collect the row IDs of all unsaved rows
var aiUnsavedRowIDs = $.grep(oSettings.aiDisplayMaster, function(iRowID) {
var oRowData = nTable.fnGetData(iRowID);
return is_unsaved(oRowData);
});
//prepare lookup table
var oUnsavedRecordIDs = {};
$.each(aiUnsavedRowIDs, function(idx, iRowID) {
oUnsavedRecordIDs[iRowID] = true;
});
//remove unsaved rows from display (to avoid duplicates after the
//following step)
for (var i = oSettings.aiDisplay.length; i >= 0; i--) {
//iterate backwards, because otherwise, removal from aiDisplay
//would mess up the iteration
if (oUnsavedRecordIDs[ oSettings.aiDisplay[i] ]) {
oSettings.aiDisplay.splice(i, 1);
}
}
//insert unsaved records at the top of the display
Array.prototype.unshift.apply(oSettings.aiDisplay, aiUnsavedRowIDs);
//NOTE: cannot just say oSettings.aiDisplay.unshift(aiUnsavedRowIDs)
//because this would add the array aiUnsavedRowIDs as an element to
//aiDisplay, not its contents.
});
What happens here? First, I find all unsaved rows by looking through oSettings.aiDisplayMaster. This array references all rows that are in this DataTable, in the correct sorting order. The elements of aiDisplayMaster are integer indices into DataTables' internal data storage (one index per row).
The filtering process goes through the rows in aiDisplayMaster, and places the row IDs of all matching rows in oSettings.aiDisplay. This array controls which rows will be rendered (after this event handler has finished!). The whole process looks like this:
[1, ..., numRows]
|
| sorting
v
oSettings.aiDisplayMaster
|
| filtering
v
oSettings.aiDisplay
|
| rendering
v
DOM
So after having located all unsaved records in aiDisplayMaster (using custom logic that I wrapped in an is_unsaved() function for the sake of this snippet), I add them all to aiDisplay (after removing existing instances of these rows, to avoid duplicates).
A side-effect of this particular implementation is that all unsaved rows appear at the top of the table, but in my case, this is actually desirable.

More efficient way to generate elements and associate $.data()? [jQuery]

I'm looking for a more efficient way to generate some DOM changes and update the .data() object.
Presently, data is provided in an array of objects and I'm building the strings in sections, appending them to the table body, and adding the .data object onto the element.
Here's a snapshot.
for (i in data.users) {
//Build the string
var rowData = "<tr class=\"";
rowData += (i%2) ? "odd":"even";
rowData +="\">";
rowData +="<td>";
rowData +=data.users[i]["FirstName"];
rowData +="</td>";
rowData +="<td>";
rowData +=data.users[i]["LastName"];
rowData +="</td>";
rowData +="<td>";
rowData +=data.users[i]["UserName"];
rowData +="</td>";
//Could be many more columns
rowData +="</tr>";
//Change the DOM, add to the data object.
$(rowData).appendTo("#list > table > tbody").data('ID', data.users[i]["ID"]);
}
I'm mostly upset over having to manipulate the DOM n times, rather than make one change with all the data-- or be able to release data as if there were a buffer. I would also like to update .data() all-at-once.
Using jQuery 1.4.3+ you can just put a data-XXX attribute into a node. That is automatically picked by jQuery into the data() hash for that node.
var Buffer = [];
for (i in data.users) {
//Build the string
Buffer.push("<tr data-ID=\"" + data.users[i]["ID"] + "\" class=\"");
Buffer.push((i%2) ? "odd":"even");
Buffer.push("\">");
Buffer.push("<td>");
Buffer.push(data.users[i]["FirstName"]);
Buffer.push("</td>");
// and so forth
Buffer.push("</tr>");
}
$("#list > table > tbody").append(Buffer.join(''));
I am assuming that the data is coming from some server-side source somewhere. Why don't you using something like jqGrid in which you'll pass a JSON object (or XML or etc.) from your server (AKA, all manipulation is done server side) and then just display the data to the users. The graphs that jqGrid generates are very nice and you won't need to do all the work you are currently doing.
take a look at the jquery templates plugin .. still in beta but very nice!
http://api.jquery.com/category/plugins/templates/
I would vote for not putting that information in the DOM at all. Use a JS model to manage that data, then query it as needed. This keeps the DOM uncluttered and properly and efficiently separates semantic markup and the data that supports/drives it. MVC ftw!

Categories