What is the best way to parse html in google apps script

What is the best way to parse html in google apps script - javascript

var page = UrlFetchApp.fetch(contestURL);
var doc = XmlService.parse(page);
The above code gives a parse error when used, however if I replace the XmlService class with the deprecated Xml class, with the lenient flag set, it parses the html properly.
var page = UrlFetchApp.fetch(contestURL);
var doc = Xml.parse(page, true);
The problem is mostly caused because of no CDATA in the javascript part of the html and the parser complains with the following error.
The entity name must immediately follow the '&' in the entity reference.
Even if I remove all the <script>(.*?)</script> using regex, it still complains because the <br> tags aren't closed.
Is there a clean way of parsing html into a DOM tree.

I ran into this exact same problem. I was able to circumvent it by first using the deprecated Xml.parse, since it still works, then selecting the body XmlElement, then passing in its Xml String into the new XmlService.parse method:
var page = UrlFetchApp.fetch(contestURL);
var doc = Xml.parse(page, true);
var bodyHtml = doc.html.body.toXmlString();
doc = XmlService.parse(bodyHtml);
var root = doc.getRootElement();
Note: This solution may not work if the old Xml.parse is completely removed from Google Scripts.

In 2021, the best way to parse HTML on the .gs side that I know of is...
Click + next to Library
Enter 1ReeQ6WO8kKNxoaA_O0XEQ589cIrRvEBA9qcWpNqdOP17i47u6N9M5Xh0
Click "Look up"
Click Add
Sample usage:
const contentText = UrlFetchApp.fetch('https://www.somesite.com/').getContentText();
const $ = Cheerio.load(contentText);
$('.some-class').first().text();
That's it -- this is probably the closest we'll get to doing jQuery-like DOM selection in GAS. The .first() is important or else you may extract more content than you expected (think of it as using querySelector() instead of querySelectorAll()).
Credit where credit is due: https://github.com/tani/cheeriogs

As of May 2020, you can now use the Cheerio library for Google Apps Script to do this.
Returns the content of Wikipedia's Main Page
const content = getContent_('https://en.wikipedia.org');
const $ = Cheerio.load(content);
Logger.log($('#mp-right').text());
Returns the content of the first paragraph <p> of Wikipedia's Main Page
const content = getContent_('https://en.wikipedia.org');
const $ = Cheerio.load(content);
Logger.log($('p').first().text());
To add to your project:
Select Resources - Libraries... in the Google Apps Script editor. Enter the project key 1ReeQ6WO8kKNxoaA_O0XEQ589cIrRvEBA9qcWpNqdOP17i47u6N9M5Xh0 in the Add a library field, and click "Add". Select the highest version number, and click "Save".

I found that the best way to parse html in google apps is to avoid using XmlService.parse or Xml.parse. XmlService.parse doesn't work well with bad html code from certain websites.
Here a basic example on how you can parse any website easily without using XmlService.parse or Xml.parse. In this example, i am retrieving a list of president from "wikipedia.org/wiki/President_of_the_United_States"
whit a regular javascript document.getElementsByTagName(), and pasting the values into my google spreadsheet.
1- Create a new Google Sheet;
2- Click the menu Tools > Script editor... to open a new tab with the code editor window and copy the following code into your Code.gs:
function onOpen() {
var ui = SpreadsheetApp.getUi();
ui.createMenu("Parse Menu")
.addItem("Parse", "parserMenuItem")
.addToUi();
}
function parserMenuItem() {
var sideBar = HtmlService.createHtmlOutputFromFile("test");
SpreadsheetApp.getUi().showSidebar(sideBar);
}
function getUrlData(url) {
var doc = UrlFetchApp.fetch(url).getContentText()
return doc
}
function writeToSpreadSheet(data) {
var ss = SpreadsheetApp.getActiveSpreadsheet();
var sheet = ss.getSheets()[0];
var row=1
for (var i = 0; i < data.length; i++) {
var x = data[i];
var range = sheet.getRange(row, 1)
range.setValue(x);
var row = row+1
}
}
3- Add an HTML file to your Apps Script project. Open the Script Editor and choose File > New > Html File, and name it 'test'.Then copy the following code into your test.html
<!DOCTYPE html>
<html>
<head>
</head>
<body>
<input id= "mButon" type="button" value="Click here to get list"
onclick="parse()">
<div hidden id="mOutput"></div>
</body>
<script>
window.onload = onOpen;
function onOpen() {
var url = "https://en.wikipedia.org/wiki/President_of_the_United_States"
google.script.run.withSuccessHandler(writeHtmlOutput).getUrlData(url)
document.getElementById("mButon").style.visibility = "visible";
}
function writeHtmlOutput(x) {
document.getElementById('mOutput').innerHTML = x;
}
function parse() {
var list = document.getElementsByTagName("area");
var data = [];
for (var i = 0; i < list.length; i++) {
var x = list[i];
data.push(x.getAttribute("title"))
}
google.script.run.writeToSpreadSheet(data);
}
</script>
</html>
4- Save your gs and html files and Go back to your spreadsheet. Reload your Spreadsheet. Click on "Parse Menu" - "Parse". Then click on "Click here to get list" in the sidebar.

Xml.parse() has an option to turn on lenient parsing, which helps when parsing HTML. Note that the Xml service is deprecated however, and the newer XmlService doesn't have this functionality.

For simple tasks such as grabbing one value from a webpage, you could use a regular expression. Regex is notoriously bad for parsing HTML as there's all sorts of weird cases it can get tripped up, but if you're confident about the HTML you're accessing this can sometimes be the simplest way.
Here's an example that fetches the contents of the page's <title> tag:
var page = UrlFetchApp.fetch(contestURL);
var regExp = new RegExp("<title>(.*)</title>", "gi");
var result = regExp.exec(page.getContentText());
// [1] is the match group when using parenthesis in the pattern
var value = result ? result[1] : 'No title found';

I know it is not exactly what OP asked, but I found this question when I was looking for some html parsing options - so it might be useful for others as well.
There is an easy to use the library for TEXT parsing. It's useful if you want to get only one piece of information from the html(xml) code.
EDIT 2021: The script library id is:
1Mc8BthYthXx6CoIz90-JiSzSafVnT6U3t0z_W3hLTAX5ek4w0G_EIrNw
It works like in the picture above
function getData() {
var url = "https://chrome.google.com/webstore/detail/signaturesatori-central-s/fejomcfhljndadjlojamaklegghjnjfn?hl=en";
var fromText = '<span class="e-f-ih" title="';
var toText = '">';
var content = UrlFetchApp.fetch(url).getContentText();
var scraped = Parser
.data(content)
.from(fromText)
.to(toText)
.build();
Logger.log(scraped);
return scraped;
}

If you are using
Cheerio library for Google Apps Script
Source code
Library page (⭐ star it!)
Installation by library ID:
1ReeQ6WO8kKNxoaA_O0XEQ589cIrRvEBA9qcWpNqdOP17i47u6N9M5Xh0
A function to get current emojis from unicode.org:
function getEmojis() {
var t = new Date();
var url = 'https://unicode.org/emoji/charts/full-emoji-list.html';
var fetch = UrlFetchApp.fetch(url);
var contentText = fetch.getContentText();
//console.log(new Date() - t);
// Cherio
var $ = Cheerio.load(contentText);
var data = [];
$("table > tbody > tr").each((index, element) => {
var row = [];
$(element).find("td").each((index, child) => {
row.push($(child).text());
});
if (row.length > 0) {
data.push(row);
}
});
//console.log(data);
//console.log(new Date() - t);
// Result
return data;
}
↑ Sample code shows how to parse table and put it into [[array]]
May be used as a custom function:
Bonus
Parsing the site may be a time-consuming operation + you may reach the limit.
Here's a test file with a full version of the script:
https://docs.google.com/spreadsheets/d/1iO7YjYWyfseQu_YCfRbGDPg7NskOgMu_iO1iGjr7KxY/edit#gid=93365395
↑ it uses CasheService to reduce the number of calls.

Natively there's no way unless you do what you already tried which wont work if the html doesnt conform with the xml format.

There are two options
a) One is to use JavaScript's string functions. First locate your tag using string.indexOf() and then extract the data you want using string.substring().
b) The other option is to make use of the Xml Service.

It's not possible to create an HTML DOM server-side in Apps Script. Using regular expressions is likely your best option, at least for simple parsing.

Related

Simple code to extract substring into a variable of GTM (Google Tag Manager)

I do not know how to code Java and am not an expert in GTM. However, the code I need is so simple, It worked on an online editor but I have been trying to get it to work on GTM and it does not validate the code.
I need to extract the email adresses from a long string (variable {{Click URL}} in GTM) that contains a complete "mailto:" url with many parameteres and only extract the short email from there (without the additional parameters after the ".com?")
Just an example of this kind of url:
'mailto:information#example.com?subject=Demande%20de%20renseign
ements&body=Votre%20nom:%20%0A%0ANom%20du%20produit:%20%0A%0AVotre%20tel
.%20si%20vous%20souhaitez%20recevoir%20un%20appel%20de%20notre%20part:%2
0%0A%0AVotre%20demande%20de%20renseignements:%20%0A'
Here is the code,
let shortmailto2 = {{Click URL}},
let fin = shortmailto2.indexOf('?'),
let debut = shortmailto2.indexOf(':'),
let shortmailto = shortmailto2.slice(debut+1,fin);
it pulls the right email address as I need when testing on an online editor but when I insert it into GTP (and use a pre-existinge variable, the "click url") I get an error (see monosnap link below for the screen shot): https://monosnap.com/file/eBFYfEwLv9LrPwGrGl6rzaHCbmoeYj
Thanks!

GTM Custom JavaScript Variables:
This field should be a JavaScript function that returns a value using the 'return' statement. If the function does not explicitly return a value, it will return undefined and your container may not behave as expected. Below is an example of this field:
function() {
var now = new Date();
return now.getTime();
}
The following worked for me when I tested it, returning just the email address.
function() {
var shortmailto2 = {{Click URL}};
var fin = shortmailto2.indexOf('?');
var debut = shortmailto2.indexOf(':');
return shortmailto2.slice(debut+1,fin);
}

THIS message is keep poping up,,<SyntaxError: Unexpected token class('SheetConverter'>

I am making a code copy from spreadsheet to email however i am getting an error which keeps popping up
<SyntaxError: Unexpected token class('SheetConverter'>
my code.
function myFunction() {
var s = SpreadsheetApp.getActive().getSheetByName('MAIL');
var ss = SpreadsheetApp.getActiveSpreadsheet();
var range = ss.getActiveSheet().getDataRange();
var range = s.getRange('C7:I24');
var to = "example#ex.com" ;
var body = "";
var htmlTable = SheetConverter.convertRange2html(range);
var body = "Here is the table:<br/><br/>"
+ htmlTable
+ "<br/><br/>The end.";
MailApp.sendEmail(to, 'Subject', body, {htmlBody: body})
}

There is a bug report with respect to v8 support with the SheetConverter library. As a workaround in the short term, you could create the file yourself inside of your project and remove the library reference, copy the source code from here and edit lines 58-60 to read:
function objIsClass_(object,className) {
return (toClass_.call(object).indexOf(className) !== -1);
}

It seems to be bug with the SheetConverter library, only after Enabling New App Script runtime this error is being populated. Try disabling New App Script runtime.
In the Script Editor >> Run >> Disable New App Script runtime.
This should work.

Go to Project Settings and untick this option:
Enable Chrome V8 runtime.

Making a checkbox Ui with Google Apps Script

This is my first time posting on Stack Overflow, so I apologize in advance if I made any "posting" mistakes.
I am using Google Apps Script for the project I am doing, because I am interacting with Google Docs and Google sheets. Google Script has a deprecated Ui Service, which is very frustrating to me.
So far, I have written a JavaScript program that parses through the "table of contents" part of a Google Doc, and retrieves the items I wanted. (If the table of content is changed, then the items retrieved might be different too.) Now, my intention is to make a checkbox that uses all the items that I retrieved from the table of contents as its options, and after users choosing and clicking "submit" button, a new Google Sheet would be generated. Of course I will operate on that new Google Sheet, but I want to stick with this problem first.
I did look up HTML Service, a Ui recommended by Google Script, but I don't know too much about HTML. What is most difficult is that if I use HTML to make the checkbox, how am I supposed to get all the items I parsed from the JavaScript passed into the HTML file, check which options are checked, and then get checked options passed back to the JavaScript so that I can do further operation? Is it even possible? Or am I having a misconception?
What I have so far:
//custom menu
function onOpen()
{
var ui = DocumentApp.getUi();
ui.createMenu('Custom Menu')
.addItem('Export Checksheets', 'customMenu')
.addToUi();
}
//parse the table of contents
function readFile()
{
var options = [];
var testDoc = DocumentApp.getActiveDocument();
var file = testDoc.getBody();
var tableOfCon = DocumentApp.ElementType.TABLE_OF_CONTENTS;
var searchRes = file.findElement(tableOfCon);
//If the element exists
if (searchRes)
{
var TC = searchRes.getElement().asTableOfContents();
var numChild = TC.getNumChildren();
for (var i=0; i < numChild; i++)
{
var info = {};
var TCItem = TC.getChild(i).asParagraph();
var TCItemText = TCItem.getChild(0).asText();
var TCItemAttrs = TCItemText.getAttributes(); //for future usage
//get all the options
if (!TCItemText.isBold())
{
options.push(TCItem.getText());
}
}
}
Logger.log(options); //which successfully displays the items retrieved
}
//linked to the HTML file
function customMenu()
{
var html = HtmlService.createHtmlOutputFromFile('Index');
html.setHeight(350).setWidth(280);
var ui = DocumentApp.getUi().showModalDialog(html, 'Hello!');
}
My HTML file only has a little code that displays some message when the custom menu is chosen.
Any help would be appreciated!

How do I use jQuery in Windows Script Host?

I'm working on some code that needs to parse numerous files that contain fragments of HTML. It seems that jQuery would be very useful for this, but when I try to load jQuery into something like WScript or CScript, it throws an error because of jQuery's many references to the window object.
What practical way is there to use jQuery in code that runs without a browser?
Update: In response to the comments, I have successfully written JavaScript code to read the contents of files using new ActiveXObject('Scripting.FileSystemObject');. I know that ActiveX is evil, but this is just an internal project to get some data out of some files that contain HTML fragments and into a proper database.
Another Update: My code so far looks about like this:
var fileIo, here;
fileIo = new ActiveXObject('Scripting.FileSystemObject');
here = unescape(fileIo.GetParentFolderName(WScript.ScriptFullName) + "\\");
(function() {
var files, thisFile, thisFileName, thisFileText;
for (files = new Enumerator(fileIo.GetFolder(here).files); !files.atEnd(); files.moveNext()) {
thisFileName = files.item().Name;
thisFile = fileIo.OpenTextFile(here + thisFileName);
thisFileText = thisFile.ReadAll();
// I want to do something like this:
s = $(thisFileText).find('input#txtFoo').val();
}
})();
Update: I posted this question on the jQuery forums as well: http://forum.jquery.com/topic/how-to-use-jquery-without-a-browser#14737000003719577

Following along with your code, you could create an instance of IE using Windows Script Host, load your html file in to the instance, append jQuery dynamically to the loaded page, then script from that.
This works in IE8 with XP, but I'm aware of some security issues in Windows 7/IE9. IF you run into problems you could try lowering your security settings.
var fileIo, here, ie;
fileIo = new ActiveXObject('Scripting.FileSystemObject');
here = unescape(fileIo.GetParentFolderName(WScript.ScriptFullName) + "\\");
ie = new ActiveXObject("InternetExplorer.Application");
ie.visible = true
function loadDoc(src) {
var head, script;
ie.Navigate(src);
while(ie.busy){
WScript.sleep(100);
}
head = ie.document.getElementsByTagName("head")[0];
script = ie.document.createElement('script');
script.src = "http://ajax.googleapis.com/ajax/libs/jquery/1.8.3/jquery.min.js";
head.appendChild(script);
return ie.document.parentWindow;
}
(function() {
var files, thisFile, win;
for (files = new Enumerator(fileIo.GetFolder(here).files); !files.atEnd(); files.moveNext()) {
thisFile = files.item();
if(fileIo.GetExtensionName(thisFile)=="htm") {
win = loadDoc(thisFile);
// your jQuery reference = win.$
WScript.echo(thisFile + ": " + win.$('input#txtFoo').val());
}
}
})();

This is pretty easy to do in Node.js with the cheerio package. You can read in arbitrary HTML from whatever source you want, parse it with cheerio and then access the parsed elements using jQuery style selectors.

How can I get the XUL as a string from a Mozilla add-on at run-time using JavaScript?

I'm trying to get a window's XUL text as a String in Javascript. I need it to be done at runtime because the window adds/removes UI elements dynamically.
I have tried the following:
document.toXML()
document.xml
document.documentElement.toXML()
Among other things. Nothing works! Can anyone help?

You use XMLSerializer:
new XMLSerializer().serializeToString(document);

I don't think there is a function or field to get xul text, but you can work around by reading the content from xul url
function getContentFromURL(url) {
var Cc = Components.classes;
var Ci = Components.interfaces;
var ioService = Cc['#mozilla.org/network/io-service;1'].getService(Ci.nsIIOService);
var scriptableStream = Cc['#mozilla.org/scriptableinputstream;1'].getService(Ci.nsIScriptableInputStream);
var channel = ioService.newChannel(url, null, null);
var input = channel.open();
scriptableStream.init(input);
return scriptableStream.read(input.available());
}
so you can call getContentFromURL(document.location) to get the XUL content

We Keep Coding

JavaScript is the programming language of the Web.

What is the best way to parse html in google apps script - javascript

Xml.parse() has an option to turn on lenient parsing, which helps when parsing HTML. Note that the Xml service is deprecated however, and the newer XmlService doesn't have this functionality.

Natively there's no way unless you do what you already tried which wont work if the html doesnt conform with the xml format.

There are two options a) One is to use JavaScript's string functions. First locate your tag using string.indexOf() and then extract the data you want using string.substring(). b) The other option is to make use of the Xml Service.

It's not possible to create an HTML DOM server-side in Apps Script. Using regular expressions is likely your best option, at least for simple parsing.

Related

Simple code to extract substring into a variable of GTM (Google Tag Manager)

THIS message is keep poping up,,<SyntaxError: Unexpected token class('SheetConverter'>

Making a checkbox Ui with Google Apps Script

How do I use jQuery in Windows Script Host?

How can I get the XUL as a string from a Mozilla add-on at run-time using JavaScript?

Categories

Resources