If I use next function to get google output:
function myFunction() {
var post_url, result;
post_url = "http://www.google.com/search?q=stack+overflow";
result = UrlFetchApp.fetch(post_url);
Logger.log(result);
}
doesn't work.
P.S.
Sorry, I have to eхplore some dependences.
I take an example
function scrapeGoogle() {
var response = UrlFetchApp.fetch("http://www.google.com/search?q=labnol");
var myRegexp = /<h3 class=\"r\">([\s\S]*?)<\/h3>/gi;
var elems = response.getContentText().match(myRegexp);
for(var i in elems) {
var title = elems[i].replace(/(^\s+)|(\s+$)/g, "")
.replace(/<\/?[^>]+>/gi, "");
Logger.log(title);
}
}
and it works, than I begin to do some modifications and noticed that when I have some error in code it gives me an error
Request failed for http://www.google.com/search?q=labnol returned code
503.
So I did some researches without error's and it solution works. But when I began to form it to the function in lib it begans to throw me an error of 503 each time!
I'm very amazing of such behavior...
Here is short video only for fact. https://youtu.be/Lem9eiIVY0I
P.P.S.
Oh! I've broke some violations, so the google engine send me to stop list
so I run this:
function scrapeGoogle() {
var options =
{
'muteHttpExceptions': true
}
var response = UrlFetchApp.fetch("http://www.google.com/search?q=labnol", options);
Logger.log(response);
}
and get
About this pageOur systems have detected unusual traffic from your computer network. This page checks to see if it's really you sending the requests, and not a robot. Why did this happen?
As I see I have to use some special google services to get the search output and not to be prohibited?
You can use simple regex to extract Google search results.
var regex = /<h3 class=\"r\">([\s\S]*?)<\/h3>/gi;
var items = response.getContentText().match(regex);
Alternatively, you can use the ImportXML function in sheets.
=IMPORTXML(GOOGLE_URL, "//h3[#class='r']")
See: Scrape Google Search with Sheets
I was developing a small extension for Firefox. I wanted to log messages while a part of my extension is executing.
CODE:
var aConsoleService = Components.classes["#mozilla.org/consoleservice;1"].getService (Components.interfaces.nsIConsoleService);
aConsoleService.logStringMessage("created");
Here "created" is message. But I am unable to see this message inside browser console. Am I missing something? I searched for it and got to know that you have to enable devtools.errorconsole.enabled inside about:config. I did that too. Please help me out.
Are you sure you're opening the browser console? Ctrl + Shift + J?
var {utils:Cu, interfaces:Ci} = Components;
Components.classes["#mozilla.org/consoleservice;1"].getService(Components.interfaces.nsIConsoleService);
consoleService.logStringMessage(text);
also can try this:
var {utils:Cu, interfaces:Ci} = Components;
Cu.import('resource://gre/modules/Services.jsm');
Services.console.logStringMessage(text);
can also try this
var {utils:Cu, interfaces:Ci} = Components;
Cu.import('resource://gre/modules/Services.jsm');
Services.appShell.hiddenDOMWindow.console.log('blah');
if you're using addon sdk then instead of var {utils:Cu, interfaces:Ci} = Components; you have to do var {Cu, Ci} = require('chrome');
var page = UrlFetchApp.fetch(contestURL);
var doc = XmlService.parse(page);
The above code gives a parse error when used, however if I replace the XmlService class with the deprecated Xml class, with the lenient flag set, it parses the html properly.
var page = UrlFetchApp.fetch(contestURL);
var doc = Xml.parse(page, true);
The problem is mostly caused because of no CDATA in the javascript part of the html and the parser complains with the following error.
The entity name must immediately follow the '&' in the entity reference.
Even if I remove all the <script>(.*?)</script> using regex, it still complains because the <br> tags aren't closed.
Is there a clean way of parsing html into a DOM tree.
I ran into this exact same problem. I was able to circumvent it by first using the deprecated Xml.parse, since it still works, then selecting the body XmlElement, then passing in its Xml String into the new XmlService.parse method:
var page = UrlFetchApp.fetch(contestURL);
var doc = Xml.parse(page, true);
var bodyHtml = doc.html.body.toXmlString();
doc = XmlService.parse(bodyHtml);
var root = doc.getRootElement();
Note: This solution may not work if the old Xml.parse is completely removed from Google Scripts.
In 2021, the best way to parse HTML on the .gs side that I know of is...
Click + next to Library
Enter 1ReeQ6WO8kKNxoaA_O0XEQ589cIrRvEBA9qcWpNqdOP17i47u6N9M5Xh0
Click "Look up"
Click Add
Sample usage:
const contentText = UrlFetchApp.fetch('https://www.somesite.com/').getContentText();
const $ = Cheerio.load(contentText);
$('.some-class').first().text();
That's it -- this is probably the closest we'll get to doing jQuery-like DOM selection in GAS. The .first() is important or else you may extract more content than you expected (think of it as using querySelector() instead of querySelectorAll()).
Credit where credit is due: https://github.com/tani/cheeriogs
As of May 2020, you can now use the Cheerio library for Google Apps Script to do this.
Returns the content of Wikipedia's Main Page
const content = getContent_('https://en.wikipedia.org');
const $ = Cheerio.load(content);
Logger.log($('#mp-right').text());
Returns the content of the first paragraph <p> of Wikipedia's Main Page
const content = getContent_('https://en.wikipedia.org');
const $ = Cheerio.load(content);
Logger.log($('p').first().text());
To add to your project:
Select Resources - Libraries... in the Google Apps Script editor. Enter the project key 1ReeQ6WO8kKNxoaA_O0XEQ589cIrRvEBA9qcWpNqdOP17i47u6N9M5Xh0 in the Add a library field, and click "Add". Select the highest version number, and click "Save".
I found that the best way to parse html in google apps is to avoid using XmlService.parse or Xml.parse. XmlService.parse doesn't work well with bad html code from certain websites.
Here a basic example on how you can parse any website easily without using XmlService.parse or Xml.parse. In this example, i am retrieving a list of president from "wikipedia.org/wiki/President_of_the_United_States"
whit a regular javascript document.getElementsByTagName(), and pasting the values into my google spreadsheet.
1- Create a new Google Sheet;
2- Click the menu Tools > Script editor... to open a new tab with the code editor window and copy the following code into your Code.gs:
function onOpen() {
var ui = SpreadsheetApp.getUi();
ui.createMenu("Parse Menu")
.addItem("Parse", "parserMenuItem")
.addToUi();
}
function parserMenuItem() {
var sideBar = HtmlService.createHtmlOutputFromFile("test");
SpreadsheetApp.getUi().showSidebar(sideBar);
}
function getUrlData(url) {
var doc = UrlFetchApp.fetch(url).getContentText()
return doc
}
function writeToSpreadSheet(data) {
var ss = SpreadsheetApp.getActiveSpreadsheet();
var sheet = ss.getSheets()[0];
var row=1
for (var i = 0; i < data.length; i++) {
var x = data[i];
var range = sheet.getRange(row, 1)
range.setValue(x);
var row = row+1
}
}
3- Add an HTML file to your Apps Script project. Open the Script Editor and choose File > New > Html File, and name it 'test'.Then copy the following code into your test.html
<!DOCTYPE html>
<html>
<head>
</head>
<body>
<input id= "mButon" type="button" value="Click here to get list"
onclick="parse()">
<div hidden id="mOutput"></div>
</body>
<script>
window.onload = onOpen;
function onOpen() {
var url = "https://en.wikipedia.org/wiki/President_of_the_United_States"
google.script.run.withSuccessHandler(writeHtmlOutput).getUrlData(url)
document.getElementById("mButon").style.visibility = "visible";
}
function writeHtmlOutput(x) {
document.getElementById('mOutput').innerHTML = x;
}
function parse() {
var list = document.getElementsByTagName("area");
var data = [];
for (var i = 0; i < list.length; i++) {
var x = list[i];
data.push(x.getAttribute("title"))
}
google.script.run.writeToSpreadSheet(data);
}
</script>
</html>
4- Save your gs and html files and Go back to your spreadsheet. Reload your Spreadsheet. Click on "Parse Menu" - "Parse". Then click on "Click here to get list" in the sidebar.
Xml.parse() has an option to turn on lenient parsing, which helps when parsing HTML. Note that the Xml service is deprecated however, and the newer XmlService doesn't have this functionality.
For simple tasks such as grabbing one value from a webpage, you could use a regular expression. Regex is notoriously bad for parsing HTML as there's all sorts of weird cases it can get tripped up, but if you're confident about the HTML you're accessing this can sometimes be the simplest way.
Here's an example that fetches the contents of the page's <title> tag:
var page = UrlFetchApp.fetch(contestURL);
var regExp = new RegExp("<title>(.*)</title>", "gi");
var result = regExp.exec(page.getContentText());
// [1] is the match group when using parenthesis in the pattern
var value = result ? result[1] : 'No title found';
I know it is not exactly what OP asked, but I found this question when I was looking for some html parsing options - so it might be useful for others as well.
There is an easy to use the library for TEXT parsing. It's useful if you want to get only one piece of information from the html(xml) code.
EDIT 2021: The script library id is:
1Mc8BthYthXx6CoIz90-JiSzSafVnT6U3t0z_W3hLTAX5ek4w0G_EIrNw
It works like in the picture above
function getData() {
var url = "https://chrome.google.com/webstore/detail/signaturesatori-central-s/fejomcfhljndadjlojamaklegghjnjfn?hl=en";
var fromText = '<span class="e-f-ih" title="';
var toText = '">';
var content = UrlFetchApp.fetch(url).getContentText();
var scraped = Parser
.data(content)
.from(fromText)
.to(toText)
.build();
Logger.log(scraped);
return scraped;
}
If you are using
Cheerio library for Google Apps Script
Source code
Library page (⭐ star it!)
Installation by library ID:
1ReeQ6WO8kKNxoaA_O0XEQ589cIrRvEBA9qcWpNqdOP17i47u6N9M5Xh0
A function to get current emojis from unicode.org:
function getEmojis() {
var t = new Date();
var url = 'https://unicode.org/emoji/charts/full-emoji-list.html';
var fetch = UrlFetchApp.fetch(url);
var contentText = fetch.getContentText();
//console.log(new Date() - t);
// Cherio
var $ = Cheerio.load(contentText);
var data = [];
$("table > tbody > tr").each((index, element) => {
var row = [];
$(element).find("td").each((index, child) => {
row.push($(child).text());
});
if (row.length > 0) {
data.push(row);
}
});
//console.log(data);
//console.log(new Date() - t);
// Result
return data;
}
↑ Sample code shows how to parse table and put it into [[array]]
May be used as a custom function:
Bonus
Parsing the site may be a time-consuming operation + you may reach the limit.
Here's a test file with a full version of the script:
https://docs.google.com/spreadsheets/d/1iO7YjYWyfseQu_YCfRbGDPg7NskOgMu_iO1iGjr7KxY/edit#gid=93365395
↑ it uses CasheService to reduce the number of calls.
Natively there's no way unless you do what you already tried which wont work if the html doesnt conform with the xml format.
There are two options
a) One is to use JavaScript's string functions. First locate your tag using string.indexOf() and then extract the data you want using string.substring().
b) The other option is to make use of the Xml Service.
It's not possible to create an HTML DOM server-side in Apps Script. Using regular expressions is likely your best option, at least for simple parsing.
I'm working on some code that needs to parse numerous files that contain fragments of HTML. It seems that jQuery would be very useful for this, but when I try to load jQuery into something like WScript or CScript, it throws an error because of jQuery's many references to the window object.
What practical way is there to use jQuery in code that runs without a browser?
Update: In response to the comments, I have successfully written JavaScript code to read the contents of files using new ActiveXObject('Scripting.FileSystemObject');. I know that ActiveX is evil, but this is just an internal project to get some data out of some files that contain HTML fragments and into a proper database.
Another Update: My code so far looks about like this:
var fileIo, here;
fileIo = new ActiveXObject('Scripting.FileSystemObject');
here = unescape(fileIo.GetParentFolderName(WScript.ScriptFullName) + "\\");
(function() {
var files, thisFile, thisFileName, thisFileText;
for (files = new Enumerator(fileIo.GetFolder(here).files); !files.atEnd(); files.moveNext()) {
thisFileName = files.item().Name;
thisFile = fileIo.OpenTextFile(here + thisFileName);
thisFileText = thisFile.ReadAll();
// I want to do something like this:
s = $(thisFileText).find('input#txtFoo').val();
}
})();
Update: I posted this question on the jQuery forums as well: http://forum.jquery.com/topic/how-to-use-jquery-without-a-browser#14737000003719577
Following along with your code, you could create an instance of IE using Windows Script Host, load your html file in to the instance, append jQuery dynamically to the loaded page, then script from that.
This works in IE8 with XP, but I'm aware of some security issues in Windows 7/IE9. IF you run into problems you could try lowering your security settings.
var fileIo, here, ie;
fileIo = new ActiveXObject('Scripting.FileSystemObject');
here = unescape(fileIo.GetParentFolderName(WScript.ScriptFullName) + "\\");
ie = new ActiveXObject("InternetExplorer.Application");
ie.visible = true
function loadDoc(src) {
var head, script;
ie.Navigate(src);
while(ie.busy){
WScript.sleep(100);
}
head = ie.document.getElementsByTagName("head")[0];
script = ie.document.createElement('script');
script.src = "http://ajax.googleapis.com/ajax/libs/jquery/1.8.3/jquery.min.js";
head.appendChild(script);
return ie.document.parentWindow;
}
(function() {
var files, thisFile, win;
for (files = new Enumerator(fileIo.GetFolder(here).files); !files.atEnd(); files.moveNext()) {
thisFile = files.item();
if(fileIo.GetExtensionName(thisFile)=="htm") {
win = loadDoc(thisFile);
// your jQuery reference = win.$
WScript.echo(thisFile + ": " + win.$('input#txtFoo').val());
}
}
})();
This is pretty easy to do in Node.js with the cheerio package. You can read in arbitrary HTML from whatever source you want, parse it with cheerio and then access the parsed elements using jQuery style selectors.
I'm trying to get a window's XUL text as a String in Javascript. I need it to be done at runtime because the window adds/removes UI elements dynamically.
I have tried the following:
document.toXML()
document.xml
document.documentElement.toXML()
Among other things. Nothing works! Can anyone help?
You use XMLSerializer:
new XMLSerializer().serializeToString(document);
I don't think there is a function or field to get xul text, but you can work around by reading the content from xul url
function getContentFromURL(url) {
var Cc = Components.classes;
var Ci = Components.interfaces;
var ioService = Cc['#mozilla.org/network/io-service;1'].getService(Ci.nsIIOService);
var scriptableStream = Cc['#mozilla.org/scriptableinputstream;1'].getService(Ci.nsIScriptableInputStream);
var channel = ioService.newChannel(url, null, null);
var input = channel.open();
scriptableStream.init(input);
return scriptableStream.read(input.available());
}
so you can call getContentFromURL(document.location) to get the XUL content