Trouble with ImportXML on Google Sheets

Trouble with ImportXML on Google Sheets - javascript

I have filled a google spreadsheet with around 500 URLs and Xpaths. After discovering that ImportXML has some drawbacks (it is getting perpetual loading errors, even when there are only 10 or so functions running). I am looking for another way to populate the sheet. My first attempt was an iterative script that simply wrote an ImportXML function into a working cell then wrote in the value for each URL. I thought that by just having one ImportXML running at a time it would work fine but it still gets perpetual loading errors.
Sample sheet:
https://docs.google.com/spreadsheets/d/1QgW4LVkB_oraO9gdS5DsnNta3GVlqsH0_uC1QP0iE7w/edit?usp=sharing
(note the sample sheet actually works OK with the iterative ImportXML script, still returns some errors, but I think there must be some limit on historical ImportXML functions not just current ones on sheet because my main sheet has real problems handling just a few now)
Is there a simple script that will work? I have tried variations using URLFetch, xml.evaluate, xmlService, but with my limited knowledge I can't get it to work.
Any guidance much appreciated.
Thanks!

Here's a working method - I tested for you :
add this function in above the function you currently have in your apps script.
function importprice(url) {
var found, html, content = '';
var response = UrlFetchApp.fetch(url);
if (response) {
html = response.getContentText();
if (html) content = html.match(/<span id="product_price" itemprop="price">(.*)<\/span>/gi)[0].match(/<span id="product_price" itemprop="price">(.*)<\/span>/i)[1];
}
return content;
}
and then replace your importxml function that currently looks like this:
var cellFunction1 = '=IMPORTXML("' + sheet.getRange(row,4).getValue() + '?' + queryString + '","' + sheet.getRange(row,5).getValue() + '")';
with this:
var cellFunction1 = importprice(sheet.getRange(row,4).getValue());

Related

HTML export from Google Doc not maintaining format when pasted into GMail

tl;dr - After exporting a Google Doc as an HTML file and pasting the HTML into a GMail draft it does not contain the formatting from the original Google Doc (other than hyperlinks).
Code snippet:
//copies the doc to HTML format
var htmlExport = "https://docs.google.com/feeds/download/documents/export/Export?id=" + docID + "&exportFormat=html";
var param = {
method: "get",
headers: {"Authorization": "Bearer " + ScriptApp.getOAuthToken()},
muteHttpExceptions: true,
};
var htmlExportText = UrlFetchApp.fetch(htmlExport,param).getContentText();
//the variables below (contactEmail & emailSubject) are both taken from a spreadsheet
//copies recent draft body to new email, then updates body of new email to include HTML export
var draftEmailBody = GmailApp.getMessageById(draftEmailID).getBody();
var draftToSend = GmailApp.createDraft(contactEmail,emailSubject,'',{htmlBody: htmlExportText + draftEmailBody}).getMessageId();
Long version:
I am building a mail merge that pulls contact info from a GSheet and uses GDoc as the template for the body. The GDoc has several bits of formatting in it (bold, italics, superscript) that, when exported as an HTML using the script above, appear in the GMail draft devoid of formatting (for some reason it leaves the hyperlinks). For some odd reason it even leaves the images from the doc!
The GMail draft pulled into the body (draftEmailBody) does, however, keep all it's formatting. I can only assume this means I'm doing something wrong by using getContentText but I don't know how else to go about it.
(This is completely separate and I should probably just make another question for this, but I'm here so...)
Separately, I wanted to have the script edit specific fields within the GDoc template, but I have run into 2 issues.
Problem 1 - I have found no way to replace specific text within a GMail draft.
Workaround 1 - I have the script edit the text in a GDoc instead, using repalceText. This, however, leads to:
Problem 2 - Using replaceText in a GDoc requires you to saveAndClose before the script can recognize the change. For some reason I can never get my script to open the GDoc again, despite including openByID in various places of the script!
Workaround 2 - I create a copy of the doc for each contact, replacing the text within that doc, then trash all of the copies on completion so there's no clutter. Quite clunky and slow but it gets the job done.

While it's not the prettiest solution, I found something that helps:
Google Scripts: Generating Email from Docs Loses Formatting

Use Juice CSS Inliner Tool in Google App script

I am using Google App Script and I am relatively new to coding. I am trying to send an email with an HTML body, but the entire formatting is dropped in Gmail.
To fix it, I came across the Juice Inliner Tool to add the CSS stylesheet to the HTML source. But I have no clue how to use it in Google App Script. Please help me with some guidance or any reference code on this.
So far I have the code to convert a Google Doc into an HTML Page. Now I want to input the HTML code finalhtml into the Juice Inline tool and return that code instead of finalhtml.
function doc2html(googleDocId){
const exporturl = 'https://docs.google.com/feeds/download/documents/export/Export?id='+googleDocId+'&exportFormat=html'; //export and hold the document as HTML in the constant exprotUrl
//set the required fetch parameters to run the UrlFetchApp
const fetchParam = {
method :"get",
headers : {
"Authorization":"Bearer "+ScriptApp.getOAuthToken()
},
muteHttpExceptions:true
};
var finalhtml = UrlFetchApp.fetch(exporturl,fetchParam).getContentText(); // holds the actual HTML code for the doc in the variable finalhtml.
GmailApp.sendEmail('mailId','Subject', finalhtml,{htmlBody: finalhtml});
}

Remove Importxml formula immediately after scrape data in every active cell in Google Sheets

I use this script to scrape data from any website in every 15 minute. I want to make this script auto remove Importxml formula and keep value only, but yet still can't achieve it.
function fetchData (){
var wrkBk= SpreadsheetApp.getActiveSpreadsheet();
var wrkSht= wrkBk.getSheetByName("Sheet1");
var url= "https://coinmarketcap.com/currencies";
for (var i= 2;i <=6;i++)
{
var coin= wrkSht.getRange('A' + i).getValue();
var formula = "=IMPORTXML(" + String.fromCharCode(34) + url + "/" + coin + String.fromCharCode(34) + "," + String.fromCharCode(34)+"//span[#class='cmc-details-panel-price__price']"+ String.fromCharCode(34)+")";
wrkSht.getRange('C' + i).activate();
wrkSht.getActiveRangeList().clear({contentsOnly: true, skipFilteredRows: true});
wrkSht.getRange('C'+i).setFormula(formula);
Utilities.sleep(1000);
}}
And I try put this script before Utilites.sleep(1000); and yet still not success
First try
var range = wrkSht.getRange('C'+i);
range.copyTo(range, {contentsOnly: true});
Second try
var range = wrkSht.getCurrentCell();
range.copyTo(range, {contentsOnly: true});
This is my Google Spreadsheet
https://docs.google.com/spreadsheets/d/1vykBSNJQ9xO23jA1ZT8fQAjfmtUQOQTqzQXFfCqz8oQ/edit?usp=sharing
Hope someone can help me, Thanks you

By default Google Apps Script applies the changes made by the code until the execution ends. Use SpreadsheetApp.flush() to force the changes be applied before doing the copy/paste as values only operation.
Instead of
var range = wrkSht.getCurrentCell();
Use
SpreadsheetApp.flush(); // This force to apply the previous changes (add the formula)
Utilities.sleep(30000); // This is required to wait for the spreadsheet to be recalculated (importxml import the data)
var range = wrkSht.getDataRange(); // This is in case that you want to paste the whole sheet as values
Instead of sleep you could use a loop to poll the spreadsheet until the spreadsheet is recalculated.
NOTE: Whenever it's possible we should avoid to use Google Apps Script classes and methods inside loops because they are (extremely?) slow and the execution time limit is small for free accounts (6 mins) and not so big for G Suite accounts (30 mins). The official docs explain this and we have several questions about this here.
Resources
Best Practices | Google Apps Script

Node.js: requesting a page and allowing the page to build before scraping

I've seen some answers to this that refer the askee to other libraries (like phantom.js), but I'm here wondering if it is at all possible to do this in just node.js?
Considering my code below. It requests a webpage using request, then using cheerio it explores the dom to scrape the page for data. It works flawlessly and if everything had gone as planned, I believe it would have outputted a file as i imagined in my head.
The problem is that the page I am requesting in order to scrape, build the table im looking at asynchronously using either ajax or jsonp, i'm not entirely sure how .jsp pages work.
So here I am trying to find a way to "wait" for this data to load before I scrape the data for my new file.
var cheerio = require('cheerio'),
request = require('request'),
fs = require('fs');
// Go to the page in question
request({
method: 'GET',
url: 'http://www1.chineseshipping.com.cn/en/indices/cbcfinew.jsp'
}, function(err, response, body) {
if (err) return console.error(err);
// Tell Cherrio to load the HTML
$ = cheerio.load(body);
// Create an empty object to write to the file later
var toSort = {}
// Itterate over DOM and fill the toSort object
$('#emb table td.list_right').each(function() {
var row = $(this).parent();
toSort[$(this).text()] = {
[$("#lastdate").text()]: $(row).find(".idx1").html(),
[$("#currdate").text()]: $(row).find(".idx2").html()
}
});
//Write/overwrite a new file
var stream = fs.createWriteStream("/tmp/shipping.txt");
var toWrite = "";
stream.once('open', function(fd) {
toWrite += "{\r\n"
for(i in toSort){
toWrite += "\t" + i + ": { \r\n";
for(j in toSort[i]){
toWrite += "\t\t" + j + ":" + toSort[i][j] + ",\r\n";
}
toWrite += "\t" + "}, \r\n";
}
toWrite += "}"
stream.write(toWrite)
stream.end();
});
});
The expected result is a text file with information formatted like a JSON object.
It should look something like different instances of this
"QINHUANGDAO - GUANGZHOU (50,000-60,000DWT)": {
 "2016-09-29": 26.7,
"2016-09-30": 26.8,
},
But since the name is the only thing that doesn't load async, (the dates and values are async) I get a messed up object.
I tried Actually just setting a setTimeout in various places in the code. The script will only be touched by developers that can afford to run the script several times if it fails a few times. So while not ideal, even a setTimeout (up to maybe 5 seconds) would be good enough.
It turns out the settimeouts don't work. I suspect that once I request the page, I'm stuck with the snapshot of the page "as is" when I receive it, and I'm in fact not looking at a live thing I can wait for to load its dynamic content.
I've wondered investigating how to intercept the packages as they come, but I don't understand HTTP well enough to know where to start.

The setTimeout will not make any difference even if you increase it to an hour. The problem here is that you are making a request against this url:
http://www1.chineseshipping.com.cn/en/indices/cbcfinew.jsp
and their server returns back the html and in this html there are the js and css imports. This is the end of your case, you just have the html and that's it. Instead the browser knows how to use and to parse the html document, so it is able to understand the javascript scripts and to execute/run them and this is exactly your problem. Your program is not able to understand that has something to do with the HTML contents. You need to find or to write a scraper that is able to run javascript. I just found this similar issue on stackoverflow:
Web-scraping JavaScript page with Python
The guy there suggests https://github.com/niklasb/dryscrape and it seems that this tool is able to run javascript. It is written in python though.

You are trying to scrape the original page that doesn't include the data you need.
When the page is loaded, browser evaluates JS code it includes, and this code knows where and how to get the data.
The first option is to evaluate the same code, like PhantomJS do.
The other (and you seem to be interested in it) is to investigate the page's network activity and to understand what additional requests you should perform to get the data you need.
In your case, these are:
http://index.chineseshipping.com.cn/servlet/cbfiDailyGetContrast?SpecifiedDate=&jc=jsonp1475577615267&_=1475577619626
and
http://index.chineseshipping.com.cn/servlet/allGetCurrentComposites?date=Tue%20Oct%2004%202016%2013:40:20%20GMT+0300%20(MSK)&jc=jsonp1475577615268&_=1475577620325
In both requests:
_ is a decache parameter to prevent caching.
jc is a name of a JS wrapper function which should be invoked with the result (https://en.wikipedia.org/wiki/JSONP)
So, scrapping the table template at http://www1.chineseshipping.com.cn/en/indices/cbcfinew.jsp and performing two additional requests you will be able to combine them into the same data structure you see in the browser.

import xml - how to refresh a Google Sheet automatically

Hi i Have 4 google sheets using import xml (because of google 50 import limits) on each which are sorting and then feeding data from a webpage to another 'summary' sheet. I just need a script for the 4 google sheets that refreshes the importxml every minute or so.
Either that, or to refresh the importxml when the specified information on the target (source) web page changes.
also, as this would be used from a mobile device some of the time, would the 4 sheets have to be kept open ?

I eventually came across, sorry cant find the OP for it now - seems to do the trick ..
function getData() {
var queryString = Math.random();
var cellFunction1 = '=IMPORTXML("' + SpreadsheetApp.getActiveSheet().getRange('D1').getValue() + '?' + queryString + '","'+ SpreadsheetApp.getActiveSheet().getRange('E1').getValue() + '")';
SpreadsheetApp.getActiveSheet().getRange('D2').setValue(cellFunction1);
}

We Keep Coding

JavaScript is the programming language of the Web.

Trouble with ImportXML on Google Sheets - javascript

Related

HTML export from Google Doc not maintaining format when pasted into GMail

Use Juice CSS Inliner Tool in Google App script

Remove Importxml formula immediately after scrape data in every active cell in Google Sheets

Node.js: requesting a page and allowing the page to build before scraping

import xml - how to refresh a Google Sheet automatically

Categories

Resources