script to get source code of website (js)

script to get source code of website (js) - javascript

My school blocked CTRL + U, but you can use 'view-source:' before a link to view the code. It takes awhile, so i've been trying to make a script to automatically direct to the source code. However, I keep getting errors because it is not a link
I have tried the following:
var code = fetch(`view-source:https://${location.hostname}${location.pathname}`);
location.href = (code);
and
var code = (`view-source:https://${location.hostname}${location.pathname}`);
location.href = (code);
In the first one, I see a bad request, and in the second, I a blank page with the words "view-source:" followed by the link

view-source: isn't a real protocol you can fetch().
However, just
var resp = await fetch('http://...');
var text = await resp.text();
document.body.textContent = text;
should replace the current document's body with the text contents of that URL...

If you try from frontend to fetch the source code you will run to CORS Problems. But you can use some proxyies like in the example beloow:
fetch('https://api.codetabs.com/v1/proxy?quest=https://stackoverflow.com/questions/75440023/script-to-get-source-code-of-website-js#75440023').then((response) => response.text()).then((text) => console.log(text));

Related

Why does my request-promise function work in Chrome but not Firefox or Safari?

I'm trying to use a script for a webpage that scrapes and then parses an RSS feed using request-promise. From what I understand, request-promise has to run server-side, so I'm calling the script externally in my html. I've been using localStorage.setItem and localStorage.getItem to pass the variables between the external script and another embedded script that assigns the variables to html elements that should be displayed throughout the page as text.
In Chrome, when the page first loads, it's blank, but then works fine after I refresh it once. In Firefox and Safari, it remains blank no matter how many times I refresh. The issue seems to be that the html is not "receiving" the variables from localStorage.getItem. I suspect it has something to do with the way the different browsers are handling the request-promise function, but I'm stumped. What am I doing wrong?
Here's the request-promise script:
var xmltext
var latlonquery, latquery, lonquery, url
const rp = require('request-promise')
const options = {
headers: {'user-agent': 'node.js'}
}
function doQuery() {
latlonquery = new URLSearchParams(window.location.search)
latquery = latlonquery.get("lat")
lonquery = latlonquery.get("lon")
url = "https://forecast.weather.gov/MapClick.php?lat=" + latquery + "&lon=" + lonquery + "&FcstType=digitalDWML"
}
function doRequest() {
return rp(url,options).then(function(html){
return html
})
}
doQuery()
doRequest().then(function(html){
xmltext = html
localStorage.setItem("xml_says_this", xmltext)
localStorage.setItem("lat_says_this", latquery)
localStorage.setItem("lon_says_this", lonquery)
})
You can see an example of the problem in action here.

Node.js: requesting a page and allowing the page to build before scraping

I've seen some answers to this that refer the askee to other libraries (like phantom.js), but I'm here wondering if it is at all possible to do this in just node.js?
Considering my code below. It requests a webpage using request, then using cheerio it explores the dom to scrape the page for data. It works flawlessly and if everything had gone as planned, I believe it would have outputted a file as i imagined in my head.
The problem is that the page I am requesting in order to scrape, build the table im looking at asynchronously using either ajax or jsonp, i'm not entirely sure how .jsp pages work.
So here I am trying to find a way to "wait" for this data to load before I scrape the data for my new file.
var cheerio = require('cheerio'),
request = require('request'),
fs = require('fs');
// Go to the page in question
request({
method: 'GET',
url: 'http://www1.chineseshipping.com.cn/en/indices/cbcfinew.jsp'
}, function(err, response, body) {
if (err) return console.error(err);
// Tell Cherrio to load the HTML
$ = cheerio.load(body);
// Create an empty object to write to the file later
var toSort = {}
// Itterate over DOM and fill the toSort object
$('#emb table td.list_right').each(function() {
var row = $(this).parent();
toSort[$(this).text()] = {
[$("#lastdate").text()]: $(row).find(".idx1").html(),
[$("#currdate").text()]: $(row).find(".idx2").html()
}
});
//Write/overwrite a new file
var stream = fs.createWriteStream("/tmp/shipping.txt");
var toWrite = "";
stream.once('open', function(fd) {
toWrite += "{\r\n"
for(i in toSort){
toWrite += "\t" + i + ": { \r\n";
for(j in toSort[i]){
toWrite += "\t\t" + j + ":" + toSort[i][j] + ",\r\n";
}
toWrite += "\t" + "}, \r\n";
}
toWrite += "}"
stream.write(toWrite)
stream.end();
});
});
The expected result is a text file with information formatted like a JSON object.
It should look something like different instances of this
"QINHUANGDAO - GUANGZHOU (50,000-60,000DWT)": {
 "2016-09-29": 26.7,
"2016-09-30": 26.8,
},
But since the name is the only thing that doesn't load async, (the dates and values are async) I get a messed up object.
I tried Actually just setting a setTimeout in various places in the code. The script will only be touched by developers that can afford to run the script several times if it fails a few times. So while not ideal, even a setTimeout (up to maybe 5 seconds) would be good enough.
It turns out the settimeouts don't work. I suspect that once I request the page, I'm stuck with the snapshot of the page "as is" when I receive it, and I'm in fact not looking at a live thing I can wait for to load its dynamic content.
I've wondered investigating how to intercept the packages as they come, but I don't understand HTTP well enough to know where to start.

The setTimeout will not make any difference even if you increase it to an hour. The problem here is that you are making a request against this url:
http://www1.chineseshipping.com.cn/en/indices/cbcfinew.jsp
and their server returns back the html and in this html there are the js and css imports. This is the end of your case, you just have the html and that's it. Instead the browser knows how to use and to parse the html document, so it is able to understand the javascript scripts and to execute/run them and this is exactly your problem. Your program is not able to understand that has something to do with the HTML contents. You need to find or to write a scraper that is able to run javascript. I just found this similar issue on stackoverflow:
Web-scraping JavaScript page with Python
The guy there suggests https://github.com/niklasb/dryscrape and it seems that this tool is able to run javascript. It is written in python though.

You are trying to scrape the original page that doesn't include the data you need.
When the page is loaded, browser evaluates JS code it includes, and this code knows where and how to get the data.
The first option is to evaluate the same code, like PhantomJS do.
The other (and you seem to be interested in it) is to investigate the page's network activity and to understand what additional requests you should perform to get the data you need.
In your case, these are:
http://index.chineseshipping.com.cn/servlet/cbfiDailyGetContrast?SpecifiedDate=&jc=jsonp1475577615267&_=1475577619626
and
http://index.chineseshipping.com.cn/servlet/allGetCurrentComposites?date=Tue%20Oct%2004%202016%2013:40:20%20GMT+0300%20(MSK)&jc=jsonp1475577615268&_=1475577620325
In both requests:
_ is a decache parameter to prevent caching.
jc is a name of a JS wrapper function which should be invoked with the result (https://en.wikipedia.org/wiki/JSONP)
So, scrapping the table template at http://www1.chineseshipping.com.cn/en/indices/cbcfinew.jsp and performing two additional requests you will be able to combine them into the same data structure you see in the browser.

Loading a JavaScript file dynamically

I was asked to add this code to my pitch pages by the vendor I sell through:
<script>
(function() {
var p = '/?vendor=2knowmysel&time=' + new Date().getTime();
var cb = document.createElement('script'); cb.type = 'text/javascript';
cb.src = '//header.clickbank.net' + p;
document.getElementsByTagName('head')[0].appendChild(cb);
})();
</script>
The code should let the page load within a header that has a red logo by clickbank. When I added the code in the head section nothing happened.
Next I tried to isolate the problem by posting on a blank html page (away from drupal) which is http://www.2knowmyself.com/testpage.htm.
But the frame doesn't show up.
What's wrong in here? Given clickbank claim the code is perfect.

Here's what your code does:
<script>
// the following line creates an anonymous immediately-invoked function
(function() {
// this will return a string named 'p', which contains the vendors ID and current time
var p = '/?vendor=2knowmysel&time=' + new Date().getTime();
// this creates a new 'script' tag for HTML, name it 'cb', and tells the code it's for JavaScript
var cb = document.createElement('script'); cb.type = 'text/javascript';
// this will take a url with the address and the query string, which you named 'p' earlier, and set it as the source for 'cb'
cb.src = '//header.clickbank.net' + p;
// now you'll insert 'cb' to the HTML, so it'll load the JavaScript file into it
document.getElementsByTagName('head')[0].appendChild(cb);
// the function won't run automatically upon declaration, so you use parenthesis to tell it to run
})();
</script>
Summing it up, it bassically sends the vendor's ID and current time to the given server, and expects a JavaScript file in return from it; it'll then load this file into your HTML document.
Currently, it seems not to be working because this server is getting the information from your page but not sending the JavaScript file back to it. When they adjust it to answer with the right file, you'll see it run accordingly.
EDIT: (to answer your Final Question)
Up to this point, I can see that their server isn't sending the expected JS file back at your page, so it doesn't work. If you want to check this by yourself, please use a JS debugger or a network monitor in your browser (most of the modern webbrowsers come with these built-in, try pressing F12 then reloading the page).
If you want to check whether iframes work on your server, you may contact its administrator or try to embed an iframe in the page yourself. Paste the following code into the document. If you see SO homepage, it works. Otherwise, it'll show nothing. If you see Your browser does not support iframes., then you might have to update your web browser and check it again.
<iframe src="http://stackoverflow.com" width="300" height="300">
<p>Your browser does not support iframes.</p>
</iframe>

What is the best way to parse html in google apps script

var page = UrlFetchApp.fetch(contestURL);
var doc = XmlService.parse(page);
The above code gives a parse error when used, however if I replace the XmlService class with the deprecated Xml class, with the lenient flag set, it parses the html properly.
var page = UrlFetchApp.fetch(contestURL);
var doc = Xml.parse(page, true);
The problem is mostly caused because of no CDATA in the javascript part of the html and the parser complains with the following error.
The entity name must immediately follow the '&' in the entity reference.
Even if I remove all the <script>(.*?)</script> using regex, it still complains because the <br> tags aren't closed.
Is there a clean way of parsing html into a DOM tree.

I ran into this exact same problem. I was able to circumvent it by first using the deprecated Xml.parse, since it still works, then selecting the body XmlElement, then passing in its Xml String into the new XmlService.parse method:
var page = UrlFetchApp.fetch(contestURL);
var doc = Xml.parse(page, true);
var bodyHtml = doc.html.body.toXmlString();
doc = XmlService.parse(bodyHtml);
var root = doc.getRootElement();
Note: This solution may not work if the old Xml.parse is completely removed from Google Scripts.

In 2021, the best way to parse HTML on the .gs side that I know of is...
Click + next to Library
Enter 1ReeQ6WO8kKNxoaA_O0XEQ589cIrRvEBA9qcWpNqdOP17i47u6N9M5Xh0
Click "Look up"
Click Add
Sample usage:
const contentText = UrlFetchApp.fetch('https://www.somesite.com/').getContentText();
const $ = Cheerio.load(contentText);
$('.some-class').first().text();
That's it -- this is probably the closest we'll get to doing jQuery-like DOM selection in GAS. The .first() is important or else you may extract more content than you expected (think of it as using querySelector() instead of querySelectorAll()).
Credit where credit is due: https://github.com/tani/cheeriogs

As of May 2020, you can now use the Cheerio library for Google Apps Script to do this.
Returns the content of Wikipedia's Main Page
const content = getContent_('https://en.wikipedia.org');
const $ = Cheerio.load(content);
Logger.log($('#mp-right').text());
Returns the content of the first paragraph <p> of Wikipedia's Main Page
const content = getContent_('https://en.wikipedia.org');
const $ = Cheerio.load(content);
Logger.log($('p').first().text());
To add to your project:
Select Resources - Libraries... in the Google Apps Script editor. Enter the project key 1ReeQ6WO8kKNxoaA_O0XEQ589cIrRvEBA9qcWpNqdOP17i47u6N9M5Xh0 in the Add a library field, and click "Add". Select the highest version number, and click "Save".

I found that the best way to parse html in google apps is to avoid using XmlService.parse or Xml.parse. XmlService.parse doesn't work well with bad html code from certain websites.
Here a basic example on how you can parse any website easily without using XmlService.parse or Xml.parse. In this example, i am retrieving a list of president from "wikipedia.org/wiki/President_of_the_United_States"
whit a regular javascript document.getElementsByTagName(), and pasting the values into my google spreadsheet.
1- Create a new Google Sheet;
2- Click the menu Tools > Script editor... to open a new tab with the code editor window and copy the following code into your Code.gs:
function onOpen() {
var ui = SpreadsheetApp.getUi();
ui.createMenu("Parse Menu")
.addItem("Parse", "parserMenuItem")
.addToUi();
}
function parserMenuItem() {
var sideBar = HtmlService.createHtmlOutputFromFile("test");
SpreadsheetApp.getUi().showSidebar(sideBar);
}
function getUrlData(url) {
var doc = UrlFetchApp.fetch(url).getContentText()
return doc
}
function writeToSpreadSheet(data) {
var ss = SpreadsheetApp.getActiveSpreadsheet();
var sheet = ss.getSheets()[0];
var row=1
for (var i = 0; i < data.length; i++) {
var x = data[i];
var range = sheet.getRange(row, 1)
range.setValue(x);
var row = row+1
}
}
3- Add an HTML file to your Apps Script project. Open the Script Editor and choose File > New > Html File, and name it 'test'.Then copy the following code into your test.html
<!DOCTYPE html>
<html>
<head>
</head>
<body>
<input id= "mButon" type="button" value="Click here to get list"
onclick="parse()">
<div hidden id="mOutput"></div>
</body>
<script>
window.onload = onOpen;
function onOpen() {
var url = "https://en.wikipedia.org/wiki/President_of_the_United_States"
google.script.run.withSuccessHandler(writeHtmlOutput).getUrlData(url)
document.getElementById("mButon").style.visibility = "visible";
}
function writeHtmlOutput(x) {
document.getElementById('mOutput').innerHTML = x;
}
function parse() {
var list = document.getElementsByTagName("area");
var data = [];
for (var i = 0; i < list.length; i++) {
var x = list[i];
data.push(x.getAttribute("title"))
}
google.script.run.writeToSpreadSheet(data);
}
</script>
</html>
4- Save your gs and html files and Go back to your spreadsheet. Reload your Spreadsheet. Click on "Parse Menu" - "Parse". Then click on "Click here to get list" in the sidebar.

Xml.parse() has an option to turn on lenient parsing, which helps when parsing HTML. Note that the Xml service is deprecated however, and the newer XmlService doesn't have this functionality.

For simple tasks such as grabbing one value from a webpage, you could use a regular expression. Regex is notoriously bad for parsing HTML as there's all sorts of weird cases it can get tripped up, but if you're confident about the HTML you're accessing this can sometimes be the simplest way.
Here's an example that fetches the contents of the page's <title> tag:
var page = UrlFetchApp.fetch(contestURL);
var regExp = new RegExp("<title>(.*)</title>", "gi");
var result = regExp.exec(page.getContentText());
// [1] is the match group when using parenthesis in the pattern
var value = result ? result[1] : 'No title found';

I know it is not exactly what OP asked, but I found this question when I was looking for some html parsing options - so it might be useful for others as well.
There is an easy to use the library for TEXT parsing. It's useful if you want to get only one piece of information from the html(xml) code.
EDIT 2021: The script library id is:
1Mc8BthYthXx6CoIz90-JiSzSafVnT6U3t0z_W3hLTAX5ek4w0G_EIrNw
It works like in the picture above
function getData() {
var url = "https://chrome.google.com/webstore/detail/signaturesatori-central-s/fejomcfhljndadjlojamaklegghjnjfn?hl=en";
var fromText = '<span class="e-f-ih" title="';
var toText = '">';
var content = UrlFetchApp.fetch(url).getContentText();
var scraped = Parser
.data(content)
.from(fromText)
.to(toText)
.build();
Logger.log(scraped);
return scraped;
}

If you are using
Cheerio library for Google Apps Script
Source code
Library page (⭐ star it!)
Installation by library ID:
1ReeQ6WO8kKNxoaA_O0XEQ589cIrRvEBA9qcWpNqdOP17i47u6N9M5Xh0
A function to get current emojis from unicode.org:
function getEmojis() {
var t = new Date();
var url = 'https://unicode.org/emoji/charts/full-emoji-list.html';
var fetch = UrlFetchApp.fetch(url);
var contentText = fetch.getContentText();
//console.log(new Date() - t);
// Cherio
var $ = Cheerio.load(contentText);
var data = [];
$("table > tbody > tr").each((index, element) => {
var row = [];
$(element).find("td").each((index, child) => {
row.push($(child).text());
});
if (row.length > 0) {
data.push(row);
}
});
//console.log(data);
//console.log(new Date() - t);
// Result
return data;
}
↑ Sample code shows how to parse table and put it into [[array]]
May be used as a custom function:
Bonus
Parsing the site may be a time-consuming operation + you may reach the limit.
Here's a test file with a full version of the script:
https://docs.google.com/spreadsheets/d/1iO7YjYWyfseQu_YCfRbGDPg7NskOgMu_iO1iGjr7KxY/edit#gid=93365395
↑ it uses CasheService to reduce the number of calls.

Natively there's no way unless you do what you already tried which wont work if the html doesnt conform with the xml format.

There are two options
a) One is to use JavaScript's string functions. First locate your tag using string.indexOf() and then extract the data you want using string.substring().
b) The other option is to make use of the Xml Service.

It's not possible to create an HTML DOM server-side in Apps Script. Using regular expressions is likely your best option, at least for simple parsing.

Send and receive data to and from another domain

I'm trying to write a plugin. I can not use any libraries or frameworks.
At any website (domain) I would like to start a script from my own domain.
For example:
In the code of the website under domain A I put a code starting the script from domain B
<script src="http://domain-b.com/myscript.js" type="text/javascript"></script>
The code of JavaScript (myscript.js)
type = 'GET';
url = 'http://domain-b.com/echojson.php';
data = ‘var1=1&var2=2’;
_http = new XMLHttpRequest();
_http.open(type, url + '?callback=jsonp123' + '&' + data, true);
_http.onreadystatechange = function() {
alert(‘Get data: ’ + _http.responseText);
}
_http.send(null);
Script from http://domain-b.com/echojson.php always gives the answer:
jsonp123({answer:”answer string”});
But in a JavaScript console I see an error (200) and AJAX doesn’t get anything.

Script loaders like LAB, yep/nope or Frame.js were designed to get around the same-origin policy. They load a script file, so the requested file would have to look like:
response = {answer:”answer string”};

If you use your code like you have posted it here, it does not work, because you are using apostrophs for the data variable!

We Keep Coding

JavaScript is the programming language of the Web.

script to get source code of website (js) - javascript

view-source: isn't a real protocol you can fetch(). However, just var resp = await fetch('http://...'); var text = await resp.text(); document.body.textContent = text; should replace the current document's body with the text contents of that URL...

Related

Why does my request-promise function work in Chrome but not Firefox or Safari?

Node.js: requesting a page and allowing the page to build before scraping

Loading a JavaScript file dynamically

What is the best way to parse html in google apps script

Send and receive data to and from another domain

Categories

Resources