HTML I am getting using node.js is much different than HTML I can see in the browser (using google chrome inspect feature). I assume this is happening because when using browser I have to wait for some elements to load but I don't wait for them when creating a request. How can I request a fully loaded HTML? Is it possible without pretending to be a real user (puppeteer)?
For example, this is my attempt to get a video element from this link https://clips.twitch.tv/IronicPoisedTermite4Head
but video element is not present at all in the HTML I have fetched.
const fetch = require("node-fetch");
const jsdom = require("jsdom");
(async () => {
let htmlDoc = await fetch("https://clips.twitch.tv/IronicPoisedTermite4Head")
.then((res) => res.text())
.then((body) => body); //body is totally different than HTML in the browser
try {
const document = new jsdom.JSDOM().window.document;
console.log(htmlDoc);
console.log(document.getElementsByTagName('video')[0]);
} catch (e) {
console.log(e);
}
})();
When a browser loads a web page, it does an HTTP GET and gets back a static piece of HTML. Let's call that the "original content". It then parses that HTML and runs any <script> tags it finds in that HTML. Those script tags may then modify the content you see. In particular some sites make additional HTTP requests to retrieve additional content and then they insert that content into the page. The produces what I will call the "full content". Those scripts may even continue running over time to continue to update the content.
When you do a fetch() of some URL, that retrieves what was labeled above as the "original content". That's all it does. fetch() just does the initial HTTP GET for that URL. It doesn't parse the resulting HTML and it doesn't run any of the <script> tags it could find in that HTML. Thus, fetch() does not produce the "full content" as described above. Sometimes, the "original content" is sufficient for your work and sometimes the "full content" is what you need - it really depends upon the specific web site.
To get the "full content", you have to feed the "original content" to a browser-like environment that can "run" it to let its scripts do their things, to provide a DOM environment for those scripts to run in so you can then query the resulting DOM to get the "full content". puppeteer is one such tool for obtaining the "full content". It actually uses the Chromium engine (same engine the Chrome browser uses) to literally "run" the web page and let its <script> tags do their thing and you can then obtain the "full content" from it after those scripts run.
fetch(), by itself, cannot get the "full content" because it doesn't parse or run the page's scripts and doesn't offer a DOM environment for them to run in either. That's what a tool like puppeteer can do.
How can I request a fully loaded HTML? Is it possible without pretending to be a real user (puppeteer)?
If the site builds its "full content" uses Javascript in <script> tags, then you have to use a tool like puppeteer to get the "full content". It's not just a matter of waiting. You need a tool that actually runs the scripts in the page.
Related
How would I overwrite the response body for an image with a dynamic value in a Manifest V3 Chrome extension?
This overwrite would happen in the background, as per the Firefox example, (see below) meaning no attaching debuggers or requiring users to press a button every time the page loads to modify the response.
I'm creating a web extension that would store an image in the extension's IndexedDB storage and then override the response body with that image on requests to a certain image. A redirect to a dataurl: I have it working in a Manifest V2 extension in Firefox via the browser.webRequest.onBeforeRequest api with the following code, but browser.webRequest and MV2 are depreciated in Chrome. In MV3, browser.webRequest was replaced with browser.declarativeNetRequest, but it doesn't have the same level of access, as you can only redirect and modify headers, not the body.
Firefox-compatible example:
browser.webRequest.onBeforeRequest.addListener(
(details) => {
const request = browser.webRequest.filterResponseData(details.requestId);
request.onstart = async () => {
request.write(racetrack);
request.disconnect();
};
},
{
urls: ['https://www.example.com/image.png'],
},
['requestBody', 'blocking']
);
The Firefox solution is the only one that worked for me, albeit being exlusive to Firefox. I attempted to write a POC userscript with xhook to modify the content of a DOM image element, but it didn't seem to return the modified image as expected. Previously, I tried using a redirect to a data URI and an external image, but while the redirect worked fine, the website threw an error that it couldn't load the required resources.
I'm guessing I'm going to have to write a content script that injects a Service Worker (unexplored territory for me) into the page and create a page rule that redirects, say /extension-injected-sw.js to either a web-available script, but I'm not too sure about how to pull that off, or if I'd still be able to have the service worker communicate with the extension, or if that would even work at all. Or is there a better way to do this that I'm overlooking?
Thank you for your time!
I am trying to create a script to download an ebook into a pdf. When I try to use beautifulsoup in it I to print the contents of a single page, I get a message in the console stating "Oh no! It looks like JavaScript is disabled in your browser. Please re-enable to access the reader."
I have already enabled Javascript in Chrome and this same piece of code works for a page like a stackO answer page. What could be blocking Javascript in this page and how can I bypass it?
My code for reference:
url = requests.get("https://platform.virdocs.com/r/s/0/doc/350551/sp/14552484/mi/47443495/?cfi=%2F4%2F2%5BP7001013978000000000000000003FF2%5D%2F2%2F2%5BP7001013978000000000000000010019%5D%2F2%2C%2F1%3A0%2C%2F1%3A0")
url.raise_for_status()
soup = bs4.BeautifulSoup(url.text, "html.parser")
elems = soup.select("p")
print(elems[0].getText())
The problem is that the page actually contains no content. To load the content it needs to run some JS code. The requests.get method does not run JS, it just loads the basic HTML.
What you need to do is to emulate a browser, i.e. 'open' the page, run JS, and then scrape content. One way to do it is to use a browser driver as described here - https://stackoverflow.com/a/57912823/9805867
I'm trying to scrape some site in Node.js. I've followed a great tutorial however realize that it might not be what I am looking for, ie. might be looking at scraping the javascript portion of the page instead of the html one.
Is that possible ?
Reason for that is that I am looking for loading the content of the below portion of the code I could find by inspecting in Safari (not showing in Chrome) a kayak.com page (see url below) and seems to be in a scripting section.
reducer: {"reducerPath":"flights\/results\/react\/reducers\/
https://www.kayak.com/flights/TYO-PAR/2019-07-05-flexible/2019-07-14-flexible/1adults/children-11?fs=cfc=1;legdur=-960;stops=~0;bfc=1&sort=bestflight_a&attempt=2&lastms=1550392662619
UPDATE: Unfortunately, this site uses bot/scrape protection: tools like curl get a page with bot warning, headless browser tools like puppeteer get a page with captcha.
===============
As this line is present in the HTML source code and is not added dynamically by JavaScript execution, you can use something like this with the appropriate library API:
const extractedString = [...document.querySelectorAll('script')]
.map(({ textContent }) => textContent)
.find(txt => txt.includes('string'))
.match(/regexp/);
I'm trying to fetch an entire webpage using JavaScript by plugging in the URL. However, the website is built as a Single Page Application (SPA) that uses JavaScript / backbone.js to dynamically load most of it's contents after rendering the initial response.
So for example, when I route to the following address:
https://connect.garmin.com/modern/activity/1915361012
And then enter this into the console (after the page has loaded):
var $page = $("html")
console.log("%c✔: ", "color:green;", $page.find(".inline-edit-target.page-title-overflow").text().trim());
console.log("%c✔: ", "color:green;", $page.find("footer .details").text().trim());
Then I'll get the dynamically loaded activity title as well as the statically loaded page footer:
However, when I try to load the webpage via an AJAX call with either $.get() or .load(), I only get delivered the initial response (the same as the content when over view-source):
view-source:https://connect.garmin.com/modern/activity/1915361012
So if I use either of the the following AJAX calls:
// jQuery.get()
var url = "https://connect.garmin.com/modern/activity/1915361012";
jQuery.get(url,function(data) {
var $page = $("<div>").html(data)
console.log("%c✖: ", "color:red;", $page.find(".page-title").text().trim());
console.log("%c✔: ", "color:green;", $page.find("footer .details").text().trim());
});
// jQuery.load()
var url = "https://connect.garmin.com/modern/activity/1915361012";
var $page = $("<div>")
$page.load(url, function(data) {
console.log("%c✖: ", "color:red;", $page.find(".page-title").text().trim() );
console.log("%c✔: ", "color:green;", $page.find("footer .details").text().trim());
});
I'll still get the initial footer, but won't get any of the other page contents:
I've tried the solution here to eval() the contents of every script tag, but that doesn't appear robust enough to actually load the page:
jQuery.get(url,function(data) {
var $page = $("<div>").html(data)
$page.find("script").each(function() {
var scriptContent = $(this).html(); //Grab the content of this tag
eval(scriptContent); //Execute the content
});
console.log("%c✖: ", "color:red;", $page.find(".page-title").text().trim());
console.log("%c✔: ", "color:green;", $page.find("footer .details").text().trim());
});
Q: Any options to fully load a webpage that will scrapable over JavaScript?
You will never be able to fully replicate by yourself what an arbitrary (SPA) page does.
The only way I see is using a headless browser such as PhantomJS or Headless Chrome, or Headless Firefox.
I wanted to try Headless Chrome so let's see what it can do with your page:
Quick check using internal REPL
Load that page with Chrome Headless (you'll need Chrome 59 on Mac/Linux, Chrome 60 on Windows), and find page title with JavaScript from the REPL:
% chrome --headless --disable-gpu --repl https://connect.garmin.com/modern/activity/1915361012
[0830/171405.025582:INFO:headless_shell.cc(303)] Type a Javascript expression to evaluate or "quit" to exit.
>>> $('body').find('.page-title').text().trim()
{"result":{"type":"string","value":"Daily Mile - Round 2 - Day 27"}}
NB: to get chrome command line working on a Mac I did this beforehand:
alias chrome="'/Applications/Google Chrome.app/Contents/MacOS/Google Chrome'"
Using programmatically with Node & Puppeteer
Puppeteer is a Node library (by Google Chrome developers) which provides a high-level API to control headless Chrome over the DevTools Protocol. It can also be configured to use full (non-headless) Chrome.
(Step 0 : Install Node & Yarn if you don't have them)
In a new directory:
yarn init
yarn add puppeteer
Create index.js with this:
const puppeteer = require('puppeteer');
(async() => {
const url = 'https://connect.garmin.com/modern/activity/1915361012';
const browser = await puppeteer.launch();
const page = await browser.newPage();
// Go to URL and wait for page to load
await page.goto(url, {waitUntil: 'networkidle'});
// Wait for the results to show up
await page.waitForSelector('.page-title');
// Extract the results from the page
const text = await page.evaluate(() => {
const title = document.querySelector('.page-title');
return title.innerText.trim();
});
console.log(`Found: ${text}`);
browser.close();
})();
Result:
$ node index.js
Found: Daily Mile - Round 2 - Day 27
First off: avoid eval - your content security policy should block it and it leaves you open to easy XSS attacks. Scraping bots definitely won't run it.
The problem you're describing is common to all SPAs - when a person visits they get your app shell script, which then loads in the rest of the content - all good. When a bot visits they ignore the scripts and return the empty shell.
The solution is server side rendering. One way to do this is if you're using a JS renderer (say React) and Node.js on the server you can fairly easily build the JS and serve it statically.
However, if you aren't then you'll need to run a headless browser on your server that executes all the JS a user would and then serves up the result to the bot.
Fortunately someone else has already done all the work here. They've put a demo online that you can try out with your site:
I think you should know the concept of SPA,
SPA is Single Page Application, it is only static html file. when the route changs, the page will create or modify DOM nodes dynamically to achieve the effect of switch page by using Javascript.
Therefore, if you use $.get(), the server will response a static html file that has a stable page, so you won't load what you want.
If you wants to use $.get() , it has two ways, the first is using headless browser, for example, headless chrome, phantomJS and etc. It will help you load the page and you can get dom nodes of the loaded page.The second is SSR (Server Slide Render), if you use SSR, you will get HTML data of page directly by $.get, because the server response HTML data of correspond page when requesting different routes.
Reference:
SSR
the SRR frame of vue: Nuxt.js
PhantomJS
Node API of Headless Chrome
I am writing a chrome extension that injects a div into a website with a content script. The content script makes an AJAX request to a website that I cleared in the manifest.json file and it inserts the data into the div with innerHTML. Part of what the AJAX request returns is javascript that needs to be executed. The AJAX request from within the content script works fine.
When I make the same AJAX request from a regular website, the javascript that is returned executes just fine, but when I make the AJAX request from the content script it does not execute. No errors are displayed in the console. I don't want to reload the website, if possible.
I assume that this is a security 'feature' and not a bug. How can I turn off or circumvent this behavior?
First off what Rob W said is very important, if you don't already know it, a good explanation of the different environment a content script runs in is useful.
You might want to check this out. It's not 100% what you're looking for but the main part is there. Basically from your background page (if you don't have one already create one), you use chrome.tabs.executeScript() to execute the script you've downloaded. That runs the javascript in the real page context instead of the "content script" context. All you need now is to get that script (in string form) to the background page, and determine the tabId to execute it on (from the sender tab)
You can use chrome.extension.sendMessage to send it to the background page, and in the background.js, use chrome.extension.onMessage to receive the message with your script. From there use the sender argument to get the tabId (sender.tab.id), and build your executeScript call.
One more helpful hint, page scripts (dynamic javascript executions) in chrome by default don't show up in any set way in the chrome debugger, but you can append something like this to the string of your javascript:
"\n//# sourceURL=/myFolder/myDynamicJavascript.js"
This will make this script always show up with the "/myFolder/myDynamicJavascript.js" path for the chrome debugger, allowing you to set breakpoints in the js code you've inserted. It's a lifesaver.