Scrape all used javascripts on a website using python

Scrape all used javascripts on a website using python - javascript

I am looking for a way to determine the name of all javascripts that are used on a website. It is not suitable to simply download the website's sourcecode using the request lib as this will not yield all javascripts that are used.
For example the website https://www.grantthornton.global/en/ uses Google Analytics (analytics.js) as one can see using chrome's "Network" tab for all used javascripts.
However you can not determine the usage of analytics.js through the sourcode alone as analytics.js is loaded through the google-tag-manager.
My current approach is to load the Website using selenium and to record all data through browsermob-proxy. I can then check for all javascripts that have been accessed by checking the urls (example: https://www.google-analytics.com/analytics.js)
Is there any better way than this:
from selenium import webdriver
from browsermobproxy import Server
import pprint, time
server = Server("browsermob-proxy-2.1.4\\bin\\browsermob-proxy")
server.start()
proxy = server.create_proxy({'captureHeaders': True, 'captureContent': True, 'captureBinaryContent': True})
service_args = ["--proxy=%s" % proxy.proxy, '--ignore-ssl-errors=yes']
driver = webdriver.PhantomJS("phantomjs-2.1.1-windows\\bin\\phantomjs", service_args=service_args)
proxy.new_har()
driver.get('URL GOES HERE')
time.sleep(3)
all_requests = [entry['request']['url'] for entry in proxy.har['log']['entries']]
pp = pprint.PrettyPrinter(indent=4)
pp.pprint(proxy.har)
EDIT:
Solution based on Florent B's approach. The webdriver has been replaced by the chrome webdriver which needs to be downloaded instead of phantomjs:
from selenium import webdriver
import pprint, time
driver = webdriver.Chrome('chromedriver.exe')
driver.get("https://www.URLGOESHERE.com")
time.sleep(3)
scripts = driver.execute_script("""return window.performance.getEntriesByType("resource").filter(e => e.initiatorType === 'script').map(e => e.name.match(/.+\/([^?]+)/)[1]);""")
driver.close()
pp = pprint.PrettyPrinter(indent=4)
pp.pprint(scripts)

You could also get all the downloaded scripts via the window.performance API :
scripts = driver.execute_script("""
return window.performance.getEntriesByType("resource")
.filter(e => e.initiatorType === 'script')
.map(e => e.name);
""")
print(scripts)

Related

View and edit file permissions disappear unexpectedly in Chrome with Plotly Dash

Problem:
My Plotly Dash app (python) has a clientside callback (javascript) which prompts the user to select a folder, then saves a file in a subfolder within that folder. Chrome asks for permission to read and write to the folder, which is fine, but I want the user to only have to give permission once. Unfortunately the permissions, which should persist until the tab closes, disappear often. Two "repeatable cases" are:
when the user clicks a simple button ~15 times very fast, previously accepted permissions will disappear (plotting a figure also does this in my real application)
downloading a file within a few seconds of reloading the page results in the permissions automatically going away within about 5 seconds
I can see the permissions (file and pen icon) disappear at the right of the chrome url banner.
What I've tried:
testing with Ublock Origin on/off (and removed from chrome) to see if the extension interfered (got idea from the only somewhat similar question I've come across: window.confirm disappears without interaction in Chrome)
turning debug mode off
using Edge instead of chrome (basically the same behavior was observed)
adding more computation to Test button to find repeatable case, but still needed to click it a lot to remove permissions (triggering callbacks / updating Dash components seems to be the issue, not server resources)
Example python script (dash app) to show permissions disappearing:
import dash
import dash_bootstrap_components as dbc
from dash.dependencies import Input, Output
from dash import html
app = dash.Dash(__name__, external_stylesheets=[dbc.themes.BOOTSTRAP])
app.layout = html.Div([
dbc.Button(id="model-export-button", children="Export Model"),
dbc.Label(id="test-label1", children="Click to download"),
html.Br(),
dbc.Button(id="test-button", children="Test button"),
dbc.Label(id="test-label2", children="Button not clicked")
])
# Chrome web API used for downloading: https://web.dev/file-system-access/
app.clientside_callback(
"""
async function(n_clicks) {
// Select directory to download
const directoryHandle = await window.showDirectoryPicker({id: 'save-dir', startIn: 'downloads'});
// Create sub-folder in that directory
const newDirectoryHandle = await directoryHandle.getDirectoryHandle("test-folder-name", {create: true});
// Download files to sub-folder
const fileHandle = await newDirectoryHandle.getFileHandle("test-file-name.txt", {create: true});
const writable = await fileHandle.createWritable();
await writable.write("Hello world.");
await writable.close();
// Create status message
const event = new Date(Date.now());
const msg = "File(s) saved successfully at " + event.toLocaleTimeString();
return msg;
}
""",
Output('test-label1', 'children'),
Input('model-export-button', 'n_clicks'),
prevent_initial_call=True
)
#app.callback(
Output('test-label2', 'children'),
Input('test-button', 'n_clicks'),
prevent_initial_call=True
)
def test_button_function(n):
return "Button has been clicked " + str(n) + " times"
if __name__ == "__main__":
app.run_server(debug=False)

This is now possible! In your code, replace the line…
await window.showDirectoryPicker({id: 'save-dir', startIn: 'downloads'});
…with…
await window.showDirectoryPicker({
id: 'save-dir',
startIn: 'downloads',
mode: 'readwrite', // This is new!
});

Unable to download research article from scihub using browser emulation with selenium

I am trying to automate the download of research articles from scihub (https://sci-hub.scihubtw.tw/) based on their corresponding article titles. I am using a library called scholarly (https://pypi.org/project/scholarly/) to get the url, author information related to the given article title as shown in the code below.
I use the fetched url (as described above) to emulate the download process using scihub. But I am unable to download directly, since I can't press the open button on the search page (https://sci-hub.scihubtw.tw/). And pressing enter after populating the query forwards me to another page with an open button. I am unable to fetch and press the open button for some reason and it always returns me a null element using the selenium library.
However, I am able to execute the following in the browser console and successfully download the pape,
document.querySelector("#open-button").click()
But, trying to get similar response from selenium is failing.
Kindly help me resolve this issue.
## This part of code fetches url using scholarly library from google scholar
from scholarly import scholarly
search_query = scholarly.search_pubs('Hydrogen-hydrogen pair correlation function in liquid water')
search_query = [query for query in search_query][0]
## This part of code uses selenium to automate download process
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
import time
download_dir = '/Users/cacsag4/Downloads'
# setup the browser
options = webdriver.ChromeOptions()
options.add_experimental_option('prefs', {
"download.default_directory": download_dir, #Change default directory for downloads
"download.prompt_for_download": False, #To auto download the file
"download.directory_upgrade": True,
"plugins.always_open_pdf_externally": True #It will not show PDF directly in chrome
})
browser = webdriver.Chrome('./chromedriver', options=options)
browser.delete_all_cookies()
browser.get('https://sci-hub.scihubtw.tw/')
# Find the search element to send the url string to it
searchElem = browser.find_element(By.CSS_SELECTOR, 'input[type="textbox"]')
searchElem.send_keys(search_query.bib['url'])
# Emulate pressing enter two different ways, either by pressing return key or by executing JS
#searchElem.send_keys(Keys.ENTER) # This produces the same effect as the next line
browser.execute_script("javascript:document.forms[0].submit()")
# Wait for page to load
time.sleep(10)
# Try to press the open button using JS or by fetching the button by its ID
# This returns error since its unable to fetch open-button id
browser.execute_script('javascript:document.querySelector("#open-button").click()')
#openElem = browser.find_element(By.ID, "open-button") ## This also returns a null element

Ok, so I got the answer to this question. Sci-hub stores its pdf inside an iframe, so all you got to do is fetch the src attribute of the iframe after pressing enter on the first page. The following code does the job.
from scholarly import scholarly
search_query = scholarly.search_pubs('Hydrogen-hydrogen pair correlation function in liquid water')
search_query = [query for query in search_query][0]
print(search_query.bib['url'])
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.ui import WebDriverWait
import time
download_dir = '/Users/cacsag4/Downloads'
# setup the browser
options = webdriver.ChromeOptions()
options.add_experimental_option('prefs', {
"download.default_directory": download_dir, #Change default directory for downloads
"download.prompt_for_download": False, #To auto download the file
"download.directory_upgrade": True,
"plugins.always_open_pdf_externally": True #It will not show PDF directly in chrome
})
browser = webdriver.Chrome('./chromedriver', options=options)
browser.delete_all_cookies()
browser.get('https://sci-hub.scihubtw.tw/')
# Find the search element to send the url string to it
searchElem = browser.find_element(By.CSS_SELECTOR, 'input[type="textbox"]')
searchElem.send_keys(search_query.bib['url'])
# Emulate pressing enter two different ways, either by pressing return key or by executing JS
#searchElem.send_keys(Keys.ENTER) # This produces the same effect as the next line
browser.execute_script("javascript:document.forms[0].submit()")
# Wait for page to load
time.sleep(2)
# Try to press the open button using JS or by fetching the button by its ID
# This returns error since its unable to fetch open-button id
#browser.execute_script('javascript:document.querySelector("#open-button").click()')
openElem = browser.find_element(By.CSS_SELECTOR, "iframe") ## This also returns a null element
browser.get(openElem.get_attribute('src'))

Is there a way to get a webpage's Network activity (which you can see on Chrome Dev Tools) on load via Python?

I want to listen to the Network events (basically all of the activity that you can see when you go to the Network tab on Chrome's Developer Tools / Inspect) and record specific events when a page is loaded via Python.
Is this possible? Thanks!
Specifically:
go to webpage.com
open Chrome Dev Tools and go to the Network tab
add api.webpage.com as a filter
refresh page [scroll]
I want to be able to capture the names of these events because there are specific IDs that aren't available via the UI.

Update 2021
I had to make few changes to Zach answer to make it work. Comments with ### are my comments
def get_perf_log_on_load(url, headless=True, filter=None):
# init Chrome driver (Selenium)
options = Options()
options.add_experimental_option('w3c', False) ### added this line
options.headless = headless
cap = DesiredCapabilities.CHROME
cap["loggingPrefs"] = {"performance": "ALL"}
### installed chromedriver.exe and identify path
driver = webdriver.Chrome(r"C:\Users\asiddiqui\Downloads\chromedriver_win32\chromedriver.exe", desired_capabilities=cap, options=options) ### installed
# record and parse performance log
driver.get(url)
if filter:
log = [item for item in driver.get_log("performance") if filter in str(item)]
else:
log = driver.get_log("performance")
driver.close()
return log

Although it didn't completely answer the question, #mihai-andrei's answer got me the closest.
If anyone is looking for a Python solution than the following code should do the trick:
from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.webdriver.chrome.options import Options
def get_perf_log_on_load(self, url, headless = True, filter = None):
# init Chrome driver (Selenium)
options = Options()
options.headless = headless
cap = DesiredCapabilities.CHROME
cap['loggingPrefs'] = {'performance': 'ALL'}
driver = webdriver.Chrome(desired_capabilities = cap, options = options)
# record and parse performance log
driver.get(url)
if filter: log = [item for item in driver.get_log('performance')
if filter in str(item)]
else: log = driver.get_log('performance')
driver.close()
return log

You could side step chrome and use a scriptable proxy like mitmproxy.
https://mitmproxy.org/
Another ideea is to use selenium to drive the browser and get the events from perf logs
https://sites.google.com/a/chromium.org/chromedriver/logging/performance-log

perform browser action with node.js with using API

I want to do some event eg. clicks in a website. I can do it in chrome with javascript (or chrome extension), but is it possible to do without opening chrome but with server side code? No API is provided. It's not scraping but perform some sort of action.

NodeJS uses Google V8 engine to interpret the JavaScript code. It does not run in a browser environment and therefore it lacks DOM and event handling. However, you can actually mock browser in NodeJS environment using mock-browser package.
const MockBrowser = require('mock-browser/lib/MockBrowser')
const mockBrowser = new MockBrowser()
global.window = mockBrowser.getWindow()
global.document = mockBrowser.getDocument()
global.navigator = mockBrowser.getNavigator()
However, you should be careful with this approach, as some methods (e.g. getComputedStyle) still will not work.
Maybe you should reconsider why you want to use DOM and events on the server side.
PhantomJS: Headless browser for NodeJS
PhantomJS is a headless browser for NodeJS that is used for testing, scraping, etc. It provides you with a full-featured browser that can simulate a browser.
Using CasperJS for scraping
If you want to scrape websites, you may use a library called CasperJS that itself uses PhantomJS. An example:
var casper = require('casper').create();
var links;
function getLinks() {
// Scrape the links from top-right nav of the website
var links = document.querySelectorAll('ul.navigation li a');
return Array.prototype.map.call(links, function (e) {
return e.getAttribute('href')
});
}
// Opens casperjs homepage
casper.start('http://casperjs.org/');
casper.then(function () {
links = this.evaluate(getLinks);
});
casper.run(function () {
for(var i in links) {
console.log(links[i]);
}
casper.done();
});

Identify tab that made request in Firefox Addon SDK

I'm using the Firefox Addon SDK to build something that monitors and displays the HTTP traffic in the browser. Similar to HTTPFox or Live HTTP Headers. I am interested in identifying which tab in the browser (if any) generated the request
Using the observer-service I am monitoring for "http-on-examine-response" events. I have code like the following to identify the nsIDomWindow that generated the request:
const observer = require("observer-service"),
{Ci} = require("chrome");
function getTabFromChannel(channel) {
try {
var noteCB= channel.notificationCallbacks ? channel.notificationCallbacks : channel.loadGroup.notificationCallbacks;
if (!noteCB) { return null; }
var domWin = noteCB.getInterface(Ci.nsIDOMWindow);
return domWin.top;
} catch (e) {
dump(e + "\n");
return null;
}
}
function logHTTPTraffic(sub, data) {
sub.QueryInterface(Ci.nsIHttpChannel);
var ab = getTabFromChannel(sub);
console.log(tab);
}
observer.add("http-on-examine-response", logHTTPTraffic);
Mostly cribbed from the documentation for how to identify the browser that generated the request. Some is also taken from the Google PageSpeed Firefox addon.
Is there a recommended or preferred way to go from the nsIDOMWindow object domWin to a tab element in the SDK tabs module?
I've considered something hacky like scanning the tabs list for one with a URL that matches the URL for domWin, but then I have to worry about multiple tabs having the same URL.

You have to keep using the internal packages. From what I can tell, getTabForWindow() function in api-utils/lib/tabs/tab.js package does exactly what you want. Untested code:
var tabsLib = require("sdk/tabs/tab.js");
return tabsLib.getTabForWindow(domWin.top);

The API has changed since this was originally asked/answered...
It should now (as of 1.15) be:
return require("sdk/tabs/utils").getTabForWindow(domWin.top);

As of Addon SDK version 1.13 change:
var tabsLib = require("tabs/tab.js");
to
var tabsLib = require("sdk/tabs/helpers.js");

If anyone still cares about this:
Although the Addon SDK is being deprecated in support of the newer WebExtensions API, I want to point out that
var a_tab = require("sdk/tabs/utils").getTabForContentWindow(window)
returns a different 'tab' object than the one you would typically get by using
worker.tab in a PageMod.
For example, a_tab will not have the 'id' attribute, but would have linkedPanel property that's similar to the 'id' attribute.

We Keep Coding

JavaScript is the programming language of the Web.

Scrape all used javascripts on a website using python - javascript

You could also get all the downloaded scripts via the window.performance API : scripts = driver.execute_script(""" return window.performance.getEntriesByType("resource") .filter(e => e.initiatorType === 'script') .map(e => e.name); """) print(scripts)

Related

View and edit file permissions disappear unexpectedly in Chrome with Plotly Dash

Unable to download research article from scihub using browser emulation with selenium

Is there a way to get a webpage's Network activity (which you can see on Chrome Dev Tools) on load via Python?

perform browser action with node.js with using API

Identify tab that made request in Firefox Addon SDK

Categories

Resources