How to get URLs of page's requests? - javascript

I am working on a Chrome app and I need to find one of the request urls (the request is initiated in a JS script).
The page script after loading asks for .../online_mektep/lesson/L_(page id)/index.json and I need this page id. How can I find out the URL?
The only way I can see now is to modify the original script with a web request and just get the data before the request. Are there other ways?

Not sure if I completely understand what you're trying to accomplish, however: maybe you can add a listener and get the url. Then you can split the URL afterwards and get the route parameter you want
chrome.webRequest.onBeforeRequest.addListener(
function(details) {
console.log('onBeforeRequest', details.url);
const yourUrl = details.url // example: ".../online_mektep/lesson/L_(page id)/index.json"
const pathArray = yourUrl.split('/')
console.log(pathArray[3].split('_')[1]) // should output (page id)
},
);

Related

Extracting pdf link from a webpage with href = "#' using python: AJAX post request not returning expected result

I’m currently trying to download a pdf from a website (I’m trying to automate the process) and I have tried numerous different approaches. I’m currently using python and selenium/phantomjs to first find the pdf href link on the webpage source and then use something like wget to download and store the pdf on my local drive.
Whilst I have no issues finding all the href links find_elements_by_xpath("//a/#href") on the page, or narrowing in on the element that has the url path find_element_by_link_text('Active Saver') and then printing it using, the get_attribute('href') method, it does not display the link correctly.
This is the source element, an a tag, that I need the link from is:
href="#" data-ng-mouseup="loadProductSheetPdf($event, download.ProductType)" target="_blank" data-ng-click="$event.preventDefault()" analytics-event="{event_name:'file_download', event_category: 'download', event_label:'product summary'}" class="ng-binding ng-isolate-scope">Active Saver<
As you can see the href attribute is href="#" and when I run get_attribute('href') on this element I get:
https://www.bupa.com.au/health-insurance/cover/active-saver#
Which is not the link to the PDF. I know this because when I open the page in Firefox and inspect the element I can see the actual, JavaScript executed source:
href="https://bupaanzstdhtauspub01.blob.core.windows.net/productfiles/J6_ActiveSaver_NSWACT_20180401_000000.pdf" data-ng-mouseup="loadProductSheetPdf($event, download.ProductType)" target="_blank" data-ng-click="$event.preventDefault()" analytics-event="{event_name:'file_download', event_category: 'download', event_label:'product summary'}" class="ng-binding ng-isolate-scope">Active Saver<
This https://bupaanzstdhtauspub01.blob.core.windows.net/productfiles/J6_ActiveSaver_NSWACT_20180401_000000.pdf is the link I need.
https://www.bupa.com.au/health-insurance/cover/active-saver is the link to the page that houses the PDF. As you can see the PDF is stored on another domain, not www.bupa.com.au.
Any help with this would be very appreciated.
I realised that this is acutally an AJAX request and when executed it obtains the PDF url that I'm after. I'm now trying to figure out how to extract that url from the response object sent via a post request.
My code so far is:
import requests
from lxml.etree import fromstring
url = "post_url"
data = {data dictionary to send with request extraced from dev tools}
response = requests.post(url,data)
response.json()
However, I keep getting error indicating that No Json object could be decoded. I can look at the response, using response.text and I get
u'<html>\r\n<head>\r\n<META NAME="robots" CONTENT="noindex,nofollow">\r\n<script src="/_Incapsula_Resource?SWJIYLWA=719d34d31c8e3a6e6fffd425f7e032f3">\r\n</script>\r\n<script>\r\n(function() { \r\nvar z="";var b="7472797B766172207868723B76617220743D6E6577204461746528292E67657454696D6528293B766172207374617475733D227374617274223B7661722074696D696E673D6E65772041727261792833293B77696E646F772E6F6E756E6C6F61643D66756E6374696F6E28297B74696D696E675B325D3D22723A222B286E6577204461746528292E67657454696D6528292D74293B646F63756D656E742E637265617465456C656D656E742822696D6722292E7372633D222F5F496E63617073756C615F5265736F757263653F4553324C555243543D363726743D373826643D222B656E636F6465555249436F6D706F6E656E74287374617475732B222028222B74696D696E672E6A6F696E28292B222922297D3B69662877696E646F772E584D4C4874747052657175657374297B7868723D6E657720584D4C48747470526571756573747D656C73657B7868723D6E657720416374697665584F626A65637428224D6963726F736F66742E584D4C4854545022297D7868722E6F6E726561647973746174656368616E67653D66756E6374696F6E28297B737769746368287868722E72656164795374617465297B6361736520303A7374617475733D6E6577204461746528292E67657454696D6528292D742B223A2072657175657374206E6F7420696E697469616C697A656420223B627265616B3B6361736520313A7374617475733D6E6577204461746528292E67657454696D6528292D742B223A2073657276657220636F6E6E656374696F6E2065737461626C6973686564223B627265616B3B6361736520323A7374617475733D6E6577204461746528292E67657454696D6528292D742B223A2072657175657374207265636569766564223B627265616B3B6361736520333A7374617475733D6E6577204461746528292E67657454696D6528292D742B223A2070726F63657373696E672072657175657374223B627265616B3B6361736520343A7374617475733D22636F6D706C657465223B74696D696E675B315D3D22633A222B286E6577204461746528292E67657454696D6528292D74293B6966287868722E7374617475733D3D323030297B706172656E742E6C6F636174696F6E2E72656C6F616428297D627265616B7D7D3B74696D696E675B305D3D22733A222B286E6577204461746528292E67657454696D6528292D74293B7868722E6F70656E2822474554222C222F5F496E63617073756C615F5265736F757263653F535748414E45444C3D363634323839373431333131303432323133352C353234303631363938363836323232363836382C393038303935393835353935393539353435312C31303035363336222C66616C7365293B7868722E73656E64286E756C6C297D63617463682863297B7374617475732B3D6E6577204461746528292E67657454696D6528292D742B2220696E6361705F6578633A20222B633B646F63756D656E742E637265617465456C656D656E742822696D6722292E7372633D222F5F496E63617073756C615F5265736F757263653F4553324C555243543D363726743D373826643D222B656E636F6465555249436F6D706F6E656E74287374617475732B222028222B74696D696E672E6A6F696E28292B222922297D3B";for (var i=0;i<b.length;i+=2){z=z+parseInt(b.substring(i, i+2), 16)+",";}z = z.substring(0,z.length-1); eval(eval(\'String.fromCharCode(\'+z+\')\'));})();\r\n</script></head>\r\n<body>\r\n<iframe style="display:none;visibility:hidden;" src="//content.incapsula.com/jsTest.html" id="gaIframe"></iframe>\r\n</body></html>'
This clearly does not have the url I'm after. The frustrating thing is I can see the was obtained when I used Firefox's dev tools:
Screen shot of FireFox Dev tools showing link
Can anyone help me with this?
I was able to solve this by ensuring that both my header information and the request payload (data) that was sent with the post request was complete and accurate (obtained from Firefox dev tools web console). Once I was able to receive the response data for the post request it was relatively trivial to extract the url linking to the pdf file I was wanting to download. I then downloaded the pdf using urlretrieve from the urllib module. I modeled my script based on the script from this page. However, I also ended up using the urllib2.Request form the urllib2 module instead of requests.post from the requests module. For some reason urllib2 module worked more consistently then the Requests module. My working code ended up looking like this (these two methods come from a my class object, but shows the working code):
....
def post_request(self,url,data):
self.data = data
self.url = url
req = urllib2.Request(self.url)
req.add_header('Content-Type', 'application/json')
res = urllib2.urlopen(req,self.data)
out = json.load(res)
return out
def get_pdf(self):
link ='https://www.bupa.com.au/api/cover/datasheets/search'
directory = '/Users/U1085012/OneDrive/PDS data project/Bupa/PDS Files/'
excess = [None, 0,50,100,500]
#singles
for product in get_product_names_singles():
self.search_request['PackageEntityName'] = product
print product
if 'extras' in product:
self.search_request['ProductType'] = 2
else:
self.search_request['ProductType'] = 1
for i in range(len(excess)):
try:
self.search_request['Excess'] = excess[i]
payload = json.dumps(self.search_request)
output = self.post_request(link,payload)
except urllib2.HTTPError:
continue
else:
break
path = output['FilePath'].encode('ascii')
file_name = output['FileName'].encode('ascii')
#check to see if file exists if not then retrieve
if os.path.exists(directory+file_name):
pass
else:
ul.urlretrieve(path, directory+file_name

Remove parameter from url not working

I'm attempting to remove a url parameter status from the url but in the following alert, the parameter is still there.
var addressurl = location.href.replace(separator + "status=([^&]$|[^&]*)/i", "");
alert(addressurl);
location.href= addressurl;
How do i solve?
You are confusing regex with strings.
It should be:
var addressurl = location.href.replace(separator, '').replace(/status=([^&]$|[^&]*)/i", "");
Javascript context in web pages are to the page you are working on.
When you reload, redirect or move to any other page, javascript changes done in previous page will not be there. This has to be handled from server side.
Refresh repeats the last request to the server, which is going to ignore your javascript changes. Instead navigate to the new url with window.location = addressurl;

What Event Is Sent Before Url Request?

Is there a way to get the url string before the request is sent to the website? I found a post about it here:
https://forums.mozilla.org/addons/viewtopic.php?f=7&t=11259&p=26111
but I could not find anything about how to "hook into the Browser:OpenLocation command" in the addon SDK.
Basically what I am doing is this:
Check the url that is about to be requested to see if it matches my RegExp.
If it matches, change the userAgent that is sent to the website. (By setting general.userAgent.override)
Thus I cannot check the url after the page starts loading since the request will have already have been sent, and I would rather not reload the page as it would delay browsing.
Thanks!
Yes, check out the docs here:
https://developer.mozilla.org/en-US/docs/XUL/School_tutorial/Intercepting_Page_Loads#HTTP_Observers
This code will work for the add-on sdk, it checks that the url matches mysite and then sets a MyBrowser/1.0 User-Agent for just that site. This does the change only when detected, not using the pref userAgent.orverride.
var chrome = require("chrome");
chrome.Cc["#mozilla.org/observer-service;1"].getService( chrome.Ci.nsIObserverService ).addObserver({
observe : function(subject, topic, data) {
var channel = subject.QueryInterface( chrome.Ci.nsIHttpChannel );
if ( /mysite/.test( channel.originalURI.host ) ) {
channel.setRequestHeader("User-Agent", "MyBrowser/1.0", false);
}
}
},"http-on-modify-request",false);

Read window.location.hash servlet-side not possible?

In my web app, a user can click an item in a list, and I modify the url in their browser:
<li>Horse</li>
<li>Cow</li>
<li>Goat</li>
function onListItemClicked() {
window.location.hash = item.name;
}
this will change the url in the user's browser to:
www.example.com#Horse
www.example.com#Cow
www.example.com#Goat
if I'm reading correctly, we can't get the # part of the url servlet-side, right? If the user copies and pastes the url from their browser to friend, it would be cool if I could generate the page already initialized with the item they clicked.
It looks like this is not possible, I'll have to load the appropriate page via javascript after the document finishes loading,
Thanks
No, you can't do this from the server side on. URL fragments are purely client side. You need to do this in the client side during page load.
window.onload = function() {
var hash = window.location.hash;
// Do your business thing here based on the hash.
}

How to use javascript to get information from the content of another page (same domain)?

Let's say I have a web page (/index.html) that contains the following
<li>
<div>item1</div>
details
</li>
and I would like to have some javascript on /index.html to load that
/details/item1.html page and extract some information from that page.
The page /details/item1.html might contain things like
<div id="some_id">
picture
map
</div>
My task is to write a greasemonkey script, so changing anything serverside is not an option.
To summarize, javascript is running on /index.html and I would
like to have the javascript code to add some information on /index.html
extracted from both /index.html and /details/item1.html.
My question is how to fetch information from /details/item1.html.
I currently have written code to extract the link (e.g. /details/item1.html)
and pass this on to a method that should extract the wanted information (at first
just .innerHTML from the some_id div is ok, I can process futher later).
The following is my current attempt, but it does not work. Any suggestions?
function get_information(link)
{
var obj = document.createElement('object');
obj.data = link;
document.getElementsByTagName('body')[0].appendChild(obj)
var some_id = document.getElementById('some_id');
if (! some_id) {
alert("some_id == NULL");
return "";
}
return some_id.innerHTML;
}
First:
function get_information(link, callback) {
var xhr = new XMLHttpRequest();
xhr.open("GET", link, true);
xhr.onreadystatechange = function() {
if (xhr.readyState === 4) {
callback(xhr.responseText);
}
};
xhr.send(null);
}
then
get_information("/details/item1.html", function(text) {
var div = document.createElement("div");
div.innerHTML = text;
// Do something with the div here, like inserting it into the page
});
I have not tested any of this - off the top of my head. YMMV
As only one page exists in the client (browser) at a time and all other (virtual/possible) pages are on the server, how will you get information from another page using JavaScript as you will have to interact with the server at some point to retrieve the second page?
If you can, integrate some AJAX-request to load the second page (and parse it), but if that's not an option, I'd say you'll have to load all pages that you want to extract information from at the same time, hide the bits you don't want to show (in hidden DIVs?) and then get your index (or whoever controls the view) to retrieve the needed information from there ... even though that sounds pretty creepy ;)
You can load the page in a hidden iframe and use normal DOM manipulation to extract the results, or get the text of the page via AJAX, grab the part between <body...>...</body>¨ and temporarily inject it into a div. (The second might fail for some exotic elements like ins.) I would expect Greasemonkey to have more powerful functions than normal Javascript for stuff like that, though - it might be worth to thumb through the documentation.

Categories