In Python, I can retrieve Javascript from an HTML document using the following code.
import urllib2, jsbeautifier
from bs4 import BeautifulSoup
f = urllib2.urlopen("http://www.google.com.ph/")
soup = BeautifulSoup(f, "lxml")
script_raw = str(soup.script)
script_pretty = jsbeautifier.beautify(script_raw)
print(script_pretty)
But how about if the script comes from a Javascript file on the server, like:
<script src="some/directory/example.js" type="text/javascript">
Is it possible to retrieve "example.js"? If so, how?
Context: I'm examining the Javascript of phishing web pages.
<script src="some/directory/example.js" type="text/javascript">
the code above will get some/directory/example.js from server
you just make folders and file structure follow the pattern above
The easiest way is to right click on that page in your browser, choose page script, click on that .js link, and it will be there.
If you want to load at run time, like some part of your javascript is dependent on other javascript, then for loading javascript at run time, you can use require.js .
Related
I have a website in several languages and it happens that there are some visitors who for some reason are viewing a page in a different language than that of their browser settings.
Therefore, if their browser language is one of those the site is translated into, I suggest they switch to their preferred language.
I do this with a div containing a message at the top of the page.
According to your settings, we suggest you to view the content of this page in the following language: {language link}.
{Denied option with link}
The message is shown in the current language (I have all the translations of course), while {language link} is taken from the user's settings. The javascript code is generated Serverside with PHP and managed by Javascript in the HTML
<script>
// instructions loaded, matching user conditions...
</script>
Is there any way to manage it in an external js file? How to do? Obviously, I'm not interested in preparing n*n combinations of files and uploading the one that matches the visitor's situation.
How to do?
Thank you!
You can probably write particular JavaScript server side that assigns data to a global variable, in addition to your external script.
In the server-rendered page, it'd look something like:
<script>
var userSettingsLanguageLink = '{language link}';
</script>
And then in your external script, something like:
if (userSettingsLanguageLink !== getBrowserLanguage())
//...
I'm building a webscraper using beautifulsoup.Some websites have javascript contents and do not load using urllib3 hence I use selenium for them.But selenium takes too long too respond and I need to build a more efficient webscraper since I need to use the same generalized scraper for multiple websites. hence I'm thinking if there's some way I can find out if the website has js content only then ill use selenium else I'll go with faster urllib
from selenium import webdriver
from bs4 import BeautifulSoup
import time
browser = webdriver.Chrome()
strt=time.time()
y=browser.get("https://www.amazon.jobs/en/locations/bangalore-india")
#time.sleep(10)
html = browser.page_source
soup = BeautifulSoup(html,'lxml')
li=soup.find_all('ul')
print(li)
print('load time='+str(time.time()-strt))
Here is the simple check using selenium
jsSize = (len(driver.find_elements_by_xpath("/html/head/script")))
if jsSize>0:
print("Page contains javascript")
The script tag is used to define a client-side script (JavaScript).
The element either contains script statements, or it points to an external script file through the src attribute.
Right click on the webpage you want to scrape >> Go to View Page Source >>
look for the tag named script, the script tag will indicate that the web page you are trying to scrape also consist of JavaScript.
I have an existing Python script that was written using urllib2 to download from a http:// link:
import urllib2
import os.path
import os
from os import chdir, getcwd, listdir, path
print "downloading with urllib2"
f = urllib2.urlopen('http://www.dcregs.dc.gov/Notice/DownLoad.aspx?VersionID=4613531')
data = f.read()
with open( "11-B300.doc", "wb" ) as code :
code.write( data )
print "All done downloads!"
The source web-page has been reformatted to uses a "javascript:__doPostBack" address:
javascript:__doPostBack('ctl00$MainContent$rpt_ruleList$ctl02$Label1','')
My presumption is that there is some form of package, similar to urllib2, that will allow me to download the same information via the "javascript:__doPostBack" formatted address or to call the http url, where the information is located, from which I can then download the information.
The existing script was working well for my purposes, so I would like to limit the additional coding, if possible.
Is there an alternate to urllib2 that will allow me to do download the information in a similar manner?
Or am I going to have to get more sophisticated in my solution (e.g., using Selenium to scrape the information)? (Do I want to get more sophisticated so that I don't have to manage updates to individual urls?)
Thanks for your help in advance.
This relates to the site that you're on using is using .NET WebForms which manages the state of the page & the interaction within hidden form variables.
So in short, you'll need to click the link via something like Selenium as you say
I am looking to get the contents of a text file hosted on my website using Python. The server requires JavaScript to be enabled on your browser. Therefore when I run:
import urllib2
target_url = "http://09hannd.me/ai/request.txt"
data = urllib2.urlopen(target_url)
I receive a html page saying to enable JavaScript.
I was wondering if there was a way of faking having JS enabled or something.
Thanks
Selenium is the way to go here, but there is another "hacky" option.
Based on this answer: https://stackoverflow.com/a/26393257/2517622
import requests
url = 'http://09hannd.me/ai/request.txt'
response = requests.get(url, cookies={'__test': '2501c0bc9fd535a3dc831e57dc8b1eb0'})
print(response.content) # Output: find me a cafe nearby
I would probably suggest tools like this. https://github.com/niklasb/dryscrape
Additionally you can see more info here: Using python with selenium to scrape dynamic web pages
I'm truing to execute a yui js script with js.executeScript Selenium's method.
The script is being executed by selenium webdriver in order simulate a "click" on hybrid mobile app (the button is webview)
String IncludeYUI = "script = document.createElement('script');script.type = 'text/javascript';script.async = true;script.onload = function(){};script.src = '"
+ YUI_PATH
+ "';document.getElementsByTagName('head')[0].appendChild(script);";
js.executeScript(IncludeYUI);
where the YUI_PATH is an url - https://cdnjs.cloudflare.com/ajax/libs/yui/3.18.0/yui/.....
The problem is that I do not have an access to the global network from the current site.
so I was thinking to save the script under the project and just to load it from FS.
But this is a js , no access to the FS ...
Any ideas how to load the script ?
Thanks
So, you're loading an html page somewhere, right? Conceptually you would load your JS file the same way: you make a request to your server to load the JS file, just like you did to load your html page.
That would look like this:
<script src="scripts/yourFile.js">
Also, I've never seen anyone loading a js file like you're doing in your code sample...I would most definitely just recommend putting a script tag in your html.
You may want to post your html code as well; we'll be able to provide better help. I'll update this answer accordingly if needed.
Finally , after many tries , some1 has suggested me to work with jquery.
after some digging , I've used executeScript with jquery's tap , and it worked...
$('#btn_login_button').trigger('tap');
I was wondering all other methods with click and element's coordinates didn't work