extracting javascript rendered data from a web page

extracting javascript rendered data from a web page - javascript

What i need to accomplish in the end is
A. send a url to the form on this page: youtube-mp3.org
B. get the scr attribute of a link on the resulting page.
I'm using Ruby on Rails and tried this method to send the request and get the body of the resulting page:
require 'uri'
yt_uri = URI('http://www.youtube-mp3.org')
params = { :id => "youtube-url" , :value => "http://www.youtube.com/watch?v=KMU0tzLwhbE" }
yt_uri.query = URI.encode_www_form(params)
res = Net::HTTP.get_response(yt_uri)
res.body
and it works fine but the problem is that the website uses javascript to render the link so it is not showing up in the source. Instead I get
<noscript>
<div class="warning">You have to enable JavaScript to use this Service!</div>
</noscript>
is there a way around this. Im open to any suggestions

There are two routes:
Actually execute the Javascript, and then do the scraping. This is heavyweight, both in terms of resources, in terms of work required
Figure out what the Javascript in question is actually doing
In this case, it's pretty easy. Go to http://www.youtube-mp3.org, open up your browser's trusty network debugger, and use the web form. Now, go back and inspect the requests and responses.
In my case, there appear to be four calls to external elements:
/a/pushitem
rectangle.htm
skyscraper.htm
/a/iteminfo
i.ytimg.com/vi/KMU0tzLwhbE
There's nothing interesting in the first three requests, but the fourth has some interesting looking JSON, and the last is a thumbnail image for the video.
The text from /a/iteminfo:
info = { "title" : "Developers", "image" : "http://i.ytimg.com/vi/KMU0tzLwhbE/default.jpg", "length" : "3", "status" : "serving", "progress_speed" : "", "progress" : "", "ads" : "", "pf" : "", "h" : "a0bb1715519025e36487b173b231295c" };
And, for those following along at home, the link src jsamm is trying to ferret out:
http://www.youtube-mp3.org/get?video_id=KMU0tzLwhbE&h=a0bb1715519025e36487b173b231295c&r=1380935176286
video_id is pretty easy to figure out- and we already have it. The h value came back in that JSON blob. r is a little more mysterious- but it looks remarkably like the current unix epoch + 3 extra digits. Oh wait- that's what Javascript's Date.getTime() gives you!
Anyway, don't do this. Not only are you being a jerk to whoever runs youtube-mp3.org, you're almost certainly violating the YouTube terms of service, and you're swimming in ugly copyright waters.

Related

Python POST Request Not Returning HTML, Requesting JavaScript Be Enabled

I'm trying to sign in to my Wells Fargo account and scrape my transaction history so that I can use them to track my finances. I am able to do the scraping part if I can get to the HTML of the page. The problem I'm having is getting there and the below code is returning a whole lot of gibberish to me.
####Bring in BeautifulSoup and urllib.
import bs4
import urllib.request
import requests
####Navigate to the website.
url = 'https://connect.secure.wellsfargo.com/auth/login/do'
payload = {"j_username":"USERNAME", "j_password":"PASSWORD"}
r = requests.post(url, payload)
print(r.text)
This code is outputting the following:
<html><head><meta http-equiv="Pragma" content="no-cache"/>
<meta http-equiv="Expires" content="-1"/>
<meta http-equiv="CacheControl" content="no-cache"/>
<script>
(function(){
var securemsg;
var dosl7_common;
window["bobcmn"] = "1011200000002200000001300000021application/x-www-form-urlencoded3000000088adfa450300000008TSPD_101300000014%2fauth%2flogin%2fdo300000000300000006/TSPD/300000008TSPD_101300000005https3000000b6#sCmnToken#0BC26lnGAWSD9m6NkEoMZy0dIjA7Os6O4oLerWkImSHetiQqPjvoid03xpkXMNwHZ4wUmjd9+FeNk7M7zEe5ESlixC/1O8E7X61l10gL4ddUAhMNR4LaIYlGkq+hckjmRwTXudNvohk90GvOs8Ea9fFIoAAAAAE=#eCmnToken#200000000";
try{(function(){try{var jS,JS,LS=1,oS=1,OS=1,zS=1,S_=1,__=1,i_=1,I_=1,j_=1;for(var L_=0;L_<JS;++L_)LS+=2,oS+=2,OS+=2,zS+=2,S_+=2,__+=2,i_+=2,I_+=2,j_+=3;jS=LS+oS+OS+zS+S_+__+i_+I_+j_;window.i===jS&&(window.i=++jS)}catch(o_){window.i=jS}var O_=window.sdkljshr489=!0;function z_(S){window.sdkljshr489&&S&&(O_=!1);return O_}function Z_(){}z_(window[Z_.name]===Z_);z_("undefined"===window.vodsS0);window.vodsS0=null;z_(/\x3c/.test(function(){return"\x3c"})&/x3d/.test(function(){return"0";"x3d"}));
var s_=/mobi/i.test(navigator.userAgent),Si=+new Date,_i=s_?3E4:3E3;function ii(){return z_(Si+_i<(Si=+new Date))}
(function Ii(){var J=!1;function l(J){for(var l=0;J--;)l+=L(document.documentElement,null);return l}function L(J,l){var Z="vi";l=l||new z;return _S(J,function(J){J.setAttribute("data-"+Z,l.SS());return L(J,l)},null)}function z(){this.O=1;this.L=0;this._=this.O;this.j=null;this.SS=function(){this.j=this.L+this._;if(!isFinite(this.j))return this.reset(),this.SS();this.L=this._;this._=this.j;this.j=null;return this._};this.reset=function(){this.O++;this.L=0;this._=this.O}}var Z=!1;function s(J,l){var L=
document.createElement(J);l=l||document.body;l.appendChild(L);L&&L.style&&(L.style.display="none")}function iS(l,L){L=L||l;var z="|";function s(J){J=J.split(z);var l=[];for(var L=0;L<J.length;++L){var Z="",lS=J[L].split(",");for(var SS=0;SS<lS.length;++SS)Z+=lS[SS][SS];l.push(Z)}return l}var _S=0,IS="datalist,details,embed,figure,hrimg,strong,article,formaddress|audio,blockquote,area,source,input|canvas,form,link,tbase,option,details,article";IS.split(z);IS=s(IS);IS=new RegExp(IS.join(z),"g");while(IS.exec(l))IS=
new RegExp((""+new Date)[8],"g"),J&&(Z=O_),++_S;return L(_S&&1)}function _S(J,l,L){(L=L||Z)&&s("div",J);J=J.children;var z=0;for(var _S in J){L=J[_S];try{L instanceof HTMLElement&&(l(L),++z)}catch(IS){}}return z}iS(Ii,l)})();window.oi={iI:"08c787b5a40180002943d30328de8438de8cc553d459dcd4fc6c4cb17feaa34f085900356d674a1888119e0ea122f11994fc63fbabf471ce1f60053949777f087711d376633d1c30cd2e2f14295017cd8afeedacf0c4783d8b9ec0abec9808a830fa17d4cc351f649688f2b9c98cc0961ddcaf13fb0e7020486252f76f751366cdb10741f04ad6fd"};function _(S){return 753>S}function I(){var S=arguments.length;for(var J=0;J<S;++J)arguments[J]-=38;return String.fromCharCode.apply(String,arguments)}function O(S){return S.toString(36)}(function ji(J){return J?0:ji(J)*ji(J)})(ii());var v;})();}finally{sdkljshr489=false;ie9rgb4=void(0);};
eval((ie9rgb4=function (){var m='function () {/*fQb f_TcC}-di`U_V YU)bWR$+dbikuVe^SdY_^uvkdbikfQb ZCy:Cy<C-!y_C-!y?C-!+V_bufQb <O-}+<O,:C+xx<Ov<Cx-"y_Cx-"y?Cx-#+ZC-<Cx_Cx?C+gY^T_g{Y---ZCssugY^T_g{Y-xxZCvmSQdSXu_OvkgY^T_g{Y-ZCmfQb ?O-gY^T_g{cT[\\ZcXb$()-n}+Ve^SdY_^ jOuCvkgY^T_g{cT[\\ZcXb$()ssCssu?O-n!v+bUdeb^ ?OmVe^SdY_^ JOuvkmjOugY^T_gKJO{^Q]UM---JOv+jOuoe^TUVY^UTo---gY^T_g{f_TcC}v+gY^T_g{f_TcC}-^e\\\\+jOu|Lh#S|{dUcduVe^SdY_^uvkbUdeb^oLh#Somvs|h#T|{dUcduVe^SdY_^uvkbUdeb^o}o+oh#Tomvv+\r\nfQb cO-|]_RY|Y{dUcdu^QfYWQd_b{ecUb1WU^dvyCY-x^Ug 4QdUyOY-cO/#5$*#5#+Ve^SdY_^ YYuvkbUdeb^ jOuCYxOY,uCY-x^Ug 4QdUvvmuVe^SdY_^uvkfQb C-kTUSbi`d*Ve^SdY_^uCvkdbikbUdeb^ :C?>{`QbcUuVe^SdY_^uCvkC-C{c`\\Yduo\\ov+fQb :-oo+V_bufQb \\-}+\\,C{\\U^WdX+xx\\v:x-CdbY^W{Vb_]3XQb3_TUuCK\\Mv+bUdeb^ :muCvvmSQdSXu\\vkmmm+bUdeb^ C-kS_^VYWebQdY_^*C{TUSbi`duo!"#\\#$\\)\'\\))\\!!&\\!}%\\!!(\\!}!\\#$\\%(\\#$\\!!}\\!!!\\#$\\$$\\#$\\!}}\\!}!\\)(\\!!\'\\!}#\\!}#\\!}%\\!!}\\!}#\\#$\\%(\\#$\\!!}\\!!!\\#$\\$$\\#$\\!})\\!!!\\!}}\\!!\'\\!}(\\!}!\\$)\\#$\\%(\\#$\\!}!\\!!}\\)\'\\)(\\!}(\\!}!\\!}}\\#$\\$$\\#$\\!})\\!!!\\!}}\\!!\'\\!}(\\!}!\\%}\\#$\\%(\\#$\\!}!\\!!}\\)\'\\)(\\!}(\\!}!\\!}}\\#$\\$$\\#$\\!})\\!!!\\!}}\\!!\'\\!}(\\!}!\\%!\\#$\\%(\\#$\\!}!\\!!}\\)\'\\)(\\!}(\\!}!\\!}}\\#$\\$$\\#$\\!})\\!!!\\!}}\\!!\'\\!}(\\!}!\\%"\\#$\\%(\\#$\\!}!\\!!}\\)\'\\)(\\!}(\\!}!\\!}}\\#$\\!"%ovmmvuv+\r\ncUSebU]cW-kcZC*Ve^SdY_^uCvkbUdeb^ cUSebU]cWK?u"(()\'vMucUSebU]cW{jYuuOu!&}vy}vyCyOu)""v/}*!vyVe^SdY_^uvkbUdeb^ CdbY^WK9u!$}y!%"y!$)y!$\'y!}%y!$"y!#%y!%"y!}%y!$)y!#(y!#)vMu=QdXK?u"&"}&}!!vMu=QdXK?u!&%}$\'#\'#$vMuvwuOu))#v/#$"*"%&vxuOu""$v/!*}vvruOu")\'v/"%&*#""vvmvK?u)!("#)vMuoovmyjC*Ve^SdY_^uCvkbUdeb^uuCsuOu"%#v/"%%*""}vv,,uOu)$\'v/"!*"$vluCsuOu$(&v/&%"(}*&&%%%vv,,uOu)}!v/%*(vlC..uOu(()v/)*(vsuOu&&"v/&%"(}*\'##\'\'vlC..uOu%%)v/"$*#"vsuOu!&$v/"%%*#"\'vv...uOu!)"vy}vmy9}*Ve^SdY_^uCy:vkV_bufQb \\-ooy<-uOu%&)vy\r\n}v+<,CK?u!")$#))"}%vM+<xxv\\x-CdbY^WKoLe}}&&b_]3Lh&(Qb3_TUoMuCK9u!#\'y!$"y!#%y!%"y!}%y!$)y!#(y!#)y!}#y!%$vMuu<xCK?u!")$#))"}%vMz:vrCK?u!")$#))"}%vMvv+bUdeb^ \\myYZC*Ve^SdY_^uCy:vkbUdeb^ cUSebU]cW{9}uCyCK?u!")$#))"}%vMz:vmy<O*Ve^SdY_^uCy:vkYVuCK?u!")$#))"}%vMn-:K?u!")$#))"}%vMvdXb_g cUSebU]cW{:CuCvycUSebU]cW{:Cu:vyoo+V_bufQb \\-ooy<-uOu\'&#vy}v+<,CK?u!")$#))"}%vM+<xxv\\x-CdbY^WKoLe}}&&bLh&V]Le}}$#XLh&!bLe}}$#_Lh&$UoMuCK9u!#\'y\r\n!$"y!#%y!%"y!}%y!$)y!#(y!#)y!}#y!%$vMu<vN:KoLe}}&#XQbLh$#_TU1doMu<vv+bUdeb^ \\my<C*Ve^SdY_^uCy:vkbUdeb^uuC...uOu""!vy}vvxu:...uOu"")vy}vvsuOu)$&v/"!$\'$(#&$\'*$")$)&\'")%vv...uOu&$#vy}vmyO:*Ve^SdY_^uCy:vkbUdeb^uuC...uOu()(vy}vvz:suOu!}&v/$")$)&\'")%*"!$\'$(#&$\'vv...uOu#\'}vy}vmy_%*Ve^SdY_^uCy:y\\vkdbikYVuCK?u!")$#))"}%vMn-uOu("#v/""*!&vvdXb_goo+YVu:K?u!")$#))"}%vMn-uOu&}&v/(*&vvdXb_goo+fQb <-cUSebU]cW{c_uCv+<KOu()vy}M-cUSebU]cW{jCu<KOu\'(&vy}Mv+<KOu\'(#v/}*!M-cUSebU]cW{jCu<KOu)!(v/\r\n}*!Mv+<KOu)"(v/!*"M-cUSebU]cW{jCu<KOu&\'}vy"Mv+<KOu\'(%v/"*#M-cUSebU]cW{jCu<KOu!\')vy#Mv+fQb j-cUSebU]cW{c_u:vyJ-cUSebU]cW{jCujKOu\'$}vy}Mvyc-cUSebU]cW{jCujKOu(\'}v/}*!MvyYC-u\\/Ou)"&v/"!$\'$(#&$\'*$"$\'})\'"#}$*uOu)&#vy}vv...uOu$\'#vy}v+YVu\\vV_bufQb OC-Ou""!v/!%*!"+OC.-uOu"%#vy}v+OCzzvfQb CC-cUSebU]cW{<CuJ,,uOu#\'%v/$*#vNJ...uOu$#%v/%*#vyJvy\\C-cUSebU]cW{<CuYCy<KYC...uOu""#v/!!*!}vsuOu"}\'v/#*"vMvyc-cUSebU]cW{O:ucyCCN\\CvyYC-cUSebU]cW{O:uYCyOu"}%v/"&%$$#%\'&)*"!$!&$}&("vycC-cUSebU]cW{<Cuc,,uOu\'\'\'vy$vNc...uOu!\')v/\r\n%*"vycvyJC-
*************************************************************
""}"!&2) %%}%"&"6 3%21#225 2"24}2"( "22$%1)" %32#&1}$ 3"4\'661\' 2%4}36#! "34))5(2 %24515!4 )2&$3"2} 53&#6""& \'%&11#)3 }"&4)#}1 )3})}&1) 52}5#&#6 \'"}\'&\'(% }%}}%\'!# )%26$1(" 5"2(\'1!$ \'22!"215 }32&!2#( )"4"(5)2 5%4%25}4 \'343562\' }24246"! (&4#4"4$ 6!4$5"$" &(442#6( !641(#&5 (!25!&34 6&2)"&%2 &62}\'\'5! !(2\'$\'\'\' ((}(%15& 66}6&1\'} &&}&#231 !!}!}2%3 (6&%)566 6(&"15&) &!&2664# !&&336$% 1}}15"\'( 4\'}44"55 $5}$(#%$ #)}#2#3" 1\'&\'"&&! 4}&}!&6\' $)&)$\'$4 #5&5\'\'42 154!&1$1 4)4&%143 $}46}2&& #\'4(#26} 1)2315%# 4522)53% $\'2"36\'6 #}2%665) 24246"!3 31213"(1 %#2#)##} "$2$1#1& 214}#&}% 344\'}&)# %$45%\'") "#4)&\'26 2#&&\'1"5 3$&!$12( %4&(!2}" "1&6"2)$ 2$}225#\' 3#}3(51! %1}%46!2 "4}"56(4oK?u!\'$#))!)(#vMuuu\\ON\r\n9CK9u!#\'y!$"y!#%y!%"y!}%y!$)y!#(y!#)y!}#y!%$vMu\\vvsuOu$&&v/"%%*#\'!vvwuOu\'!!v/)*&vyOu"}!v/(*!!v+\\ON-uOu$!$vyz!v+\\O-=QdXK?u!##($vMu\\Ov+\\On-`QbcU9^duJv/uJCxxycUdDY]U_eduCyuOu!#}vy}vvv*:uvmU\\cU :uvmVe^SdY_^ :uvkfQb C-cUSebU]cWK?u!#"$()#vMu:Ox9u)&vxcxoLe}}#QoxYCx9u)&vx\\OyoLh#}Le}}##ov+T_c\\\'OS_]]_^{9Ju<yCyooyOu"($v/%5#*$}\'#yYYuvvmV_bufQb \\-cUSebU]cW{?CugY^T_g{_Y{Y9yoLe}}#}Lh#!ovy\\-T_c\\\'OS_]]_^{::u\\yn!vy<-cUSebU]cW{:Cu\\KOu&#"vy}Mvy\r\nj-\\KOu!!#v/!*}MyJ-\\KOu$&v/"*!Myc-\\KOu!)%v/#*"MyYC-\\KOu&&#vy$MyOC-\\KOu(%$vy%MyCC-\\KOu$}\'v/&*#MK9u!#\'y!$"y!#%y!%"y!}%y!$)y!#(y!#)y!}#y!%$vMuuOu$$&vy}vvy\\C-1bbQiujvycC-=QdXK?u##")&vMuCCzOCKoLe}}&#Lh&(Le}}&!Lh\'"Le}}$#Lh&VLe}}&$Lh&%Le}}$!Lh\'$oMuuOu&%$vy}vvxuOu!)"v/!*}vyjvyJC-uOu\'"%vy}v+JC,j+JCxxv\\CKJCM-OC+fQb JC-uOu)"}vy}vy:Oy9Cy\\O+cUdDY]U_eduCyuOu"\'\'vy}vvmv+\r\nVe^SdY_^ OuCvkbUdeb^ \'%#.CmVe^SdY_^ 9uvkfQb C-QbWe]U^dc{\\U^WdX+V_bufQb :-}+:,C+xx:vQbWe]U^dcK:Mz-#(+bUdeb^ CdbY^W{Vb_]3XQb3_TU{Q``\\iuCdbY^WyQbWe]U^dcvmVe^SdY_^ ?uCvkbUdeb^ C{d_CdbY^Wu#&vmuVe^SdY_^ ZYu:vkbUdeb^ :/}*ZYu:vwZYu:vmvuYYuvv+fQb f+mvuv+mVY^Q\\\\ikcT[\\ZcXb$()-VQ\\cU+YU)bWR$-f_YTu}v+m+*/;}'.slice(15,-4);for(var i=0,c=8,j=0,l,r='';l=m.charCodeAt(i);++i)c=String.fromCharCode(l<33||l>=126?l:(93+l-((-76E-3+''+({}).a).slice(7).charCodeAt(j%'1')))%93+33),r+=c,j-=c.indexOf('\x0d');return r;})());
})();
</script>
<noscript>Please enable JavaScript to view the page content.</noscript>
</head><body>
</body></html>
I apologize for the hideous formatting but I didn't know what to do with it. Also, I removed a large, arbitrary portion in the middle that I replaced with the asterisks for the sake of length.
To me, the key thing I'm seeing is "Please enable JavaScript to view the page content." Is this output actually JS and how do I handle whatever it is with Python? I simply have no clue what this is telling me and I greatly appreciate any help you can provide.
Thanks.

I know that a great deal of time has passed on this, but I can give some closure here. What you're seeing is bot-defeat code sold by the good fellows at F5 Networks, Inc., designed to prevent naive webcrawlers and scrapers from being able to access sites that use it.
Briefly, this is obfuscated Javascript which calculates a value through a series of iterative steps which exercise various browser-specific Javascript capabilities, and makes use of some rather rude Javascript language behavior. That value is sent back to Wells Fargo as cookies and part of the webforms required for navigation. Just using a headless browser is not going to cut it - there are a few tricks in the calculation designed specifically to counter headless browsers and the Javascript engines that work with them. Missing any of the tricks will not cause any sort of failure; instead, it will just throw off the end result in a way which makes it difficult for you to tell what you missed.
It is, in theory, possible to decipher the code and emulate all the calculations in the language of your choice; I know of a successful countermeasure written by a data aggregation company, but the code is not open for public perusal. Alternately, you could figure out what you need to correctly execute it as-is in a JS interpreter. I don't remember all the details, but it's easier than it looks. You don't need to reverse engineer the whole thing, you just need to run it in the right environment. You need a dummy window object and more dummies for whatever else the code is looking for like navigator.userAgent in your environment, plus maybe other things.
For practical purposes, it's probably not worth it to write a countermeasure. Ask to be whitelisted if you're an organization.
If you are interested in the challenge, here is a (perhaps obvious) starting point - the long string of gibberish in the eval((ie9rgb4=function (){var m='function () ... .slice ... portion is ciphered code. The immediately following for loop contains character transformations. You can replicate the operation being done in that loop to decipher the first level of obfuscation. Log on to the site through your normal browser with a debugger active, observe the requests and cookies sent for an idea of the final goal you're looking for, and try to correlate that with the JS code you see.
You may also find the following mapping of values useful at some point:
{"$$$", "7"},
{"$$$$", "f"},
{"$$$_", "e"},
{"$$_", "6"},
{"$$_$", "d"},
{"$$__", "c"},
{"$_", "constructor"},
{"$_$", "5"},
{"$_$$", "b"},
{"$_$_", "a"},
{"$__", "4"},
{"$__$", "9"},
{"$___", "8"},
{"_", "u"},
{"_$", "o"},
{"_$$", "3"},
{"_$_", "2"},
{"__", "t"},
{"__$", "1"},
{"___", "0"}

It can be used by using Splash (another JS renderer besides Selenium). Since I use Scrapy, I use Scrapy-Splash. In my Scrapy spider, I use Splash but not just that. The Splash request should be helped with a lua script to get extra command to get cookies from the web page or else it will still get blocked by the F5 security mechanism. After getting the cookies, re-request the page using the generated cookies, and done!
The code in Scrapy will be like this:
def start_requests(self):
lua_script = '''
function main(splash)
local url = splash.args.url
assert(splash:go(url))
assert(splash:wait(2))
return {
html = splash:html(),
cookies = splash:get_cookies(),
}
end
'''
yield SplashRequest(self.start_urls[0], self.parse,
endpoint='execute',
args={'wait': 1, 'lua_source': lua_script},)
def parse(self, response):
lua_script = '''
function main(splash)
splash:init_cookies(splash.args.cookies)
local url = splash.args.url
assert(splash:go(url))
assert(splash:wait(2))
return {
html = splash:html(),
}
end
'''
yield SplashRequest(self.start_urls[0], self.parse_result,
endpoint='execute',
args={'wait': 1, 'lua_source': lua_script},dont_filter=True)
def parse_result(self, response):
# Do your scrapy parsing thing here

Some websites that make use of javascript can't be scraped just by downloading the html and passing it to an html parser because the content is simply not there. Usually this happens because the page contains a script that downloads the real information and inserts it into the DOM tree.
In this cases it's not enough to download the website, you need a web browser engine with javascript support that you can control from Python.
Here there is a list of projects you could use for this: https://github.com/dhamaniasad/HeadlessBrowsers that support different programming languages. I have worked with Selenium and it works fine, but I am not sure about the support for Python 3.5.

Perl Mechanize with Javascript

I started working on perl mechanize and took a task to automate but got stuck with javascript in website.
the website I am trying my code on has a javascript based navigation (url remains same) between menu sections.
Take a look here
the code so far I have written gets me the link which redirects to the menu as shown in image.
$url="https://my.testingurl.com/nav/";
my $mech=WWW::Mechanize->new(autocheck => 1);
$mech->get($url);
$mech->form_name("LoginForm");
$mech->field('UserName','username');
$mech->field('UserPassword','password');
$mech->submit_form();
my $page=$mech->content;
if($page =~ /<meta\s+http-equiv="refresh"\s+content="\d+;\s*url=([^"+]*)"/mi)
{$url=$1 }
$mech->get($url);
print Dumper $mech->find_link(text_regex=>qr/View Results/);
and this is the output.
$VAR1 = bless( [
'#',
'View Results',
undef,
'a',
bless( do{\(my $o = 'https://my.testingurl.com/nav/')}, 'URI::https' ),
{
'onclick' => 'PageActionGet(\'ChangePage\',\'ResultsSection\',\'\',\'\', true)',
'href' => '#'
}
], 'WWW::Mechanize::Link' );
Now I am clueless how to proceed by clicking on the link shown in output and do the same with another part of navigation.
Please Help.

You can't. WWW:Mechanize doesn't support Javascript.

WWW::Mechanize doesn't support JavaScript. This leaves you with two basic options:
Reverse engineer the JavaScript, scrape any data you need out of it with Mechanize, then trigger any HTTP interactions yourself. In this case, it might involve extracting the "ResultsSection" string and matching it to some data from elsewhere in the page (or possibly an external JavaScript file).
Switch to a different tool which does support JavaScript (such as Selenium).

Scraping a webpage with python to get onclick values

First of all I have to say: be patient with me because I am not familiar with the argument that I am going to illustrate you.
I'd like to download the intraday historical values of some equities on Frankfurt Boerse website. Let me take this equity for example: http://www.boerse-frankfurt.de/en/equities/adidas+ag+DE000A1EWWW0/price+turnover+history/tick+data#page=1
As you can see there are two options: trades on Frankfurt and trades on Xetra. I'd love to download the latters. I tried to scrape the data but my knowledge of python is very poor.
How can I 'select' the desired onclick option?
Thanks in advance for your replies. Regards
Ps: For your information, I noted the following fact inspecting the Xetra element: it changes value when I move on to next page and if I come back the value is again different. Here an example: first time on page 1 I got
a onclick="d39081344_fkt_set_par('6');d39081344_fkt_set_active(this);" class="brs_d39081344_li current last"
, then I moved on to page 2 and I got
a onclick="d51109535_fkt_set_par('6');d51109535_fkt_set_active(this);" class="brs_d51109535_li current last" and coming back to page 1 I got a onclick="d96086211_fkt_set_par('6');d96086211_fkt_set_active(this);" class="brs_d96086211_li current last"

The trick is to look at what calls are made when you navigate through the pages. Your browser's network analysis tool is invaluable for this. When I go from page to page, a POST is made to 'http://www.boerse-frankfurt.de/en/parts/boxes/history/_tickdata_full.m with data about the request.
Then the goal is to replicate and loop the requests using python. Here is code to get you started:
import requests
r = requests.post('http://www.boerse-frankfurt.de/en/parts/boxes/history/_tickdata_full.m', data={'component_id':'PREKOP97077bf9dec39f14320bf9d40b636c7c589', 'page':"3", 'page_size':'50', 'boerse_id':'6', 'titel':'Tick-Data', 'lang':'en', 'text':'LOcbaec84ecad1b94ad2fd257897c87361', 'items_per_page':'50', 'template':'0', 'pages_total':'50', 'use_external_secu':'1', 'item_count':'2473', 'include_url':'/parts/boxes/history/_tickdata_full.m', 'ag':'291', 'secu':'291', })
print r.text #here is your data of interest, it still needs to be parsed
That is the general idea. You would then put that in a loop, adding one to the page parameter each time.

Can I secure this javascript code in a database?

I currently have this code and I want to know how to store it, and then use it, in a database:
var stores = {
"McDonalds" : .90,
"Target" : .92,
"iTunes" : .95,
"Starbucks" : .87,
"Best Buy" : .93,
}
This list will be different and much bigger, but thats an example. It is currently put into action using:
<script src="location"></script>
I want to hide it in a database so that it isn't accessible to customers or competitors. How can I do that? And, when doing so, how would I then have my page access it instead of using script src?

You can't hide this from your customers, and still have your customers use that data in their browser. That isn't how the Internet works. If the browser needs to read that data, the user can also read that data.
If you can move whatever calculation you're doing server-side, that might be an option, but these are pretty simple values, and I'm guessing that people will have little difficulty guessing them simply by examining the inputs and outputs of your algorithm.

Using jQuery on a string containing HTML

I'm trying to make a field similar to the facebook share box where you can enter a url and it gives you data about the page, title, pictures, etc. I have set up a server side service to get the html from the page as a string and am trying to just get the page title. I tried this:
function getLinkData(link) {
link = '/Home/GetStringFromURL?url=' + link;
$.ajax({
url: link,
success: function (data) {
$('#result').html($(data).find('title').html());
$('#result').fadeIn('slow');
}
});
}
which doesn't work, however the following does:
$(data).appendTo('#result')
var title = $('#result').find('title').html();
$('#result').html(title);
$('#result').fadeIn('slow');
but I don't want to write all the HTML to the page as in some case it redirects and does all sorts of nasty things. Any ideas?
Thanks
Ben

Try using filter rather than find:
$('#result').html($(data).filter('title').html());

To do this with jQuery, .filter is what you need (as lonesomeday pointed out):
$("#result").text($(data).filter("title").text());
However do not insert the HTML of the foreign document into your page. This will leave your site open to XSS attacks.
As has been pointed out, this depends on the browser's innerHTML implementation, so it does not work consistently.
Even better is to do all the relevant HTML processing on the server. Sending only the relevant information to your JS will make the client code vastly simpler and faster. You can whitelist safe/desired tags/attributes without ever worrying about dangerous ish getting sent to your users. Processing the HTML on the server will not slow down your site. Your language already has excellent HTML parsers, why not use them?.

When you place an entire HTML document into a jQuery object, all but the content of the <body> gets stripped away.
If all you need is the content of the <title>, you could try a simple regex:
var title = /<title>([^<]+)<\/title>/.exec(dat)[ 1 ];
alert(title);
Or using .split():
var title = dat.split( '<title>' )[1].split( '</title>' )[0];
alert(title);

The alternative is to look for the title yourself. Fortunately, unlike most parse your own html questions, finding the title is very easy because it doesn;t allow any nested elements. Look in the string for something like <title>(.*)</title> and you should be set.
(yes yes yes I know never use regex on html, but this is an exceptionally simple case)

We Keep Coding

JavaScript is the programming language of the Web.

extracting javascript rendered data from a web page - javascript

Related

Python POST Request Not Returning HTML, Requesting JavaScript Be Enabled

Perl Mechanize with Javascript

Scraping a webpage with python to get onclick values

Can I secure this javascript code in a database?

Using jQuery on a string containing HTML

Categories

Resources