Perl Mechanize with Javascript

Perl Mechanize with Javascript - javascript

I started working on perl mechanize and took a task to automate but got stuck with javascript in website.
the website I am trying my code on has a javascript based navigation (url remains same) between menu sections.
Take a look here
the code so far I have written gets me the link which redirects to the menu as shown in image.
$url="https://my.testingurl.com/nav/";
my $mech=WWW::Mechanize->new(autocheck => 1);
$mech->get($url);
$mech->form_name("LoginForm");
$mech->field('UserName','username');
$mech->field('UserPassword','password');
$mech->submit_form();
my $page=$mech->content;
if($page =~ /<meta\s+http-equiv="refresh"\s+content="\d+;\s*url=([^"+]*)"/mi)
{$url=$1 }
$mech->get($url);
print Dumper $mech->find_link(text_regex=>qr/View Results/);
and this is the output.
$VAR1 = bless( [
'#',
'View Results',
undef,
'a',
bless( do{\(my $o = 'https://my.testingurl.com/nav/')}, 'URI::https' ),
{
'onclick' => 'PageActionGet(\'ChangePage\',\'ResultsSection\',\'\',\'\', true)',
'href' => '#'
}
], 'WWW::Mechanize::Link' );
Now I am clueless how to proceed by clicking on the link shown in output and do the same with another part of navigation.
Please Help.

You can't. WWW:Mechanize doesn't support Javascript.

WWW::Mechanize doesn't support JavaScript. This leaves you with two basic options:
Reverse engineer the JavaScript, scrape any data you need out of it with Mechanize, then trigger any HTTP interactions yourself. In this case, it might involve extracting the "ResultsSection" string and matching it to some data from elsewhere in the page (or possibly an external JavaScript file).
Switch to a different tool which does support JavaScript (such as Selenium).

Related

Scraping a webpage with python to get onclick values

First of all I have to say: be patient with me because I am not familiar with the argument that I am going to illustrate you.
I'd like to download the intraday historical values of some equities on Frankfurt Boerse website. Let me take this equity for example: http://www.boerse-frankfurt.de/en/equities/adidas+ag+DE000A1EWWW0/price+turnover+history/tick+data#page=1
As you can see there are two options: trades on Frankfurt and trades on Xetra. I'd love to download the latters. I tried to scrape the data but my knowledge of python is very poor.
How can I 'select' the desired onclick option?
Thanks in advance for your replies. Regards
Ps: For your information, I noted the following fact inspecting the Xetra element: it changes value when I move on to next page and if I come back the value is again different. Here an example: first time on page 1 I got
a onclick="d39081344_fkt_set_par('6');d39081344_fkt_set_active(this);" class="brs_d39081344_li current last"
, then I moved on to page 2 and I got
a onclick="d51109535_fkt_set_par('6');d51109535_fkt_set_active(this);" class="brs_d51109535_li current last" and coming back to page 1 I got a onclick="d96086211_fkt_set_par('6');d96086211_fkt_set_active(this);" class="brs_d96086211_li current last"

The trick is to look at what calls are made when you navigate through the pages. Your browser's network analysis tool is invaluable for this. When I go from page to page, a POST is made to 'http://www.boerse-frankfurt.de/en/parts/boxes/history/_tickdata_full.m with data about the request.
Then the goal is to replicate and loop the requests using python. Here is code to get you started:
import requests
r = requests.post('http://www.boerse-frankfurt.de/en/parts/boxes/history/_tickdata_full.m', data={'component_id':'PREKOP97077bf9dec39f14320bf9d40b636c7c589', 'page':"3", 'page_size':'50', 'boerse_id':'6', 'titel':'Tick-Data', 'lang':'en', 'text':'LOcbaec84ecad1b94ad2fd257897c87361', 'items_per_page':'50', 'template':'0', 'pages_total':'50', 'use_external_secu':'1', 'item_count':'2473', 'include_url':'/parts/boxes/history/_tickdata_full.m', 'ag':'291', 'secu':'291', })
print r.text #here is your data of interest, it still needs to be parsed
That is the general idea. You would then put that in a loop, adding one to the page parameter each time.

Possible to dump AJAX content from webpage?

I would like to dump all the names on this page and all the remaining 146 pages.
The red/orange previous/next buttons uses JavaScript it seams, and gets the names by AJAX.
Question
Is it possible to write a script to crawl the 146 pages and dump the names?
Does there exist Perl modules for this kind of thing?

You can use WWW::Mechanize or another Crawler for this. Web::Scraper might also be a good idea.
use Web::Scraper;
use URI;
use Data::Dump;
# First, create your scraper block
my $scraper = scraper {
# grab the text nodes from all elements with class type_firstname (that way you could also classify them by type)
process ".type_firstname", "list[]" => 'TEXT';
};
my #names;
foreach my $page ( 1 .. 146) {
# Fetch the page (add page number param)
my $res = $scraper->scrape( URI->new("http://www.familiestyrelsen.dk/samliv/navne/soeginavnelister/godkendtefornavne/drengenavne/?tx_lfnamelists_pi2[gotopage]=" . $page) );
# add them to our list of names
push #names, $_ for #{ $res->{list} };
}
dd \#names;
It will give you a very long list with all the names. Running it may take some time. Try with 1..1 first.

In general, try using WWW::Mechanize::Firefox which will essentially remote-control Firefox.
For that particular page though, you can just use something as simple as HTTP::Tiny.
Just make POST requests to the URL and pass the parameter tx_lfnamelists_pi2[gotopage] from 1 to 146.
Example at http://hackst.com/#4sslc for page #30.
Moral of the story: always look in Chrome's Network tab and see what requests the web page makes.

extracting javascript rendered data from a web page

What i need to accomplish in the end is
A. send a url to the form on this page: youtube-mp3.org
B. get the scr attribute of a link on the resulting page.
I'm using Ruby on Rails and tried this method to send the request and get the body of the resulting page:
require 'uri'
yt_uri = URI('http://www.youtube-mp3.org')
params = { :id => "youtube-url" , :value => "http://www.youtube.com/watch?v=KMU0tzLwhbE" }
yt_uri.query = URI.encode_www_form(params)
res = Net::HTTP.get_response(yt_uri)
res.body
and it works fine but the problem is that the website uses javascript to render the link so it is not showing up in the source. Instead I get
<noscript>
<div class="warning">You have to enable JavaScript to use this Service!</div>
</noscript>
is there a way around this. Im open to any suggestions

There are two routes:
Actually execute the Javascript, and then do the scraping. This is heavyweight, both in terms of resources, in terms of work required
Figure out what the Javascript in question is actually doing
In this case, it's pretty easy. Go to http://www.youtube-mp3.org, open up your browser's trusty network debugger, and use the web form. Now, go back and inspect the requests and responses.
In my case, there appear to be four calls to external elements:
/a/pushitem
rectangle.htm
skyscraper.htm
/a/iteminfo
i.ytimg.com/vi/KMU0tzLwhbE
There's nothing interesting in the first three requests, but the fourth has some interesting looking JSON, and the last is a thumbnail image for the video.
The text from /a/iteminfo:
info = { "title" : "Developers", "image" : "http://i.ytimg.com/vi/KMU0tzLwhbE/default.jpg", "length" : "3", "status" : "serving", "progress_speed" : "", "progress" : "", "ads" : "", "pf" : "", "h" : "a0bb1715519025e36487b173b231295c" };
And, for those following along at home, the link src jsamm is trying to ferret out:
http://www.youtube-mp3.org/get?video_id=KMU0tzLwhbE&h=a0bb1715519025e36487b173b231295c&r=1380935176286
video_id is pretty easy to figure out- and we already have it. The h value came back in that JSON blob. r is a little more mysterious- but it looks remarkably like the current unix epoch + 3 extra digits. Oh wait- that's what Javascript's Date.getTime() gives you!
Anyway, don't do this. Not only are you being a jerk to whoever runs youtube-mp3.org, you're almost certainly violating the YouTube terms of service, and you're swimming in ugly copyright waters.

How to create a search function using Javascript or Jquery for Bible verse for instance

I'm trying to create a link of drop down list for the Bible on my webpage, I'm new to this thing. What I want is to do is to be able to select or ask the user to input the verse that they want to go to and then click Submit button then take that to the section of the Bible. For instance, if the user input Mathew 1: 2-10, this should take them to that section of that line of the Bible verse. How do I start doing this? and is Javascript and Jquery even the correct script to use or I need to use other programming language?
Thank you.

If you are planning to store the Bible text in a data base on a server, then you would need two programming languages: one for the server and one for the client (browser). JavaScript (with or without jQuery) would be a good tool for the client. Alternatives would be ActionScript (Flash) or a Java applet, but I would not recommend either of them for this.
On the server side, it completely depends on the nature of the server. Probably the most common combination is PHP and MySQL, although there are lots of other possibilities. For instance, you could store the data in XML files on the server and use XSLT to format the results for display in HTML on the client. That's the approach taken by tanach.us (for the Hebrew Bible).

Intrigued by such a task I was wondering. Yes it can be done, but don't just wave away #Ted his comment. It will indeed be a better option to choose for a database application.
But yes, if you don't mind the work and just use it on a small scale, it is possible to create a javascript based application to serve your pages.
You could use an iframe to serve your pages. With creating a selection box that populates a second one, it is possible to make an acceptable application. The pages are collected in javascript objects and served. In this example the domain of w3schools is used.
var Mathew = {
verses: ["verse2_1", "verse2_2", "verse2_3"],
verse2_1: "html/html_iframe.asp",
verse2_2: "tags/tag_select.asp",
verse2_3: "jquery/default.asp"
}
The first selection box will contain a hand coded option list
<select name="book" id="book">
<option value="choose">Please choose</option>
<option value="Mathew">Mathew</option>
<option value="John">John</option>
</select>
The second option list is automatically populated with the help of the javascript object.
function populateSecondSelect(book) {
if (book == "choose") {
$("#verses").children().remove();
$("#verses").append("<option>choose a book first</option>");
$("button").prop("disabled", true);
return;
}
$("button").prop("disabled", false);
var obj = eval(book);
$("#verses").children().remove();
$(obj.verses).each(function () {
$("<option/>", {
name: this,
id: this,
value: this,
text: this
}).appendTo("#verses");
});
}
With the second selection made, the button can be clicked to serve the page:
function fetchVerse() {
var book = $("#book").val();
var verse = $("#verses").val();
var url = baseUrl + eval(book + "." + verse);
$("#frame").attr("src", url);
}
The whole thing is working in a fiddle: http://jsfiddle.net/djwave28/nEqeK/4/
It is fun, but a database for a whole bible is better ..

Using jQuery on a string containing HTML

I'm trying to make a field similar to the facebook share box where you can enter a url and it gives you data about the page, title, pictures, etc. I have set up a server side service to get the html from the page as a string and am trying to just get the page title. I tried this:
function getLinkData(link) {
link = '/Home/GetStringFromURL?url=' + link;
$.ajax({
url: link,
success: function (data) {
$('#result').html($(data).find('title').html());
$('#result').fadeIn('slow');
}
});
}
which doesn't work, however the following does:
$(data).appendTo('#result')
var title = $('#result').find('title').html();
$('#result').html(title);
$('#result').fadeIn('slow');
but I don't want to write all the HTML to the page as in some case it redirects and does all sorts of nasty things. Any ideas?
Thanks
Ben

Try using filter rather than find:
$('#result').html($(data).filter('title').html());

To do this with jQuery, .filter is what you need (as lonesomeday pointed out):
$("#result").text($(data).filter("title").text());
However do not insert the HTML of the foreign document into your page. This will leave your site open to XSS attacks.
As has been pointed out, this depends on the browser's innerHTML implementation, so it does not work consistently.
Even better is to do all the relevant HTML processing on the server. Sending only the relevant information to your JS will make the client code vastly simpler and faster. You can whitelist safe/desired tags/attributes without ever worrying about dangerous ish getting sent to your users. Processing the HTML on the server will not slow down your site. Your language already has excellent HTML parsers, why not use them?.

When you place an entire HTML document into a jQuery object, all but the content of the <body> gets stripped away.
If all you need is the content of the <title>, you could try a simple regex:
var title = /<title>([^<]+)<\/title>/.exec(dat)[ 1 ];
alert(title);
Or using .split():
var title = dat.split( '<title>' )[1].split( '</title>' )[0];
alert(title);

The alternative is to look for the title yourself. Fortunately, unlike most parse your own html questions, finding the title is very easy because it doesn;t allow any nested elements. Look in the string for something like <title>(.*)</title> and you should be set.
(yes yes yes I know never use regex on html, but this is an exceptionally simple case)

We Keep Coding

JavaScript is the programming language of the Web.

Perl Mechanize with Javascript - javascript

You can't. WWW:Mechanize doesn't support Javascript.

Related

Scraping a webpage with python to get onclick values

Possible to dump AJAX content from webpage?

extracting javascript rendered data from a web page

How to create a search function using Javascript or Jquery for Bible verse for instance

Using jQuery on a string containing HTML

Categories

Resources