Web scraping of modal window(dialogue box) using jsoup - javascript

I am studying about the project in which I have to extract the data from the website . The project is in java and the website is in java script . I am using Jsoup to extract the data from the website But there are some modal windows(dialogue box , pop up windows) present in the web page.So Is it possible to extract the data of modal windows using jsoup?????
So if answer is yes , then how could I do it?? please provide links and if not, then what are the other best ways to do it???
Thanks for your help. I really appreciate it.

I assume that the modal is generated by Javascript.
Jsoup is just a parser. This means that it will make an HTTP request (GET or POST, whatever you tell it to do) and the server (website) will respond with the initial html. By saying initial, I mean the html before any javascript is executed.
Javascript can generate html (like the modal in question), but this is not visible to Jsoup because a parser can only read, it cannot execute code. The browser is able to generate the modal because it includes a Javascript execution engine that parses and executes Javascript.
When you visit a web page you don't know what is dynamic (generated by Javascript) and what is static (fetched by the server as is).
A little trick to check what is dynamic and what is static (static is visible to Jsoup) is to do the following:
Visit the web page you want to parse (with chrome if possible, mozilla will work too I think).
Press Ctrl + U. This will open a new tab.
The new tab will contain some mesh of html, css and js. This is what the server fetches to the browser and is also visible to Jsoup.
If the modal is in there, then great, it is visible to Jsoup. If not, then you have to use a library that acts as a headless browser.
A headless browser is essentially a browser without the graphical interface. It can parse and execute Javascript. It "sees" what a normal browser sees.
The most common library used is selenium webdriver. Be careful, selenium is a testing framework that has a lot of parts. What you need is the webdriver.
There a lot of examples out there with ready made code to get you started.

Related

Pentest pure JavaScript (qooxdoo) Website

I'm wondering how I could Pentest a website made completely in JavaScript, for example using the qooxdoo Framework.
Those websites do not contain any requests to the server which respond with HTML content. Only one Javascript file gets transmitted when loading the page (which is an almost empty html page with just the link to the javascript file) and the page is then beeing set up by the loaded JS file, without any line of HTML written by the developer.
Typically, there would be some Spidering/Crawling in most Web App Scanners (like Nexpose), which check a website for links and forms and crawl any link they find which directs to the same domain and test any parameter found on these links. I assume those scanners would not have any effect on a pure JS page.
Then there's the other possibility: A proxy server (like Burp Suite) which captures any traffic beeing sent to a server and is able to check any found parameters on this requests. This would probably work to test the API-Server which is located behind the Website (for example to find SQL injections).
But: Is there any way to test the client, for example for XSS (self or stored)?
Or more in general: What types of attacks would you typically need to check in such a pure JS web application? What tools could help with that?

Automating Add to Cart Javascript Sequence without browser open with Python

I'm looking for a method to automate an add-to-cart process using Python WITHOUT needing to have a browser window open.
I've tried using modules such as mechanize but it does not have the functionality of directly "clicking" a web element
Currently I've been able to automate this process using Selenium but the process of having to open the browser and load web elements, photos, etc adds up to a lengthy process where time is of the essence.
An example page that I would like to automate is here :
http://store.nike.com/us/en_us/pd/kd-vi-elite-basketball-shoe/pid-972328/pgid-972324?cp=usns_twit_041214_basketball_kdelitehome
Any direction is greatly appreciated.
It seems that in the web page you listed, the "Add to Cart" button is actually a form submit button. What you can do is simulate the submission of the form by doing a POST request, with all the necessary form parameters, which you can get from all the <input> elements on the page.
A possible python implementation may be:
Download the page with urllib2. You will probably have to enable cookies.
Parse the page using BeautifulSoup or similar, and find all the <input> tags and their values.
Construct a new POST request with all these params (while maintaining cookies).
You can use your Browser's Network sniffing capabilities to see an actual request being sent, and try to mimic it using the above tools.
Hope it helps.

Process web page in JavaScript/browser and inserting into database?

I'm processing URLs with my program,
Basically what it does is, file_get_contents($url) to get web content and attach a JavaScript code at the bottom which processes the HTML source for the biggest image and inserts into the database by ajax.
The problem is I have to instantiate Firefox browser for the processing. I really don't need Firefox to render the page visually and other workings. All I want is for my script to do its job.
So is there a way to use Firefox's HTML/CSS and JavaScript engine without having to call the entire browser?

How can I get the HTML generated with javascript?

I want to get the HTML content of a web page but most of the content is generated by javascript.
Is it posible to get this generated HTML (with python if posible)?
The only way I know of to do this from your server is to run the page in an actual browser engine that will parse the HTML, build the normal DOM environment, run the javascript in the page and then reach into that DOM engine and get the innerHTML from the body tag.
This could be done by firing up Chrome with the appropriate URL from Python and then using a Chrome plugin to fetch the dynamically generated HTML after the page was done initializing itself and communicate back to your Python.
Checkout Selenium. It have a python driver, which might be what you're looking for.
If most of the content is generated by Javascript then the Javascript may be doing ajax calls to retrieve the content. You may be able to call those server side scripts from your Python app.
Do check that it doesn't violate the website's terms though and get permission.

web crawler/spider to fetch ajax based link

I want to create a web crawler/spider to iteratively fetch all the links in the webpage including javascript-based links (ajax), catalog all of the Objects on the page, build and maintain a site hierarchy. My question is:
Which language/technology should be better (to fetch javascript-based links)?
Is there any open source tools there?
Thanks
Brajesh
You can automate the browser. For example, have a look at http://watir.com/
Fetching ajax links is something that even the search-giants haven't accomplished yet. It is because, the ajax links are dynamic and the command and response both vary greatly as per the user's actions. That's probably why, SEF-AJAX (Search Engine Friendly AJAX) is now being developed. It is a technique that makes a website completely indexable to search engines that when visited by a web browser, acts as a web application. For reference, you may check this link: http://nixova.com
No offence but I dont see any way of tracking ajax links. That's where my knowledge ends. :)
you can do it with php, simple_html_dom and java. let the php crawler copy the pages on your local machine or webserver, open it with an java application (jpane or something) mark all text as focused and grab it. send it to your database or where you want to store it. track all a tags or tags with an onclick or mouseover attribute. check what happens when you call it again. if the source html (the document returned from server) size or md5 hash is different you know its an effective link and can grab it. i hope you can understand my bad english :D

Categories