Screen Scraping from a web page with a lot of Javascript [closed] - javascript

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
I have been asked to write an app which screen scrapes info from an intranet web page and presents the certain info from it in a nice easy to view format. The web page is a real mess and requires the user to click on half a dozen icons to discover if an ordered item has arrived or has been receipted. As you can imagine users find this irritating to say the least and it would be nice to have an app anyone can use that lists the state of their orders in a single screen.
Yes I know a better solution would be to re-write the web app but that would involve calling in the vendor and would cost us as small fortune.
Anyway while looking into this I discovered the web page I want to scrape is mostly Javascript (although it doesn't use any AJAX techniques). Does anyone know if a library or program exists which I could feed with the Javascript and which would then spit out the DOM for my app to parse ?
I can pretty much write the app in any language but my preference would be JavaFX just so I could have a play with it.
Thanks for your time.
Ian

You may consider using HTMLunit
It's a java class library made to automate browsing without having to control a browser, and it integrates the Mozilla Rhino Javascript engine to process javascript on the pages it loads. There's also a JRuby wrapper for that, named Celerity. Its javascript support is not really perfect right now, but if your pages don't use many hacks things should work fine the performance should be way better than controlling a browser. Furthermore, you don't have to worry about cookies being persisted after your scraping is over and all the other nasty things connected to controlling a browser (history, autocomplete, temp files etc).

Since you say that no AJAX is used, then all the info is present at the HTML source. The javascript just renders it based on user clicks. So you need to reverse engineer the way the application works, parse the html and the javascript code and extract the useful information. It is strictly business of text parsing - you shouldn't deal with running javascript and producing a new DOM. This would be much more difficult to do.
If AJAX was used, your job would be easier. You could easily find out how the AJAX services work (probably by receiving JSON and XML) and extract the information.

You could consider using a greasemonkey JS. greasemonkey is a very powerful Firefox add on that allows you to run your own script alongside that of specific web sites. This allows you to modify how the web site is displayed, add or remove content. You can even use it to do AJAX style lookups and add dynamic content.
If your tool is for in house use, and users are all happy to use Firefox then this could be a winner.
Regards

I suggest IRobotSoft web scraper. It is a dedicated free software for screen scraping with the best javascript support. You can create and test a robot with its visual interface. You can also embed it into your own application using its ActiveX control and hide the browser window.

I'd go with Perl's Win32::IE::Mechanize which lets you automate Internet Explorer. You should be able to click on icons and extract text while letting MSIE do the annoying tasks of processing all the JS.

I agree with kgiannakakis' answer. I'd be suprised if you couldn't reverse engineer the javascript to identify where the information comes from and then write some simple Python scripts using Urllib2 and the Beautiful Soup library to scrape the same information.
If Python and scraping are a new idea, there's some excellent tutorials available on how to get going.
[Edit] Looks like there's a Python version of mechanize too. Time to re-write some scrapers I developed a while back! :-)

I created a project site2archive that uses phantomJs to render including JS stuff and wget to scrape. phantomJs is based on Webkit, that delivers a similar browsing environment as Safari and Google Chrome.

Related

Is it possible to edit and save website content to server, making the change viewable by everyone?

I have created a website for a third party, who have no experience in editing HTML. However, the third party wishes to be able to edit the content on the website without having to open the files and edit it this way, they wish to do it somewhat WYSIWYG (For example, hit an "edit" button for the content they wish to edit). Is this possible to achieve? It is not an internal website, it has user tracking (this should obviously only be available to admin users).
Is there a way of making contents of a div editable, then saving the change directly to the server, so the content gets updated publicly?
I am currently researching the topic, and although I have found some indications that the solution may be a PHP script, I have yet to find any definitive solutions or examples of similar functionality.
Yes you will need a backend language or framework to archive this. Where Javascript is used to interact with the page, the actual storage of information requires a database or similar technology.
Unfortunately which backend language or framework to choose really is the million dollar question. It largely depends on exactly what you are trying to accomplish, what your client or user is comfortable with, and how much experience you have programming.
PHP is fast and time tested backend language. Node is the new kid on the block, and it very popular also. Java and dotNet are on the way out. You can dig up a bunch more including Go, Python, Haskel, Etc.
You can use a languge listed above and start scripting away, but this can be time consuming and error prone. Most people use a framework to get started, and program using that framework's tools. The most popular PHP framework is WordPress, but it is designed for blogs and might not fit your use case. I use the framework Craft CMS which is very customizable. But the way you are phrasing the question a framework might be overkill. This is really up to you to decide after doing research into the available options and comparing them to what you wish to accomplish.
For the WYSIWYG, you might want to look into the following tools for the client to edit content:
https://imperavi.com/redactor/
https://ckeditor.com/
Hopefully this provides some direction, happy coding!

Somebody help me to answer me why do we use script in asp.net? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
when i use asp.net to coding for website. Asp.net in server call sql server (ado.net, linq or entities framework) and get back data, send for clientside. i use some control as girdview to show data. actually, i do web optimization ( sql server - store procedure, create index, partition => faster, acceleration for get data. Website: UI simple, do not use too many effects)
but why in client-side, when server return data for client, many people always are going to use script (such as javascript, jquery, node js, angular js, bootstrap, react, google o/i....) to show in webpage.
so, it slower or faster when we use girdview?
And when User (people in clientsite) stop scrip in browser, it's mean, Manufacturers allow user or offer to stop script on browser in clientside., so why do we user them ( *.js), when User can stop script?
Even many people use asp.net (new version - 2013) in server, they also use script in there. so asp.net + script in server is faster or slow when we only use asp.net?
please, help me answer.
(I'm apologize because my English is not good.)
Thank you so much.
In the early years of Web development, Javascript on the client side provided for considerable enhancement of the client's "user experience" that static HTML delivered from the server did not. This includes such things as the enabling or disabling of certain interface features based on user input, the appearance or hiding of certain regions of a display based on user input, or combination of other pieces of data.
As web development evolved, the need for even more robust client-side interaction with back-end web servers became evident, and the "frameworks" you mentioned all work in various ways to improve the design, responsiveness, and behavior of a web-based application in ways beyond just enabling or disabling a button. This amounts to complex data binding, callbacks to web services, reducing server round-trips, and creating rich client interfaces, to name only a few.
They're all tools, each with their own role, each working to make web applications a bit more robust than those of the generation before them.
If I understand your question right, the answer comes down to speed and preference.
Firstly, if you disable client-side javascript, your asp.net controls aren't going to really work anyway. You'll find few places that still disable this so it's not really a concern people have anymore.
Secondly, it comes down to where you want to focus development effort and what kind of developers you have. If you have a lot of people used to working backend (C#) and want to stay there, then using asp.net controls and the like make development easier.
If you have javascript developers or people who want to use it, then you have more options that allow you to more decouple your server-side code from your front-end code. This can work out well for maintenance purposes.
The real point is that if you can utilize ajax (http://www.w3schools.com/ajax/default.asp) within your web application, you can make it a lot more responsive. ASP.NET Controls can often cause your page to refresh and cause unnecessary server-side computing to get the data and re-render the entire page (or partial page with asp.net mvc). Using new technologies like angular and others you listed, you can focus data computation and network traffic only on what's important.
For example, if you need a table to change what data is loaded, you can make an ajax request JUST for the data you need to load and then just render that portion on the client.
First of all, every "script" you mentioned (jQuery, AngularJS, Bootstrap, React) is a library written in JavaScript. Except node, which isn't even front-end. And I'm not sure what did you mean by google o/i... JavaScript is currently the only language which works in all browsers.
Initial purpose of JavaScript was to check form values before sending data to servers. It quickly evolved past that, although the usage was throttled by browser adoption, which is still a problem today.
Nowadays we can use JavaScript to render whole webpages. First when opening the page, it can help with rendering, meaning that server doesn't have to do all the work, but can just send plain data, usually in JSON. It's also used to add content to page later, without reloading the page (AJAX). Most well-known examples are real-time chat systems, like the one on facebook. This greatly improves user experience, I can't imagine how terrible it would be if whole page would reload to display a single new message.
Although user can disable JavaScript in their browsers and this would mean the page probably won't work, except if there is fallback design for such cases, I do not know why would someone disable it. And to be honest, probably most of the regular users don't even know this can be done and where is the setting to disable JavaScript.

Is there an easy way to make Javascript apps SEO friendly? [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 9 years ago.
Improve this question
I've been looking at using a new workflow process for web development. Yemoan, Grunt, and Bower with AngularJS seems like a great solution for front-end development. The only downside is that the SEO is absolutely horrible. This seems like a HUGE component of the business decision driving adoption of these services yet I can't find any solutions.
What's a solid solution for making SEO-friendly javascript apps?
The current standard practice for making ajax heavy sites/apps SEO friendly is to use snapshots. See the google tutorial on this here: https://developers.google.com/webmasters/ajax-crawling/docs/html-snapshot and here: https://developers.google.com/webmasters/ajax-crawling/docs/specification
To summarize, you add this tag <meta name="fragment" content="!"> to your DOM. The crawler will see this and redirect itself from www.example.com to www.example.com?_escaped_fragment_= where it will be expecting the snapshot of the page.
You could manually copy the html from your site after all ajax is finished, and create your snapshot files yourself. However, this could be quite a nuisance. Instead, you could use PhantomJS to automate this process for you. Personally, I am going to use .htaccess to send the escaped_fragment requests to a single php file which has cached markup created from the content manager when the edits were made. This allows it to recreate the markup for crawlers to view (but no functionality for humans).
Here's a relevant piece of info from Debunking 10 common KnockoutJS myths. I assume it applies more or less equally to Angular.
Graceful degradation in absense of javascript depends on the way your
application has been architectured. Although KO being a pure
javascript library, does not offer any support for graceful
degradation in absence of javascript, nevertheless unlike many of the
competing technologies it does not hinder graceful degradation.
To create a KO application that degrades gracefully, just ensure that
the initial state of the page that is rendered by the server suffices
to convey the information that a user should see in absence of
javascript. Fallback mechanisms (eg simple forms and links) should be
available that provide the complete (or partial) application
functionality in absence of javascript. Then when you create your view
models you can instantiate them from the data already available from
the DOM and future data can be loaded via ajax without refreshing the
page.
A good example for this functionality can be a grid. The basic HTML
page served by the server can contain a simple HTML table with support
for traditional links for pagination. Then you can create your view
models from the data present in the table ( or ajax if a bit of
redundant data load does not matter for you) and utilize KO for
interactive bindings.
Since KO does not use special inline markup or custom html tags, but
rather simple data-bind attributes which are anyways not visible in
absence of javascript, it does not hinder graceful degradation.

How does Gmail , Twitter, Grooveshark, and those web app built with pure javascript UI prevent people from stealing their code? [closed]

Closed. This question is off-topic. It is not currently accepting answers.
Want to improve this question? Update the question so it's on-topic for Stack Overflow.
Closed 10 years ago.
Improve this question
I am currently considering to build a single page web app using restful api and put the entire UI logic in javascript on the client side. This design concept has been adopted by twitter and several other web apps.
However, I am wondering how to prevent user from stealing my javascript code, since my app logic is all stored in javascript. Does product like gmail, grooveshark, or twitter not care about this issue? do they not care if people can just replicate their app by copy the javascript? if so, does it not bring a lot of risk to the business?
I hope someone can answer my question as I am figuring out how other people are building their app. and if anyone has similar concern on this issue.
On a pure technical level you can't. Any Javascript code readable by a browser can be read by a developer UserAgent. In fact there are browser addons which allow the user to read the Javascript behind or linked by any web page.
Having said that, you can make hijacking of your Javascript code harder by using Minification. (eg: http://code.google.com/p/minify/)
As previously stated, there are no way to prevent "code stealing". Just remember we are in a world where code isn't valued anymore. It's so easy to build an application that what really matters is the branding around it.
Anyone can build a facebook of it's own, but the real value is the number of users on facebook. I don't believe that company tries to protect their code anymore, they in fact make it easy for you to get it via github or the likes. Talking about their products and the way there are made are more beneficial to them than you think.
Just take a look at twitter bootstrap. The investment they put in that code is well rewarded by all the people building apps on their technology. It reinforce the technical value of their systems.
You can minify/obfuscate your javascript code, making it essentially unreadably.
For example: http://code.google.com/p/minify/
or check this question:
How can I obfuscate (protect) JavaScript?
If your business requirements state that your source must remain a closely guarded secret and you are attempting to make a single webpage that contains all your business logic you have a conflicting design.
No matter how much obfuscation or minification you perform on your client-side code, there is going to be a way (simple browser plugins to firebug can do this) to deobfuscate your code.
There is no such thing as "security through obscurity".
Take a look at:
http://a0.twimg.com/b/1/bundle/phoenix-core-en-201112200936.js
http://a2.twimg.com/b/1/bundle/phoenix-more-en-201112200936.js
And consider how hard it is to extract useful information from the code.
This is some of the javascript code that your browser downloads when you visit a page on Twitter. This code has been minified (to make it more efficient to move around the network) and obfuscated (to make it harder to read). These techniques make it much harder for the casual user to re-use or reverse-engineer your code. Tools for doing this a widespread and include: Google's Closure Compiler, Yahoo's YUI Compressor, and others.
No such tool is perfect, however. They won't stop a determined hacker -- of course, a determined hacker could probably just reproduce the functionality, which leads to your best defense, IMHO -- which is your copyright.
When you create software, that software is protected by copyright law, in much the same way as other works are (see Software Copyright). If you create a hot new javascript app, and someone rips the code and puts it in their app, you have grounds for legal action. However, the law doesn't just prevent them from using it exactly "as is". From Wikipedia:
There is a certain amount of work that goes into making copyright
successful and just as with other works, copyright for computer
programs prohibits not only literal copying, but also copying of
"nonliteral elements", such as program structure and design.
This can be very valuable protection.

Alternatives to GWO (Google Website Optimizer)? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
We're using GWO (Google Website Optimizer) now. The multivariate and A/B testing is exactly what we need and works great from the perspective of showing the variations to the users. However, we have several issues that make me want to use a different tool:
Statistics are inaccurate compared to Google Analytics, so we now disregard them and have to manually check
Previews typically don't work
Cannot have dynamic content in variations (I know about variation_content, but I cannot get it to work and nobody in google's forums has been able to help.. I suspect google may have stopped supporting this)
Documentation is poor, there's a techie guide with well-known inaccuracies which haven't been fixed in well over a year.
The html/javascript code we modify our multivariate test sections with is ugly and makes our pages fail standards validation
Only 8 test sections per page, problem there is we want to allow our marketeers the ability to do everything they need from within GWO, but now they need to mark off which test sections they want/don't want in our custom tool
Different experiment key for every test, again it makes marketeers need to work with our code sometimes
Is there a good tool like GWO that works with Google Analytics (which I love)?
UPDATE: We went with Optimizely and have generally been happy. However, it can be difficult to work with because it does a little too much for you. You edit your webpage directly from their UI, but of course that isn't always easy or even possible. Particularly when Javascript is involved. Our UI often gets screwed up in the process. I liked GWO's approach to this in that the developer sections off the code and the marketer can then fill in those sections with variables the developer allows for. To me that's ideal, except that GWO, of course, doesn't actually work.
There's a very similar competitor to Optimizely called Visual Website Optimizer. Also looks very nice, but has the same issue I describe above.
Is there a GWO that works?
You should take a look at Optimizely.
Doesn't require creating invalid code.
Easy to create variations on the fly, though only A/B, not MVT.
Simple WYSIWYG test design, on the fly.
Real time data.
Retroactive goals
With regex/head matching for experiments, you can set the experiment to work on dynamic pages.
You can set a Google Analytics custom variable for the experiment that will pass the variation the end user sees as a custom variable. (It even allows you to set what slot you want it to use.)
The test variations are basically just jQuery manipulations of the DOM, so if you know a little jQuery, it's very easy to extend its capabilties even further than the very generous WYSIWYG GUI.
Installation is easy: You only need to include a single script tag, one time, on any experiment or goal page.
I have found Adobe Test&Target to have all the features you need. It is very easy to create experiments, add variations, and conversion goals. You can easily inject JQuery snippets to create new variations, click Save, and the test is running in production.
I have no idea how much it costs, but I'm guessing it is not cheap.
Now Google Website Optimizer killed multivariate testing in new version (Google Analytics Content Experiments) we launched Convert Experiments on Convert.com for people that look for a GWO alternative with MVT
Yes I am Founder of Convert Insights, the company behind this tool...
Dennis
Re your update: I have tried both GWO and Optimizely, and I'd go with Optimizely every time.
You say you wish it worked a little more like GWO - if you want, instead of manipulating the elements of the design using their GUI, you can just redirect each variant to a different URL:
https://help.optimizely.com/hc/en-us/articles/200040675
There are a few other tools which do A/B and MVT. Aren't free, but check them out for yourself: Omniture, Webtrends Optimize, SiteSpect.
Hope this helps!
You can also try VWO. It does MVT as well as A/B testing and is also a good tool. Optimizely is easier to use though so you might want to evaluate both for your scenario.

Categories