Cannot get form from webpage - javascript

I am trying to get the login form from:
https://www.etoro.com/login
when I inspect in Chrome I can see the element, however when I use the jaunt api in Java I cannot get the form.
userAgent = new UserAgent();
userAgent.visit("https://etoro.com/login");
List<Form> forms = userAgent.doc.getForms();
System.out.println(forms.size()); // 0
I have little experience in HTML so any direction would be great!
This is my first post so if I havent done something correctly please let me know.
Thank you very much!

Well, you are out of luck with a simple Java web scraper.
If you look at the source of the page in the browser, you see, that the page consists mainly of a long <script>. The whole login form is then created by the browser with Javascript.
If you absolutely must scrape this exact form, you need a tool, that can execute Javascript. For this, you could use PhantomJS. That's basically a complete browser, that can be controlled with a Javascript API.
Search Google for phantomjs web scraping to get you started.

Related

is there a way to retrieve information from a webpage and use that info to prefill another page?

I'd like to create a browser extension or a javascript to retrieve pieces of information from a currently open webpage and use that info to prefill a form on another webpage.
So, for example, if I'm on a StackOverflow page, I'd like a script that takes info from that page (title, question,...) and prefill that data in a new webpage (eg: https://stackoverflow.com/questions/ask).
I'm not an expert in coding, but I created some scripts using Python and Selenium (nothing too fancy though). I looked for a similar question, but I didn't find anything. Does anyone have an idea on how I could accomplish something like that?
You can do it with the help of content script. fetch the required data from any webpage using a content script and store the data on extension storage. Then you can inject the data to required input on required webpage with executeScript()
for more reference: https://developer.chrome.com/extensions/content_scripts

I can't see the data which is on web page in source code, but I can see via inspect element

I'm scratching my head from yesterday trying to find about this.
When I navigate to account settings page and view source code, there's literally no user specific data like name, email, gender etc, but when I check via inspect element its there. Same happens with other web pages like order history etc.
I'm assuming the data is being generated dynamically (Am I right?)
I have two questions about this.
How do developers do this?
What's the purpose of doing this? Since developers take the extra pain of generating data dynamically this must be solving an issue otherwise why would they do this?
By generating the new page dynamically, developers can improve the user experience. For example, if you had a separate html file for your settings page, the user would have to make a call to your server to receive that file and see the page (maybe 1/3 of a second). However, if the developer dynamically generates new pages using javascript or some framework, everything is stored locally on the user's machine meaning that the page loads significantly quicker (~1/500 of a second).
Hope this helps.

DOM and JavaScript Engine used in Office365 Thick Clients

The product I work on offers SSO into Office365, through both the web and native, "thick" clients aka rich clients. Part of SSO-ing into an Office365 app, such as Excel for example, involves displaying my product's login page inside of the login popup window inside the thick client. The problem is, only on Windows, I get many JavaScript errors when trying to execute the JavaScript included in our login page (it happens to be using AngularJS, but I suspect many frameworks/libraries would be incompatible). It appears that console is not supported, along with document.body, and many other "essentials".
Does anyone have any knowledge of the DOM and script engines that are used here? The first page shown in the SSO flow is Microsoft's login page where you enter your email address, which then redirects to my product's login page (mapped by domain on the email address), and their page seems to render fine, so clearly it's possible to get HTML and JS to work nice (enough). I'd also take a recommendation on any kind of shim/polyfill that would help me get moving, as well.
After doing some more digging, it looks like I was able to solve my problem by specifying an HTTP Response header of name X-UA-Compatible with value IE=edge, which tells IE to render using the latest document standards. It looked like the web view was originally trying to render using IE7 compatibility mode, which explains why none of my JS was working as intended.
See https://stackoverflow.com/a/6771584/3822733 for more information on X-UA-Compatible, this is the question/answer that helped me solve this problem.

How do you keep content from your previous web page after clicking a link?

I'm sorry if this is a newbie question but I don't really know what to search for either. How do you keep content from a previous page when navigating through a web site? For example, the right side Activity/Chat bar on facebook. It doesn't appear to refresh when going to different profiles; it's not an iframe and doesn't appear to be ajax (I could be wrong).
Thanks,
I believe what you're seeing in Facebook is not actual "page loads", but clever use of AJAX or AHAH.
So ... imagine you've got a web page. It contains links. Each of those links has a "hook" -- a chunk of JavaScript that gets executed when the link gets clicked.
If your browser doesn't support JavaScript, the link works as it normally would on an old-fashioned page, and loads another page.
But if JavaScript is turned on, then instead of navigating to an HREF, the code run by the hook causes a request to be placed to a different URL that spits out just the HTML that should be used to replace a DIV that's already showing somewhere on the page.
There's still a real link in the HTML just in case JS doesn't work, so the HTML you're seeing looks as it should. Try disabling JavaScript in your browser and see how Facebook works.
Live updates like this are all over the place in Web 2.0 applications, from Facebook to Google Docs to Workflowy to Basecamp, etc. The "better" tools provide the underlying HTML links where possible so that users without JavaScript can still get full use of the applications. (This is called Progressive Enhancement or Graceful degradation, depending on your perspective.) Of course, nobody would expect Google Docs to work without JavaScript.
In the case of a chat like Facebook, you must save the entire conversation on the server side (for example in a database). Then, when the user changes the page, you can restore the state of the conversation on the server side (with PHP) or by querying your server like you do for the chat (Javascript + AJAX).
This isn't done in Javascript. It needs to be done using your back-end scripting language.
In PHP, for example, you use Sessions. The variables set by server-side scripts can be maintained on the server and tied together (between multiple requests/hits) using a cookie.
One really helpful trick is to run HTTPFox in Firefox so you can actually monitor what's happening as you browse from one page to the next. You can check out the POST/Cookies/Response tabs and watch for which web methods are being called by the AJAX-like behaviors on the page. In doing this you can generally deduce how data is flowing to and from the pages, even though you don't have access to the server side code per se.
As for the answer to your specific question, there are too many approaches to list (cookies, server side persistence such as session or database writes, a simple form POST, VIEWSTATE in .net, etc..)
You can open your last closed web-page by pressing ctrl+shift+T . Now you can save content as you like. Example: if i closed a web-page related by document sharing and now i am on travel web page. Then i press ctrl+shift+T. Now automatic my last web-page will open. This function works on Mozilla, e explorer, opera and more. Hope this answer is helpful to you.

web crawler/spider to fetch ajax based link

I want to create a web crawler/spider to iteratively fetch all the links in the webpage including javascript-based links (ajax), catalog all of the Objects on the page, build and maintain a site hierarchy. My question is:
Which language/technology should be better (to fetch javascript-based links)?
Is there any open source tools there?
Thanks
Brajesh
You can automate the browser. For example, have a look at http://watir.com/
Fetching ajax links is something that even the search-giants haven't accomplished yet. It is because, the ajax links are dynamic and the command and response both vary greatly as per the user's actions. That's probably why, SEF-AJAX (Search Engine Friendly AJAX) is now being developed. It is a technique that makes a website completely indexable to search engines that when visited by a web browser, acts as a web application. For reference, you may check this link: http://nixova.com
No offence but I dont see any way of tracking ajax links. That's where my knowledge ends. :)
you can do it with php, simple_html_dom and java. let the php crawler copy the pages on your local machine or webserver, open it with an java application (jpane or something) mark all text as focused and grab it. send it to your database or where you want to store it. track all a tags or tags with an onclick or mouseover attribute. check what happens when you call it again. if the source html (the document returned from server) size or md5 hash is different you know its an effective link and can grab it. i hope you can understand my bad english :D

Categories