Python - How to scrape multiple dynamically updated forms / webpages? - javascript

I've been trying to scrape a dynamically updated website, each webpage containing hundreds of rows, and the website in total has thousands of of pages (as in each page is accessed by clicking a "next" button or a number on the bottom of the page, just like you see in the bottom of a Google search page).
While I've been able to successfully scrape the pages, I've had trouble getting 100% accuracy in my results namely because the pages are dynamically updated (javascript). When a user logs in to their account, the system puts them back to the very top of the first row of the first page. So, for example, if I were just about to scrape page 101, and I were on page 100, and a user on page 101 logs in to their account, then I would miss that user's info. Considering the volume of activity, this can be quite problematic.
I tried running my automation during the wee hours, but realized there were users world-wide, so that was a fail. I also can't scrape pages in parallel because the forms are accessed/uploaded through javascript and I've had to use Selenium to click through one page at a time. (There's no unique URL per page; I've also tried looking through my browser's Network tab, but there's no variable that changes when I click on another page). I also tried accessing the API following the instructions on here, but the link that I was able to obtain only displays the information on the current page -- so it's no different than what I was able to access through the HTML source.
What are my options? Is there someway I can catch all the information at once so that I don't risk missing any information?
I know there will be people asking for the URL, but unfortunately I can't give it away. Even if I did, I couldn't give away the username and password. I'm a beginner at web-scraping, so any help is really appreciated!

If you've got no problem hitting the page as many times as you want, and the information never disappears, just go through all the pages as fast as you can, over and over again. In Selenium you can control multiple tabs and/or browsers simultaneously all using the same cookie to make your scraping faster.

Related

Problem with Document expired after pressing "back button" in browser

We are using an e-commerce script, which is coded with ionCube technology. In the product catalog we have filters. They are sent via post method. Because of the script I cannot use get - even if I want to.
When the user will try to go to a product details, by choosing one of the filtered products in the catalog, when he tries to go back via back button in browser, he gets Document expired. After clicking the refresh button it shows the right catalog page with all filters which were chosen.
We tried to set this on server:
ini_set('session.cache_limiter','public');
It helps with the above problem, but it corrupts the cart page - everything goes crazy.
I tried many scripts founded in Stack Overflow and in other places on the net, but it won't work.
Please notice that I also cannot use PHP, because of ionCube. When I am trying to add anything in the index.php I get a corrupt notice after the page reload.
Any solution?

Is it possible to get url of child iframe with PHP after user navigates?

We are building an educational tool whereby students opens a website in another tab/window and then searches around the other site. Once finding the information they enter the url of the page they were on into a box. Its a bit clunky and what we want to do is allow them to open the new site (bbc.co.uk for example) within an iframe that has a header at the top allowing htem to return to their workbook.
When they navigate around the BBC site, we would like for them to be able to click a button on our frame which grabs what url they are on and some other info like page header etc and insert that automatically into their workbook.
However I cant seem to find how to grab the url of the page being viewed within the iframe. As we send them to bbc, I can get the source id easily enough but as soon as they start moving around the bbc site doing their research there is no way for the parent iframe (on our domain) to see what page they are on?
I know this is not possible in JS due to XSS issues, but was wondering if there is a workaround. Or any other way to grab the url. Currently our way of doing things is clunky, we want to make the tool a lot more easier.
Thanks
Paul

Go back button in single-page webapp

I'm bulding my first WebApp. I've got a small navigation bar in the head where a back button should be placed. The Pages of the app are all placed in one document so the div's are set to
<div id="page_1" class="page" style="display:none">
and will be shown by clicking on a link
onclick="show('Page_dash');
Now i want to have a back button which goes back to the last shown page. I've tried this
onclick="history.go(-1);
but it's not working because there is only one page which contains all pages so the history.go(-1) goes to the last visited homepage. So i'm looking for a good, fast and simple solution!
thanks
In order to create an effective single page application (SPA), you will need to implement a method to track history that appears traditional to your end users. There are a few different techniques for this, but as a developer of enterprise-level single page applications, I highly recommend using the url hash method.
This technique allows your end users to bookmark specific "pages" in your single page app, along with using their browser's back button to return to the previous page. End users can become extremely frustrated with a single page app if they try to return to the previous page using their browser's back button, and find that they are returned to Google, or whatever site they visited before yours.
Here is some additional reading on the subject: URL Hash Techniques

detecting user closing tabs or leaving the site

I'd like to be able to tell when the user leaves the site so I can update a field in my database (particularly, a last_visit field in order to determine unread items for each user). However, I'm not sure how to manage the fact that sometimes, a user opens several tabs of the site, so I can't use onbeforeunload reliably to accomplish this goal.
Ideally, I would be able to update this field only when there is only one open tab of the site.
On the hand, maybe I could get more functionality by simply using a table to record read items for several days and assuming that threads older than that are read by default.
What do you think?
Regards
All I can think of is using either cookies or local storage to update the time at which they're viewing your site on each page load. This way, once they close all the tabs where your website is open, the cookie/local storage entry won't update, and you can access that value later on when they return.
So run this every time the page loads:
window.localStorage.setItem('lastVisit',Date.now);
And to grab it:
var lastVisit = window.localStorage.getItem('lastVisit');

Do links with javascript slow down a page?

Due to an issue that came up with a website I have to use javascript for all of the links on the page.
like so...
<img src="image.png"/>
Will having many links with javascript on the webpage slow it down significantly?
Does the Javascript run when the page initially loads or only when a link is clicked?
EDIT: For those asking why I'm doing this. I'm creating an iPad site, when you use the 'add to home page' button to add the site as an icon, it allows users to view the site with no address bar.
However everytime a link is clicked it reopens Safari in a new window with the address bar back.
The only solution I could find was using javascript instead of an html based link to open the page.
For further reference see...
iPad WebApp Full Screen in Safari
2nd answer
"It only opens the first (bookmarked) page full screen. Any next page will be opened WITH the address bar visible again. Whatever meta tag you put into your page header..."
3rd answer down
"If you want to stay in a browser without launching a new window use this HTML code:
a href="javascript:this.location = 'index.php?page=1'"
"
I can see this adding to the bandwidth needs of a site marginally (very marginally), but the render time and the response time on clicking shouldn't be noticeable.
If it is a large concern I would recommend benchmarking the two different approaches to compare the real impact.
What do you mean by slow it down?
Page load time? Depends on the number of links on your page. It would have to be a LOT to be noticeable. Execution time? Again, not noticeable.
The better question to ask is are you o.k. with effectively deleting your website for those without javascript?
Also, if you are worried about SEO, you will need to take additional measures to ensure your site can still be indexed. (I doubt Google follows those kinds of URLs... could be wrong I guess).
EDIT: Now that you explained your situation above, you could easily just "hide" the address bar. See this SO question.

Categories