Puppeteer - Scraping FiveThirtyEight Presidential Approval doesn't work

Puppeteer - Scraping FiveThirtyEight Presidential Approval doesn't work - javascript

I saw one other SO question, but it had no answers, and was from 2018. Am I supposed to intercept XHR? JSON Get requests? Any help getting started here appreciated.
I try to scrape by selector or xpath, and it always times out saying the xpath doesn't exist. I've tried various functions to wait until the page is well and truly loaded, no dice. Something funky seems to be occurring for puppeteer that it doesn't actually load the page, whereas for me, as a real human, it well and fully loads, and I see an element with the approval rating.

I see it's in an iframe, here is the src: https://projects.fivethirtyeight.com/trump-approval-ratings/promo.html
You can get those with:
document.querySelector('.approve .val').innerHTML
// "42.6"
document.querySelector('.disapprove .val').innerHTML
// "52.7"

Related

Google Tag Manager and Single Page APP: How to read a DOM Element?

I want to create a variable on my GTM to store a DOM element. I've tried a custom javascript, for example:
function(){
return document.querySelector('.room__price-value').innerText;
}
But nothing, on GTM preview I see always NULL. I think the issue was the single page app.
And I can't involve the programmers.
Any suggestions?
Thanks

Make sure that when in preview, you check the value of your variable on every event on the left.
Make sure when you execute document.querySelector('.room__price-value').innerText; in your console, you actually get the result you're expecting.
Change the code until you get the result in the local console.
Make sure your timing is good, as Eike mentioned in the comment.
What you're doing there seems right. That's pretty much how you do DOM scrape in GTM when front-end devs aren't available. If the suggestions still don't work for you, give us more info on what you're doing, on how your DOM looks like, what your debugging showed and such. As much potentially useful context as you can.

Programatically retrieve count of javascript errors on page

I'd like to write a test case (using Selenium, but not the point of this question) to validate that my web application has no script errors\warnings or unhanded exceptions at certain points in time (like after initializing a major library).
This information can easily be seen in the debug consoles of most browsers. Is it possible to execute a javascript statement to get this information programatically?
It's okay if it's different for each browser, I can deal with that.

not so far read about your issue (as far as I understood your problem) here
The idea be the following:
I found, however, that I was often getting JavaScript errors when the page first loaded (because I was working on the JS and was introducing errors), so I was looking for a quick way to add an assert to my test to check whether any JS errors occurred. After some Googling I came to the conclusion that there is nothing built into Selenium to support this, but there are a number of hacks that can be used to accomplish it. I'm going to describe one of them here. Let me state again, for the record, that this is pretty hacky. I'd love to hear from others who may have better solutions.
I simply add a script to my page that will catch any JS errors by intercepting the window.onerror event:
<script type="text/javascript">
window.onerror=function(msg){
$("body").attr("JSError",msg);
}
</script>
This will cause an attribute called JSError with a value corresponding to the JavaScript error message to be added to the body tag of my document if a JavaScript error occurs. Note that I'm using jQuery to do this, so this specific example won't work if jQuery fails to load. Then, in my Selenium test, I just use the command assertElementNotPresent with a target of //body[#JSError]. Now, if any JavaScript errors occur on the page my test will fail and I'll know I have to address them first. If, for some strange reason, I want to check for a particular JavaScript error, I could use the assertElementPresent command with a target of //body[#JSError='the error message'].
Hope this fresh idea helps you :)

try {
//code
} catch(exception) {
//send ajax request: exception.message, exception.stack, etc.
}
More info - MDN Documentation

Google AdSense JavaScript causing multiple page-loads?

Update
Ok - I now know where the multiple page loads are coming from! (However, the mystery is not yet solved).
It seems that immediately after a request is made to a page containing AdSense ads, Google makes a request for exactly the same URL (one or more times)
e.g. this is what the logs look like (note requests from Mediapartners-Google):
2011-07-20 09:50:20 xxx.xxx.xxx.xxx GET /requestedURL/ 80 - xxx.xxx.xxx.xxx Mozilla/5.0+(Browserstring removed) 200 0 0 1140
2011-07-20 09:50:20 xxx.xxx.xxx.xxx GET /requestedURL/ 80 - 66.249.72.52 Mediapartners-Google 200 0 64 218
2011-07-20 09:50:22 xxx.xxx.xxx.xxx GET /requestedURL/ 80 - 66.249.72.52 Mediapartners-Google 200 0 0 171
(I should have paid more attention to the IIS logs, rather than my own application logs - it just didn't occur to me that these multiple, identical, simultaneous request could have been coming from different sources). This also explains why I couldn't find anything strange when analysing the request with WireShark, and why fiddler didn't show anything strange.
So the question for the bounty now becomes:
Why is google making these requests so quickly after the page is requested? (I know they need to asses the page for content, but immediately after, and multiple times sees like abuse to me.)
What can I do to stop this?
And out of interest:
Has anyone else seem something similar in their logs? (or is this something weird with my AdSense account)
Ok, I'll apologise in advance for the length!...
This question is realted to this one, regarding Google Adsense Javascript code causing errors. (of the form Unable to post message to googleads.g.doubleclick.net. Recipient has origin something.com)
I won't duplicate all of the information there, but the conclusion seems to be that the AdSense JS is buggy. (please read the question for background if you have time).
I knew about this problem for some time, but decided to live with the JS errors rather than pulling AdSense from the site.
However, Recently I noticed that in my ASP.NET MVC2 application, Controller Actions seemed to be called twice per page request (sometimes even 3 times). Odly, it was only happening on the production server. After some thought I relalised that one difference between the Dev and Production environments was that the AdSense javscript was only active in production.
To test this I removed all adsense code from one of the production pages, and lone behold, the multiple-page-load problem went away!
I thought that perhaps it was the fact that there were general JS errors on the page that was causing the problem, so to test this I introduced some simple errors into my own JS code, however this did not cause the multiple-page-load problem to reappear.
One known situation where pages can be called multiple times per request is when there are image tags with empty src attributes, or external resource references with empty src attributes. Crucially, The most upvoted answer to the AdSense JS Bug question notes that:
"The targetOrigin argument in this call, this.la is set to
http://googleads.g.doubleclick.net. However, the new iframe was
written with its src set to about:blank."
This seems eerily similar to the empty src issue.... This seems too much of a co-incidence, and currently I'm of the opinion that this is the problem.
[EDIT: This was a red herring]
However, I've no idea wehre to go from here. These multiple action calls are causing real problems (I'm having to use code blocking, serialised transactions, and all sorts of nasty hacks to limit adverse effects). Of course, I could be barking up the wrong tree entirely - I'm puzzled that I can't find any other references to this, given the ubiquity of AdSense, and the nature of the problem (but then again the conclusions of the AdSense JS Bug question are also surprising). I would love this to turn out to be a stupid mistake on my part, so I need a sanity check.
I'd like to ask the community:
Has anyone else experienced this problem?, or can anyone who is using AdSense replicate and confirm it? [See note below]
Assuming the problem is what it seems, what can I do? (other than pulling AdSense of course)
If not, then what might be causing this?
To Sumarise:
- My actions are being executed 2 (sometimes 3) times per page request.
THIS ONLY HAPPENS WHEN GOOGLE ADSENSE ADS ARE PRESENT
I removed all AdSense JS and introduced an error into my own JS : Actions are called only once...
A similar problem can happen when empty src properties are present on the page
An answer to a previous question sumarises that the AdSense JS sets a src="about:blank" on an iFrame
I have come to the conclusion that the src="about:blank" from the AdSense code is the most likely source of the problem.
If I disable JavaScript on the browser, the problem goes away
Just to document the things I have ruled out:
This is happening across browsers: Chrome(12) Firefox(5) and IE(8).
I have dissabled all plugins on browsers (YSlow, Firebug etc...)
There are no empty src (src=""/src="#") for images, or other external resources in the html in my code
There are no empty url references in the css ( url('') )
It's unlikely to be server side code/config problem, as it doesn't happen in Dev (and of the few differences between dev and production is the absence of AdSence JS in Dev)
Note: For anyone looking to replicate this, it should be noted that, strangely, when the multiple action calls happen Fiddler shows only one request being sent to the server. I have no idea why this should be the case, but the server logging doesn't lie :) Perhaps someone who has prior experience with this problem when caused by empty src attributes in img tags can say whether they have seen the same behaviour with Fiddler.
Requested extra information
HTML (#Ivan)
Here's how I'm implementing the Adsense (ids removed)
<%# Control Language="C#" Inherits="System.Web.Mvc.ViewUserControl" %>
<div class="ad">
<%if (!HttpContext.Current.IsDebuggingEnabled) { %>
<script type="text/javascript"><!--
google_ad_client = "ca-pub-xxxxxxxxxxxxxxx";
/* xxxxxxxxxxxxxxx */
google_ad_slot = "xxxxxxxxx";
google_ad_width = 728;
google_ad_height = 15;
//-->
</script>
<script type="text/javascript" src="http://pagead2.googlesyndication.com/pagead/show_ads.js">
</script>
<%} else { %>
<img src="/Content/images/googleAdMock728x15_4_e.gif" width="728" height="15" />
<%} %>
</div>
This is being inserted by a RenderPartial in the View:
<% Html.RenderPartial("AdSense_XXXXXX"); %>
TCP Logging (#Tomas)
So far I have done a wireshark capture:
on client when requesting page on production with problem
on client when requesting page on production without problem (i.e. Adsense Removed)
I can't really see a significant difference between the two (although my network skills are not great). One thing to note is that they both seem to have a TCP retransmittion of the HTTP request immediately after the initial request - I don't know the significance of that. I can confirm though that in case 1 the server logs reported 2 executions, and in case 2 only one execution.
Next I will try TCP logging on the server side in both cases, and post results here.

Mediabot is the name given to the web crawler that Google uses to crawl webpages for purposes of analysing the content so Google AdSense can serve contextually relevant advertising to the page.
In my experience, it is impredictable and, yes , it can be pretty heavy and annoying.
If you don't want Mediapartner bot to access a specific page, you can disallow it in your robots.txt with:
#
# disallow adsense bot
#
User-agent: Mediapartners-Google
Disallow: path to your specific page
This will have the drawback of service untargeted ads from that specific page.
If you are seeing this pattern always on the same page with different query string, adding the canonical rel could ease the pain.
If you can't resolve this issue, and you see it as an abuse, don't esitate to ask help in the Crawling Indexing and Ranking Google support.

Given that the behaviour that you are observing appear to be hard to avoid, can we rather focus on workarounds?
Can you differentiate requests based on UserAgent, and thus filter out requests.
Could that be a viable approach for you?
If so then you could probably base upon this approach: http://blog.flipbit.co.uk/2009/07/writing-iphone-sites-with-aspnet-mvc.html
Here they detect iPhones, but the consept is the same for Mediapartners-Google bot.

Aside from the embedding of the AdSense code itself, there are two things related to AdSense that differ in your two test cases:
What else happens when !HttpContext.Current.IsDebuggingEnabled? This appears to be the de-facto production flag; maybe there is some other nuance somewhere that is happening that depends on this same flag.
Is it possible that Html.RenderPartial("AdSense_XXXXXX") is somehow causing your Controller to jump back to the beginning of its execution?
From your description, it seems like the execution is happening twice on the server but only one request is being sent from the client. This implies a server error, and these two lines are the crux of your AdSense triggering. To further narrow it down, try embedding the AdSense partial directly instead of calling Html.RenderPartial(). If that doesn't change the result, it might be worth a sanity check on what else switches on HttpContext.Current.IsDebuggingEnabled.
Failing that, it might be helpful to know whether your server-side logging takes place as the request is received, before the response is sent, or after the response is sent.

Yes, I just detected this during a TeamView session with my partner. On my box my main page ONLY for my site loads once per request.
Then by coincidence while using Fiddler my partner is getting 4 requests to the sample page. It is a 1.5 MB page with big scripts and lotsa other dependencies so this was truly a WTF moment as I have never seen anything like this in 15 years of web development.
If google is doing this I must say they should realize today's sites might have very big pages and very big audiences. That could mean they are jacking bandwidth by a factor of 4 per request. Like I said, WTF?????
I wish this Q&A had a more definitive resolution.
I do use Google Translate widget but this is only occurring on his box and for the main page. The other pages also use the translate widget and I do request my JQUERY via the google CDN. Could anything Google be doing this.

Google Analytics: _trackPageview without page refresh?

I'm reworking some site tracking for a site I'm working with. For the tracking we are currently using Google Analytics, which seems to be working fairly well. However, I'm having some troubles resembling the ones in this question, but it's old and no one answered, so I'm bumping a bit here. :)
Basically, I'm tracking two kinds of things. Raw pageviews (entering a page), and events on the page (lightbox opened, something important clicked, etc). I'm using _trackPageview for both kinds of events, because I need to be able to track some lightbox flows in GA's goal funnel tracking, and as I understand it _trackEvent calls can't be tracked in goal funnels.
The problem here is that it seems like the way GA works, it doesn't really post its data instantly (firebug doesn't show any requests happening, at least), but defers it to a page refresh or something like that. I'm not totally sure what happens, but basically I'm getting all events up to the first one leading to a page refresh all shuffled up in the funnel and looking like they all happened as an exit from the event causing the refresh. (Did that make sense? :) Is there any way of forcing GA to "flush" an event when it happens and not defer it? Or am I using things totally wrong?
EDIT: I was a bit blind reading the firebug logs... It does actually do the request to __utm.gif with the correct data. Makes the funnel being weird even more strange though, so the basic question is still valid.
Thanks

I made a function for this. We wanted to track how many people click on each on of a few links we have so we "track pageviews" for it.
function trackPV(trackerCode, url)
{
var tracker = _gat._getTracker(trackerCode);
if(url)
{
tracker._trackPageview(url);
}
else
{
tracker._trackPageview();
}
}
Basically, you pass in your tracker code (UA-XXXXX) and a url if you'd like to, such as "http://www.example.com/link1", by default it just tracks the page you are on.
Hope this helps.

I believe each call to _trackPageview will submit a unique request to Google Analytics (via parameters to the __utm.gif object). Google Analytics is pretty tough to debug since there is such a lag between the time your send your data, until it is actually visible online. Typically, you will have to wait 4+ hours before your data will show up - so maybe you just need to wait to confirm that your code is working.

Hmmm... I really only have experience with the old GA, but it seems to me that your best course of action is decoding the utm.gif request and seeing if it contains incorrect information. Here's a list of debugging tools that Google recommends.

use "event tracking" . At least check it out in google analytics help.

Partial Javascript Statements Logged To Server

I have some code that generates URLs to be used in various places across a site (image src, link hrefs, etc). I am seeing lines in the access logs which show some of the javascript code that generates the URLs masquerading as a file request.
For example, "/this.getIconSrc()" is one that I'm seeing quite a bit. I can't figure out how or why this is occurring and I can't manage to reproduce it without actually entering "http://whateverthesiteis.com/this.getIconSrc()" into the location bar. In most cases, these functions are chained together to generate a URL but the whole function chain does not appear in the server logs, just part of it.
I've probably invested around 30 hours trying to figure out why this is happening but cannot. It doesn't appear to be a browser issue as I've tried in IE 6/7, FF 2/3, Opera, Safari 3, and the problem does not occur. Has anyone else experienced something similar and, if so, what was the solution?

There's three possibilities really:
A bug in your HTML - malformed HTML causing onclick to leak into href, for example
A bug in your Javascript - myIcon.src = 'this.getIconSrc()'; - note the quotes that shouldn't be there
A poorly-written spider is hitting your site (like #Diodeus said: ___)
Edit:
Check the User Agent and Referrer in your logs - they may offer a clue.

Are you generating JavaScript calls like this? This may explain it.
___

#RoBorg... I'm thinking the most likely scenario is #3 since this particular function is actually only called in one place...
function whatever(){
var src = this.getIconSrc();
return src.replace( /((?:https?:\/\/)?(?:[^\/]+\/)*)[^\/]+/, '$1newimage.png' );
}

We Keep Coding

JavaScript is the programming language of the Web.

Puppeteer - Scraping FiveThirtyEight Presidential Approval doesn't work - javascript

I see it's in an iframe, here is the src: https://projects.fivethirtyeight.com/trump-approval-ratings/promo.html You can get those with: document.querySelector('.approve .val').innerHTML // "42.6" document.querySelector('.disapprove .val').innerHTML // "52.7"

Related

Google Tag Manager and Single Page APP: How to read a DOM Element?

Programatically retrieve count of javascript errors on page

Google AdSense JavaScript causing multiple page-loads?

Google Analytics: _trackPageview without page refresh?

Partial Javascript Statements Logged To Server

Categories

Resources