web-scraping a strange html setup with Python-BeautifulSoup & urllib - javascript

The problem is not really extracting the data, but locating it. I am scraping for football data. This site lays it out in total(all years) or year(season), however the data contained in the html is the data about all time, not the season you select, even though the site displays the season statistic's. Interestingly when you load data for a season, it first loads and briefly displays the data for all time, of that variable.
For example: line within the "td" tags on line 983 of the html source for this site, it says 515(Chelsea's wins for all time) when I'm viewing the page for Chelsea's wins that season, which should be 26.
Can anyone explain this witchcraft and how to scrape data by season?

Looks like when you select a season, they pull from an API that returns the data in JSON format. This makes your job a lot easier because JSON is easier to parse than HTML.
You can see the requests and responses in Chrome web dev tools:
Press F12 when looking at the page in Chrome.
Go to the Network tab.
Click the Filter icon, then click XHR.
When you choose a season you should see an XHR request to footballapi.pulselive.com.
For example https://footballapi.pulselive.com/football/stats/ranked/teams/wins?page=0&pageSize=20&compSeasons=42&comps=1&altIds=true
Click on that URL in the dev tools and to the right, click the Preview tab to see the response formatted nicely.
I think you'll be able to mimic these requests in your program. You may need to send some of the same request headers because it appears they block it if you try to hit the API directly in the browser.

Related

I am using cheerio to grab stats information from https://www.nba.com/players/langston/galloway/204038 but I can't the table data to show up

[the information i want to access][1]
[1]: https://i.stack.imgur.com/4SpCU.png
NO matter what i do i just can't access the table of stats. I am suspicious it has to do with there being mulitple tables but I am not sure.
enter code here
var cheerio = require("cheerio");
var axios = require("axios");
axios
.get("https://www.nba.com/players/langston/galloway/204038")
.then(function (response) {
var $ = cheerio.load(response.data);
console.log(
$("player-detail").find("section.nba-player-stats-traditional").find("td:nth-child(3)").text()
);
});
The actual html returned from your get request doesn't contain the data or a table. When your browser loads the page, a script is executed that pulls the data from using api calls and creates most of the elements on the page.
If you open the chrome developer tools (CTRL+SHIFT+J) and switch to the network tab and reload the page you can see all of the requests taking place. The first one is the html that is downloaded in your axios GET request. If you click on that you can see the HTML is very basic compared to what you see when you inspect the page.
If you click on 'XHR' that will show most of the API calls that are made to get data. There's an interesting one for '204038_profile.json'. If you click on that you can see the information I think you want in JSON format which is much easier to use without parsing an html table. You can right-click on '204038_profile.json' and copy the full url:
https://data.nba.net/prod/v1/2019/players/204038_profile.json
NOTE: Most websites will not like you using their data like this, you might want to check what their policy is. They could make it more difficult to access the data or change the urls at any time.
You might want to check out this question or this one about how to load the page and run the javascript to simulate a browser.
The second one is particularly interesting and has an answer saying how you can intercept and mutate requests from puppeteer

Screen-scrape paginated data

I'm trying to grab a list of all of the available stores returned from the search at this website.
https://www.metropcs.com/find-store.html.html
The issue is that it returns back only 4 or 5 at a time, and does not have the option for 'See All'. I attempted to use Post Man in Chrome and AutoPager in Firefox to see if I could somehow see all of the data in the background but I wasn't able to. I also was researching JSON interception tools, as I believe the site is using JSON in the return set, but I wasn't able to find any of the actual data that I needed.
In the past I was able to hit 'print preview' and grab the list that way (then I just copy-pasted to Excel and ran some custom macros to strip the data I need) but the printer-friendly version is gone now as well.
Any ideas on tools that would allow me to export all of the stores found, especially for larger return sets?
You want to manipulate this request:
https://www.metropcs.com/apps/mpcs/servlet/genericservlet
You'll notice the page sends this (among other things) as the request to that URL:
inputReqParam=
{"serviceProviderName":"Hbase","expectedParams":
{"Corporate Stores":...Truncated for clarity...},
"requestParams":
{"do":"json",
"minLatitude":"39.89234063913044",
"minLongitude":"-74.85258152641507",
"maxLongitude":"-74.96578907358492",
"maxLatitude":"39.979297160869564"
},
"serviceName":"metroPCSStoreLocator"}
You'll need to manipulate the lat and long bounding box to encompass the area you want. (The entire US is something like [-124.848974, 24.396308] to [-66.885444, 49.384358] )
In your favorite browser it should be easy enough to tweak the request to get a JSON response with what you require.

Using DOMXpath to extract JSON data

I have used php simple html dom to no success on this issue.
Now I have gone to DOMDocument and DOMXpath and this does seem promising.
Here is my issue:
I am trying to scrape data from a page which is loaded via a web service request after the page initially shows. It is only milliseconds but because of this, normal scraping shows a template value as opposed to the actual data.
I have found the endpoint url using chrome developer network settings. So if I enter that url into the browser address bar the data displays nicely in JSON format. All Good.
My problem arises because any time the site is re-visited or the page refreshed, the suffix of the endpoint url is randomly-generated so I can't hard-code this url into my php file. For example the end of the url is "?=253648592" on first visit but on refresh it could be "?=375482910". The base of the url is static.
Without getting into headless browsers (I tried and MY head hurts!) is there a way to have Xpath find this random url when the page loads?
Sorry for being so long-winded but I wanted to explain as best I could.
It's probably much easier and faster to just use a regex if you only need one item/value from the HTML. I would like to give an example but therefor I would need a more extended snippet of how the HTML looks like that contains the endpoint that you want to fetch.
Is it possible to give a snippet of the HTML that contains the endpoint?

google finance zoom in zoom out graph logic

I'm looking for logic behind zoom-able graph like google finance. I know there
are off the shelf components that just do that, but I am looking for a basic
example that explains the logic.
Whoever writes things like that basically has two choices.
Load a lot of data, and show only a little bit. When the user changes the zoom, use the data we weren't showing before. Basically, we load all of the data at page-load time, so the Javascript can use it later. This is easier to write, but slow; sometimes, you have to load tons of data to do it.
Load only the data you need. When the user interacts with the page, make AJAX requests back to the server, to load in the new data that you need.
2a. When you load new data, store everything you've loaded so far, so that you don't need to make more AJAX requests if they return to an older zoom setting.
1 + 2. Load only the data you need, then show the page. Then immediately load everything else, but don't show it until/unless they change the zoom settings.
Of these, 2 and 2a are likely the best choices, while #1 is the "get it done quicker" approach.
Google Chrome (and browsers based on chromium) have developer tools with a network feature that lets you see what happens.
When you load a quote and then change the zoom, you will see a new data request. For example:
https://www.google.com/finance/getprices?q=AA&x=NYSE&i=1800&p=30d&f=d,c,v,o,h,l&df=cpct&auto=1&ts=1382233772497
It makes a new request for each "zoom level", which is necessary because the larger time windows (1 yr, 5 yr) will show data at coarser granularity (1 day, 1 week respectively)

Anyway to change the "Name", or "Initiator", or add new tabs in Chrome Dev Tool's Network view?

I was wondering if there was anyway to change the name or initiator columns in the Network Tab in Chrome's Dev Tools.
The issue is that, currently I'm making a web app, and it makes tons of POST calls using jQuery. that's all fine and dandy, however, when I have 10+ calls, obviously the Network tab gets flooded with POST calls.
All calls are to the same PHP script, thus the Name column is all the same. Also, since I'm using jQuery, the initiator is set to jQuery. I was wondering if there was any way to customize this view so that I know what script is calling the POST without having to open each call and see it's properties.
It'd even be nice to see maybe a truncated version of values sent right in the list view. This way I can just look at each call and know exactly what function or script called it, or at least have a better idea, rather than 10+ entries of Name: " xxx.php".
You can add custom columns that show you the values of response headers by right clicking on the table header and selecting Response Headers > Manage Header Columns:
You can also hide columns via this right-click menu.
You can also add a query to the url you are posting to, with information about what function you are calling.
Example:
If you are posting to https://myserver.com/api it is the last part api that will be displayed as the name in the network tab.
So you can extend that url with https://myserver.com/api?whatever and you will see that in the network tab name. The back end server can and will just ignore that extra query in the url.

Categories