Screen-scrape paginated data - javascript

I'm trying to grab a list of all of the available stores returned from the search at this website.
https://www.metropcs.com/find-store.html.html
The issue is that it returns back only 4 or 5 at a time, and does not have the option for 'See All'. I attempted to use Post Man in Chrome and AutoPager in Firefox to see if I could somehow see all of the data in the background but I wasn't able to. I also was researching JSON interception tools, as I believe the site is using JSON in the return set, but I wasn't able to find any of the actual data that I needed.
In the past I was able to hit 'print preview' and grab the list that way (then I just copy-pasted to Excel and ran some custom macros to strip the data I need) but the printer-friendly version is gone now as well.
Any ideas on tools that would allow me to export all of the stores found, especially for larger return sets?

You want to manipulate this request:
https://www.metropcs.com/apps/mpcs/servlet/genericservlet
You'll notice the page sends this (among other things) as the request to that URL:
inputReqParam=
{"serviceProviderName":"Hbase","expectedParams":
{"Corporate Stores":...Truncated for clarity...},
"requestParams":
{"do":"json",
"minLatitude":"39.89234063913044",
"minLongitude":"-74.85258152641507",
"maxLongitude":"-74.96578907358492",
"maxLatitude":"39.979297160869564"
},
"serviceName":"metroPCSStoreLocator"}
You'll need to manipulate the lat and long bounding box to encompass the area you want. (The entire US is something like [-124.848974, 24.396308] to [-66.885444, 49.384358] )
In your favorite browser it should be easy enough to tweak the request to get a JSON response with what you require.

Related

dynamically generate content for a page when clicking on product

everyone. I am making a website with t-shirts. I dynamically generate preview cards for products using a JSON file but I also need to generate content for an HTML file when clicking on the card. So, when I click on it, a new HTML page opens like product.html?product_id=id. I do not understand how to check for id or this part ?prodcut_id=id, and based on id it generates content for the page. Can anyone please link some guides or good solutions, I don't understand anything :(.
It sounds like you want the user's browser to ask the server to load a particular page based on the value of a variable called product_id.
The way a browser talks to a server is an HTTP Request, about which you can learn all the basics on javascipt.info and/or MDN.
The ?product_id=id is called the 'query' part of the URL, about which you can learn more on MDN and Wikipedia.
A request that gets a page with this kind of URL from the server is usually a GET request, which is simpler and requires less security than the more common and versatile POST request type.
You may notice some of the resources talking about AJAX requests (which are used to update part of the current page without reloading the whole thing), but you won't need to worry about this since you're just trying to have the browser navigate to a new page.
Your server needs to have some code to handle any such requests, basically saying:
"If anybody sends an HTTP GET request here, look at the value of the product_id variable and compare it to my available HTML files. If there's a match, send a response with the matching file, and if there's no match, send a page that says 'Error 404'."
That's the quick overview anyway. The resources will tell you much more about the details.
There are some solutions, how you can get the parameters from the url:
Get ID from URL with jQuery
It would also makes sense to understand what is a REST Api and how to build a own one, because i think you dont have a backend at the moment.
Here some refs:
https://www.conceptatech.com/blog/difference-front-end-back-end-development
https://www.tutorialspoint.com/nodejs/nodejs_restful_api.htm

I am using cheerio to grab stats information from https://www.nba.com/players/langston/galloway/204038 but I can't the table data to show up

[the information i want to access][1]
[1]: https://i.stack.imgur.com/4SpCU.png
NO matter what i do i just can't access the table of stats. I am suspicious it has to do with there being mulitple tables but I am not sure.
enter code here
var cheerio = require("cheerio");
var axios = require("axios");
axios
.get("https://www.nba.com/players/langston/galloway/204038")
.then(function (response) {
var $ = cheerio.load(response.data);
console.log(
$("player-detail").find("section.nba-player-stats-traditional").find("td:nth-child(3)").text()
);
});
The actual html returned from your get request doesn't contain the data or a table. When your browser loads the page, a script is executed that pulls the data from using api calls and creates most of the elements on the page.
If you open the chrome developer tools (CTRL+SHIFT+J) and switch to the network tab and reload the page you can see all of the requests taking place. The first one is the html that is downloaded in your axios GET request. If you click on that you can see the HTML is very basic compared to what you see when you inspect the page.
If you click on 'XHR' that will show most of the API calls that are made to get data. There's an interesting one for '204038_profile.json'. If you click on that you can see the information I think you want in JSON format which is much easier to use without parsing an html table. You can right-click on '204038_profile.json' and copy the full url:
https://data.nba.net/prod/v1/2019/players/204038_profile.json
NOTE: Most websites will not like you using their data like this, you might want to check what their policy is. They could make it more difficult to access the data or change the urls at any time.
You might want to check out this question or this one about how to load the page and run the javascript to simulate a browser.
The second one is particularly interesting and has an answer saying how you can intercept and mutate requests from puppeteer

web-scraping a strange html setup with Python-BeautifulSoup & urllib

The problem is not really extracting the data, but locating it. I am scraping for football data. This site lays it out in total(all years) or year(season), however the data contained in the html is the data about all time, not the season you select, even though the site displays the season statistic's. Interestingly when you load data for a season, it first loads and briefly displays the data for all time, of that variable.
For example: line within the "td" tags on line 983 of the html source for this site, it says 515(Chelsea's wins for all time) when I'm viewing the page for Chelsea's wins that season, which should be 26.
Can anyone explain this witchcraft and how to scrape data by season?
Looks like when you select a season, they pull from an API that returns the data in JSON format. This makes your job a lot easier because JSON is easier to parse than HTML.
You can see the requests and responses in Chrome web dev tools:
Press F12 when looking at the page in Chrome.
Go to the Network tab.
Click the Filter icon, then click XHR.
When you choose a season you should see an XHR request to footballapi.pulselive.com.
For example https://footballapi.pulselive.com/football/stats/ranked/teams/wins?page=0&pageSize=20&compSeasons=42&comps=1&altIds=true
Click on that URL in the dev tools and to the right, click the Preview tab to see the response formatted nicely.
I think you'll be able to mimic these requests in your program. You may need to send some of the same request headers because it appears they block it if you try to hit the API directly in the browser.

Javascript: get data displayed from another url

I was wondering if I could get some data from another website to get it displayed on mine. The good example can be alexa.com. I need to display Alexa traffic rank and reputation in a div for example on my page, so it will be changed dynamically each time Alexa change its data.
Thank you for your help.
One way is to make an ajax request for the Alexa.com site, once you receive all the html, then you can use jquery or something to scrape it for the div you want.
It feels kinda dirty, but its an easy way to get what you want. Though this is assuming their page content isn't loaded dynamically.
Edit: See this for more info: Request external website data using jQuery ajax
yahoo yql... (instead of a php? proxy serverside script)..
I have a sneaky suspicion you do not own/control the external link site, so getting content from a different site, would fall under cross-domain security restrictions (to a modern browser).
So in order to regain 'power to the user', just use http://query.yahooapis.com/.
jQuery would not be strictly needed.
EXAMPLE 1:
Using the SQL-like command:
select * from html
where url="http://stackoverflow.com"
and xpath='//div/h3/a'
The following link will scrape SO for the newest questions (bypassing cross-domain security bull$#!7):
http://query.yahooapis.com/v1/public/yql?q=select%20title%20from%20html%20where%20url%3D%22http%3A%2F%2Fstackoverflow.com%22%20and%0A%20%20%20%20%20%20xpath%3D%27%2F%2Fdiv%2Fh3%2Fa%27%0A%20%20%20%20&format=json&callback=cbfunc
As you can see this will return a JSON array (one can also choose xml) and calling the callback-function: cbfunc.
Indeed, as a 'bonus' you also save a kitten every time you did not need to regex data out of 'tag-soup'.
Do you hear your little mad scientist inside yourself starting to giggle?
Then see this answer for more info (and don't forget it's comments for more examples).
Good Luck!

Anyway to change the "Name", or "Initiator", or add new tabs in Chrome Dev Tool's Network view?

I was wondering if there was anyway to change the name or initiator columns in the Network Tab in Chrome's Dev Tools.
The issue is that, currently I'm making a web app, and it makes tons of POST calls using jQuery. that's all fine and dandy, however, when I have 10+ calls, obviously the Network tab gets flooded with POST calls.
All calls are to the same PHP script, thus the Name column is all the same. Also, since I'm using jQuery, the initiator is set to jQuery. I was wondering if there was any way to customize this view so that I know what script is calling the POST without having to open each call and see it's properties.
It'd even be nice to see maybe a truncated version of values sent right in the list view. This way I can just look at each call and know exactly what function or script called it, or at least have a better idea, rather than 10+ entries of Name: " xxx.php".
You can add custom columns that show you the values of response headers by right clicking on the table header and selecting Response Headers > Manage Header Columns:
You can also hide columns via this right-click menu.
You can also add a query to the url you are posting to, with information about what function you are calling.
Example:
If you are posting to https://myserver.com/api it is the last part api that will be displayed as the name in the network tab.
So you can extend that url with https://myserver.com/api?whatever and you will see that in the network tab name. The back end server can and will just ignore that extra query in the url.

Categories