Retrieving data from AJAX requests using Scrapy - javascript

So I have a product page where I see several different products that are being generated only when I click on the product. I see the dynamic content being loaded in the XHR tab in Network Console with a single GET Request for every item I click on.
Is there a way to select every product on that page and scrape specific data just like normal? I've tried scraping websites that don't use AJAX with ease but I'm pretty stumped here. I was going to scrape the products in order but when I went to the headers and looked at the request URL, I noticed that the item code for each product is not in order. For example the first item on the page has a product code of 46837 and then the one right below it has a product code of 68392. Or should I just be using selenium or splash to capture these AJAX calls?
Thanks!

Related

dynamically generate content for a page when clicking on product

everyone. I am making a website with t-shirts. I dynamically generate preview cards for products using a JSON file but I also need to generate content for an HTML file when clicking on the card. So, when I click on it, a new HTML page opens like product.html?product_id=id. I do not understand how to check for id or this part ?prodcut_id=id, and based on id it generates content for the page. Can anyone please link some guides or good solutions, I don't understand anything :(.
It sounds like you want the user's browser to ask the server to load a particular page based on the value of a variable called product_id.
The way a browser talks to a server is an HTTP Request, about which you can learn all the basics on javascipt.info and/or MDN.
The ?product_id=id is called the 'query' part of the URL, about which you can learn more on MDN and Wikipedia.
A request that gets a page with this kind of URL from the server is usually a GET request, which is simpler and requires less security than the more common and versatile POST request type.
You may notice some of the resources talking about AJAX requests (which are used to update part of the current page without reloading the whole thing), but you won't need to worry about this since you're just trying to have the browser navigate to a new page.
Your server needs to have some code to handle any such requests, basically saying:
"If anybody sends an HTTP GET request here, look at the value of the product_id variable and compare it to my available HTML files. If there's a match, send a response with the matching file, and if there's no match, send a page that says 'Error 404'."
That's the quick overview anyway. The resources will tell you much more about the details.
There are some solutions, how you can get the parameters from the url:
Get ID from URL with jQuery
It would also makes sense to understand what is a REST Api and how to build a own one, because i think you dont have a backend at the moment.
Here some refs:
https://www.conceptatech.com/blog/difference-front-end-back-end-development
https://www.tutorialspoint.com/nodejs/nodejs_restful_api.htm

I am using cheerio to grab stats information from https://www.nba.com/players/langston/galloway/204038 but I can't the table data to show up

[the information i want to access][1]
[1]: https://i.stack.imgur.com/4SpCU.png
NO matter what i do i just can't access the table of stats. I am suspicious it has to do with there being mulitple tables but I am not sure.
enter code here
var cheerio = require("cheerio");
var axios = require("axios");
axios
.get("https://www.nba.com/players/langston/galloway/204038")
.then(function (response) {
var $ = cheerio.load(response.data);
console.log(
$("player-detail").find("section.nba-player-stats-traditional").find("td:nth-child(3)").text()
);
});
The actual html returned from your get request doesn't contain the data or a table. When your browser loads the page, a script is executed that pulls the data from using api calls and creates most of the elements on the page.
If you open the chrome developer tools (CTRL+SHIFT+J) and switch to the network tab and reload the page you can see all of the requests taking place. The first one is the html that is downloaded in your axios GET request. If you click on that you can see the HTML is very basic compared to what you see when you inspect the page.
If you click on 'XHR' that will show most of the API calls that are made to get data. There's an interesting one for '204038_profile.json'. If you click on that you can see the information I think you want in JSON format which is much easier to use without parsing an html table. You can right-click on '204038_profile.json' and copy the full url:
https://data.nba.net/prod/v1/2019/players/204038_profile.json
NOTE: Most websites will not like you using their data like this, you might want to check what their policy is. They could make it more difficult to access the data or change the urls at any time.
You might want to check out this question or this one about how to load the page and run the javascript to simulate a browser.
The second one is particularly interesting and has an answer saying how you can intercept and mutate requests from puppeteer

How to get items from cart at homepage when there is NO API

So I've added items to a cart and I need to be able to create a popup on the homepage of the items in the cart after the user is on a specific place on the page. The problem is that when i'm on the homepage you can not access the items in that cart.
(*BTW i have to do this on the console of the website)
I could not find the API to be able to retrieve a JSON object so I figured I could fetch to "website.com/cart" which gives me a HTML string in return. But this is where i am having problems.
How can I properly use jQuery to scrape that html string for the items? The html string is also very long, it isnt efficient to append the whole string to the page...
Or is there a better way to go about this?
What I understood is that the list of items in the cart must have had been sent to the server, since you've thought about querying website.com/cart. If so, make an API in your backend that you can query from your application using $.post or $.get and retrieve cart details in JSON format that you can work on
You don't have to inject the entire HTML string into the page to use jQuery on it. If you do:
const page = $(html);
...that will parse the HTML into DOM elements that are detached from the page, and you can query child elements:
const items = page.find('.cart-item');
But in case you haven't fully ruled it out, check to see if the cart page makes any XHR/fetch requests when it loads; maybe there is an endpoint you can call to get the data directly so you can avoid the "screen scraping" approach.

How can a web page find what value to POST?

I'm currently making an application for a client to automatically fill some web forms on the website he uses to store his item pricing. The website doesn't have a documented public API, and there doesn't seem to be a way to add bulk pricing on the website itself. In order to accomplish this, I'm making a simple python application that reads his data, then sends POSTs to the website.
Their website is giving me a hard time, however, because it's sending payloads containing dozens of fields, while the form used to enter the pricing information only has 4 input fields. On top of that, their website uses angularjs to generate most of the web page, so I can't just find the <form>[...]</form> block and look at what's being sent, because that's not what they use.
Here is what the payload json looks like:
{
"entities":
[
{
"Price_Line_ID":"{}",
"Price_List_ID":"{}",
"Item_ID":"{}",
"Uofm_ID":"{}",
"Amount":"{}",
"Dtstamp":"{}",
"Tenant_ID":"{}",
"Created_On":"null",
"Created_By":"null",
"Changed_On":"null",
"Changed_By":"null",
"Seq":"0",
"Begin_Qty":"0",
"End_Qty":"0",
"Customer_ID":"null",
"Tax_Before_Discount":"false",
"Discount_Target":"All",
"Max_Discount_Amount":"null",
"Min_Discount_Amount":"null",
"Customer_Name":"null",
"Uofm":"null",
"Item_Number":"null",
"Uofm_Schedule_ID":"null",
"Uofm_Schedule":"null",
"Inactive":"false",
"entityAspect":
{
"entityTypeName":"PriceLine:#SalesPad.Spo.Api.Model",
"defaultResourceName":"PriceLines",
"entityState":"Added",
"originalValuesMap":
{},
"autoGeneratedKey":
{
"propertyName":"Price_Line_ID",
"autoGeneratedKeyType":"Identity"
}
}
}
],
"saveOptions":
{}
}
The 7 values at the top (with values of "{}") are found when I do a GET or POST action on the website's other pages. I've managed to find where all the values originate from, except the "Price_Line_ID" one, because it appears to change from page to page (and it changes after a price is added).
I know a web page can get the data needed for a POST event either in its own html (when using tags like <form>), and it can get them from other GET and POST events. Is there any other way for a web page to determine a value that will be sent in a POST event?
I'm not very familiar with angularjs, although from what I understand it only creates a bunch of Javascript for the page. Does it offer different ways of determining what values are sent in a POST or GET event?
Edit: I've already tracked all responses from GET and POST events from logging-in to adding a price, the Price_Line_ID field changes from page to page, and adding the price appears to use an ID different to the one received in the GET event. I just want to know the different ways that a web page (specifically one using angularjs might use to determine the value of the data sent in POST events.

How do I pre-cache a page before it redirects to it?

As seen in YouTube, when you click a YT link it before redirecting you it preloads the layout of the page and then it redirect yoo it, for you to not having to see how the layout items load.
How do I make this with, JavaScript, PHP, HTTP, jQuery or with any other language?
I don't think you really understand what is going on behind the scenes at YouTube. You see, the template/layout doesn't always change. The data always changes. It isn't loading the layout (unless the new page has a different template, such as sending a message or viewing a user profile), it is fetching the data from a database and then it uses javascript to replace the current data on the page with the data it just fetched. It does this through AJAX. They use Python on the back end.
Basically... this is what happens when you click a link for a new video:
1) You click the link.
2) Some JavaScript code makes an XMLHttpRequest to a script on the server which processes the request. A progress bar appears on the screen.
3) The script on the server connects to a database and grabs the information... like other videos in the playlist, comments, the video description, etc. It does this by submitting a query to the database.
4) The query returns the information to the script which in turn organizes it and returns it to the AJAX request (asynchronously, of course).
5) The JavaScript receives the information that it was waiting for and updates the HTML of the page. The JavaScript also does some other stuff behind the scenes, like update the URL and browsing history so that you can hit your "back" button and return to the previous page that you were on. (If the template for the newly requested page is different, the JavaScript will restructure the HTML of the page appropriately.)

Categories