How to extract data from upbit.com website - javascript

I am readdressing a question that I created a few months ago and I lost access to my account but happened to stumble upon this question while I was searching around.
My original post was here Converting JavaScript back to readable HTML in Python script .The problem that I am experiencing is I am not getting the full HTML markup back from the website when you try to webscrape it. Upbit.com is protected by cloudflare so I am using a module in python to bypass is called cfscrape. The cloudflare module works and gets me a partial html markup when I output it in a variable but it is not getting the nested HTML tags at all. The tag that I am trying to extract from starts at a div tag with an id called "root". In the console it only show that div tag with <...> in between the open and close tag of that div. I am still using the same code as before so nothing has changed. My best guess now is to try to extract the cookie and maybe pass it into a python curl request? But I am completely unsure of how to do that and hence why I am reaching out to Stack. I am also totally willing to use other programming languages.
import cfscrape
scraper = cfscrape.create_scraper(delay=15) # returns a CloudflareScraper instance
# Or: scraper = cfscrape.CloudflareScraper() # CloudflareScraper inherits from requests.Session
print scraper.get("https://upbit.com/service_center/notice").content # => "<!DOCTYPE html><html><head>..."
Edit 1: This is the data that I'm trying to extract. The information I'm looking for is in a table. I want to retrieve each tag within this table since this contains the content showed on the webpage.
Edit 2: Okay I figured out what data needed to be passed off to bypass cloudflare authentication every time using the standard "request" library in Python. The issue that I am having now is even this is not getting nested tags still. When I make a request it just gets the top level "root" tag but not the tags inside of that div tag (as shown in my picture). I have never seen anything like this typically when you do a get request it returns all the html content on the webpage. Does anyone have any ideas why this would be happening??? I'm convinced they are somehow hiding the information using JavaScript, but I don't understand JavaScript enough to know what to look for when someone would try to obfuscate it.
import cfscrape
import requests
import time
request = "GET / HTTP/1.1\r\n"
scraper = cfscrape.create_scraper(delay=15)
cookie_value, user_agent = cfscrape.get_cookie_string("https://upbit.com/service_center/notice", user_agent='Mozilla/5.0')
request += "Cookie: %s\r\nUser-Agent: %s\r\n" % (cookie_value, user_agent)
#print request
temp = cookie_value.split('; __cfduid=')
cf_clearance = temp[0].split('cf_clearance=')
#print temp[1]
#print cf_clearance[1]
headers = {'User-Agent': 'Mozilla/5.0'}
cookies = {'cf_clearance': cf_clearance[1], '__cfduid':temp[1]}
r = requests.get("https://upbit.com/service_center/notice", cookies=cookies, headers=headers).content
print r

Related

Python web scraper of javascript Div table

I have a wholesale website (behind login) that I am trying to scrape inventory levels. I've created my python script and it is giving a 200 response for login.
I'm trying to figure out how to scrape the inventory. I'm 99% sure that it is javascript but even if it is I don't know how to return the data since it is in divs and not a table (and I don't want to return every div).
This is the html page source
https://jsfiddle.net/3t6vjyLx/1/
the code is in the jsfiddle---too large to post here
When I inspect the element it is giving me and then
What do I need to do to load the page fully in my Python script so that I am able to pull that product-count?
There will be 64 seperate product-counts (8 locations and 5 sizes each)... is there a way to have it saved in a table in a specific way so that it is sorted by size? Since this wasn't created with a table that makes it more difficult, but I want to learn how to do it.
Thanks!
https://i.stack.imgur.com/L2MZV.png This is the inspect of the element
One solution is to use a library like requests_html to create an HTMLSession() that loads the javascript elements which you can then parse.
The code could look something like this:
from requests_html import HTMLSession
def get_html(url):
session = HTMLSession()
r = session.get(url)
r.html.render() # renders javascript html and stores it in {obj}.html.html
return r.html.html
While this solution may not be the most elegant (web scraping rarely is), I believe it should be sufficient if you're only scraping a small amount of data.

dynamically generate content for a page when clicking on product

everyone. I am making a website with t-shirts. I dynamically generate preview cards for products using a JSON file but I also need to generate content for an HTML file when clicking on the card. So, when I click on it, a new HTML page opens like product.html?product_id=id. I do not understand how to check for id or this part ?prodcut_id=id, and based on id it generates content for the page. Can anyone please link some guides or good solutions, I don't understand anything :(.
It sounds like you want the user's browser to ask the server to load a particular page based on the value of a variable called product_id.
The way a browser talks to a server is an HTTP Request, about which you can learn all the basics on javascipt.info and/or MDN.
The ?product_id=id is called the 'query' part of the URL, about which you can learn more on MDN and Wikipedia.
A request that gets a page with this kind of URL from the server is usually a GET request, which is simpler and requires less security than the more common and versatile POST request type.
You may notice some of the resources talking about AJAX requests (which are used to update part of the current page without reloading the whole thing), but you won't need to worry about this since you're just trying to have the browser navigate to a new page.
Your server needs to have some code to handle any such requests, basically saying:
"If anybody sends an HTTP GET request here, look at the value of the product_id variable and compare it to my available HTML files. If there's a match, send a response with the matching file, and if there's no match, send a page that says 'Error 404'."
That's the quick overview anyway. The resources will tell you much more about the details.
There are some solutions, how you can get the parameters from the url:
Get ID from URL with jQuery
It would also makes sense to understand what is a REST Api and how to build a own one, because i think you dont have a backend at the moment.
Here some refs:
https://www.conceptatech.com/blog/difference-front-end-back-end-development
https://www.tutorialspoint.com/nodejs/nodejs_restful_api.htm

I am using cheerio to grab stats information from https://www.nba.com/players/langston/galloway/204038 but I can't the table data to show up

[the information i want to access][1]
[1]: https://i.stack.imgur.com/4SpCU.png
NO matter what i do i just can't access the table of stats. I am suspicious it has to do with there being mulitple tables but I am not sure.
enter code here
var cheerio = require("cheerio");
var axios = require("axios");
axios
.get("https://www.nba.com/players/langston/galloway/204038")
.then(function (response) {
var $ = cheerio.load(response.data);
console.log(
$("player-detail").find("section.nba-player-stats-traditional").find("td:nth-child(3)").text()
);
});
The actual html returned from your get request doesn't contain the data or a table. When your browser loads the page, a script is executed that pulls the data from using api calls and creates most of the elements on the page.
If you open the chrome developer tools (CTRL+SHIFT+J) and switch to the network tab and reload the page you can see all of the requests taking place. The first one is the html that is downloaded in your axios GET request. If you click on that you can see the HTML is very basic compared to what you see when you inspect the page.
If you click on 'XHR' that will show most of the API calls that are made to get data. There's an interesting one for '204038_profile.json'. If you click on that you can see the information I think you want in JSON format which is much easier to use without parsing an html table. You can right-click on '204038_profile.json' and copy the full url:
https://data.nba.net/prod/v1/2019/players/204038_profile.json
NOTE: Most websites will not like you using their data like this, you might want to check what their policy is. They could make it more difficult to access the data or change the urls at any time.
You might want to check out this question or this one about how to load the page and run the javascript to simulate a browser.
The second one is particularly interesting and has an answer saying how you can intercept and mutate requests from puppeteer

Retrieving data from browser to JavaScript

I have just started out working with JS and I've managed to post data from a MySQL db to the website using node.js, jade and plain JS.
Now I'm trying to do the other way around, i.e. getting data from the website, to the JS code and then inserting it into the db.
What I'm thinking is simply making a textfield with a button. When I fill the textfield and press the button it is collected by the JS script and the inserted to the DB.
I am however having problems with Jade and the listener and I'm unable to even do a console.log using the listener.
This is what I've got so far in my .jade file.
extends layout
script.
var something = function() {
console.log('something')
}
block content
button(onclick='something()') Click
The website renders nicely, but nothing is printed when I click the button.
If someone could give a hint on how to fetch the data in my .js file that would also be appreciated.
In the context of the WWW there are two places that JavaScript can run.
On the server, e.g. with node.js
On the browser, embedded in a <script> element
Since you want to put the data into a database on the server, you want to go with option 1. So don't use a <script> element.
Use a <form> (you could use client side JS to read the data from the form and send it to the server (Ajax) but that seems overcomplicated for your needs and should be layered on top of a plain HTML solution if you were to go down that route).
form(action="/myendpoint" method="post")
label
| Data
textarea(name="foo")
button Submit
Then you just need to write server side code to retrieve that. The specifics of that will depend on how you are implementing the HTTP server in Node.
The question How do you extract POST data in Node.js? provides some starting points.

How to call a specific python function from within a Django template

I have a button to copy data from a user uploaded file to the clipboard in a specific format. I already have that data saved in the database as it was uploaded in a separate file form. I currently have it so that upon clicking the copy to clipboard button it is linked to a copy_data view in my views.py that requires an HTTP request which redirects to the current template containing the copy to clipboard button with something like this:
HttpResponseRedirect('previous/template/here')
This works fine except for the fact that since it links to my copy_data view which then redirects to the original view containing the button it reloads the entire page which is undesirable.
I think a better solution would be to somehow bind a python function directly to the button click rather than worrying about redirecting from one view to another.
I've found many examples using ajax, but haven't found any that work for my use case. I tried binding a click event to the button without any problems, but I am stuck on figuring out how to bind the python function with the click.
How can I bind a python function in my Django template upon a button press?
It's tough to tell for sure, but I think you're mixing synch/asynch paradigms here. When you generate requests with Ajax, you don't (generally) want to return a redirect, you want to return data. That might be JSON data or data formatted as a specific MIME type or even just text. One way this might look at a high level is:
def copy_data(request):
# get posted data
submitted = request.POST
# do whatever is necessary to create document
data = ???
# first, we'll need a response
resp = HttpResponse()
# set the content type, if needed
resp.content_type = 'text/???; charset=utf-8'
# response has a file-like interface
resp.write(data)
return resp
Obviously, this would need work to suit your purpose, but that's the high-level approach.
It doesn't sound like you're returning JSON, but there's a special response object for that now if you need it.

Categories