Python web scraper of javascript Div table - javascript

I have a wholesale website (behind login) that I am trying to scrape inventory levels. I've created my python script and it is giving a 200 response for login.
I'm trying to figure out how to scrape the inventory. I'm 99% sure that it is javascript but even if it is I don't know how to return the data since it is in divs and not a table (and I don't want to return every div).
This is the html page source
https://jsfiddle.net/3t6vjyLx/1/
the code is in the jsfiddle---too large to post here
When I inspect the element it is giving me and then
What do I need to do to load the page fully in my Python script so that I am able to pull that product-count?
There will be 64 seperate product-counts (8 locations and 5 sizes each)... is there a way to have it saved in a table in a specific way so that it is sorted by size? Since this wasn't created with a table that makes it more difficult, but I want to learn how to do it.
Thanks!
https://i.stack.imgur.com/L2MZV.png This is the inspect of the element

One solution is to use a library like requests_html to create an HTMLSession() that loads the javascript elements which you can then parse.
The code could look something like this:
from requests_html import HTMLSession
def get_html(url):
session = HTMLSession()
r = session.get(url)
r.html.render() # renders javascript html and stores it in {obj}.html.html
return r.html.html
While this solution may not be the most elegant (web scraping rarely is), I believe it should be sufficient if you're only scraping a small amount of data.

Related

How can I get the link from this table content (I guess it's javascript) ? (Without selenium)

I'm trying to get the href from these table contents, but in the html code is not available. [edited # 3:44 pm 10/02/2019] I will scrape this site and others similar to this one, on a daily basis and compare with the "yesterday" data. So I get the daily new info in this data. [/edited]
I found a similar (but simpler) solution, but it uses chromedriver (link). I'm looking for a solution that doesn't uses Selenium.
Site: http://web.cvm.gov.br/app/esforcosrestritos/#/detalharOferta?ano=MjAxOQ%3D%3D&valor=MTE%3D&comunicado=MQ%3D%3D&situacao=Mg%3D%3D
If you click in the first parte of the table (as below)
You will get to this site:
http://web.cvm.gov.br/app/esforcosrestritos/#/enviarFormularioEncerramento?type=dmlldw%3D%3D&ofertaId=ODc2MA%3D%3D&state=eyJhbm8iOiJNakF4T1E9PSIsInZhbG9yIjoiTVRFPSIsImNvbXVuaWNhZG8iOiJNUT09Iiwic2l0dWFjYW8iOiJNZz09In0%3D
How can I scrape the first site to get all the links it have in the tables? (to go for the second "links")
When I use requests.get it doesn't even get the content of the table. Any help?
link_cvm = "http://web.cvm.gov.br/app/esforcosrestritos/#/detalharOferta?ano=MjAxOQ%3D%3D&valor=MTE%3D&comunicado=MQ%3D%3D&situacao=Mg%3D%3D"
import requests
html_code = requests.get(link_cvm)
html_code.text
print(html_code)
The second page your are taken to is dynamically loaded using jscript. The data you are looking for is contained in another page, in json format. Search around, there is a lot of information about this, for one, of many, example, see this.
In your case, you can get to it this way:
import requests
import json
url = 'http://web.cvm.gov.br/app/esforcosrestritos/enviarFormularioEncerramento/getOfertaPorId/8760'
resp = requests.get(url)
data = json.loads(resp.content)
print(data)
The output is the information on that page.

Scraping information from specific remote site PHP

I'm trying to scrape information using PHP from this site, however the information I'm looking for seems to be generated through Javascript or similar. I would be greateful for any suggestions on what approach to take!
This is the remote site that I'm trying to fetch data from: http://www.riksdagen.se/sv/webb-tv/video/debatt-om-forslag/yrkestrafik-och-taxi_H601TU11
The page contains a video and beneith the headline "Anförandelista" there are a number of names/links to individual time spots in the video.
I want to use PHP to automatically fetch the names and links in this list and store it in a database. However, this information is not included in the HTML source and thus I fail to retreive it.
Any ideas on how I can remotely access the information using an automated script? Or in which direction I should look for a solution? Any pointers are very much appreciated.
You can get this info as a json response from the API call the page makes. I don't know PHP, yet, but a quick Google shows handling json is possible and fairly straightforward. I give an example python script at the bottom.
The API call is this
http://www.riksdagen.se/api/videostream/get/H601TU11
It returns json as follows (just an excerpt shown. The json includes the speech as well):
Explore full json response here.
PHP
Looking at this question you could start with something like:
$array = json_decode(file_get_contents('http://www.riksdagen.se/api/videostream/get/H601TU11'));
Example python if wanted:
import requests
import pandas as pd
r = requests.get('http://www.riksdagen.se/api/videostream/get/H601TU11').json()
results = []
for item in r['videodata'][0]['speakers']:
start = item['start']
duration = item['duration']
speaker = item['text']
row = [speaker, start, duration]
results.append(row)
df = pd.DataFrame(results, columns = ['Speaker', 'Start', 'Duration'])
print(df)
Example output:
You can not get the information loaded by JS using just PHP solution. Curl, file_get_contents and similar options will only get the server response for you, they will not execute JS as it is client side script.
For that you will need to use a headless browser (there are multiple to choose from: Chromium, Google Chrome with it's new headless mode or Selenium web driver are just few of the most popular ones)

How to extract data from upbit.com website

I am readdressing a question that I created a few months ago and I lost access to my account but happened to stumble upon this question while I was searching around.
My original post was here Converting JavaScript back to readable HTML in Python script .The problem that I am experiencing is I am not getting the full HTML markup back from the website when you try to webscrape it. Upbit.com is protected by cloudflare so I am using a module in python to bypass is called cfscrape. The cloudflare module works and gets me a partial html markup when I output it in a variable but it is not getting the nested HTML tags at all. The tag that I am trying to extract from starts at a div tag with an id called "root". In the console it only show that div tag with <...> in between the open and close tag of that div. I am still using the same code as before so nothing has changed. My best guess now is to try to extract the cookie and maybe pass it into a python curl request? But I am completely unsure of how to do that and hence why I am reaching out to Stack. I am also totally willing to use other programming languages.
import cfscrape
scraper = cfscrape.create_scraper(delay=15) # returns a CloudflareScraper instance
# Or: scraper = cfscrape.CloudflareScraper() # CloudflareScraper inherits from requests.Session
print scraper.get("https://upbit.com/service_center/notice").content # => "<!DOCTYPE html><html><head>..."
Edit 1: This is the data that I'm trying to extract. The information I'm looking for is in a table. I want to retrieve each tag within this table since this contains the content showed on the webpage.
Edit 2: Okay I figured out what data needed to be passed off to bypass cloudflare authentication every time using the standard "request" library in Python. The issue that I am having now is even this is not getting nested tags still. When I make a request it just gets the top level "root" tag but not the tags inside of that div tag (as shown in my picture). I have never seen anything like this typically when you do a get request it returns all the html content on the webpage. Does anyone have any ideas why this would be happening??? I'm convinced they are somehow hiding the information using JavaScript, but I don't understand JavaScript enough to know what to look for when someone would try to obfuscate it.
import cfscrape
import requests
import time
request = "GET / HTTP/1.1\r\n"
scraper = cfscrape.create_scraper(delay=15)
cookie_value, user_agent = cfscrape.get_cookie_string("https://upbit.com/service_center/notice", user_agent='Mozilla/5.0')
request += "Cookie: %s\r\nUser-Agent: %s\r\n" % (cookie_value, user_agent)
#print request
temp = cookie_value.split('; __cfduid=')
cf_clearance = temp[0].split('cf_clearance=')
#print temp[1]
#print cf_clearance[1]
headers = {'User-Agent': 'Mozilla/5.0'}
cookies = {'cf_clearance': cf_clearance[1], '__cfduid':temp[1]}
r = requests.get("https://upbit.com/service_center/notice", cookies=cookies, headers=headers).content
print r

How to more efficiently generate large amounts of HTML

I have a web app running off google apps script, when a user searched for some data I generate templated HTML on the server and return that to the client which populates a table (Each table row is an accordion that expands down for much more detailed info for each item).
The problem is that the HTML generation takes ~20 seconds if the user does a search that returns all data. It returns ~3.5MB of HTML to the client.
I was trying to utilize JQuery templates, but each row may contain different data and that format of that data may change periodically, I had more templates than I did web page. It's not really maintainable to manage a ton of JQuery templates when 15 lines of code (As a "Scriplet") on the server can create the same HTML.
So my question is, how can you serve a large chunk of data to a client and generate HTML without relying on templates for each data format?
If this is not descriptive enough, please let me know.
The problem is that the HTML generation takes ~20 seconds
Generate HTML on client side. server only return data .
if the user does a search that returns all data. It returns ~3.5MB of HTML to the client.
Do not return all data at a time, just return a smallest information. like how many page , category, etc.
When user select a page , client send a request to get details from server.

Using DOMXpath to extract JSON data

I have used php simple html dom to no success on this issue.
Now I have gone to DOMDocument and DOMXpath and this does seem promising.
Here is my issue:
I am trying to scrape data from a page which is loaded via a web service request after the page initially shows. It is only milliseconds but because of this, normal scraping shows a template value as opposed to the actual data.
I have found the endpoint url using chrome developer network settings. So if I enter that url into the browser address bar the data displays nicely in JSON format. All Good.
My problem arises because any time the site is re-visited or the page refreshed, the suffix of the endpoint url is randomly-generated so I can't hard-code this url into my php file. For example the end of the url is "?=253648592" on first visit but on refresh it could be "?=375482910". The base of the url is static.
Without getting into headless browsers (I tried and MY head hurts!) is there a way to have Xpath find this random url when the page loads?
Sorry for being so long-winded but I wanted to explain as best I could.
It's probably much easier and faster to just use a regex if you only need one item/value from the HTML. I would like to give an example but therefor I would need a more extended snippet of how the HTML looks like that contains the endpoint that you want to fetch.
Is it possible to give a snippet of the HTML that contains the endpoint?

Categories