Beautiful Soup returns only javaScript Code? - javascript

I want to scrape data from the following website. https://dell.secure.force.com/FAP/?c=de&l=de&pt=findareseller
I tried to get data from the network tab but it returns nothing. Then I tried BeautifulSoup to get some data but it returns only Javascript with empty tbody tags. But in inspect element, it shows the data in a table.
import requests
from bs4 import BeautifulSoup
url = 'https://dell.secure.force.com/FAP'
headers = {
'Connection': 'keep-alive'
}
data = {
'pt': "findareseller"
}
page = requests.get(url, params= data)
soup = BeautifulSoup(page.text, 'html.parser')
soup.find_all('table') # returns only javascript code.
Can someone help, how can I scrape the data?

soup.find_all('table') returns a list with all table elements.
So to find your specific element you should try to find some distinct properties that makes it different from all the other tables (like an id or class).
To access the elements attributes use t[0].attrs to get a list of them and for example: t[0]["width"] to access them.
Also: By using soup.select('table') instead, you can use css selectors as the string input, so you won't have to deal with beautifulsoups functions.

Thank you all.
I figure out the answer.
I use network search to fetch any search requests. I found the search URL, to confirm if the URL was right, I double-clicked it and it returns the exact same page. so I copy the bash code and past it into POSTMAN as import "RAW TEXT". I find out it actually uses post requests. After using post request, I was able to fetch the data I needed.
below is the request with POST.
response = requests.request("POST", url, headers=headers, data=payload)
then I use BeautifulSoup as soup.
st = soup.find('input')['value'] # returns data I needed

Related

Python Beautiful Webscraping Simulate Click to Scrape All Pages

I ran into an interesting issue while trying to scrape http://www.o-review.com/database_filter_model.php?table_name=glasses&tag= containing 42 pages of data. I was able to successfully scrape the first page of information, but while trying to scrape all pages I found that the URL remains unchanged, and changing the page uses a button at the bottom of the website.
The html code in the inspector reads:
<div onclick="filter_page('1')" class="filter_nav_button round5"
style="cursor:pointer;"><img src="/images/icon_arrow_next.svg"></div>
I'm very new at scraping and python but was told I need to simulate a "click" in the javascript which I have absolutely no idea how to do, and wasn't sure if it could be hard-coded. My weak attempt to try something so far:
response = get('http://www.o-review.com/database_filter_model.php?
table_name=glasses&tag=')
soup = bs(response.text, 'html.parser')
print(soup)
for page in range(1, 42):
pages = soup.find('div', onclick_ = 'filter_page()')
Hopefully, someone has solved this issue in the past. Help would be greatly appreciated! Thanks!
Edit: Here is the code I'm trying to add:
## Find All Frame models
for find_frames in soup.find_all('a', class_ = 'round5
grid_model'):
# Each iteration grabs child text and prints it
all_models = find_frames.text
print(all_models)
This would be added where the comment was to add code! Thanks!
The request is made via POST request where you can check my previous-answer in order to know how to get the actual API
Also html.parser or lxml is the not a part of your issue.
The reason why i used lxml because it's more fast than html.parser according to documentation
import requests
from bs4 import BeautifulSoup
from pprint import pp
def main(url):
with requests.Session() as req:
for page in range(1, 44):
print("[*] - Extracting Page# {}".format(page))
data = {
"table_name": "glasses",
"family": "",
"page": "{}".format(page),
"sort": "",
"display": "list",
}
r = req.post(url, data=data)
soup = BeautifulSoup(r.text, 'lxml')
pp([x.text for x in soup.select(
'.text-clip') if x.get_text(strip=True)])
if __name__ == "__main__":
main('http://www.o-review.com/ajax/database_filter_model.php')
I saw the answer to this question and your comment. The reason αԋɱҽԃ αмєяιcαη's code works is because it sends request to the actual ajax api the sit is getting data from. You could easily use your browser's developer tools to track it down. It's not because of lxml or whatever, you just had to find the right source ;)
And of course αԋɱҽԃ αмєяιcαη should have explained some parts in his answer to clarify everything for you.

How to webscrape data from a webpage with dynamic HTML (Python)?

I'm trying to figure out how to scrape the data from the following url: https://www.aap.org/en-us/advocacy-and-policy/aap-health-initiatives/nicuverification/Pages/NICUSearch.aspx
Here is the type of data:
It appears that everything is populated from a database and loaded into the webpage via javascript.
I've done something similar in the past using selenium and PhantomJS but I can't figure out how to get these data fields in Python.
As expected, I can't use pd.read_html for this type of problem.
Is it possible to parse the results from:
from selenium import webdriver
url="https://www.aap.org/en-us/advocacy-and-policy/aap-health-initiatives/nicuverification/Pages/NICUSearch.aspx"
browser = webdriver.PhantomJS()
browser.get(url)
content = browser.page_source
Or maybe to access the actual underlying data?
If not, what are other approaches short of copy and pasting for hours?
EDIT:
Building on the answer below, from #thenullptr I have been able to access the material but only on page 1. How can I adapt this to go across all of the pages [recommendations to parse properly]? My end goal is to have this in a pandas dataframe
import requests
from bs4 import BeautifulSoup
r = requests.post(
url = 'https://search.aap.org/nicu/',
data = {'SearchCriteria.Level':'1', 'X-Requested-With':'XMLHttpRequest'},
) #key:value
html = r.text
# Parsing the HTML
soup = BeautifulSoup(html.split("</script>")[-1].strip(), "html")
div = soup.find("div", {"id": "main"})
div = soup.findAll("div", {"class":"blue-border panel list-group"})
def f(x):
ignore_fields = ['Collapse all','Expand all']
output = list(filter(bool, map(str.strip, x.text.split("\n"))))
output = list(filter(lambda x: x not in ignore_fields, output))
return output
results = pd.Series(list(map(f, div))[0])
To follow on from my last comment, the below should give you a good starting point. When looking through the XHR calls you just want to see what data is being sent and received from each one to pinpoint the one you need. The below is the raw POST data being sent to the API when doing a search, it looks like you need to use at least one and include the last one.
{
"SearchCriteria.Name": "smith",
"SearchCriteria.City": "",
"SearchCriteria.State": "",
"SearchCriteria.Zip": "",
"SearchCriteria.Level": "",
"SearchCriteria.LevelAssigner": "",
"SearchCriteria.BedNumberRange": "",
"X-Requested-With": "XMLHttpRequest"
}
Here is a simple example of how you can send a post request using the requests library, the web page will reply with the raw data so you can use BS or similar to parse it to get the information you need.
import requests
r = requests.post('https://search.aap.org/nicu/',
data = {'SearchCriteria.Name':'smith', 'X-Requested-With':'XMLHttpRequest'}) #key:value
print(r.text)
prints <strong class="col-md-8 white-text">JOHN PETER SMITH HOSPITAL</strong>...
https://requests.readthedocs.io/en/master/user/quickstart/

scrapy + selenium: <a> tag has no href, but content is loaded by javascript

I'm almost there with my first try of using scrapy, selenium to collect data from website with javascript loaded content.
Here is my code:
# -*- coding: utf-8 -*-
import scrapy
from selenium import webdriver
from scrapy.selector import Selector
from scrapy.http import Request
from selenium.webdriver.common.by import By
import time
class FreePlayersSpider(scrapy.Spider):
name = 'free_players'
allowed_domains = ['www.forge-db.com']
start_urls = ['https://www.forge-db.com/fr/fr11/players/?server=fr11']
driver = {}
def __init__(self):
self.driver = webdriver.Chrome('/home/alain/Documents/repository/web/foe-python/chromedriver')
self.driver.get('https://forge-db.com/fr/fr11/players/?server=fr11')
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
#time.sleep(1)
sel = Selector(text = self.driver.page_source)
players = sel.xpath('.//table/tbody/tr')
for player in players:
joueur = player.xpath('.//td[3]/a/text()').get()
guilde = player.xpath('.//td[4]/a/text()').get()
yield {
'player' : joueur,
'guild' : guilde
}
next_page_btn = self.driver.find_element_by_xpath('//a[#class="paginate_button next"]')
if next_page_btn:
time.sleep(2)
next_page_btn.click()
yield scrapy.Request(url = self.start_urls, callback=self.parse)
# Close the selenium driver, so in fact it closes the testing browser
self.driver.quit()
def parse_players(self):
pass
I want to collect user names and their relative guild and output to a csv file.
For now my issue is to proceed to NEXT PAGE and to parse again the content loaded by javascript.
if i'm able to simulate click on NEXT tag, i'm not 100% sure that code will proceed all pages and i'm not able to parse the new content using the same function.
Any idea how could i solve this issue ?
thx.
Instead of using selenium, you should try recreate the request to update the table. If you look closely at the HTML under chrometools. You can see that the request is made with parameters and a response is sent back with the data in a nice structured format.
Please see here with regards to dynamic content in scrapy. As it explains the first step to think about is it necessary to recreate browser activity ? Or can I get the information I need from reverse engineering HTTP get requests. Sometimes the information is hidden with <script></script> tags and you can use some regex or some string methods to gain what you want. Rendering the page and then using browser activity should be thought of as a last step.
Now before I go into some background on reverse engineering the requests, this website you're trying to get information from requires only to reverse engineer the HTTP requests.
Reverse Engineering HTTP requests in Scrapy
Now in terms of the actual web itself we can use chrome devtools by right clicking inspect on a page. Clicking the network tab allows you to see all requests the browser makes to render the page. In this case you want to see what happens when you click next.
Image1: here
Here you can see all the requests made when you click next on the page. I always look for the biggest sized response as that'll most likely have your data.
Image2: here
Here you can see the request headers/params etc... things you need to make a proper HTTP request. We can see that the referring URL is actually getplayers.php with all the params to get the next page added on. If you scroll down you can see all the same parameters it sends to getplayers.php. Keep this in mind, sometimes we need to send headers, cookies and parameters.
Image3: here
Here is the preview of the data we would get back from the server if we make the correct request, it's a nice neat format which is great for scraping.
Now You could copy the headers and parameters, cookies here into scrapy, but after a bit of searching and it's always worth checking this first, if just by passing in an HTTP request with the url will you get the data you want then that is the simplest way.
In this case it's true and infact you get in a nice need format with all the data.
Code example
import scrapy
class TestSpider(scrapy.Spider):
name = 'test'
allowed_domains = ['forge-db.com']
def start_requests(self):
url = 'https://www.forge-db.com/fr/fr11/getPlayers.php?'
yield scrapy.Request(url=url)
def parse(self,response):
for row in response.json()['data']:
yield {'name':row[2],'guild':row[3] }
Settings
In settings.py, you need to set ROBOTSTXT_OBEY = False The site doesn't want you to access this data so we need to set it to false. Be careful, you could end getting banned from the server.
I would also suggest a couple of other settings to be respectful and cache the results so if you want to play around with this large dataset you don't hammer the server.
CONCURRENT_REQUESTS = 1
DOWNLOAD_DELAY = 3
HTTPCACHE_ENABLED = True
HTTPCACHE_DIR = 'httpcache'
Comments on the code
We make a request to https://www.forge-db.com/fr/fr11/getPlayers.php? and if you were to print the response you get all the data from the table, it's quite a lot... Now it looks like it's in json format so we use scrapy's new feature to handle json and convert into a python dictionary. response.json() be sure that you have uptodate scrapy to take advantage of this. Otherwise you could use the json library that python provides to do the same thing.
Now you have to look at the preview data abit here but the individual rows are within response.json()['data'][i] where i in the row of data. The name and guild are within response.json()['data'][i][2] and response.json()['data'][i][3]. So looping over every response.json()['data']and grabbing the name and guild.
If the data wasn't so structured as it is here and it needed modifying I would strongly urge you to use Items or ItemLoaders for creating the fields that you can then output the data. You can modifying the extracted data more easily with ItemLoaders and you can interact with duplicates items etc using a pipeline. These are just some thoughts for in the future, I almost never use yielding a dictionary for extracting data particularly large datasets.

How does jQuery Autocomplete dynamically filter responses

I am currently using http://www.devbridge.com/sourcery/components/jquery-autocomplete/#jquery-autocomplete to autocomplete input.
My question is: how is the demo at the above link automatically filtering results?
If I use a local datastore, it filters the results for me.
<script>
var suggestions = [ "Afghan",
"African",
"Senegalese",
"American",
"Arabian",
"Arab Pizza",
"Argentine",
"Armenian",
"Asian Fusion",
"Asturian",
"Australian",
"Austrian"
]
$('#categories').autocomplete({
// serviceUrl: '/autocomplete/categories',
lookup: suggestions,
delimiter: ',',
maxHeight: 200,
minChars: 2
});
</script>
However, if I instead replace "lookup:" with an external datastore (serviceUrl), the results are no longer filtered.
Here's my code for the external-calls version:
class AjaxHandler(webapp2.RequestHandler):
def __init__(self, request, response):
self.initialize(request, response)
self.categories = []
with open("static/categories.data") as categories_file:
for entry in categories_file:
self.categories.append(str(entry))
print entry
def get(self):
suggestions = {"suggestions": self.categories}
self.response.write(json.dumps(suggestions))
self.response.headers.add_header("Content-Type", "application/json; charset-UTF-8")
With this version, it's still doing an edit-distance with all of the entries, but filtering is no longer working.
Here is their API: https://github.com/devbridge/jQuery-Autocomplete
There's a bunch of options there, and if anyone can give me some pointers to which one might help, that'd be great.
That demo isn't using an external data source.
But I'm not sure what you're asking: the whole point of using an external data source is that it's the source that does the filtering - it only returns values that match the token that is sent with the Ajax get. Otherwise you might as well include all the data in the original request.
When you trying to request to other server with your javascript usually it will be blocked by Web Browser because of Security Concern. (You can google by keyword Cross domain javascript request)
If you using java, you can create some java code - Controller or Servlet (not javascript) that request to other Server and pass it to your html (Just like a bridge). Or you can do same thing if you using with PHP or Python.

Scraping with Nokogiri and Ruby before and after JavaScript changes the value

I have a program that scrapes value from https://web.apps.markit.com/WMXAXLP?YYY2220_zJkhPN/sWPxwhzYw8K4DcqW07HfIQykbYMaXf8fTzWT6WKnuivTcM0W584u1QRwj
My current code is:
doc = Nokogiri::HTML(open(source_url))
puts doc.css('span.indexDate').text
date = doc.css('span.indexDate').text
date = Date.parse(date)
puts date
values = doc.css('table#CdsIndexTable td.col2 span')
puts values
This scrapes the date and values of the second column from the "CDS Indexes" table correctly which is fine. Now, I want to scrape the similar values from the "Bond Indexes" table where I am facing the problem.
I can see a JavaScript function changes it without loading the page and without changing the URL of the page. The difference between these two tables is their IDs are different which is exactly that it should be. But, unfortunately when I try with:
values = doc.css('table#BondIndexTable')
puts values
I get nothing from the Bond Indexes table. But I get values from CDS Indexes table if I use:
values = doc.css('table#CdsIndexTable')
puts values
How can I get the values from both tables?
You can use Capybara with the Poltergeist driver to execute the Javascript and format the page. Poltergeist is a wrapper for the PhantomJS headless browser. Here's an example of how you can do it:
require 'rubygems'
require 'capybara'
require 'capybara/dsl'
require 'capybara/poltergeist'
Capybara.default_driver = :poltergeist
Capybara.run_server = false
module GetPrice
class WebScraper
include Capybara::DSL
def get_page_data(url)
visit(url)
doc = Nokogiri::HTML(page.html)
doc.css('td.col2 span')
end
end
end
scraper = GetPrice::WebScraper.new
puts scraper.get_page_data('https://web.apps.markit.com/WMXAXLP?YYY2220_zJkhPN/sWPxwhzYw8K4DcqW07HfIQykbYMaXf8fTzWT6WKnuivTcM0W584u1QRwj').map(&:text).inspect
Visit here for a complete example using Amazon.com:
https://github.com/wakproductions/amazon_get_price/blob/master/getprice.rb
If you don't want to use PhantomJS you can also use the network sniffer on Firefox or Chrome development tools, and you will see that the HTML table data is returned with a javascript POST request to the server.
Then instead of opening the original page URL with Nokogiri, you'd instead run this POST from your Ruby script and parse and interpret that data instead. It looks like it's just JSON data with HTML embedded into it. You could extract the HTML and feed that to Nokogiri.
It requires a bit of extra detective work, but I've used this method many times with JavaScript web pages and scraping. It works OK for most simple tasks, but it requires a bit of digging into the inner workings of the page and network traffic.
Here's an example of the JSON data from the Javascript POST request:
Bonds:
https://web.apps.markit.com/AppsApi/GetIndexData?indexOrBond=bond&ClientCode=WSJ
CDS:
https://web.apps.markit.com/AppsApi/GetIndexData?indexOrBond=cds&ClientCode=WSJ
Here's the quick and dirty solution just so you get an idea. This will grab the cookie from the initial page and use it in the request to get the JSON data, then parse the JSON data and feed the extracted HTML to Nokogiri:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
require 'json'
# Open the initial page to grab the cookie from it
p1 = open('https://web.apps.markit.com/WMXAXLP?YYY2220_zJkhPN/sWPxwhzYw8K4DcqW07HfIQykbYMaXf8fTzWT6WKnuivTcM0W584u1QRwj')
# Save the cookie
cookie = p1.meta['set-cookie'].split('; ',2)[0]
# Open the JSON data page using our cookie we just obtained
p2 = open('https://web.apps.markit.com/AppsApi/GetIndexData?indexOrBond=bond&ClientCode=WSJ',
'Cookie' => cookie)
# Get the raw JSON
json = p2.read
# Parse it
data = JSON.parse(json)
# Feed the html portion to Nokogiri
doc = Nokogiri.parse(data['html'])
# Extract the values
values = doc.css('td.col2 span')
puts values.map(&:text).inspect
=> ["0.02%", "0.02%", "n.a.", "-0.03%", "0.02%", "0.04%",
"0.01%", "0.02%", "0.08%", "-0.01%", "0.03%", "0.01%", "0.05%", "0.04%"]
PhantomJS is a headless browser with a JavaScript API. Since you need to run the scripts on the page you are scraping, a browser will do that for you; and PhantomJS will allow you to manipulate and scrape the page after the script execution.

Categories