I'm trying to scrape information using PHP from this site, however the information I'm looking for seems to be generated through Javascript or similar. I would be greateful for any suggestions on what approach to take!
This is the remote site that I'm trying to fetch data from: http://www.riksdagen.se/sv/webb-tv/video/debatt-om-forslag/yrkestrafik-och-taxi_H601TU11
The page contains a video and beneith the headline "Anförandelista" there are a number of names/links to individual time spots in the video.
I want to use PHP to automatically fetch the names and links in this list and store it in a database. However, this information is not included in the HTML source and thus I fail to retreive it.
Any ideas on how I can remotely access the information using an automated script? Or in which direction I should look for a solution? Any pointers are very much appreciated.
You can get this info as a json response from the API call the page makes. I don't know PHP, yet, but a quick Google shows handling json is possible and fairly straightforward. I give an example python script at the bottom.
The API call is this
http://www.riksdagen.se/api/videostream/get/H601TU11
It returns json as follows (just an excerpt shown. The json includes the speech as well):
Explore full json response here.
PHP
Looking at this question you could start with something like:
$array = json_decode(file_get_contents('http://www.riksdagen.se/api/videostream/get/H601TU11'));
Example python if wanted:
import requests
import pandas as pd
r = requests.get('http://www.riksdagen.se/api/videostream/get/H601TU11').json()
results = []
for item in r['videodata'][0]['speakers']:
start = item['start']
duration = item['duration']
speaker = item['text']
row = [speaker, start, duration]
results.append(row)
df = pd.DataFrame(results, columns = ['Speaker', 'Start', 'Duration'])
print(df)
Example output:
You can not get the information loaded by JS using just PHP solution. Curl, file_get_contents and similar options will only get the server response for you, they will not execute JS as it is client side script.
For that you will need to use a headless browser (there are multiple to choose from: Chromium, Google Chrome with it's new headless mode or Selenium web driver are just few of the most popular ones)
Related
I have a wholesale website (behind login) that I am trying to scrape inventory levels. I've created my python script and it is giving a 200 response for login.
I'm trying to figure out how to scrape the inventory. I'm 99% sure that it is javascript but even if it is I don't know how to return the data since it is in divs and not a table (and I don't want to return every div).
This is the html page source
https://jsfiddle.net/3t6vjyLx/1/
the code is in the jsfiddle---too large to post here
When I inspect the element it is giving me and then
What do I need to do to load the page fully in my Python script so that I am able to pull that product-count?
There will be 64 seperate product-counts (8 locations and 5 sizes each)... is there a way to have it saved in a table in a specific way so that it is sorted by size? Since this wasn't created with a table that makes it more difficult, but I want to learn how to do it.
Thanks!
https://i.stack.imgur.com/L2MZV.png This is the inspect of the element
One solution is to use a library like requests_html to create an HTMLSession() that loads the javascript elements which you can then parse.
The code could look something like this:
from requests_html import HTMLSession
def get_html(url):
session = HTMLSession()
r = session.get(url)
r.html.render() # renders javascript html and stores it in {obj}.html.html
return r.html.html
While this solution may not be the most elegant (web scraping rarely is), I believe it should be sufficient if you're only scraping a small amount of data.
[the information i want to access][1]
[1]: https://i.stack.imgur.com/4SpCU.png
NO matter what i do i just can't access the table of stats. I am suspicious it has to do with there being mulitple tables but I am not sure.
enter code here
var cheerio = require("cheerio");
var axios = require("axios");
axios
.get("https://www.nba.com/players/langston/galloway/204038")
.then(function (response) {
var $ = cheerio.load(response.data);
console.log(
$("player-detail").find("section.nba-player-stats-traditional").find("td:nth-child(3)").text()
);
});
The actual html returned from your get request doesn't contain the data or a table. When your browser loads the page, a script is executed that pulls the data from using api calls and creates most of the elements on the page.
If you open the chrome developer tools (CTRL+SHIFT+J) and switch to the network tab and reload the page you can see all of the requests taking place. The first one is the html that is downloaded in your axios GET request. If you click on that you can see the HTML is very basic compared to what you see when you inspect the page.
If you click on 'XHR' that will show most of the API calls that are made to get data. There's an interesting one for '204038_profile.json'. If you click on that you can see the information I think you want in JSON format which is much easier to use without parsing an html table. You can right-click on '204038_profile.json' and copy the full url:
https://data.nba.net/prod/v1/2019/players/204038_profile.json
NOTE: Most websites will not like you using their data like this, you might want to check what their policy is. They could make it more difficult to access the data or change the urls at any time.
You might want to check out this question or this one about how to load the page and run the javascript to simulate a browser.
The second one is particularly interesting and has an answer saying how you can intercept and mutate requests from puppeteer
I'm trying to get the href from these table contents, but in the html code is not available. [edited # 3:44 pm 10/02/2019] I will scrape this site and others similar to this one, on a daily basis and compare with the "yesterday" data. So I get the daily new info in this data. [/edited]
I found a similar (but simpler) solution, but it uses chromedriver (link). I'm looking for a solution that doesn't uses Selenium.
Site: http://web.cvm.gov.br/app/esforcosrestritos/#/detalharOferta?ano=MjAxOQ%3D%3D&valor=MTE%3D&comunicado=MQ%3D%3D&situacao=Mg%3D%3D
If you click in the first parte of the table (as below)
You will get to this site:
http://web.cvm.gov.br/app/esforcosrestritos/#/enviarFormularioEncerramento?type=dmlldw%3D%3D&ofertaId=ODc2MA%3D%3D&state=eyJhbm8iOiJNakF4T1E9PSIsInZhbG9yIjoiTVRFPSIsImNvbXVuaWNhZG8iOiJNUT09Iiwic2l0dWFjYW8iOiJNZz09In0%3D
How can I scrape the first site to get all the links it have in the tables? (to go for the second "links")
When I use requests.get it doesn't even get the content of the table. Any help?
link_cvm = "http://web.cvm.gov.br/app/esforcosrestritos/#/detalharOferta?ano=MjAxOQ%3D%3D&valor=MTE%3D&comunicado=MQ%3D%3D&situacao=Mg%3D%3D"
import requests
html_code = requests.get(link_cvm)
html_code.text
print(html_code)
The second page your are taken to is dynamically loaded using jscript. The data you are looking for is contained in another page, in json format. Search around, there is a lot of information about this, for one, of many, example, see this.
In your case, you can get to it this way:
import requests
import json
url = 'http://web.cvm.gov.br/app/esforcosrestritos/enviarFormularioEncerramento/getOfertaPorId/8760'
resp = requests.get(url)
data = json.loads(resp.content)
print(data)
The output is the information on that page.
For some reason the National Weather Service's xml site does not work for me. When I say "does not work", I mean that I've tried both XMLHttpRequest and ajax to GET the xml data from http://w1.weather.gov/xml/current_obs/KSFO.xml in order to write a script that displays current weather conditions. this is my code:
(function (){
updateWeather();
})();
function updateWeather(){
var url= "http://w1.weather.gov/xml/current_obs/KSFO.xml";
$.ajax({
url: url,
dataType: 'xml',
error:function(xhr){
document.getElementById("weatherbox").innerHTML="error" +xhr.status+xhr.statusText;},
success:function(result,status,xhr){
document.getElementById('weatherbox').innerHTML="success";
}
});
}
I know that you typically cannot request information cross domain, but the NWS site says its open to the public and im using an ajax call and since it seems as though nobody else has this problem it must not be a cross domain error, but i have tried using crossDomain: true in the ajax call. I have tried making the url "https:...." instead but that did nothing. I've tried specifying type:'GET' in the ajax call as well. Every time I run the script it returns error0error . Does anyone have any ideas? A working implementation of an ajax call would be even better, I've been working at this for days and it's driving me crazy that I can't seem to retrieve this data.
in response to the first comment, I looked into it before but it seems like the SOAP service is for requesting data packages, such as "the weather in SF from january to september" or something, and from the looks of this:
"XML Feeds of Current Weather Conditions
This page provides access to observed current weather conditions for about 1,800 locations across the United States and US Territories. Two file formats designed for computer to computer data transfer are provided. RSS and XML lists are provided to aid the automated dissemination of this information. More information on RSS and XML formats/feeds. Comments and feedback are welcome. There is additional information about this offering via this Product Description Document.
Select a State or Territory to locate XML weather observations feeds available:
Select a State/Territory above to list display list of observations stations An index list of all available stations is available in XML (900kb): XML Format"
and
"About XML
NWS offers hourly weather observations formatted with xml tags to aid in the parsing of the information by automated programs used to populate databases, display information on webpages or other similar applications. This format is not to be confused with RSS and cannot be read by RSS readers and aggregators. These files present more detailed information than the RSS feeds in strings friendly for parsing. Both the RSS and XML feeds offer URLs to icon images. Additionally, A list of what phrases may appear in the XML tag and suggested icons is available. To access these feeds, select a state and then the last XML link in the column."
from this site: http://w1.weather.gov/xml/current_obs/
i should be able to just use the xml from the link i posted above to retrieve current observation data and not packages like one would use for calculating or predicting forecast trends, AND it seems as though the SOAP request service actually would not work for my purposes because i cannot just order one data point.
You could use JSONP request to avoid getting CORS errors, but this SOAP service does not wrap data in script. Just try reading the docs here. You'll most probably have to create a client. NWS also provides a RESTful API. Read the tutorials here.
If you can use a php proxy, then look at http://www.webresourcesdepot.com/cross-domain-javascript-with-simple-php-proxy/ for the solution, and the corresponding code link at pastebin
To summarize, the solution uses an intermediary to the remote site that sits at the same location as your JS code. You invoke the proxy by setting the url parameter to your target. Let's say you saved the proxy code to filename 'weatherproxy.php' and your webserver supports the php module and curl support, then you would set your variable as
var url = 'weatherproxy.php?url=http://w1.weather.gov/xml/current_obs/KSFO.xml';
With no other options to your proxy, on success it will return a json with the form:
{ status: { http_code: 200 }, contents: "your xml contents as a string" }
From there you would have to invoke an xml interpreter on 'contents'. Alternatively, there are parameters that you can supply to that proxy to return the raw xml. The parameter is '&mode=native'. I'm not sure though that jQuery can properly handle the XML that returns.
Have fun exploring the code.
To document some recent events I saved all tweets including a special hashtag. Now I have about 50.000 tweets which I want to publish. To save bandwidth and server load I want just want to send the raw tweet text to the client and then render it with javascript (linking hashtags, useranames and urls).
Is there already javascript library which is able to parse and create a html representation from a raw tweet?
twitterlib.render() looks like a good start... assuming you have parsed JSON tweet data:
<script src="twitterlib.js"></script>
<script>
var parsed_tweet_data = getTweetData(...); // get a Tweet JS object...
var html = twitterlib.render(parsed_tweet_data);
// Do something with the rendered html now...
</script>
Here's a twitterlib walkthrough on SlideShare.net (slide 17 has a demo.)
Have you considered using the Twitter oembed API? It basically lets you request the "official" embedded tweet HTML programmatically using an anonymous API (no authentication required). This would at least make it easy to meet the display requirements without reinventing the wheel. You can even do this client side, depending on your use case.
I'm grappling with this same issue, so let us know what you try and how it works for your project.