Scraping a messy javascript-heavy website with python

Scraping a messy javascript-heavy website with python - javascript

I was trying to scrape the household links from this page :
https://www.sreality.cz/en/search/to-rent/apartments?page=2
For instance, for the first apartment I would like to obtain the link with:
https://www.sreality.cz/en/detail/lease/flat/1+kt/plzen-jizni-predmesti-technicka/25873756#img=0&fullscreen=false
However the website is quite heavy on javascript. By using requests.get() I only obtain an uninformative chunk of html code:
from requests import get
i = 2
url = f"https://www.sreality.cz/en/search/to-rent/apartments?page={i}"
response = get(url)
print(response.text)
-----------------------------
<!doctype html>
<html lang="{{ html.lang }}" ng-app="sreality" ng-controller="MainCtrl">
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width,initial-scale=1,maximum-scale=1.0,minimal-ui">
<!--- Nastaveni meta pres JS a ne pres Angular, aby byla nastavena default hodnota pro agenty co nezvladaji PhantomJS --->
<title ng:bind-template="{{metaSeo.title}}">Sreality.cz â¢ reality a nemovitosti z celÃ© ÄR</title>
<meta name="description" content="NejvÄtÅ¡Ã nabÃdka nemovitostÃ v ÄR. NabÃzÃme byty, domy, novostavby, nebytovÃ© prostory, pozemky a dalÅ¡Ã reality k prodeji i pronÃ¡jmu. Sreality.cz">
<meta property="og:title" content="Sreality.cz â¢ reality a nemovitosti z celÃ© ÄR">
<meta property="og:type" content="website">
<meta property="og:image" content="https://www.sreality.cz/img/sreality-logo-og.png">
-----------------------------
ETC ...
The question is therefore, how to proceed with some simple scraping activity for websites of this kind ?
Thanks in advance for the help.

I don't think that website has a public API but looking at the API calls from the network tab I could fetch the details for your need and make it as link have a look at the below code.
Let me know if you have any questions :)
import time
import requests
page=2
numberofresults=20
epochmiliseconds=round(time.time() * 1000)
paramsdict={
"category_main_cb":1,
"category_type_cb":2,
"page":page,
"per_page":numberofresults,
"tms":epochmiliseconds
}
data=requests.get("https://www.sreality.cz/api/en/v2/estates",params=paramsdict).json()
for lead in data["_embedded"]["estates"]:
locality=lead["seo"]["locality"]
name=lead["name"]
hash_id=lead["hash_id"]
typedata=[s for s in name.split(" ") if "+" in s][0].replace("\u00a0"," ").split(" ")[0]
print(f'https://www.sreality.cz/en/detail/lease/flat/{typedata}/{locality}/{hash_id}'))
Output:

First ask website, if they provide any API to get the desired information.
To deal with javascript during the scraping only request will not work. You should go Selenium only or for scrapy in combination of scrapy-selenium. These two allow loading of javascript during scraping.
Feel free to ask if you have any other question.

Related

How to display information from a website on my discord bot?

Im building a discord bot(Javascript, Node.js, Discord.js) which is based on a game (a online multiplayer).So, im pretty much done developing the bot except for one thing which I really wish I could add. So, this game has some highscores which can be viewed here - https://www.hzgaming.net/high.php . So, I want to display those highscores (Highscore money - https://www.hzgaming.net/high.php?scores=money) (Highscore materials - https://www.hzgaming.net/high.php?scores=materials) and all those. So, what I would like to have is..when a user types a cmd like '!highscore money', it should show the highscore available on that website link I gave above, and similarly..when they use '!highscore materials' it should show the highscore of materials from that link respectivly. I want it to be from that link because, it keeps on changing and gets updated. Also, im pretty sure its possible because, there is a similar bot which shows the same thing as I explained. I hope you understood what I meant. I would really love to get an answer for this, also it would be great if an example code is provided with the answer so that its easy to understand.
An example for the cmd is given below -
user - !highscore money
BOT - Money High Scores
Celia_Fernandz - $41,085,610 total wealth
Armando_Domrani - $40,204,664 total wealth
Sergio_Box - $38,199,486 total wealth
Tony_Sativa - $30,193,261 total wealth
Aminox_Trigui - $28,052,188 total wealth
Ben_Martin - $23,439,003 total wealth
Daryl_Grimes - $17,128,518 total wealth
Luccas_Von_Koening - $16,457,964 total wealth
Charlie_Hustle - $14,452,056 total wealth
Kevin_Maddox - $13,630,605 total wealth
user- !highscore materials
BOT - 1. Chapo_Diamond - 5,749,300 materials
2. Van_Damme - 4,923,046 materials
3. Brandon_Heath_Tsung - 3,906,395 materials
4. Armando_Domrani - 3,241,925 materials
5. Tazz_Equinox - 3,187,045 materials
6. Danny_Ted - 2,868,088 materials
7. Jack_Paterson - 2,748,249 materials
8. John_Dixon - 2,548,250 materials
9. Gab_Alphonse - 2,252,285 materials
10. Don_Thomax - 2,131,177 materials
((ALL THE REPLY BY THE BOT WILL BE EMBEDS))
(yea, so somehow those values should keep on updating, thats why I gave the link above)
Pls note that the code should be javascript using discord.js and node.js. Thank you <3 :)

I'm not sure if that's possible because of the "Just checking your computer, this will only take a few seconds" page that will always pop up (for DDoS protection). For example, this is what I got in my console after requesting some data:
<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<meta http-equiv="X-UA-Compatible" content="IE=Edge,chrome=1">
<meta name="robots" content="noindex, nofollow">
<meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=1">
<meta http-equiv="refresh" content="12">
<link rel="Shortcut Icon" href="https://www.hzgaming.net/favicon.ico" type="image/x-icon">
<title>Just a moment...</title>
<style>#font-face{font-family:Open Sans;font-style:normal;font-weight:400;font-display:swap;src:local("Open Sans Regular"),local("OpenSans-Regular"),url(data:font/woff2;base64,
d09GMgABAAAAACjgAA4AAAAAUhQAACiIAAEAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGiIbEBwaBmAAZBEQCoGANONLC4
M8AAE2AiQDhnAEIAWDWgeQShv4QBXjmFXAxsEAi709IypHX42iQhIH/39MoGOIFG6KqtsXWLsNGTEJswgLuVELAuzt
6BPcQHcKGcv5HKXdi+eKlYT6O/H7D3cgR+jqXNVLasoPuSn55q2H3pbvh3OAu9IvBQY5QmOf5EL/td9nz5tz7szuhl
FHoURhUQJqVpEko1I+Ki4WZRSx2vo+qLaiHszxbb7Ne0BkhgdhTe1mgqThm6figcxKku0p+W2vqybIl4ofbmvDwzr/
// and so on... for a very long time.
I don't think there's a way to get past this, but in the miraculous chance you find a way, this is how I would get the needed data:
// you should use this npm package - https://www.npmjs.com/package/node-fetch
const fetch = require('node-fetch');
fetch('https://www.hzgaming.net/high.php?scores=money')
.then((res) => res.text())
.then((body) => console.log(body));

How to set scale in WKWebview?

I´m trying to modify the scale of a page load in a WKWebView. It is working as far as I´m looking in the html. But the actual scale in the WKWebview doesn´t change.
Here´s how I inject the script:
let script =
"var viewport = document.querySelector(\"meta[name=viewport]\");" +
"viewport.setAttribute('content', 'width=device-width, initial-scale=0.4, user-scalable=0');" +
let userScript = WKUserScript(source: script,
injectionTime: WKUserScriptInjectionTime.atDocumentEnd,
forMainFrameOnly: true)
userContentController.addUserScript(userScript)
Print the html:
webView.evaluateJavaScript("document.getElementsByTagName('html')[0].innerHTML", completionHandler: { (innerHTML, error) in
print(innerHTML)
})
HTML result in console:
Optional(<!--<![endif]--><head>
<title>Bitcoin (BTC) $2391.84 (1.85%) | CoinMarketCap</title>
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=0.4, user-scalable=0">
<meta name="google-site-verification" content="EDc1reqlQ-zAgeRrrgAxRXNK-Zs9JgpE9a0wdaoSO9A">
<meta property="og:type" content="website">
I´m asking myself if this is the right approach. I also tried: let script = document.body.style.zoom = '0.8'; and let script = document.body.style.webkitTransform = 'scale(0.8)'; which is both working in terms of scale, but using this method, the timeperiod slider in this highchart loses its functionality. Any kind of new ideas would be much appreciated.

Add <meta name="viewport" content="width=device-width, shrink-to-fit=YES"> to your HTML file.
(I added right after but not sure if you can add it anywhere)
In my case I have html files for privacy in my Xcode project. I added that javascript code right below the and it did the trick for me.
It works as if you used scalePageToFit in UIWebView.
Hope it helps.

Errors in parsing a webpage using jQuery?

I am trying to parse a webpage using jQuery. This is my code:
$.get(u, function(data)
{
console.log(data);
$(data).find('meta').each(function()
{
console.log($(this).text());
//alert($(this).text());
alert($(this).attr('content'));
console.log($(this).attr('content'));
});
});
The page source is here.
There are many meta tags in this page but its only able to parse 6 of them namely :
<meta property="og:type" content="website" />
<meta property="og:title" content="Affect and Engagement in Game-BasedLearning Environments" />
<meta property="og:description" content="The link between affect and student learning has been the subject of increasing attention in recent years. Affective states such as flow and curiosity tend to have positive correlations with learning while negative states such as boredom and frustrat..."/>
<meta property="og:url" content="http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6645369" />
<meta property="og:image" content="http://ieeexplore.ieee.org/assets/img/logo-ieee-200x200.png" />
<meta property="og:site_name" content="IEEE Xplore" />
<meta property="fb:app_id" content="179657148834307" />
What am I doing wrong?

jQuery will skip all meta-Tags which do not have closing brackets.
You therefore would be forced to parse the response using a regular expression, but this is in fact ugly.
It's very sad, that we still live in a world where there is such HTML out there.
If you really want to go that path, you could do some thing like this
data = data.replace(/(<meta.*[^\/])>/g, "$1/>");
before doing the "find" on it.

You must append the response to an element and then get the meta.
$(data).appendTo("#test");
var m = $('#test').find('meta');
alert(m.length);
$('#test').html(''); // delete content
Example: http://jsfiddle.net/B4ATz/

set meta tage dynamic page with javascript

i want to change title in every page of my code but when i view source code i still see the old title.
here is my code:
window.onload = function (){
setProductMeta();
}
function setProductMeta(){
var des = document.getElementById("description").setAttribute("content","dynamic meta description");
document.getElementById("keywords").setAttribute("content","dynamic meta keywords");
document.title = "Point of Sale System";
}
and this meta tage above
<title>Welcome to Atmostphere Technology</title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<meta http-equiv="Content-Style-Type" content="text/css" />
<meta id="title" name="title" content="Welcome to Atmostphere Technology" />
<meta id="description" name="description" content="atmos is best service in cambodia" />
<meta id="keywords" name="keywords" content="atmos is best service in cambodia" />

The server delivers the source code to the browser.
The browser converts the markup in to a DOM.
Then JavaScript runs and any DOM manipulation you perform is performed. It never touches the source code.
While it is sometimes useful to change the title with JavaScript (e.g. Facebook do it to indicate the number of alerts since the page was last interacted with), almost everything that consumes meta data does not execute JavaScript. If you want to change something fundamental about the page, do it in the source code, not with client side programming.

Timeline Action Layout - No Attachment displayed

I am trying to post a Action to the Facebook Timeline using the JS API
FB.api('/me/application:action_type' + '?opject_type='+document.location.href,'post',
function(response) {
if (!response || response.error) {
alert("error");
} else {
alert("success");
}
});
Posting works quite well and the API returns no error. A new activity appears at the Timeline but only as a small text within the "recent activities" box which looks like this:
What could be the problem if the action is not displayed like in the Attachment Preview of the Action Type Settings? Which look like this:
I have linked all the properties from the Object Type and tested my Object URL with the Facebook Debugging Tool and it looks like all the attributes can be parsed correctly by the Facebook scraper.
I also defined a aggregation layout for the action type. So what can be the reason that no Attachment is displayed?

You can see a single action attachment layout on your timeline by setting "Shown on timeline" instead of "Allowed on timeline", but by default you will never see a single action on the timeline. You will see the single action attachment in the ticker (and maybe in the news stream).
If the user doesn't change the display mode You will only see aggregations on a timeline

I also reported this issue as a bug to facebook. Their reply was that this behavior is by design and the attachment layout only appears in the activity log or when multiple activities have been posted to a users timeline.

Have you set the object parameters on your web page? For instance:
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:fb="http://ogp.me/ns/fb#">
<head prefix="og: http://ogp.me/ns# fb: http://ogp.me/ns/fb# app: http://ogp.me/ns/fb/app#">
<meta property="fb:app_id" content="123" />
<meta property="og:type" content="app:action" />
<meta property="og:url" content="http://www.example.com/" />
<meta property="og:url" content="http://www.example.com/" />
<meta property="og:title" content="Testing Title" />
<meta property="og:description" content="testing Description" />
<meta property="og:image" content="http.example.com/image.jpg" />
You will need to get the correct code off the Facebook Developers website but it is essential that you create your object in order for Facebook to get the parameters from your webpage.
You can test it by simply going into aggregations->preview->add-action, and in event, just paste the webpage. You will see instantly if it works.

We Keep Coding

JavaScript is the programming language of the Web.