I am trying to parse a webpage using jQuery. This is my code:
$.get(u, function(data)
{
console.log(data);
$(data).find('meta').each(function()
{
console.log($(this).text());
//alert($(this).text());
alert($(this).attr('content'));
console.log($(this).attr('content'));
});
});
The page source is here.
There are many meta tags in this page but its only able to parse 6 of them namely :
<meta property="og:type" content="website" />
<meta property="og:title" content="Affect and Engagement in Game-BasedLearning Environments" />
<meta property="og:description" content="The link between affect and student learning has been the subject of increasing attention in recent years. Affective states such as flow and curiosity tend to have positive correlations with learning while negative states such as boredom and frustrat..."/>
<meta property="og:url" content="http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=6645369" />
<meta property="og:image" content="http://ieeexplore.ieee.org/assets/img/logo-ieee-200x200.png" />
<meta property="og:site_name" content="IEEE Xplore" />
<meta property="fb:app_id" content="179657148834307" />
What am I doing wrong?
jQuery will skip all meta-Tags which do not have closing brackets.
You therefore would be forced to parse the response using a regular expression, but this is in fact ugly.
It's very sad, that we still live in a world where there is such HTML out there.
If you really want to go that path, you could do some thing like this
data = data.replace(/(<meta.*[^\/])>/g, "$1/>");
before doing the "find" on it.
You must append the response to an element and then get the meta.
$(data).appendTo("#test");
var m = $('#test').find('meta');
alert(m.length);
$('#test').html(''); // delete content
Example: http://jsfiddle.net/B4ATz/
Related
I was trying to scrape the household links from this page :
https://www.sreality.cz/en/search/to-rent/apartments?page=2
For instance, for the first apartment I would like to obtain the link with:
https://www.sreality.cz/en/detail/lease/flat/1+kt/plzen-jizni-predmesti-technicka/25873756#img=0&fullscreen=false
However the website is quite heavy on javascript. By using requests.get() I only obtain an uninformative chunk of html code:
from requests import get
i = 2
url = f"https://www.sreality.cz/en/search/to-rent/apartments?page={i}"
response = get(url)
print(response.text)
-----------------------------
<!doctype html>
<html lang="{{ html.lang }}" ng-app="sreality" ng-controller="MainCtrl">
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width,initial-scale=1,maximum-scale=1.0,minimal-ui">
<!--- Nastaveni meta pres JS a ne pres Angular, aby byla nastavena default hodnota pro agenty co nezvladaji PhantomJS --->
<title ng:bind-template="{{metaSeo.title}}">Sreality.cz ⢠reality a nemovitosti z celé ÄR</title>
<meta name="description" content="NejvÄtÅ¡Ã nabÃdka nemovitostà v ÄR. NabÃzÃme byty, domy, novostavby, nebytové prostory, pozemky a dalÅ¡Ã reality k prodeji i pronájmu. Sreality.cz">
<meta property="og:title" content="Sreality.cz ⢠reality a nemovitosti z celé ÄR">
<meta property="og:type" content="website">
<meta property="og:image" content="https://www.sreality.cz/img/sreality-logo-og.png">
-----------------------------
ETC ...
The question is therefore, how to proceed with some simple scraping activity for websites of this kind ?
Thanks in advance for the help.
I don't think that website has a public API but looking at the API calls from the network tab I could fetch the details for your need and make it as link have a look at the below code.
Let me know if you have any questions :)
import time
import requests
page=2
numberofresults=20
epochmiliseconds=round(time.time() * 1000)
paramsdict={
"category_main_cb":1,
"category_type_cb":2,
"page":page,
"per_page":numberofresults,
"tms":epochmiliseconds
}
data=requests.get("https://www.sreality.cz/api/en/v2/estates",params=paramsdict).json()
for lead in data["_embedded"]["estates"]:
locality=lead["seo"]["locality"]
name=lead["name"]
hash_id=lead["hash_id"]
typedata=[s for s in name.split(" ") if "+" in s][0].replace("\u00a0"," ").split(" ")[0]
print(f'https://www.sreality.cz/en/detail/lease/flat/{typedata}/{locality}/{hash_id}'))
Output:
First ask website, if they provide any API to get the desired information.
To deal with javascript during the scraping only request will not work. You should go Selenium only or for scrapy in combination of scrapy-selenium. These two allow loading of javascript during scraping.
Feel free to ask if you have any other question.
Pug template - how to add a span tag with class as interpolation into a mixins argument to surround one word so I can style it differently?
Here is the mixin from _section-headline.pug:
mixin section-headline(tagline, title, description, idSection)
section.section(id=idSection)
.container
.row.row--tablet-center
.col-xs-12.col-sm-10.col-md-12.col-lg-10
.headline
div.headline__top-line
p.headline__tagline=tagline
h1.headline__title=title
p.headline__description=description
I'm calling it into team.pug:
doctype html
include ./includes/global/_landing.pug
include ./includes/global/_section-headline.pug *<-- including here*
html
head
<meta charset="UTF-8">
<title>Team - Cerv PC</title>
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1">
<link rel="shortcut icon" type="image/png" href="/images/favicon.png">
body
- var greenText = "<span call="type--color-green">language</span>";
+section-headline('guided by diveristy', 'We speak your '+ greenText +', 'We offer personal injury legal services in Hindi, Punjabi, and Urdu. We are happy to accommodate clients speaking other languages through the use of interpreters.', 'contact-us')
So, I'm trying to place a span tag with/without text into the mixin argument so I'm making only one word green but not all. How to do this? Tried few things but nothing seems to work.
I tried !{greenText} but it says false when rendered as html
use
h1.headline__title !{title}
instead of
h1.headline__title=title OR h1.headline__title=!{title}
I am using JavaScript sdk for one of my Facebook canvas game application. I am trying to implement custom story share dialog to post story on user wall.
The information available on developer site is unclear and limited.The bellow code is for sharing custom story using open grap API. Bellow code is available on Facebook developer site link provided(https://developers.facebook.com/docs/sharing/reference/share-dialog). The Facebook provided code is working fine as its using predefined action_type.
FB.ui({
method: 'share_open_graph',
action_type: 'og.likes',
action_properties: JSON.stringify({
object:'https://developers.facebook.com/docs/',
})
}, function(response){});
I have create an object(cricket) and action(play) for custom story on FB Developer console App's Open Graph tab.i have created a self hosted Object(html page) called cricket.html.Bellow is the content or my html page.i verified the html page on Open Graph Object Debugger.Graph Object Debugger showing me all the information, what i have given with no errors or warnings .
<!DOCTYPE html>
<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=utf-8">
<meta property="fb:app_id" content="*************" />
<meta property="og:type" content="appnamespace:cricket" />
<meta property="og:title" content="App for u" />
<meta property="og:url" content="https://example.com/appnamespace/cricket.html" />
<meta property="og:description" content="Find me on facebook for u" />
<meta property="og:image" content="https://example.com/appnamespace/image/any_time_share.png" />
</head>
<body>
</body>
</html>
Bellow is my code where I am replacing:-
og.likes ---to--->appnamespace:play("play"is my action).
Is i am doing any thing wrong here ? please let me know.
function customshare()
{
FB.ui({
method: 'share_open_graph',
action_type: 'appnamespace:play',
action_properties: JSON.stringify(
{
object:'https://example.com/appnamespace/cricket.html',
})
},
function(response){});
}
However i am getting the below error while executing the FB.ui method: 'share_open_graph' for custom share.
I got my problem solved by simple changing this:-
object:'https://example.com/appnamespace/cricket.html',
To
cricket:'https://example.com/appnamespace/cricket.html',
You need to mention the url in url field of object i guess then only it will work and cricket field is missing as seen from your error message
i want to change title in every page of my code but when i view source code i still see the old title.
here is my code:
window.onload = function (){
setProductMeta();
}
function setProductMeta(){
var des = document.getElementById("description").setAttribute("content","dynamic meta description");
document.getElementById("keywords").setAttribute("content","dynamic meta keywords");
document.title = "Point of Sale System";
}
and this meta tage above
<title>Welcome to Atmostphere Technology</title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<meta http-equiv="Content-Style-Type" content="text/css" />
<meta id="title" name="title" content="Welcome to Atmostphere Technology" />
<meta id="description" name="description" content="atmos is best service in cambodia" />
<meta id="keywords" name="keywords" content="atmos is best service in cambodia" />
The server delivers the source code to the browser.
The browser converts the markup in to a DOM.
Then JavaScript runs and any DOM manipulation you perform is performed. It never touches the source code.
While it is sometimes useful to change the title with JavaScript (e.g. Facebook do it to indicate the number of alerts since the page was last interacted with), almost everything that consumes meta data does not execute JavaScript. If you want to change something fundamental about the page, do it in the source code, not with client side programming.
I am trying to post a Action to the Facebook Timeline using the JS API
FB.api('/me/application:action_type' + '?opject_type='+document.location.href,'post',
function(response) {
if (!response || response.error) {
alert("error");
} else {
alert("success");
}
});
Posting works quite well and the API returns no error. A new activity appears at the Timeline but only as a small text within the "recent activities" box which looks like this:
What could be the problem if the action is not displayed like in the Attachment Preview of the Action Type Settings? Which look like this:
I have linked all the properties from the Object Type and tested my Object URL with the Facebook Debugging Tool and it looks like all the attributes can be parsed correctly by the Facebook scraper.
I also defined a aggregation layout for the action type. So what can be the reason that no Attachment is displayed?
You can see a single action attachment layout on your timeline by setting "Shown on timeline" instead of "Allowed on timeline", but by default you will never see a single action on the timeline. You will see the single action attachment in the ticker (and maybe in the news stream).
If the user doesn't change the display mode You will only see aggregations on a timeline
I also reported this issue as a bug to facebook. Their reply was that this behavior is by design and the attachment layout only appears in the activity log or when multiple activities have been posted to a users timeline.
Have you set the object parameters on your web page? For instance:
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:fb="http://ogp.me/ns/fb#">
<head prefix="og: http://ogp.me/ns# fb: http://ogp.me/ns/fb# app: http://ogp.me/ns/fb/app#">
<meta property="fb:app_id" content="123" />
<meta property="og:type" content="app:action" />
<meta property="og:url" content="http://www.example.com/" />
<meta property="og:url" content="http://www.example.com/" />
<meta property="og:title" content="Testing Title" />
<meta property="og:description" content="testing Description" />
<meta property="og:image" content="http.example.com/image.jpg" />
You will need to get the correct code off the Facebook Developers website but it is essential that you create your object in order for Facebook to get the parameters from your webpage.
You can test it by simply going into aggregations->preview->add-action, and in event, just paste the webpage. You will see instantly if it works.