Stripping content from a website to my website - javascript

I am trying to make a website that stream from a wiki page and take the content down into my page.
Before anyone saying it is illegal to scrape a website, mind you this is a wiki site, and under each page of that site, there is:
Content is available under Attribution-Noncommercial-Share Alike 3.0 Unported.
Meaning I am free to use and REUSE the info that is provided to me.
This is the wiki page: http://wiki.mabinogiworld.com/
Basically I am trying to make a website to take the server online status table directly and put it into my webpage, but at the same time I want to keep it updated, so it have to re-get the table next time the webpage is refreshed.
With this, I faced the cross domain issue and found something related to YQL that seems to be able to help me, but I still cant figure it out.
This is what I did so far:
YUI().use("yql", function (Y)
{
var query = 'SELECT * FROM html WHERE url="http://wiki.mabinogiworld.com/" and xpath="//div/table"';
Y.YQL(query, function(results)
{
var temp;
var size = 0;
temp = results.query.results.table;
size = temp.length;
for (var i = 0; i < size; i++)
{
//Loop through the result and find the exact table I want
}
}
}
With the above code (the loop is too messy that I cut it out) I am able to get the exact table I want with all the sub columns and rows, but it is returned in a structure that I have no idea how to translate back into HTML.
What can I do to get the table from the wiki page and put it onto my webpage? And what is the variable type of "results" anyways? I cant seems to use it in any ways other than access.
Thank you.

Try doing something that is posted here: YQL JSON script not returning?
Basically it makes AJAX possible with help of YQL
Source: http://net.tutsplus.com/tutorials/javascript-ajax/quick-tip-cross-domain-ajax-request-with-yql-and-jquery/
Well, if you really want to keep the formatting and the style of the table, make your own table, and then put your own style onto it, and then extract info out of YQL and start populating the table. That way it be done with your method. YQL is really useful, I started playing with it a bit and find it very powerful.
Not sure if that would violate the copyright rules or not though, since you are indeed reusing the data in your own format.

YQL Solution
First off, your XPath query is way too broad. Looking at the wiki page's source, I came up with this:
//div[#id='mw-content-text']/table//table[#class='center']
Unfortunately, the table that you want doesn't have an ID on it, so selecting tables with a center class was the best I could do. This returns 5 different tables; you want the first one. I tried to use the "first element" predicate (table[#class='center'][1]), but that didn't seem to do anything. Notice that the XML in the <results> element is straight XHTML that you could dump into your page. (That's assuming that you're requesting the results as XML, not JSON)
I found Yahoo's YQL Console really helpful. It allows you to fine tune your query before trying to incorporate it with Javascript to parse the results.
jQuery Solution
This isn't the optimal solution, but it circumvents the need to parse XML in Javascript or convert JSON to HTML. You can do an AJAX call to get the HTML and then strip out everything besides the table:
var scrapeUrl = 'www.example.com';
$.ajax({
type: "GET",
url: scrapeUrl,
success(html) {
var $scrapedElement = $(html).find("h1");
$("#scrapedDataDiv").html($scrapedElement);
},
error() {
alert("Problem getting table");
}
});
In this example, the code downloads the page at www.example.com and scrapes out all of the h1 tags, thanks to jQuery's handy selectors. The h1 tags are then place in a div with the id of scrapedDataDiv.
Obviously, you still have to deal with XSS/Same Origin issues. You can do this by setting up a proxy on your server.

Related

How do I retrieve a variable's content using Python Requests module?

I started a little personal project just for fun. I hope posting what I'm doing here doesn't break any of the local rules. If so, let me know and I'll take the question down. No need to flag me for anything
I'm trying to pull the background image URL of my chromium homepage. Just for reference, the URL is https://www.mystart.com/new-tab/newtab/ When going to this page, nice background images are loaded. I'm trying to grab those images for personal, not commercial, use.
What I've traced down is that the page listed above calls out to another similar page: https://www.mystart.com/new-tab/newtab/newtab/ Currently, on line #1622 through #1636, two significant lines read:
var fastload = JSON.parse(localStorage.getItem('FASTLOAD_WALLPAPER_557b2c52a6fde1413ac3a48a'))
...
var url = fastload.info.cache_url || fastload.info.data_uri || fastload.info.image;
The value returned in the url is the URL to the background image. If I drop into the Chromium console and use: console.log(url), I see the exact data I'm trying to scrape. I'm wondering how I do that through python, since the actual textValue of url is not seen.
I have looked all over to try to find the localStorage object definition with no luck. I'm pulling the page with result = requests.get("https://www.mystart.com/new-tab/newtab/newtab/"); and then looking through result.text. I've also tried using BeautifulSoup to parse through things, not that this is really any different, but still not getting the results I'm looking for.
Being that I'm a hobbyist coder, I feel like I'm missing something simple. I've searched for answers, but I must be using the wrong keywords. I'm finding a lot of answers for parsing the urls that can be read, but not from the contents of a variable.
if you look at the requests being made, there is JSON response with info for 350 images. image_id is used in the url, e.g.
https://gallery.mystartcdn.com/mystart/images/<image_id>.jpeg
so for id=154_david-wilson-moab:
https://gallery.mystartcdn.com/mystart/images/154_david-wilson-moab.jpeg
Parse the JSON and get url for all images.
Note: this is not an answer of your question, but it looks like XY problem - this solves the underlying problem of retrieving image urls.

I can't find Xpath on this website or maybe wrong syntax

I'm trying to scrape data from this url https://drive.getbigger.io/#/stores, however I didn't find the Xpath of the text I want to export, which are the producer's offer.
Firstly I try the importxml function in Google sheet:
=IMPORTXML(A1;"/html/body/flt-ruler-host/div[23]/p")
and it gave me N/A error "the imported content is empty"
so I try to scrape this website with add-ons and Parsehub, and it gave me every time a .csv file where i can't find the data I want to export.
Also I can't find the right Xpath for the data I would like to scrape, when I use the inspection tool, the data isn't in the <body> part.
However the Xpath I use in my importXML function is some code I find in the <body> part and which is close of the text I'd like to extract (the producer's offer).
It seems that the Xpath I am looking for is linked in the <head> part with some JavaScript code, also when I hover the page with the selection tool in order to scrape the data it select the whole page, maybe because there is a "scroll <div>".
So I wonder if the website use some kind of protection against scraping or other.
Please guys tell me if :
I could find the right Xpath in order to scrape with the importXML function?
Should I extract the data with a python script?
if the website block my attempts, how could I do this?
You won't be able to scrape anything with IMPORTXML formula since the website uses dynamic rendering (javascript).
So yes, Python+Selenium (or other combinations) could do the job. The website won't block you if you follow some rules (switch user-agent, add pauses between requests).
You would probably need these XPath :
Product description :
//p[1][string-length(text())>5][parent::flt-dom-canvas]
Product price :
//p[3][contains(text(),"€") and not (contains(text(),","))][parent::flt-dom-canvas]
However, I think the most elegant way to get the data is probably to use the API the website relies upon. With GoogleSheets and a custom ImportJSON script, you can obtain something like this (result for "fromage" as query) :
It won't work out of the box, you'll have to modify some parts of the script since it won't load a JSON (called with POST) which needs headers in the request. In a nutshell, you need to construct the payload part, add headers to the request ("Bearer XXXXX"), and add a parameter to a function to retrieve the results.
All this depends on your objective and your expected output.
EDIT : For references (constructing the payload, adding parameters) you can read :
https://developers.google.com/apps-script/reference/url-fetch/url-fetch-app#fetchurl,-params
Look also the networktab of your browser developper tools in order to find : the url of the API and the correct parameters to send.

How to mimic Facebook's "link share" functionality using node.js and javascript

so what I want to mimic is the link share feature Facebook provides. You simply enter in the URL and then FB automatically fetches an image, the title, and a short description from the target website. How would one program this in javascript with node.js and other javascript libraries that may be required? I found an example using PHP's fopen function, but i'd rather not include PHP in this project.
Is what I'm asking an example of webscraping? Is all I need to do is retrieve the data from inside the meta tags of the target website, and then also get the image tags using CSS selectors?
If someone can point me in the right direction, that'd be greatly appreciated. Thanks!
Look at THIS post. It discusses scraping with node.js.
HERE you have lots of previous info on scraping with javascript and jquery.
That said, Facebook doesn't actually guess what the title and description and preview are, they (at least most of the time) get that info from meta tags present in the sites that want to be more accessible to fb users.
Maybe you could make use of that existing metadata to pull titles, descriptions and img previews. The docs on the available metadata is HERE.
Yes web-scraping is required and that's the easy part. The hard part is the generic algo to find headings and relevant texts and images.
How to scrape
You can use jsdom to download and create a DOM structure in your server and scrape that using jquery on your server. You can find a good tutorial at blog.nodejitsu.com/jsdom-jquery-in-5-lines-on-nodejs as suggested by #generalhenry above.
What to scrape
I guess a good way to find the heading would be:-
var h;
for(var i=6; i<=1; i++)
if(h = $('h'+i).first()){
break;
}
Now h will have the title or undefined if it fails. The alternative for this could be simply get the page's title tag. :)
As for the images. List all or first few images on that page which are reasonably large, i.e. so as to filter out sprites used for buttons, arrows, etc.
And while fetching the remote data make sure that ProcessExternalResources flag is off. This will ensure that script tags for ads do not pollute the fetched page.
And yes the relevant text would be in some tags after h.

Creating a scroll down form using java - is this the code?

I am trying to understand the code (if this is even the code) which comes from thisnext.com website. Basically it allows any user surfing any website to post recommendations. Im interested to know what does the codes mean.
javascript:(function() {
var x=document.getElementsByTagName('head').item(0);
var so=document.createElement('script');
var h=location.hostname.split('.');
var a=new Array();for(i=h.length-1;i>=0;i=i-1){a[a.length]=h[i];}var d=a.join('/');
var s='http://www.thisnext.com/js/bookmarklet/'+d+'/';
if(typeof so !='object') so=document.standardCreateElement('script');
so.setAttribute('src',s);so.setAttribute('type','text/javascript');
x.appendChild(so);
})();
Thank you in advance!
The script loads some sort of script (probably some sort of JSONP setup) from that site (thisnext.com). It does it by creating a new script element and then building a URL based on the current URL (the URL of the page with the script on it).
I edited the code to get rid of the "%20" characters; those are there because the script is probably set up to be used as a bookmarklet.
The direct answer to your question is "no". That script may load some other code that does what you're looking for, but that script does not do anything of the sort.

question regarding google maps api

if im loading data for the markers from a database do i write the output queried from the db into a javascript file or is there a cleaner way of doing it?
thanks
Yeah, writing to a file is a good way to do it. Just write the data as JSON. Your file would look like:
var map = {waypoints:[...]};
And then you can do:
for(var i=o; i<map.waypoints.length; ++i) {
addWaypoint(map.waypoints[i]);
}
I actually do some static caching of nodes using this method: http://www.trailbehind.com/site_media/javascript/gen/national-parks.js
We use that set of National Parks a lot, so we cache it. But we also have urls where you can fetch JSON for a node on the fly, such as: http://www.trailbehind.com/map/node/7538973/632/735/
This URL gets the map for node 7538973, and specifies the dimensions of their map in pixels as well.
The needed Javascript can of course be wrapped in whatever language you prefer to use, see e.g. pymaps for a Python example. While pymaps is actualally inserting the JS code into an HTML template, if you're writing a web app you can perfectly well choose to serve that JS code on the fly at an appropriate URL and use that URL in a <script> tag in your pages.
Depending on the size of your application, you may want to consider printing out plain javascript.
I have a map that uses server-side clustering, so markers update frequently. I found that parsing JSON markers slowed the app significantly, and simply wasn't necessary.
If speed is an issue, I'd suggesting removing all of the unnecessary layers possible (JSON, AJAX, etc.). If it's not, you'll be just fine with JSON, which is cleaner.
I agree with Andrew's answer (+1).
I guess the only point I would add is that rather than including some server side generated JavaScript, you could use an AJAX request to grab that data. Something like:
var request = new Request.JSON (url: 'get_some_json.php',
onSuccess: function(data) {
// do stuff with the data
}).get ();
(This is a Mootools AJAX thing, but you could use any kind of AJAX request object).
Edit: ChrisB makes a good point about the performance of parsing JSON responses and re-reading my answer I certainly didn't make myself clear. I think AJAX requests are suitable for re-requesting data based on parameters generated by user interaction. I guess an example use case might be, a user filtering the data displayed on the map. You might grab the filtered data via an AJAX/SJON request rather than re-loading the page.

Categories