Web-scraping a website, that is being loaded with javascript (using javascript)

Web-scraping a website, that is being loaded with javascript (using javascript) - javascript

I am trying to gather line-ups from football/soccer reports. I decided to web-scrape the data from a reports provider, but their websites are loaded with javascript.
To be more specific, let's take this link to a flashscores.co.uk match.
First, they restrict CORS, which means I used allorigins.me to avoid it and then I used this code:
function readurl(url, elementID){
var url = "http://allorigins.me/get?url=" + encodeURIComponent(url) + "&callback=?";
var xhttp = new XMLHttpRequest();
xhttp.onreadystatechange = function() {
if (this.readyState == 4 && this.status == 200) {
document.getElementById(elementID).innerHTML = this.responseText;
}
};
xhttp.open("GET", url, true);
xhttp.send();
}
The result was something like this and it looks the same all the way down (still \n and \t, not the real content). I guess the problem is that the flashscores website is using javascript to load the data, but allorigins.me did not "wait" until the whole website was loaded. Here is another look, where it seems that is being loaded with javascript.
The desired result is to gather the starting elevens of both teams (Allonso M., Arrizabalaga K., Azpilicueta C.,...). I inspected the website and found, that every name is inside a HTML tag: <div class="name">PLAYER'S NAME HERE</div>.
Any idea how to avoid both problems at once?
CORS restriction
The delay before the web is "filled" with data from javascript
I am trying to use client-side languages (no PHP).
Thank you :)

There are a few problems with your question:
CORS is used to protect resources on the server side, and you need the client side resources, which are mostly public, so you do not need a way to avoid it.
The problem is not "waiting" until the page will load, the problem is you need to run these scripts yourself.
I recommend you use something like JSDom with Node.js for this task, should be quite simple.
A great blog post about web scraping with Node.js (without script execution): here
official JSDom npm page: here
Good Luck !

Related

Perform window.location.replace multiple times in for loop

I'm not able to get my script to run window.location.replace multiple times. I have a flask application that create files given the unique ids found from streaming access logs on the server. Once the file is on the server, I have a flask route that if the user is redirected to https://somewebsite.com/getFile/<id>, it will than push the file that was created on the server to the client.
Here's my script below:
<script>
var xhr = new XMLHttpRequest();
xhr.onreadystatechange = function(){
var response = '';
if(xhr.readyState === 4 && xhr.status === 200){
response = xhr.responseText;
response = response.substring(0, response.length-1);
response = response.split('\n');
for(x in response){
url_path = response[x];
window.location.replace(url_path);
};
};
};
xhr.open('GET', '{{ url_for('stream') }}', true);
xhr.send();
</script>
I did a few console.log() calls to see if the for loop is running correctly and it is. Even the url_path given to window.location.replace is correct. One thing to note is that when being redirecting to https://somewebsite.com/getFile/<id>, the browser doesn't technically change to that url path, because flask isn't rendering a template, but instead returning a download file, so the browser stays at the current url path after the file is downloaded.
I'm not sure why I am not able to get the script to run window.location.replace more than once. It seems like if there's 2 url_path in the response object, only the last one is being downloaded. Same goes with 3 or 4 paths. Any insight would be helpful. Thanks

An alternative solution I just figured out was instead of redirecting the same browser to multiple urls, why not just open them up in different tabs. I did this by using window.open, which to my surprise did work.
However, my browser pop up blocker did block them out at first, but after changing the settings and allowing pop ups for the page, I was successfully able to have multiple files downloaded to the client. Also, since there's no template rendering on the server side, the tabs themselves don't actually pop up, so the browser won't be flooded with tabs.
I'm still interested in knowing why window.location.replace didn't work if anyone knows why.

Your code is not working because what window.location.replace does is to literally replace the source document (see documentation) with the one provided by the new URL.
What this means is that any code you put after window.location.replace won't be executed.
That is why window.open works perfectly for your situation, because it will open a new document apart from the one it is called. But be careful because not all parameters work for all browsers (check this for compatibility specifications).

Give out widget (web application) with activation code

I don't know if this is the right place to ask this question but I'm just going to do it.
I've been trying to figure out how I want to give out my web application.
This is my situation:
I've created a web application. People who want to use this application are free to do so. BUT they need to be signed up on our website.
The application needs to be bound to a unique key. This key is generated as soon as they sign up on our website. The application is hosted on our server.
The web application needs to be easy to implement.
I've seen:
I've seen other services generating a JS script, for example:
<script type='text/javascript' data-cfasync='false'>window.exampleApi = { l: [], t: [], on: function () { this.l.push(arguments); } }; (function () { var done = false; var script = document.createElement('script'); script.async = true; script.type = 'text/javascript'; script.src = 'https://app.example.com/VisitorWidget/WidgetScript'; document.getElementsByTagName('HEAD').item(0).appendChild(script); script.onreadystatechange = script.onload = function (e) { if (!done && (!this.readyState || this.readyState == 'loaded' || this.readyState == 'complete')) { var w = new PCWidget({c: 'e01fe420-5c14-55p0-bbec-229c7d9t2f0cf', f: true }); done = true; } }; })();</script>
What have I done so far:
I have made a simple web application that requires you to sign up and log in. From this point on you get an iframe that you can use. I used an iframe just for testing. The web application consists of HTML, CSS, PHP(mostly), JS and jQuery.
I've tried:
I've tried this. I got stuck on the Python part. I have never used/looked into this language.
Also, I'm kind of afraid that people are going to "use" my web application without the right to do so.
What I'm thinking is that the generated key, needs to be send to our website to check if the key is correct.
Tips, tricks, guides?
Do you have any tips or tricks? Maybe critisism?
JSONP, CORS or anything? Never done JSONP or CORS, so any tips about that would be nice too!
Anything is appreciated!

Yes, in general, you got the idea right.
The client signs in to your website, registers his domain and receives ID (lets call it that way)
He implements the js in his site with the ID properly implemented
An authentication request that holds the ID is sent to server; also validating the domain (otherwise, it would be easy just to copy js and put it in my site).
What concerns exact implementation - there are tons of examples out there. There is no one-size-fits-all way and that is the best part of it - fully custom to fit your needs. For example, twitter
<a class="twitter-timeline"
data-widget-id="600720083413962752"
href="https://twitter.com/TwitterDev"
data-tweet-limit="3">
Tweets by #TwitterDev
</a>
Since php tag is provided and used, I see no reason why to mix it with python unless something really specific is required (and I doubt that). So stick to php for now. About the security of the application - using the ID and some domain that requests will be validated against is fair enough. Maybe there are some extra metrics larger sites/services are using, but don't worry about it much.
Extra read:
What are the differences between JSON and JSONP?
CORS - What is the motivation behind introducing preflight requests?

Chrome Extension / App Settings Javascript

So I'm trying to think of a method to get a "local" file not from the user but from the Google Chrome App or Extension which ever it may be since I am building both. Basically, it'll be my Settings JSON and I need access to it through my Options Page, like to have access to it via my content scripts but it's ok, and I need access to it via my Background page.
Sample Settings.json
{
"defaultResults": "all",
"view":"full",
"results":"cur",
"count":"5",
"omni":{
"h8":{
"title":"Hello World",
"url":"www.someurl.com"
}
}
}
So does anyone have any real options for this. I'm not positive if Google has already implemented a native function for this such as chrome.getAppFile("file URL"); or something of that matter. I'd rather not use Ajax inside my app for this file. And I'd rather not use it everywhere. So hopefully, someone here will have a reasonable idea how I should go about this.

You can do this in 2 ways, as described in this answer.
By assigning your JSON object to a variable, saving the script as settings.js, and including it in your background page, as follows:
settings.js looks as follows:
var settings = {"param":value,...}; //Your JSON object
then, in your background page:
<script src="settings.js"></script>
By making an AJAX call to your settings.json from your background page:
var xhr = new XMLHttpRequest();
xhr.onreadystatechange = handleStateChange; // Implemented elsewhere.
xhr.open("GET", chrome.extension.getURL('/config_resources/config.json'), true);
xhr.send();
As discussed in the comments, you cannot use FileSystem API in chrome extensions. Only chrome apps have access to it. Either way, the FileSystem API works in a sandboxed zone, so I'm not sure if you can write to files packed with the extension.

Injecting HTML into existing web pages

I'm interested in the concept of injecting a bit of HTML into existing web pages to perform a service. The idea is to create an improved bookmarking system - but I digress, the specific implementation is unimportant. I'm quite new to web development and so I have no definite idea as to how to accomplish this, thought I have noticed a couple of possibilities.
I found out I can right click > 'inspect element' and proceed to edit my browser's version of the HTML corresponding with the webpage I'm viewing. I assume that this means I can edit what I see and interact with. Could I possibly create a script that ran from a button on bookmarks bar that injected an Iframe which linked to a web service of my making? (And deleted itself after being used).
Could I possibly use a chrome extension to accomplish this? I have no experience with creating extensions and so I have no clue what they're capable of - though I wouldn't be against learning.
Which of these would be best? If they are even valid ideas. Or is there another way that I've yet to know of?
EDIT: The goal is to have a user click a button in the browser if they would like to save this page. They are then presented an interface visually independent of the rest of the page that allows them to categorize this webpage according to their interests. It would take the current link, add some information such as a comment, rating, etc. and add it to the user's data. This is meant as a sort of side-service to a website whose purpose would be to better organize and display the browsing information of the user.

Yes, you can absolutely do this. You're asking about Bookmarklets.
A bookmarklet is just a bookmark where the URL is a piece of JavaScript instead of a URL. They are very simple, yet can be capable of doing anything to a web page. Full JavaScript access.
A bookmarklet can be engaged on any web page -- the user simply has to click the bookmark(let) to launch it on the current page.
Bookmark = "http://chasemoskal.com/"
Bookmarklet = "javascript:(function(){ alert('I can do anything!') })();"
That's all it is. You can create a bookmarklet link which can be clicked-and-dragged onto a bookmark bar like this:
Bookmarklet
Bookmarklets can be limited in size, however, you can load an entire external script from the bookmarklet.

You can do what you refer to as like an <iframe>, so here are some steps that may help you, simply put:
Create an XMLHttpRequest object and make a request for a page trough it.
Make the innerHTML field of an element to hold the resultString of the previous request, aka the HTML structure.
Lets assume you have an element with the id="Result" on your html. The request goes like this:
var req = new XMLHttpRequest();
req.open('GET', 'http://example.com/mydocument.html', true);
req.onreadystatechange = function (aEvt) {
if (req.readyState == 4 && req.status == 200) {
Result.innerHTML = req.responseText;
}
};
req.send(null);
Here's an improved version in the form of a fiddle.
When you're done, you can delete that injected HTML by simply:
Result.innerHTML = '';
And then anything inside it will be gone.
However, you can't make request to other servers due to request policies. They have to be under the same domain or server. Take a look at this: Using XMLHttpRequest on MDN reference pages for more information.

Read a text file

I have looked everywhere and surprisingly can't find a good solution to this! I've got the following code that is supposed to read a text file and display it's contents. But it's not reading, for some reason. Am I doing something wrong?
FTR, I can't use PHP for this. It's gotta be Javascript.
var txtFile = new XMLHttpRequest();
txtFile.open("GET", "http://www.mysite.com/todaysTrivia.txt", true);
txtFile.send(null);
txtFile.onreadystatechange = function() {
if (txtFile.readyState == 4) { // Makes sure the document is ready to parse.
alert(txtFile.responseText+" - "+txtFile.status);
//if (txtFile.status === 200) { // Makes sure it's found the file.
var doc = document.getElementById("Trivia-Widget");
if (doc) {
doc.innerHTML = txtFile.responseText ;
}
//}
}
txtFile.send(null);
}
Any good ideas what I'm doing wrong? It just keeps givimg me a zero status.
EDIT: I guess it would be a good idea to explain why I need this code. It's basically a widget that other folks can put on their own websites that grabs a line of text from my website and displays it on theirs. The problem is that it really can't be server-side since I've got zero control over everyone else's sites that use this.

If this is cross domain, you won't be able to do this with an xmlhttprequest due to the same origin policy.

This exmaple contains jQuery code.
var text;
$.get( "proxy.php", function(data) {
text = data.responseText;
});
Then in proxy.php:
<?php
header('Content-type: application/xml');
$daurl = 'http://www.mysite.com/todaysTrivia.txt';
$handle = fopen($daurl, "r");
if ($handle) {
while (!feof($handle)) {
$buffer = fgets($handle, 4096);
echo $buffer;
}
fclose($handle);
}
Example taken from here:
http://jquery-howto.blogspot.com/2009/04/cross-domain-ajax-querying-with-jquery.html
As explained before, xmlhttp is designed for forbid cross domain requests for security issues. But nothing prevents you from doing this on your server in PHP.
Another example can be found here: http://usejquery.com/posts/9/the-jquery-cross-domain-ajax-guide

Your problem could be with the fact that you can only request XML data from the same domain via Javascript. This is the biggest issue with AJAX calls - if the text file is on another server, you can't get it via AJAX. If it's on the same server, make your request using a relative URL (no http://).
EDIT
Now that I know what you're trying to accomplish ... my recommendation would be to use an iFrame. Build the system on your server using server-side code and allow remote sites to embed an iFrame to display the output on their own sites. NetworkedBlogs uses this for displaying Facebook features on remote sites. iGoogle uses it extensively with their various Apps and Gadgets. It's a fairly tried-and-true method.
The advantage of using an iFrame is that you'll still have control over most of the content of the widget, but you can give end-users control over the styling (just have your iFrame application accept arguments via query variables to change colors, positions, and sizes).

Assuming the AJAX stuff is right (which I haven't confirmed): You say you can't use PHP for this - if you just mean you need it to use javascript asynchronously but can still use server code in some places, what about using PHP (or any server-side language) to do the actual work and return it to the page through AJAX/javascript - this would solve the problem Alex brings up.
So instead of getting from mysite.com/something.txt from javascript, get it from SomeAjaxHelper.php (or aspx or whatever).

For cross domain, you would have to use dynamic script tags to fetch data asynchronously. The todaysTrivia file would be a .js file that stores the data as JSON. Google for "dynamic script tags cross domain" if you want to use this technique.

We Keep Coding

JavaScript is the programming language of the Web.