Scraping a rendered javascript webpage - javascript

I'm trying to build a short Python program that extracts Pewdiepie's number of subscribers which is updated every second on socialblade to show it in the terminal. I want this data like every 30 seconds.
I've tried using PyQt but it's slow, i've turned to dryscrape, slightly faster but doesn't work either as I want it to. I've just found Invader and written some short code that still has the same problem : the number returned is the one before the Javascript on the page is executed :
from invader import Invader
url = 'https://socialblade.com/youtube/user/pewdiepie/realtime'
invader = Invader(url, js=True)
subscribers = invader.take(['#rawCount', 'text'])
print(subscribers.text)
I know that this data is accessible via the site's API but it's not always working, sometimes it just redirect to this.
Is there a way to get this number after the Javascript on the page modified the counter and not before ? And which method seems the best to you ? Extract it :
from the original page which always returns the same number for hours ?
from the API's page which bugs when not using cookies in the code and after a certain amount of time ?
Thanks for your advices !

If you want scrape a web page that has parts of it loaded in by javascript you pretty much need to use a real browser.
In python this can be achieved with pyppeteer:
import asyncio
from pyppeteer import launch
async def main():
browser = await launch(headless=False)
page = await browser.newPage()
await page.goto('https://socialblade.com/youtube/user/pewdiepie/realtime',{
'waitUntil': 'networkidle0'
})
count = int(await page.Jeval('#rawCount', 'e => e.innerText'))
print(count)
asyncio.get_event_loop().run_until_complete(main())
Note: It does not seems like the website you mentioned above is updating the subscriber count frequently any more (even with JavaScript). See: https://socialblade.com/blog/abbreviated-subscriber-counts-on-youtube/
For best success and reliability you will probably need to set the user agent(page.setUserAgent in pyppeteer) and keep it up to date and use proxies (so your ip does not get banned). This can be a lot of work.
It might be easier and cheaper (in time and than buying a large pool of proxies) to use a service that will handle this for you like Scraper's Proxy. It supports will use a real browser and return the resulting html after the JavaScript has run and route all of our requests through a large network of proxies, so you can sent a lot of requests without getting you ip banned.
Here is an example using the Scraper's Proxy API getting the count directly from YouTube:
import requests
from pyquery import PyQuery
# Send request to API
url = "https://scrapers-proxy2.p.rapidapi.com/javascript"
params = {
"click_selector": '#subscriber-count', # (Wait for selector work-around)
"wait_ajax": 'true',
"url":"https://www.youtube.com/user/PewDiePie"
}
headers = {
'x-rapidapi-host': "scrapers-proxy2.p.rapidapi.com",
'x-rapidapi-key': "<INSERT YOUR KEY HERE>" # TODO
}
response = requests.request("GET", url, headers=headers, params=params)
# Query html
pq = PyQuery(response.text)
count_text = pq('#subscriber-count').text()
# Extract count from text
clean_count_text = count_text.split(' ')[0]
clean_count_text = clean_count_text.replace('K','000')
clean_count_text = clean_count_text.replace('M','000000')
count = int(clean_count_text)
print(count)
I know this is a bit late, but I hope this helps

Related

Is it possible to link a random html site with node javascript?

Is it possible to link a random site with node.js, when I say that, Is it possible to link it with only a URL, if not then I'm guessing it's having the file.html inside the javascript directory. I really wanna know if it's possible because the html is not mine and I can't add the line of code to link it with js that goes something like (not 100% sure) <src = file.html>
I tried doing document = require('./page.html'); and ('./page') but it didn't work and when I removed the .html at the end of require it would say module not found
My keypoint is that the site shows player count on some servers, and I wanna get that number by linking it with js and then using it in some code which I have the code to (tested in inspect element console) but I don't know how to link it properly to JS.
If you wanna take a look at the site here it is: https://portal.srbultras.info/#servers
If you have any ideas how to link a stranger's html with js, i'd really appreciate to hear it!
You cannot require HTML files unless you use something like Webpack with html-loader, but even in this case you can only require local files. What you can do, however, is to send an HTTP Request to the website. This way you get the same HTML your browser receives whenever you open a webpage. After that you will have to parse the HTML in order to get the data you need. The jsdom package can be used for both steps:
const { JSDOM } = require('jsdom');
JSDOM.fromURL('https://portal.srbultras.info/')
.then(({ window: { document }}) => {
const servers = Array.from(
document.querySelectorAll('#servers tbody>tr')
).map(({ children }) => {
const name = children[3].textContent;
const [ip, port] = children[4]
.firstElementChild
.textContent
.split(':');
const [playersnum, maxplayers] = children[5]
.lastChild
.textContent
.split('/')
.map(n => Number.parseInt(n));
return { name, ip, port, playersnum, maxplayers };
});
console.log(servers);
/* Your code here */
});
However, grabbing the server information from a random website is not really what you want to do, because there is a way to get it directly from the servers. Counter Strike 1.6 servers seem to use the GoldSrc / Source Server Protocol that lets us retrieve information about the servers. You can read more about the protocol here, but we are just going to use the source-server-query package to send queries:
const query = require('source-server-query');
const servers = [
{ ip: '51.195.60.135', port: 27015 },
{ ip: '51.195.60.135', port: 27017 },
{ ip: '185.119.89.86', port: 27021 },
{ ip: '178.32.137.193', port: 27500 },
{ ip: '51.195.60.135', port: 27018 },
{ ip: '51.195.60.135', port: 27016 }
];
const timeout = 5000;
Promise.all(servers.map(server => {
return query
.info(server.ip, server.port, timeout)
.then(info => Object.assign(server, info))
.catch(console.error);
})).then(() => {
query.destroy();
console.log(servers);
/* Your code here */
});
Update
servers is just a normal JavaScript array consisting of objects that describe servers, and you can see its structure when it is logged into the console after the information has been received, so it should not be hard to work with. For example, you can access the playersnum property of the third server in the list by writing servers[2].playersnum. Or you can loop through all the servers and do something with each of them by using functions like map and forEach, or just a normal for loop.
But note that in order to use the data you get from the servers, you have to put your code in the callback function passed to the then method of Promise.all(...), i.e. where console.log(servers) is located. This has to do with the fact that it takes some time to get the responses from the servers, and for that reason server queries are normally asynchronous, meaning that the script continues execution even though it has not received the responses yet. So if you try to access the information in the global scope instead of the callback function, it is not going to be there just yet. You should read about JavaScript Promises if you want to understand how this works.
Another thing you may want to do is to filter out the servers that did not respond to the query. This can happen if a server is offline, for example. In the solution I have provided, such servers are still in the servers array, but they only have the ip and port properties they had originally. You could use filter in order to get rid of them. Do you see how? Tell me if you still need help.

Is there a way to lower HTTPS requests where I have to request a large amount of IDs

I have a site, say example.com
And I have an array of IDs that can correspond to raw JSON on the site.
I can request it via example.com/<name>/<id>
Thing is, the amount of IDs I have is over 100. So, I have to make over 100 HTTPs requests. I don't think the site can let me do example.com/<name>?ids=[ids](plug in the array of IDs as a component). Is there a way I can optimize this?
I'm using Node.JS, the HTTPS module.
You can split the whole list into chunks (let's say of the length = 5) and make all requests in this chunk in parallel. That way you can optimize a time needed to make all requests and at the same time not to overload that site by making all requests in parallel.
// this is `chunk` function from `lodash` package
const idChunks = _.chunk(ids, 5);
for (const idChunk of idChunks) {
// I'm using axios here only for the demonstration purposes
const results = await Promise.all(idChunk.map(id => axios.get(`example.com/<name>/${id}`)))
...
// processing results
}

js function calling an api doesn't respond with expected values

I am trying to collect all the market pairs from a crypto exchange using its API, but I'm not sure how to select the proper line in the JSON object as it does not seem to work.
the api : https://ftx.com/api/markets
my code :
requests.js
import axios from 'axios';
import parsers from './parsers';
async function ftxMarkets() {
const ftxResponse = await axios.get('https://ftx.com/api/markets');
return parsers.ftxMarkets(ftxResponse.data);
}
parsers.js
function ftxMarkets(data) {
const [ftxMarketPairs] = data;
let ftxPairs = data.map(d => d.name );
console.log(ftxPairs);
};
I'm not sure about d.name in the parsers.js file, but I tried with another exchange with the same code, changing just that part and it worked, so I guess that's where the problem comes from, although can't be sure and I don't know by what to replace it.
Thanks
I ran the api call and after looking at the response I see a result key with the list of all crypto data. So I am guessing it'll work if you call the parser with the result object like this
return parsers.ftxMarkets(ftxResponse.result);
// try parsers.ftxMarkets(ftxResponse.data.result) if the above one doesnt work
and then in the parser it should work normally
function ftxMarkets(data) {
let ftxPairs = data.map(d => d.name );
console.log(ftxPairs);
};
Update:
Since fxtResponse.data.result works. Your issue should be a CORS issue and to fix that there are two options.
CORS plugin in web browser(not recommended in production)
Proxy it through a server. By requesting the resource through a proxy - The simplest way, what you can do is, write a small node server (or if you already have a back-end associate it with your front-end you can use it) which does the request for the 3rd party API and sends back the response. And in that server response now you can allow cross-origin header.
For 2 If you already have a nodeJs server. You can use CORs Npm package and call the third party api from the server and serve the request to the front end with CORS enabled.

Google BigQuery with javascript - how to get query exe time and how much data (in size) processed by query

I am new to BigQuery and need more functions in BiogQuery + Javascript so i can get total execution time and how much GB of data processed by that query.
How can i know total exe time and processed data size in javascript api.
Eg. Query complete (1.6s elapsed, 35.7 GB processed)
the above example result from javascript api.
The total processed bytes i can get from response. But query exeution time from where i will get this. I dont want to run timer (to calculate time) before and after query exe.
Also need a intel on how to see executed query history from JavaScript api.
Thanks in advance.
To determine how long a job took, you can compare statistics.creationTime, statistics.startTime, and statistics.endTime, depending on your needs. These can be accessed from the jobs.list or jobs.get API. These responses will also contain the bytes processed by a query in the statistics.query.totalBytesProcessed field.
To retrieve a history of jobs (including queries, and any other load, copy, or extract jobs you may have run) you can call the jobs.list API.
Specifically in JS, if you have a query response containing a jobReference, you can run something like the following to retrieve the full job details using the jobs.get method and log them to the console. The logged response should contain the fields linked above.
var projectId = response['jobReference']['projectId'];
var jobId = response['jobReference']['jobId'];
var path = 'https://clients6.google.com/bigquery/v2/projects/' + projectId + '/jobs/' + jobId;
var request = {'path': path, 'method': 'GET'};
gapi.client.request(request).execute(function(response) { console.log(response) });

How to filter out the most active users from fan page?

I am creating a new website. I want to promote it using another my topic-related web service. I want to send some gifts to people which popularized my first website and fanpage. How to filter out lets say 20 users which likes/shares/comments most of my posts?
Any suitable programming language will be good.
[EDIT]
Ok... to be honest I looking a way to parse a fanpage that is not mine. I want to send gifts to the most active users of fanpage of my competition, to simply bribe them a little :)
There are a number of ways, I'll start with the easiest...
Say there's a brand name or #hashtag involved then you could user the search API as such: https://graph.facebook.com/search?q=watermelon&type=post&limit=1000 and then iterate over the data, say the latest 1000 (the limit param) to find out mode user (the one that comes up the most) out of all the statuses.
Say it's just a page, then you can access the /<page>/posts end point (eg: https://developers.facebook.com/tools/explorer?method=GET&path=cocacola%2Fposts) as that'll give you a list of the latest posts (they're paginated so you can iterate over the results) and this'll include a list of the people who like the posts and who comment on them; you can then find out the mode user and so on.
In terms of the code you can use anything, you can even run this locally on your machine using a simple web server (such as MAMP or WAMP, etc...) or CLI. The response is all JSON and modern languages are able to handle this. Here's a quick example I knocked up for the first method in Python:
import json
import urllib2
from collections import Counter
def search():
req = urllib2.urlopen('https://graph.facebook.com/search?q=watermelon&type=post')
res = json.loads(req.read())
users = []
for status in res['data']:
users.append(status['from']['name'])
count = Counter(users)
print count.most_common()
if __name__ == '__main__':
search()
I've stuck it up on github if you want to refer to it later: https://github.com/ahmednuaman/python-facebook-search-mode-user/blob/master/search.py
When you run the code it'll return an ordered list of the mode users within that search, eg the ones who've posted the most comments with the specific search tag. This can be easily adapted for the second method should you wish to use it.
Based on Ahmed Nuaman answer (please also give them +1), I have prepared this piece of code:
example of usage:
To analyze most active facebook users of http://www.facebook.com/cern
$ python FacebookFanAnalyzer.py cern likes
$ python FacebookFanAnalyzer.py cern comments
$ python FacebookFanAnalyzer.py cern likes comments
notes: shares and inner comments are not supported
file: FacebookFanAnalyzer.py
# -*- coding: utf-8 -*-
import json
import urllib2
import sys
from collections import Counter
reload(sys)
sys.setdefaultencoding('utf8')
###############################################################
###############################################################
#### PLEASE PASTE HERE YOUR TOKEN, YOU CAN GENERETE IT ON:
#### https://developers.facebook.com/tools/explorer
#### GENERETE AND PASTE NEW ONE, WHEN THIS WILL STOP WORKING
token = 'AjZCBe5yhAq2zFtyNS4tdPyhAq2zFtyNS4tdPw9sMkSUgBzF4tdPw9sMkSUgBzFZCDcd6asBpPndjhAq2zFtyNS4tsBphqfZBJNzx'
attrib_limit = 100
post_limit = 100
###############################################################
###############################################################
class FacebookFanAnalyzer(object):
def __init__(self, fanpage_name, post_limit, attribs, attrib_limit):
self.fanpage_name = fanpage_name
self.post_limit = post_limit
self.attribs = attribs
self.attrib_limit = attrib_limit
self.data={}
def make_request(self, attrib):
global token
url = 'https://graph.facebook.com/' + self.fanpage_name + '/posts?limit=' + str(self.post_limit) + '&fields=' + attrib + '.limit('+str(self.attrib_limit)+')&access_token=' + token
print "Requesting '" + attrib + "' data: " + url
req = urllib2.urlopen(url)
res = json.loads(req.read())
if res.get('error'):
print res['error']
exit()
return res
def grep_data(self, attrib):
res=self.make_request(attrib)
lst=[]
for status in res['data']:
if status.get(attrib):
for person in status[attrib]['data']:
if attrib == 'likes':
lst.append(person['name'])
elif attrib == 'comments':
lst.append(person['from']['name'])
return lst
def save_as_html(self, attribs):
filename = self.fanpage_name + '.html'
f = open(filename, 'w')
f.write(u'<html><head></head><body>')
f.write(u'<table border="0"><tr>')
for attrib in attribs:
f.write(u'<td>'+attrib+'</td>')
f.write(u'</tr>')
for attrib in attribs:
f.write(u'<td valign="top"><table border="1">')
for d in self.data[attrib]:
f.write(u'<tr><td>' + unicode(d[0]) + u'</td><td>' +unicode(d[1]) + u'</td></tr>')
f.write(u'</table></td>')
f.write(u'</tr></table>')
f.write(u'</body>')
f.close()
print "Saved to " + filename
def fetch_data(self, attribs):
for attrib in attribs:
self.data[attrib]=Counter(self.grep_data(attrib)).most_common()
def main():
global post_limit
global attrib_limit
fanpage_name = sys.argv[1]
attribs = sys.argv[2:]
f = FacebookFanAnalyzer(fanpage_name, post_limit, attribs, attrib_limit)
f.fetch_data(attribs)
f.save_as_html(attribs)
if __name__ == '__main__':
main()
Output:
Requesting 'comments' data: https://graph.facebook.com/cern/posts?limit=50&fields=comments.limit(50)&access_token=AjZCBe5yhAq2zFtyNS4tdPyhAq2zFtyNS4tdPw9sMkSUgBzF4tdPw9sMkSUgBzFZCDcd6asBpPndjhAq2zFtyNS4tsBphqfZBJNzx
Requesting 'likes' data: https://graph.facebook.com/cern/posts?limit=50&fields=likes.limit(50)&access_token=AjZCBe5yhAq2zFtyNS4tdPyhAq2zFtyNS4tdPw9sMkSUgBzF4tdPw9sMkSUgBzFZCDcd6asBpPndjhAq2zFtyNS4tsBphqfZBJNzx
Saved to cern.html
Read the list of posts on the page at the page's /feed connection and track the user IDs of those users who posted and commented on each post, building a list of who does it the most often.
Then store those somewhere and use the stored list in the part of your system which decides who to send the bonuses to.
e.g.
http://graph.facebook.com/cocacola/feed returns all the recent posts on the cocacola page, and you could track the IDs of the posters, commenters, likers, to determine who are the most active users
write a php or jquery script which is executed when user clicks like or share on your website just before actually sharing/like to fb and record the user info and the post he/she shared/liked. Now you can track who shared your post the most.
PHP/Jquery script will act as middle man, so dont use facebook share/like script directly. I will try to find the code I have written for this method. I have used PHP & Mysql. Try to use JQuery this will give a better result in terms of hidding the process (I mean data will be recorded without reloading the page).
Your question is nice, but it is quite hard.. (Actually, in the beginning, there's a thing that came from my mind that this is impossible. So, I build a quite different solution...) One of the best ways is to create a network where your viewers can register in the form that required the official URLs of their social networking pages, and also, they could choose that they doesn't have this kind of network:
“Do you want to share some of our page? Please register here first..”
So, they can get a specific URL that they've wanted to share when they're in your website, but they doesn't know the they're in tracing when they visited that specific URL.. (Every time a specific URL get visited, an I.P. get tracked and the number of visits get ++1 in a database.) Give them a dynamic URL on the top of your website that's in text area of every pages to track them. Or use scripting to automated the adding of a tracing query string on the URLs of your site.
I think there's a free software to build an affiliate network to make this easy! If your viewers really love your website, they'll registered to be an affiliate. But this thing is different, affiliate network is quite different from a network that mentioned in the paragraphs above..
But I think, you can also use Google Analytics to fully trace some referrals that didn't came from the URLs with dynamic QUERY STRING like Digital Point, but not from the other social network like Facebook 'cause you wouldn't get the exact Referral Paths with that kind of social network because of the query path. However you can use it to track the other networks. Also, AddThis Analytics is good for non-query string URLs.
The two kinds of referrals on Google Analytics are under of “Traffic Sources” menu of STANDARD REPORTS..
Traffic Sources
Sources
Referrals
Social
Network Referrals
This answer is pretty messy, but sometimes, quite useful.. Other than that? Please check these links below:
Publishing with an App Access Token - Facebook Developers
Facebook for Websites - Facebook Developers
Like - Facebook Developers
Open Graph Overview - Facebook Developers

Categories