Just to clarify in advance, I don't have a Facebook account and I have no intent to create one. Also, what I'm trying to achieve is perfectly legal in my country and the USA.
Instead of using the Facebook API to get the latest timeline posts of a Facebook page, I want to send a get request directly to the page URL (e.g. this page) and extract the posts from the HTML source code.
(I'd like to get the text and the creation time of the post.)
When I run this in the web console:
document.getElementsByClassName('userContent')
I get a list of elements containing the text of the latest posts.
But I'd like to extract that information from a nodejs script. I could probably do it quite easily using a headless browser like puppeteer or the like, but that would create a ton of unnecessary overhead. I'd really like to a simple approach like downloading the HTML code, passing it to cheerio and use cheeriio's jQuery-like API to extract the posts.
Here is my attempt of trying exactly that:
// npm i request cheerio request-promise-native
const rp = require('request-promise-native'); // requires installation of `request`
const cheerio = require('cheerio');
rp.get('https://www.facebook.com/pg/officialstackoverflow/posts/').then( postsHtml => {
const $ = cheerio.load(postsHtml);
const timeLinePostEls = $('.userContent');
console.log(timeLinePostEls.html()); // should NOT be null
const newestPostEl = timeLinePostEls.get(0);
console.log(newestPostEl.html()); // should NOT be null
const newestPostText = newestPostEl.text();
console.log(newestPostText);
//const newestPostTime = newestPostEl.parent(??).child('.livetimestamp').title;
//console.log(newestPostTime);
}).catch(console.error);
unfortunately $('.userContent') does not work. However, I was able to verify that the data I'm looking for is embedded somewhere in that HTML code.
But I couldn't really come up with a with a good regex approach or the like to extract that data.
Depending on the post content the number of HTML tags within the post varies heavily.
Here is a simple example of a post containing one link:
<div class="_5pbx userContent _3576" data-ft="{"tn":"K"}"><p>We're proud to be named one of Built In NYC's Best Places to Work in 2019, ranking in the top 10 for Best Midsize Places to Work and top 3 (!) for Best Perks and Benefits. See what it took to make the list and check out our profile to see some of our job openings. http://*******/2H3Kbr2</p></div>
Formatted in a more readable form it looks somewhat like this:
<div class="_5pbx userContent _3576" data-ft="{"tn":"K"}">
<p>
We're proud to be named one of Built In NYC's Best Places to Work in
2019, ranking in the top 10 for Best Midsize Places to Work and top 3 (!) for
Best Perks and Benefits. See what it took to make the list and check out our
profile to see some of our job openings.
SHORT_LINK.....
</p>
</div>
This regex seems to work okay, but I don't think it is very reliable:
/<div class="[^"]+ userContent [^"]+" data-ft="[^"]+">(.+?)<\/div>/g
If for example the post contained another div-element then it wouldn't work properly. In addition to that I have no way of knowing the time/date the post was created using this approach?
Any ideas how I could relatively reliably extract the most recent 2-3 posts including the creation date/time?
Okay, I finally figured it out. I hope this will be useful to others. This function will extract the 20 latest posts, including the creation time:
// npm i request cheerio request-promise-native
const rp = require('request-promise-native'); // requires installation of `request`
const cheerio = require('cheerio');
function GetFbPosts(pageUrl) {
const requestOptions = {
url: pageUrl,
headers: {
'User-Agent': 'Mozilla/5.0 (X11; Fedora; Linux x86_64; rv:64.0) Gecko/20100101 Firefox/64.0'
}
};
return rp.get(requestOptions).then( postsHtml => {
const $ = cheerio.load(postsHtml);
const timeLinePostEls = $('.userContent').map((i,el)=>$(el)).get();
const posts = timeLinePostEls.map(post=>{
return {
message: post.html(),
created_at: post.parents('.userContentWrapper').find('.timestampContent').html()
}
});
return posts;
});
}
GetFbPosts('https://www.facebook.com/pg/officialstackoverflow/posts/').then(posts=>{
// Log all posts
for (const post of posts) {
console.log(post.created_at, post.message);
}
});
Since Facebook messages can have complicated formatting the message is not plain text, but HTML. But you could remove the formatting and just get the text by replacing message: post.html() with message: post.text().
Edit:
If you want to get more than the latest 20 posts, it is more complicated. The first 20 posts are served statically on the initial html page. All following posts are retrieved via ajax in chunks of 8 posts.
It can be achieved like that:
// make sure your node.js version supports async/await (v10 and above should be fine)
// npm i request cheerio request-promise-native
const rp = require('request-promise-native'); // requires installation of `request`
const cheerio = require('cheerio');
class FbScrape {
constructor(options={}) {
this.headers = options.headers || {
'User-Agent': 'Mozilla/5.0 (X11; Fedora; Linux x86_64; rv:64.0) Gecko/20100101 Firefox/64.0' // you may have to update this at some point
};
}
async getPosts(pageUrl, limit=20) {
const staticPostsHtml = await rp.get({ url: pageUrl, headers: this.headers });
if (limit <= 20) {
return this._parsePostsHtml(staticPostsHtml);
} else {
let staticPosts = this._parsePostsHtml(staticPostsHtml);
const nextResultsUrl = this._getNextPageAjaxUrl(staticPostsHtml);
const ajaxPosts = await this._getAjaxPosts(nextResultsUrl, limit-20);
return staticPosts.concat(ajaxPosts);
}
}
_parsePostsHtml(postsHtml) {
const $ = cheerio.load(postsHtml);
const timeLinePostEls = $('.userContent').map((i,el)=>$(el)).get();
const posts = timeLinePostEls.map(post => {
return {
message: post.html(),
created_at: post.parents('.userContentWrapper').find('.timestampContent').html()
}
});
return posts;
}
async _getAjaxPosts(resultsUrl, limit=8, posts=[]) {
const responseBody = await rp.get({ url: resultsUrl, headers: this.headers });
const extractedJson = JSON.parse(responseBody.substr(9));
const postsHtml = extractedJson.domops[0][3].__html;
const newPosts = this._parsePostsHtml(postsHtml);
const allPosts = posts.concat(newPosts);
const nextResultsUrl = this._getNextPageAjaxUrl(postsHtml);
if (allPosts.length+1 >= limit)
return allPosts;
else
return await this._getAjaxPosts(nextResultsUrl, limit, allPosts);
}
_getNextPageAjaxUrl(html) {
return 'https://www.facebook.com' + /"(\/pages_reaction_units\/more[^"]+)"/g.exec(html)[1].replace(/&/g, '&') + '&__a=1';
}
}
const fbScrape = new FbScrape();
const minimum = 28; // minimum number of posts to request (gets rounded up to 20, 28, 36, 44, 52, 60, 68 etc... because of page sizes (page1=20; all_following_pages=8)
fbScrape.getPosts('https://www.facebook.com/pg/officialstackoverflow/posts/', minimum).then(posts => { // get at least the 28 latest posts
// Log all posts
for (const post of posts) {
console.log(post.created_at, post.message);
}
});
Related
I have an application developing using Nodejs. This application is making a request to GitLab API and obtaining the raw file data from it.
I would like to read the particular string which is present after another string and get all similar data from it. I am just a bit confused on this part and unable to proceed further can someone please explain to me how to achieve this?
Following is the sample file data:
I would like to read all the numbers if present after the keyword Scenario: i.e in this case I would like to get A001-D002 & K002-M002. These numbers can be anything random and can appear anywhere within the file content. I would like to read them and store them within an array for that particular file.
FileName: File Data
Background:
This is some random background
Scenario: A001-D002 My first scenario
Given I am sitting on a plane
When I am offered drinks
Scenario: K002-M002 My second scenario
Given when I book the uber taxi
When I get the notifications
I am not understanding how to iterate over the file content and read every word and match and accordingly obtain the ids.
Following is the code that I have which makes the request to GitLab and obtains the raw file content:
./index.js:
const express = require('express');
const http = require("http");
const bodyParser = require('body-parser');
const app = express();
const port = process.env.PORT || 9000;
const gitlabDump = require("./controller/GitLabDump");
app.use(bodyParser.json());
app.use(bodyParser.urlencoded({ extended: true }));
//Make NodeJS to Listen to a particular Port in Localhost
app.listen(port, function(){
gitlabDump.gitlabDump(type, function(data){
console.log("Completed Execution for GitLab")
process.exit();
})
}
My ./controller/GitLabDump.js:
const request = require('request');
const https = require('https');
const axios = require('axios');
exports.gitlabDump = function(callback){
var gitlabAPI = "https://gitlab.com/api/v4/projects/<project_id>/repository/files/tree/<subfolders>/<fileName>/raw?ref=master&private_token=<privateToken>";
//Make the request to the each file and read its raw contents
request(gitlabAPI, function(error, response, body) {
const featureFileData = JSON.parse(JSON.stringify(body)).toString();
console.log(featureFileData)
for(const match of featureFileData.matchAll("Scenario:")){
console.log(match);
}
callback("Completed");
})
}
I am able to print the file contents. Can someone please explain me how can I iterate over the raw file contents and get all the required ids?
I suggest you to use a method by analyzing each part of your string by iterating over each lines (i assume that your string is compose like in your exemple). It is easier to understand and coding it than using a regex.
The exemple below represent your request callback function.
I split the code in 3 logics :
search the filename
search the line we are interesting with ("Scenario" word)
extract the ID by filter function
You can after that, easily change you ID filter (txt.substr(0, txt.indexOf(' ')) to use a more proper expression to extract your sentence.
The result is sent to a callback function with as first argument the filename, and as second all ids. Like you did in your exemple.
((callback) => {
const featureFileData = `FileName: File Data
Background:
This is some random background
Scenario: A001-D002 My first scenario
Given I am sitting on a plane
When I am offered drinks
Scenario: K002-M002 My second scenario
Given when I book the uber taxi
When I get the notifications`;
// find "filename"
const filenames = featureFileData.split('\n')
.filter(line => line.trim().substr(0,8) === 'FileName')
.map((raw) => {
if(!raw) return 'unknown';
const sentences = raw.trim().split(':');
if(sentences[1] && sentences[1].length) {
return sentences[1].trim();
}
});
// filter the "Scenario" lines
const scenarioLines = featureFileData.split('\n')
.map((line) => {
if(line.trim().substr(0,8) === 'Scenario') {
const sentences = line.trim().split(':');
if(sentences[1] && sentences[1].length) {
return sentences[1].trim();
}
}
return false;
})
.filter(r => r !== false);
// search ids
const ids = scenarioLines.map(txt => txt.substr(0, txt.indexOf(' ')));
callback(filenames[0], ids);
})(console.log)
I am fetching data from different API with javascript's fetch API. But how can I find out how many bytes are sent on each request for analytics?
The request could be in any method.
I know that I can get the amount of bytes received with
response.headers["content-length"].
I need to find out a way to get the amount of bytes sent on the frontend (browser or mobile using React Native). Ideally, it would be the total size of the request, but just the size of the request body would be good enough.
You can get the value that will be set in the Content-Length header by reading the Request's body as text and checking the length of the returned string:
(async () => {
const formdata = new FormData();
const file = new Blob(["data".repeat(1024)])
formdata.append("key", file)
const req = new Request("/", { method: "POST", body: formdata });
// note that we .clone() the Request
// so that we can still use the original one with fetch()
console.log((await req.clone().text()).length);
fetch(req);
console.log("check the Network panel of your dev tools to see the sent header");
})();
However this only applies for requests where this header is sent, i.e not for GET and HEAD requests.
A quick solution that I used - a tiny middleware (I use Express):
const socketBytes = new Map();
app.use((req, res, next) => {
req.socketProgress = getSocketProgress(req.socket);
next();
});
/**
* return kb read delta for given socket
*/
function getSocketProgress(socket) {
const currBytesRead = socket.bytesRead;
let prevBytesRead;
if (!socketBytes.has(socket)) {
prevBytesRead = 0;
} else {
prevBytesRead = socketBytes.get(socket).prevBytesRead;
}
socketBytes.set(socket, {prevBytesRead: currBytesRead})
return (currBytesRead-prevBytesRead)/1024;
}
And then you can use req.socketProgress in your middlewares.
I'm developing an AWS Lambda in TypeScript that uses Axios to get data from an API and that data will be filtered and be put into a dynamoDb.
The code looks as follows:
export {};
const axios = require("axios");
const AWS = require('aws-sdk');
exports.handler = async (event: any) => {
const shuttleDB = new AWS.DynamoDB.DocumentClient();
const startDate = "2021-08-16";
const endDate = "2021-08-16";
const startTime = "16:00:00";
const endTime = "17:00:00";
const response = await axios.post('URL', {
data:{
"von": startDate+"T"+startTime,
"bis": endDate+"T"+endTime
}}, {
headers: {
'x-rs-api-key': KEY
}
}
);
const params = response.data.data;
const putPromise = params.map(async(elem: object) => {
delete elem.feat1;
delete elem.feat2;
delete elem.feat3;
delete elem.feat4;
delete elem.feat5;
const paramsDynamoDB = {
TableName: String(process.env.TABLE_NAME),
Item: elem
}
shuttleDB.put(paramsDynamoDB).promise();
});
await Promise.all(putPromise);
};
This all works kind of fine. If the test button gets pushed the first time, everything seems fine and is working. E.g. I received all the console.logs during developing but the data is not put into the db.
With the second try it is the same output but the data is successfully put into the Db.
Any ideas regarding this issue? How can I solve this problem and have the data put into the Db after the first try?
Thanks in advance!
you need to return the promise from the db call -
return shuttleDB.put(paramsDynamoDB).promise();
also, Promise.all will complete early if any call fails (compared to Promise.allSettled), so it may be worth logging out any errors that may be happening too.
Better still, take a look at transactWrite - https://docs.aws.amazon.com/AWSJavaScriptSDK/latest/AWS/DynamoDB/DocumentClient.html#transactWrite-property to ensure all or nothing gets written
I am fairly new to JS/Winappdriver.
The application I am trying to test is a windows based "Click Once" application from .Net, so I have to go to a website from IE and click "Install". This will open the application.
Once the application is running, I have no way to connect the application to perform my UI interactions while using JavaScript.
Using C#, I was looping through the processes looking for a process name, get the window handle, convert it to hex, add that as a capability and create the driver - it worked. Sample code below,
public Setup_TearDown()
{
string TopLevelWindowHandleHex = null;
IntPtr TopLevelWindowHandle = new IntPtr();
foreach (Process clsProcess in Process.GetProcesses())
{
if (clsProcess.ProcessName.StartsWith($"SomeName-{exec_pob}-{exec_env}"))
{
TopLevelWindowHandle = clsProcess.Handle;
TopLevelWindowHandleHex = clsProcess.MainWindowHandle.ToString("x");
}
}
var appOptions = new AppiumOptions();
appOptions.AddAdditionalCapability("appTopLevelWindow", TopLevelWindowHandleHex);
appOptions.AddAdditionalCapability("ms:experimental-webdriver", true);
appOptions.AddAdditionalCapability("ms:waitForAppLaunch", "25");
AppDriver = new WindowsDriver<WindowsElement>(new Uri(WinAppDriverUrl), appOptions);
AppDriver.Manage().Timeouts().ImplicitWait = TimeSpan.FromSeconds(60);
}
How do I do this in Javascript ? I can't seem to find any code examples.
Based on an example from this repo, I tried the following in JS to find the process to latch on to but without luck.
import {By2} from "selenium-appium";
// this.appWindow = this.driver.element(By2.nativeAccessibilityId('xxx'));
// this.appWindow = this.driver.element(By2.nativeXpath("//Window[starts-with(#Name,\"xxxx\")]"));
// this.appWindow = this.driver.elementByName('WindowsForms10.Window.8.app.0.13965fa_r11_ad1');
// thisappWindow = this.driver.elementByName('xxxxxxx');
async connectAppDriver(){
await this.waitForAppWindow();
var appWindow = await this.appWindow.getAttribute("NativeWindowHandle");
let hex = (Number(ewarpWindow)).toString(16);
var currentAppCapabilities =
{
"appTopLevelWindow": hex,
"platformName": "Windows",
"deviceName": "WindowsPC",
"newCommandTimeout": "120000"
}
let driverBuilder = new DriverBuilder();
await driverBuilder.stopDriver();
this.driver = await driverBuilder.createDriver(currentEwarpCapabilities);
return this.driver;
}
I keep getting this error in Winappdriver
{"status":13,"value":{"error":"unknown error","message":"An unknown error occurred in the remote end while processing the command."}}
I've also opened this ticket here.
It seems like such an easy thing to do, but I couldn't figure this one out.
Any of nodes packages I could use to get the top level window handle easily?
I am open to suggestions on how to tackle this issue while using JavaScript for Winappdriver.
Hope this helps some one out there,
Got around this by creating an exe using C# that generated hex of the app to connect based on the process name, it looks like something like this.
public string GetTopLevelWindowHandleHex()
{
string TopLevelWindowHandleHex = null;
IntPtr TopLevelWindowHandle = new IntPtr();
foreach (Process clsProcess in Process.GetProcesses())
{
if (clsProcess.ProcessName.StartsWith(_processName))
{
TopLevelWindowHandle = clsProcess.Handle;
TopLevelWindowHandleHex = clsProcess.MainWindowHandle.ToString("x");
}
}
if (!String.IsNullOrEmpty(TopLevelWindowHandleHex))
return TopLevelWindowHandleHex;
else
throw new Exception($"Process: {_processName} cannot be found");
}
Called it from JS to get the hex of the top level window handle, like this,
async getHex () {
var pathToExe =await path.join(process.cwd(), "features\\support\\ProcessUtility\\GetWindowHandleHexByProcessName.exe");
var pathToDir =await path.join(process.cwd(), "features\\support\\ProcessUtility");
const result = await execFileSync(pathToExe, [this.processName]
, {cwd: pathToDir, encoding: 'utf-8'}
, async function (err, data) {
console.log("Error: "+ err);
console.log("Data(hex): "+ data);
return JSON.stringify(data.toString());
});
return result.toString().trim();
}
Used the hex to connect to the app like this,
async connectAppDriver(hex) {
console.log(`Hex received to connect to app using hex: ${hex}`);
const currentAppCapabilities=
{
"browserName": '',
"appTopLevelWindow": hex.trim(),
"platformName": "Windows",
"deviceName": "WindowsPC",
"newCommandTimeout": "120000"
};
const appDriver = await new Builder()
.usingServer("http://localhost:4723/wd/hub")
.withCapabilities(currentAppCapabilities)
.build();
await driver.startWithWebDriver(appDriver);
return driver;
}
Solution:
In WebDriverJS (used by selenium / appium), use getDomAttribute instead of getAttribute. Took several hours to find :(
element.getAttribute("NativeWindowHandle")
POST: /session/270698D2-D93B-4E05-9FC5-3E5FBDA60ECA/execute/sync
Command not implemented: POST: /session/270698D2-D93B-4E05-9FC5-3E5FBDA60ECA/execute/sync
HTTP/1.1 501 Not Implemented
let topLevelWindowHandle = await element.getDomAttribute('NativeWindowHandle')
topLevelWindowHandle = parseInt(topLevelWindowHandle).toString(16)
GET /session/DE4C46E1-CC84-4F5D-88D2-35F56317E34D/element/42.3476754/attribute/NativeWindowHandle HTTP/1.1
HTTP/1.1 200 OK
{"sessionId":"DE4C46E1-CC84-4F5D-88D2-35F56317E34D","status":0,"value":"3476754"}
and topLevelWindowHandle have hex value :)
Working on a Chrome Extension, which needs to integrate with IndexedDB. Trying to figure out how to use Dexie.JS. Found a bunch of samples. Those don't look too complicated. There is one specific example particularly interesting for exploring IndexedDB with Dexie at https://github.com/dfahlander/Dexie.js/blob/master/samples/open-existing-db/dump-databases.html
However, when I run the one above - the "dump utility," it does not see IndexedDB databases, telling me: There are databases at the current origin.
From the developer tools Application tab, under Storage, I see my IndexedDB database.
Is this some sort of a permissions issue? Can any indexedDB database be accessed by any tab/user?
What should I be looking at?
Thank you
In chrome/opera, there is a non-standard API webkitGetDatabaseNames() that Dexie.js uses to retrieve the list of database names on current origin. For other browsers, Dexie emulates this API by keeping an up-to-date database of database-names for each origin, so:
For chromium browsers, Dexie.getDatabaseNames() will list all databases at current origin, but for non-chromium browsers, only databases created with Dexie will be shown.
If you need to dump the contents of each database, have a look at this issue, that basically gives:
interface TableDump {
table: string
rows: any[]
}
function export(db: Dexie): TableDump[] {
return db.transaction('r', db.tables, ()=>{
return Promise.all(
db.tables.map(table => table.toArray()
.then(rows => ({table: table.name, rows: rows})));
});
}
function import(data: TableDump[], db: Dexie) {
return db.transaction('rw', db.tables, () => {
return Promise.all(data.map (t =>
db.table(t.table).clear()
.then(()=>db.table(t.table).bulkAdd(t.rows)));
});
}
Combine the functions with JSON.stringify() and JSON.parse() to fully serialize the data.
const db = new Dexie('mydb');
db.version(1).stores({friends: '++id,name,age'});
(async ()=>{
// Export
const allData = await export (db);
const serialized = JSON.stringify(allData);
// Import
const jsonToImport = '[{"table": "friends", "rows": [{id:1,name:"foo",age:33}]}]';
const dataToImport = JSON.parse(jsonToImport);
await import(dataToImport, db);
})();
A working example for dumping data to a JSON file using the current indexedDB API as described at:
https://developers.google.com/web/ilt/pwa/working-with-indexeddb
https://developer.mozilla.org/en-US/docs/Web/API/IndexedDB_API/Using_IndexedDB
The snippet below will dump recent messages from a gmail account with the Offline Mode enabled in the gmail settings.
var dbPromise = indexedDB.open("your_account#gmail.com_xdb", 109, function (db) {
console.log(db);
});
dbPromise.onerror = (event) => {
console.log("oh no!");
};
dbPromise.onsuccess = (event) => {
console.log(event);
var transaction = db.transaction(["item_messages"]);
var objectStore = transaction.objectStore("item_messages");
var allItemsRequest = objectStore.getAll();
allItemsRequest.onsuccess = function () {
var all_items = allItemsRequest.result;
console.log(all_items);
// save items as JSON file
var bb = new Blob([JSON.stringify(all_items)], { type: "text/plain" });
var a = document.createElement("a");
a.download = "gmail_messages.json";
a.href = window.URL.createObjectURL(bb);
a.click();
};
};
Running the code above from DevTools > Sources > Snippets will also let you set breakpoints and debug and inspect the objects.
Make sure you set the right version of the database as the second parameter to indexedDB.open(...). To peek at the value used by your browser the following code can be used:
indexedDB.databases().then(
function(r){
console.log(r);
}
);