I want to use nodeJS as tool for website scrapping. I have already implemented a script which logs me in on the system and parse some data from the page.
The steps are defined like:
Open login page
Enter login data
Submit login form
Go to desired page
Grab and parse values from the page
Save data to file
Exit
Obviously, the problem is that every time my script has to login, and I want to eliminate that. I want to implement some kind of cookie management system, where I can save cookies to .txt file, and then during next request I can load cookies from file and send it in request headers.
This kind of cookie management system is not hard to implement, but the problem is how to access cookies in nodejs? The only way I found it is using request response object, where you can use something like this:
request.get({headers:requestHeaders,uri: user.getLoginUrl(),followRedirect: true,jar:jar,maxRedirects: 10,},function(err, res, body) {
if(err) {
console.log('GET request failed here is error');
console.log(res);
}
//Get cookies from response
var responseCookies = res.headers['set-cookie'];
var requestCookies='';
for(var i=0; i<responseCookies.length; i++){
var oneCookie = responseCookies[i];
oneCookie = oneCookie.split(';');
requestCookies= requestCookies + oneCookie[0]+';';
}
}
);
Now content of variable requestCookies can be saved to the .txt file and can loaded next time when script is executed, and this way you can avoid process of logging in user every time when script is executed.
Is this the right way, or there is a method which returns cookies?
NOTE: If you want to setup your request object to automatically resend received cookies on every subsequent request, use the following line during object creation:
var request = require("request");
request = request.defaults({jar: true});//Send cookies on every subsequent requests
In my case, i've used 'http'library like the following:
http.get(url, function(response) {
variable = response.headers['set-cookie'];
})
This function gets a specific cookie value from a server response (in Typescript):
function getResponseCookieValue(res: Response, param: string) {
const setCookieHeader = res.headers.get('Set-Cookie');
const parts = setCookieHeader?.match(new RegExp(`(^|, )${param}=([^;]+); `));
const value = parts ? parts[2] : undefined;
return value;
}
I use Axios personally.
axios.request(options).then(function (response) {
console.log(response.config.headers.Cookie)
}).catch(function (error) {
console.error(error)
});
Related
I wish to give the client feedback before a redirect occurs, so they can store it in session storage, then when the cached page arrive from the service worker, they check session storage while the page is being rendered (not after!), and can handle the cached response accordingly.
I tried:
Adding a custom header to the response, but the client JavaScript can't read it for security reasons.
I have tried to edit the response directly. This only works for GET requests. Unfortunately when I sync a POST request, because it returns a redirect, so then it looks like a normal GET. So I need some additional way of saying, this is a GET after a sync POST, tell the user the POST was saved, its not just a normal "get the page"
Post Message, but slow as.
LocalStorage and SessionStorage is forbidden for the service worker
I could write to IndexedDB in the service worker, and then read from the client. But IndexedDB is such a confusing beast I really don't want to go down this route.
URL search parameters, redirect and url cleaning strategy became spaghetti code very quickly. The server would have to clean up URLs, and so would the service worker for the injected query args.
Is there any recommended machanism for relaying information to a client that would suite this purpose?
Side note about the post message being slow:
I currently use post message, but the problem is its really slow, and the reason I think is this:
Client attempts offline POST
Service worker serializes and stores it for when online again. In the fetch interrupt it responds with the cached response. It also calls an async postmessage to tell the client it was saved. Unfortunately if I await the postmessage, it errors out the fetch. So then one has to leave it to be async. Which means the post message happens only after the redirect
Client receive redirect response
Client redirects
Client paints the page
The cahed paged is showed
Only after about two seconds later it shows the 'was saved banner'
Heres some code if applicable:
Note: Orginally the code would set a value in the session storage when receiving the message (assumed it would receive the message before the redirect), and then pop it after the redirect at page render. However because the post message was coming so much later, I changed to performing the change on the page directly.
async function msgClientSyncSaved(event) {
const data = {
type: 'MSG_SYNC_SAVED',
};
const client = await getClient(event);
client.postMessage(data);
}
// Applicable parts of runFetch:
async function runFetch(event) {
const urlObj = new URL(event.request.url);
if (utils.getIsMethodTx(event.request.method)) {
// If a Sync URL
const clonedRequest = event.request.clone();
const response = await new strategies.NetworkOnlyStratey(log, event, cacheMutator).run();
if (!response.isDefaultResponse && !response.isCachedResponse) {
event.waitUntil(syncAllRequests());
return response;
} else {
const [syncKey, syncValue] = settings.PWA_SYNC_POST_URL_PARAM.split("=");
if (urlObj.searchParams.get(syncKey) === syncValue) {
// A failed POST, that requires SYNCING
console.log(`SW: Sync later: ${event.request.method} to ${event.request.url}`);
event.waitUntil(storeRequest(clonedRequest)); // no need to wait for this to finish before returning response
event.waitUntil(msgClientSyncSaved(event)); <--- HERE message client
// After a post, return a redirect
urlObj.searchParams.delete(syncKey);
const redirectUrl = String(urlObj);
// 302 means GET the redirect, 307 means POST to the redirect
console.log('REDIRECT TO', redirectUrl)
return Response.redirect(redirectUrl, 302);
}
}
}
}
function handleFetch(event) {
event.respondWith(runFetch(event));
}
self.addEventListener("fetch", handleFetch);
Reciever on client side:
async function handleMessage(event) {
switch (event.data.type) {
case 'MSG_SYNC_SAVED':
document.body.setAttribute('data-pwa-cached-page', 'true data-tx')
break;
}
}
navigator.serviceWorker.addEventListener("message", handleMessage);
I would like to use the output of my nodeJS. This is my code
var fs = require('fs'); //File System
var rutaImagen = 'C:/Users/smontesc/Desktop/imagenes1/'; //Location of images
fs.readdir(rutaImagen, function(err, files) {
if (err) { throw err; }
var imageFile = getNewestFile(files, rutaImagen);
//process imageFile here or pass it to a function...
console.log(imageFile);
});
function getNewestFile(files, path) {
var out = [];
files.forEach(function(file) {
var stats = fs.statSync(path + "/" +file);
if(stats.isFile()) {
out.push({"file":file, "mtime": stats.mtime.getTime()});
}
});
out.sort(function(a,b) {
return b.mtime - a.mtime;
})
return (out.length>0) ? out[0].file : "";
}
And the result is console.log(imageFile), I want to call the result of this in my javascript project, like
<script>
document.write(imageFile)
</script>
All this is to get the newest file created in a directory because I can't do it directly on JS.
Thanks a lot
First, there are several fundamental things about how the client/server relationship of the browser and a web server work that we need to establish. That will then offer a framework for discussing solving your problem.
Images are displayed in a browser, not with document.write(), but by inserting an image tag in your document that points to the URL of a specific image.
For a web page to get some result from the server, it has to either have that result embedded in the web page when the web page was originally fetched from the server or the Javascript in the web page has to request information from the server with an Ajax request. An ajax request is an http request where the Javascript in your web page, forms an http request that is sent to your server, your server receives that request and sends back a response which the Javascript in your web page receives and can then do something with.
To implement something where your web page requests some data from your back-end, you will have to have a web server in your back-end that can response to Ajax requests sent from the web page. You cannot just run a script on your server and magically modify a web page displayed in a browser. Without the type of structure described in the previous points, your web page has no connection at all to the displayed server. The web page can't directly reach your server file system and the server can't directly touch the displayed web page.
There are a number of possible schemes for implementing this type of connection. What I would think would work best would be to define an image URL that, when requested by any browser, it returns an image for the newest image in your particular directory on your server. Then, you would just embed that particular URL in your web page and anytime that image was refreshed or displayed, your server would send it the newest version of that image. Your server probably also needs to make sure that the browser does not cache that URL by setting appropriate cache headers so that it won't mistakenly just display the previously cached version of that image.
The web page could look like this:
<img src='http://mycustomdomain.com/dimages/newest'>
Then, you'd set up a web server at mycustomdomain.com that is publicly accessible (from the open internet - you choose your own domain obviously) that has access to the desired images and you'd create a route on that web server that answers to the /dimages/newest request.
Using Express as your web server framework, this could look like this:
const app = require('express')();
const fs = require('fs');
const util = require('util');
const readdir = util.promisify(fs.readdir);
const stat = util.promisify(fs.stat);
// middleware to use in some routes that you don't want any caching on
function nocache(req, res, next) {
res.header('Cache-Control', 'private, no-cache, no-store, must-revalidate, proxy-revalidate');
res.header('Expires', '-1');
res.header('Pragma', 'no-cache');
next();
}
const rutaImagen = 'C:/Users/smontesc/Desktop/imagenes1/'; //Location of images
// function to find newest image
// returns promise that resolves with the full path of the image
// or rejects with an error
async function getNewestImage(root) {
let files = await readdir(root);
let results = [];
for (f of files) {
const fullPath = root + "/" + f;
const stats = await stat(fullPath);
if (stats.isFile()) {
results.push({file: fullPath, mtime: stats.mtime.getTime()});
}
}
results.sort(function(a,b) {
return b.mtime - a.mtime;
});
return (results.length > 0) ? results[0].file : "";
}
// route for fetching that image
app.get(nocache, '/dimages/newest', function(req, res) {
getNewestImage(rutaImagen).then(img => {
res.sendFile(img, {cacheControl: false});
}).catch(err => {
console.log('getNewestImage() error', err);
res.sendStatus(500);
});
});
// start your web server
app.listen(80);
To be able to use that result in your Javascipt project, we definitely have to create an API which has a particular route that responses the imageFile. Then, in your Javascript project, you can use XMLHttpRequest (XHR) objects or the Fetch API to interact with servers to get the result.
The core idea is we definitely need both server-side and client-side programming to perform that functionality.
My goal is to fetch the status data from a UBNT radio (https://www.ubnt.com/) using an HTTP request. The web interface url is formatted as http://192.168.0.120/status.cgi. Making the request requires a authentication cookie. Using the cookie copied from the existing web interface I am able to successfully retrieve the data.
This is my current code using the Meteor framework.
radioHost = "http://192.168.0.120";
HTTP.call("POST", radioHost + "/login.cgi",
{
headers: {
"Content-Type": "multipart/form-data"
},
data: {
username: "ubnt",
password: "ubnt"
}
}, (err, res) = > {
if(err) return console.log(err);
var cookie = res.headers["set-cookie"][0];
HTTP.call("GET", radioHost + "/status.cgi", {
headers: {
cookie
}
}, (err, res) = > {
if(err) return console.log("Error");
console.log(res);
})
})
The above code achieves both request successfully. However the server is responding to the first with a faulty token ("set-cookie" string). Using the cookie from the existing web framework the response is correct.
Here is a library written in Python that I believe does a similar thing. https://github.com/zmousm/ubnt-nagios-plugins
I believe my problem lies within the HTTP request and the web api not cooperating with the username and password.
Thanks in advance for any help.
A direct POST request to a url is not a recommended way. When you open a browser you just don't directly login. You fetch the page and then submit/login
Not simulating this behavior may impact certain sites depending on how the server works.
So if always want to look at the simulating like a real user/browser would do, make a GET request first and then the POST.
Also capture any cookies from the first GET request and then pass the same on to the next one
So my scenario is a user clicks a button on a web app, this triggers a server side POST request to an internal (i.e non public) API sitting on another server in the same network, this should return a PDF to my server which will proxy (pipe) it back to the user.
I want to just proxy the PDF body content directly to the client without creating a tmp file.
I have this code which works using the npm request module but it does not feel right:
var pdfRequest = request(requestOptions);
pdfRequest.on('error', function (err) {
utils.sendErrorResponse(500, 'PROBLEM PIPING PDF DOWNLOAD: ' + err, res);
});
pdfRequest.on('response', function (resp) {
if (resp.statusCode === 200) {
pdfRequest.pipe(res);
} else {
utils.sendErrorResponse(500, 'PROBLEM PIPING PDF DOWNLOAD: RAW RESP: ' + JSON.stringify(resp), res);
}
});
Is the the correct way to pipe the PDF response?
Notes:
I need to check the status code to conditionally handle errors, the payload for the POST is contained in the requestOptions (I know this part is all correct).
I would like to keep using the request module
I defiantly do not want to be creating any temp files
If possible I would also like to modify the content disposition header to set a custom filename, i know how to do this without using pipes
Hi everybody.
My web application is based on asynchronous requests. Timer widget is working and updating it's status every second by AJAX (yes, it is necessary).
I am sending with each AJAX my CSRF tokens:
project_data.append(csrf_name_key,csrf_name_value);
project_data.append(csrf_value_key,csrf_value_value);
And in response I am updating that global variables:
function setCSRF(response) {
csrf_name_key = response.nameKey;
csrf_name_value = response.name;
csrf_value_key = response.valueKey;
csrf_value_value = response.value;
}
Everything is generally fine. But if I will do another AJAX for example when I change task in todo list to "done" it sometimes ending with error because I am sending AJAX before I am getting new tokens from previous request.
I really don't know how to do solve that problem. First idea was that I will make "like stack array" with 5 different tokens but one https request = one pair of tokens and I can't generate it.
Maybe some type of queue of ajax requests, but what with doing them in a right time - I don't know.
My actual pseudo-solution is "if failed try again max 10 times":
if(e.target.response=="Failed CSRF check!") {
if(failedAjax<10) checkForSurvey();
failedAjax++;
return;
}
It is generally working, but errors are appears in a console and it is very dirty solution.
I am using Slim 3 microframework with CSRF extension. Really please for help with that interesting problem.
I will be very thankful,
Arthur
There are some options for you:
Use a stack of csrf-tokens inside you javascript code
Use a csrf token which is can be used more than once (not so secure)
Use a queue for the request
A stack for the tokens
The Slim-Csrf-middleware provides functionallity for you, to generate these tokens, you just need to get them to the clientside.
You could do an api for getting 5 csrf tokens, this api would also consume on csrf-token.
Add an api and generate the tokens there.
$app->get('/foo', function ($request, $response, $args) {
// check valid csrf token
$tokens = [];
for ($i = 0; $i < 5; $i++) {
$tokens[] = $this->csrf->generateToken();
}
return $response->withJson($tokens);
});
Now the csrf-token are valid through the whole user session.
Guard::generateToken() returns something like this:
array (size=2)
'csrf_name' => string 'csrf58e669ff70da0' (length=17)
'csrf_value' => string '52ac7689d3c6ea5d01889d711018f058' (length=32)
A multi-use csrf-token
For that, Slim-Csrf already provides functionallity with the token persistance mode. That can be enabled through the constructor or the Guard::setPersistentTokenMode(bool) method. In my example, I'm doing this with the method:
$container['csrf'] = function ($c) {
$guard = new \Slim\Csrf\Guard;
$guard->setPersistentTokenMode(true);
return $guard;
};
Here the PhpDoc from the persistanceTokenMode-attribute
/**
* Determines whether or not we should persist the token throughout the duration of the user's session.
*
* For security, Slim-Csrf will *always* reset the token if there is a validation error.
* #var bool True to use the same token throughout the session (unless there is a validation error),
* false to get a new token with each request.
*/
A queue for the ajax requests.
Add a queue for the request, that could be delay the execution of your request but there will always be a valid csrf token.
This should be seen as pseudocode as I havn't tested this yet.
var requestQueue = [];
var isInRequest = false;
var csrfKey = ''; // should be set on page load, to have a valid token at the start
var csrfValue = '';
function newRequest(onSuccessCallback, data) { // add all parameters you need
// add request to the queue
requestQueue.push(function() {
isInRequest = true;
// add to csrf stuff to data
$.ajax({
data: xxx
url: "serverscript.xxx",
success: function(data) {
// update csrfKey & csrfValue
isInRequest = false;
tryExecuteNextRequest(); // try execute next request
onSuccessCallback(data); // proceed received data
}
}});
);
tryExecuteNextRequest();
}
function tryExecuteNextRequest() {
if(!isInRequest && requestQueue.length != 0) { // currently no request running &
var nextRequest = requestQueue.shift();
nextRequest(); // execute next request
}
}
Generally, you can simply eliminate CSRF by not accepting cookies for authentication.
You can save the authentication token in the localStorage and send it as an header with each request.
This way you'll never need to worry about CSRF and its tokens.