Optimize crawler script on cronjob - javascript

i have about 66Million domains in a MySQL table, i need to run crawler on all the domains and update the row count = 1 when the crawler completed.
the crawler script is in php using php crawler library
here is the script.
set_time_limit(10000);
try{
$strWebURL = $_POST['url'];
$crawler = new MyCrawler();
$crawler->setURL($strWebURL);
$crawler->addContentTypeReceiveRule("#text/html#");
$crawler->addURLFilterRule("#\.(jpg|jpeg|gif|png)$# i");
$crawler->enableCookieHandling(true);
$crawler->setTrafficLimit(1000 * 1024);
$crawler->setConnectionTimeout(10);
//start of the table
echo '<table border="1" style="margin-bottom:10px;width:100% !important;">';
echo '<tr>';
echo '<th>URL</th>';
echo '<th>Status</th>';
echo '<th>Size (bytes)</th>';
echo '<th>Page</th>';
echo '</tr>';
$crawler->go();
echo '</table>';
$this->load->model('urls');
$this->urls->incrementCount($_POST['id'],'urls');
}catch(Exception $e){
}
$this->urls->incrementCount(); only update the row and to mark the count column = 1
and because i have 66M domains i needed to run a cronjob on my server
and as cronjob runs on command line i needed a headless browser so i choose phanjomjs
because the crawler doesnt work the way i wanted it to work without the headless browser (phantomjs)
first problem i faced was to load domains from mysql db and run crawler script from a js script
i tried this:
create a php script that returns domains in json form and load it from js file and foreach the domains and run the crawler, but it didnt work very well and get stuck after sometime
next thing i tried, which im still using is create a python script to load the domains directly from mysql db and run the phantom js script on each domains from python script.
here is the code
import MySQLdb
import httplib
import sys
import subprocess
import json
args = sys.argv;
db = MySQLdb.connect("HOST","USER","PW","DB")
cursor = db.cursor()
#tablecount = args[1]
frm = args[1]
limit = args[2]
try:
sql = "SELECT * FROM urls WHERE count = 0 LIMIT %s,%s" % (frm,limit)
cursor.execute(sql)
print "TOTAL RECORDS: "+str(cursor.rowcount)
results = cursor.fetchall()
count = 0;
for row in results:
try:
domain = row[1].lower()
idd = row[0]
command = "/home/wasif/public_html/phantomjs /home/wasif/public_html/crawler2.js %s %s" % (domain,idd)
print command
proc = subprocess.Popen(command, shell=True, stdout=subprocess.PIPE)
script_response = proc.stdout.read()
print script_response
except:
print "error running crawler: "+domain
except:
print "Error: unable to fetch data"
db.close()
it takes 2 arguments to set the limit to select domain from database.
foreach domains and run this command using subproces
command = "/home/wasif/public_html/phantomjs /home/wasif/public_html/crawler2.js %s %s" % (domain,idd)
command
proc = subprocess.Popen(command, shell=True, stdout=subprocess.PIPE)
script_response = proc.stdout.read()
print script_response
crawler2.js file also takes 2 args 1 is domain and 2nd is the id to update the count = 1 when crawler completed
this is the crawler2.js
var args = require('system').args;
var address = '';
var id = '';
args.forEach(function(arg, i) {
if(i == 1){
address = arg;
}
if(i == 2){
id = arg;
}
});
address = "http://www."+address;
var page = require('webpage').create(),
server = 'http://www.EXAMPLE.net/main/crawler',
data = 'url='+address+'&id='+id;
console.log(data);
page.open(server, 'post', data, function (status) {
if (status !== 'success') {
console.log(address+' Unable to post!');
} else {
console.log(address+' : done');
}
phantom.exit();
});
it works well but my script get stuck after sometime n need to restart after sometime and log shows nothing wrong
i need to optimize this process and run crawler as fast as i can, any help would be appreciated

Web crawler programmer is in here. :)
Your python execute the phantom serially. You should do it in parallel. To do it, execute the phantom then leave it, don't wait it.
In PHP, would be like this:
exec("/your_executable_path > /dev/null &");
Don't use phantom if you don't need to. It render everything. > 50MB memory will be needed.

Related

Is it possible to post multiple Live sensor data to a single Google Spread Sheet?

I made a weighing Machine using an M5 stack. The measured data is being saved in the sheet called, "logsheet" and another sheet (called "chartsheet")to create a gauge chart using that live data. Now, I want to create 60 more weighing machines like this and want to display their live data in a single spreadsheet. Is it possible to create a sheet like this?
This is my Arduino code: (only postvalues funtion)
void postValues(float a){
DynamicJsonDocument doc(capacity);
doc["sensor"] = a;
serializeJson(doc, Serial);
Serial.println("");
serializeJson(doc, buffer, sizeof(buffer));
HTTPClient http;
Serial.println(http.begin(host));
http.addHeader("Content-Type", "application/json");
int status_code = http.POST((uint8_t*)buffer, strlen(buffer));
Serial.printf("status_code=%d\r\n", status_code);
if( status_code == 200 ){
Stream* resp = http.getStreamPtr();
DynamicJsonDocument json_response(255);
deserializeJson(json_response, *resp);
serializeJson(json_response, Serial);
Serial.println("");
}else{
Serial.println(http.getString());
}
http.end();
}
And My appSpript code:
function doPost(e) {
var postjsonString = e.postData.getDataAsString();
var postdata = JSON.parse(postjsonString);
var spreadsheet = SpreadsheetApp.openById("####"); //sheetName
var logSheet = spreadsheet.getSheetByName("logSheet") || spreadsheet.insertSheet("logSheet"); //logsheet
var chartSheet = spreadsheet.getSheetByName("chartSheet") || spreadsheet.insertSheet("chartSheet"); //chartsheet
sensor_data = postdata.sensor;
var date_time = Utilities.formatDate(new Date(), 'JST', 'yyyy年M月d日 H時m分s秒')
var values = [date_time, sensor_data];
logSheet.appendRow(values);
//chart
chartSheet.getRange("A1:B1").setValues([values]);
var max = 25;
var min = 0;
var charts = chartSheet.getCharts();
if (charts.length == 0) {
var chart = chartSheet.newChart().setChartType(Charts.ChartType.GAUGE).addRange(chartSheet.getRange('B1')).setPosition(3, 1, 0, 0).setOption('height', 300).setOption('width', 300).setOption('title', 'Weighing Gauge').setOption('max', max).setOption('min', min).build();
chartSheet.insertChart(chart);
}
}
First things to check
Make sure your web app is deployed to be accessible to "Anyone, even anonymous"
IMPORTANT! Try deploying the web app again with the Old Editor - the New IDE has issues with deployment of web apps and sometimes redeploying it in the old editor fixes things.
Sample web app
function doPost(e) {
let file = SpreadsheetApp.getActive();
let sheet = file.getSheetByName("Sheet1");
let id = e.parameter.id;
let weight1 = e.parameter.weight1;
let weight2 = e.parameter.weight2;
sheet.appendRow([id, weight1, weight2])
}
Then accessing with a simple POST request with CURL from any machine:
curl -d "id=0001&weight1=100&weight2=200" -X POST https://script.google.com/a/egs-sbt095.eu/macros/s/[SCRIPTID]/exec
Produces this:
This example uses cURL just to show that now the spreadsheet can receive data from anywhere. Here is another variation of the same POST request:
curl -X POST "https://script.google.com/a/egs-sbt095.eu/macros/s/[SCRIPTID]/exec?id=0001&weight1=100&weight2=200"
Finally, remember to make sure you are redeploying your web app every time you make changes!

server progress indicator using XMLHttpRequest [duplicate]

Question:
How to read and echo file size of uploaded file being written at server in real time without blocking at both server and client?
Context:
Progress of file upload being written to server from POST request made by fetch(), where body is set to Blob, File, TypedArray, or ArrayBuffer object.
The current implementation sets File object at body object passed to second parameter of fetch().
Requirement:
Read and echo to client the file size of file being written to filesystem at server as text/event-stream. Stop when all of the bytes, provided as a variable to the script as a query string parameter at GET request have been written. The read of the file currently takes place at a separate script environment, where GET call to script which should read file is made following POST to script which writes file to server.
Have not reached error handling of potential issue with write of file to server or read of file to get current file size, though that would be next step once echo of file size portion is completed.
Presently attempting to meet requirement using php. Though also interested in c, bash, nodejs, python; or other languages or approaches which can be used to perform same task.
The client side javascript portion is not an issue. Simply not that versed in php, one of the most common server-side languages used at world wide web, to implement the pattern without including parts which are not necessary.
Motivation:
Progress indicators for fetch?
Related:
Fetch with ReadableStream
Issues:
Getting
PHP Notice: Undefined index: HTTP_LAST_EVENT_ID in stream.php on line 7
at terminal.
Also, if substitute
while(file_exists($_GET["filename"])
&& filesize($_GET["filename"]) < intval($_GET["filesize"]))
for
while(true)
produces error at EventSource.
Without sleep() call, correct file size was dispatched to message event for a 3.3MB file, 3321824, was printed at console 61921, 26214, and 38093 times, respectively, when uploaded same file three times. The expected result is file size of file as the file is being written at
stream_copy_to_stream($input, $file);
instead of file size of uploaded file object. Are fopen() or stream_copy_to_stream() blocking as to other a different php process at stream.php?
Tried so far:
php is attributed to
Beyond $_POST, $_GET and $_FILE: Working with Blob in JavaScriptPHP
Introduction to Server-Sent Events with PHP example
php
// can we merge `data.php`, `stream.php` to same file?
// can we use `STREAM_NOTIFY_PROGRESS`
// "Indicates current progress of the stream transfer
// in bytes_transferred and possibly bytes_max as well" to read bytes?
// do we need to call `stream_set_blocking` to `false`
// data.php
<?php
$filename = $_SERVER["HTTP_X_FILENAME"];
$input = fopen("php://input", "rb");
$file = fopen($filename, "wb");
stream_copy_to_stream($input, $file);
fclose($input);
fclose($file);
echo "upload of " . $filename . " successful";
?>
// stream.php
<?php
header("Content-Type: text/event-stream");
header("Cache-Control: no-cache");
header("Connection: keep-alive");
// `PHP Notice: Undefined index: HTTP_LAST_EVENT_ID in stream.php on line 7` ?
$lastId = $_SERVER["HTTP_LAST_EVENT_ID"] || 0;
if (isset($lastId) && !empty($lastId) && is_numeric($lastId)) {
$lastId = intval($lastId);
$lastId++;
}
// else {
// $lastId = 0;
// }
// while current file size read is less than or equal to
// `$_GET["filesize"]` of `$_GET["filename"]`
// how to loop only when above is `true`
while (true) {
$upload = $_GET["filename"];
// is this the correct function and variable to use
// to get written bytes of `stream_copy_to_stream($input, $file);`?
$data = filesize($upload);
// $data = $_GET["filename"] . " " . $_GET["filesize"];
if ($data) {
sendMessage($lastId, $data);
$lastId++;
}
// else {
// close stream
// }
// not necessary here, though without thousands of `message` events
// will be dispatched
// sleep(1);
}
function sendMessage($id, $data) {
echo "id: $id\n";
echo "data: $data\n\n";
ob_flush();
flush();
}
?>
javascript
<!DOCTYPE html>
<html>
<head>
</head>
<body>
<input type="file">
<progress value="0" max="0" step="1"></progress>
<script>
const [url, stream, header] = ["data.php", "stream.php", "x-filename"];
const [input, progress, handleFile] = [
document.querySelector("input[type=file]")
, document.querySelector("progress")
, (event) => {
const [file] = input.files;
const [{size:filesize, name:filename}, headers, params] = [
file, new Headers(), new URLSearchParams()
];
// set `filename`, `filesize` as search parameters for `stream` URL
Object.entries({filename, filesize})
.forEach(([...props]) => params.append.apply(params, props));
// set header for `POST`
headers.append(header, filename);
// reset `progress.value` set `progress.max` to `filesize`
[progress.value, progress.max] = [0, filesize];
const [request, source] = [
new Request(url, {
method:"POST", headers:headers, body:file
})
// https://stackoverflow.com/a/42330433/
, new EventSource(`${stream}?${params.toString()}`)
];
source.addEventListener("message", (e) => {
// update `progress` here,
// call `.close()` when `e.data === filesize`
// `progress.value = e.data`, should be this simple
console.log(e.data, e.lastEventId);
}, true);
source.addEventListener("open", (e) => {
console.log("fetch upload progress open");
}, true);
source.addEventListener("error", (e) => {
console.error("fetch upload progress error");
}, true);
// sanity check for tests,
// we don't need `source` when `e.data === filesize`;
// we could call `.close()` within `message` event handler
setTimeout(() => source.close(), 30000);
// we don't need `source' to be in `Promise` chain,
// though we could resolve if `e.data === filesize`
// before `response`, then wait for `.text()`; etc.
// TODO: if and where to merge or branch `EventSource`,
// `fetch` to single or two `Promise` chains
const upload = fetch(request);
upload
.then(response => response.text())
.then(res => console.log(res))
.catch(err => console.error(err));
}
];
input.addEventListener("change", handleFile, true);
</script>
</body>
</html>
You need to clearstatcache to get real file size. With few other bits fixed, your stream.php may look like following:
<?php
header("Content-Type: text/event-stream");
header("Cache-Control: no-cache");
header("Connection: keep-alive");
// Check if the header's been sent to avoid `PHP Notice: Undefined index: HTTP_LAST_EVENT_ID in stream.php on line `
// php 7+
//$lastId = $_SERVER["HTTP_LAST_EVENT_ID"] ?? 0;
// php < 7
$lastId = isset($_SERVER["HTTP_LAST_EVENT_ID"]) ? intval($_SERVER["HTTP_LAST_EVENT_ID"]) : 0;
$upload = $_GET["filename"];
$data = 0;
// if file already exists, its initial size can be bigger than the new one, so we need to ignore it
$wasLess = $lastId != 0;
while ($data < $_GET["filesize"] || !$wasLess) {
// system calls are expensive and are being cached with assumption that in most cases file stats do not change often
// so we clear cache to get most up to date data
clearstatcache(true, $upload);
$data = filesize($upload);
$wasLess |= $data < $_GET["filesize"];
// don't send stale filesize
if ($wasLess) {
sendMessage($lastId, $data);
$lastId++;
}
// not necessary here, though without thousands of `message` events will be dispatched
//sleep(1);
// millions on poor connection and large files. 1 second might be too much, but 50 messages a second must be okay
usleep(20000);
}
function sendMessage($id, $data)
{
echo "id: $id\n";
echo "data: $data\n\n";
ob_flush();
// no need to flush(). It adds content length of the chunk to the stream
// flush();
}
Few caveats:
Security. I mean luck of it. As I understand it is a proof of concept, and security is the least of concerns, yet the disclaimer should be there. This approach is fundamentally flawed, and should be used only if you don't care of DOS attacks or information about your files goes out.
CPU. Without usleep the script will consume 100% of a single core. With long sleep you are at risk of uploading the whole file within a single iteration and the exit condition will be never met. If you are testing it locally, the usleep should be removed completely, since it is matter of milliseconds to upload MBs locally.
Open connections. Both apache and nginx/fpm have finite number of php processes that can serve the requests. A single file upload will takes 2 for the time required to upload the file. With slow bandwidth or forged requests, this time can be quite long, and the web server may start to reject requests.
Clientside part. You need to analyse the response and finally stop listening to the events when the file is fully uploaded.
EDIT:
To make it more or less production friendly, you will need an in-memory storage like redis, or memcache to store file metadata.
Making a post request, add a unique token which identify the file, and the file size.
In your javascript:
const fileId = Math.random().toString(36).substr(2); // or anything more unique
...
const [request, source] = [
new Request(`${url}?fileId=${fileId}&size=${filesize}`, {
method:"POST", headers:headers, body:file
})
, new EventSource(`${stream}?fileId=${fileId}`)
];
....
In data.php register the token and report progress by chunks:
....
$fileId = $_GET['fileId'];
$fileSize = $_GET['size'];
setUnique($fileId, 0, $fileSize);
while ($uploaded = stream_copy_to_stream($input, $file, 1024)) {
updateProgress($id, $uploaded);
}
....
/**
* Check if Id is unique, and store processed as 0, and full_size as $size
* Set reasonable TTL for the key, e.g. 1hr
*
* #param string $id
* #param int $size
* #throws Exception if id is not unique
*/
function setUnique($id, $size) {
// implement with your storage of choice
}
/**
* Updates uploaded size for the given file
*
* #param string $id
* #param int $processed
*/
function updateProgress($id, $processed) {
// implement with your storage of choice
}
So your stream.php don't need to hit the disk at all, and can sleep as long as it is acceptable by UX:
....
list($progress, $size) = getProgress('non_existing_key_to_init_default_values');
$lastId = 0;
while ($progress < $size) {
list($progress, $size) = getProgress($_GET["fileId"]);
sendMessage($lastId, $progress);
$lastId++;
sleep(1);
}
.....
/**
* Get progress of the file upload.
* If id is not there yet, returns [0, PHP_INT_MAX]
*
* #param $id
* #return array $bytesUploaded, $fileSize
*/
function getProgress($id) {
// implement with your storage of choice
}
The problem with 2 open connections cannot be solved unless you give up EventSource for old good pulling. Response time of stream.php without loop is a matter of milliseconds, and it is quite wasteful to keep the connection open all the time, unless you need hundreds updates a second.
You need break file on chunks with javascript and send those chunks. When chunk is uploaded you know exactly how many data were sent.
This is the only way and by the way it is not hard.
file.startByte += 100000;
file.stopByte += 100000;
var reader = new FileReader();
reader.onloadend = function(evt) {
data.blob = btoa(evt.target.result);
/// Do upload here, I do with jQuery ajax
}
var blob = file.slice(file.startByte, file.stopByte);
reader.readAsBinaryString(blob);

How to read and echo file size of uploaded file being written at server in real time without blocking at both server and client?

Question:
How to read and echo file size of uploaded file being written at server in real time without blocking at both server and client?
Context:
Progress of file upload being written to server from POST request made by fetch(), where body is set to Blob, File, TypedArray, or ArrayBuffer object.
The current implementation sets File object at body object passed to second parameter of fetch().
Requirement:
Read and echo to client the file size of file being written to filesystem at server as text/event-stream. Stop when all of the bytes, provided as a variable to the script as a query string parameter at GET request have been written. The read of the file currently takes place at a separate script environment, where GET call to script which should read file is made following POST to script which writes file to server.
Have not reached error handling of potential issue with write of file to server or read of file to get current file size, though that would be next step once echo of file size portion is completed.
Presently attempting to meet requirement using php. Though also interested in c, bash, nodejs, python; or other languages or approaches which can be used to perform same task.
The client side javascript portion is not an issue. Simply not that versed in php, one of the most common server-side languages used at world wide web, to implement the pattern without including parts which are not necessary.
Motivation:
Progress indicators for fetch?
Related:
Fetch with ReadableStream
Issues:
Getting
PHP Notice: Undefined index: HTTP_LAST_EVENT_ID in stream.php on line 7
at terminal.
Also, if substitute
while(file_exists($_GET["filename"])
&& filesize($_GET["filename"]) < intval($_GET["filesize"]))
for
while(true)
produces error at EventSource.
Without sleep() call, correct file size was dispatched to message event for a 3.3MB file, 3321824, was printed at console 61921, 26214, and 38093 times, respectively, when uploaded same file three times. The expected result is file size of file as the file is being written at
stream_copy_to_stream($input, $file);
instead of file size of uploaded file object. Are fopen() or stream_copy_to_stream() blocking as to other a different php process at stream.php?
Tried so far:
php is attributed to
Beyond $_POST, $_GET and $_FILE: Working with Blob in JavaScriptPHP
Introduction to Server-Sent Events with PHP example
php
// can we merge `data.php`, `stream.php` to same file?
// can we use `STREAM_NOTIFY_PROGRESS`
// "Indicates current progress of the stream transfer
// in bytes_transferred and possibly bytes_max as well" to read bytes?
// do we need to call `stream_set_blocking` to `false`
// data.php
<?php
$filename = $_SERVER["HTTP_X_FILENAME"];
$input = fopen("php://input", "rb");
$file = fopen($filename, "wb");
stream_copy_to_stream($input, $file);
fclose($input);
fclose($file);
echo "upload of " . $filename . " successful";
?>
// stream.php
<?php
header("Content-Type: text/event-stream");
header("Cache-Control: no-cache");
header("Connection: keep-alive");
// `PHP Notice: Undefined index: HTTP_LAST_EVENT_ID in stream.php on line 7` ?
$lastId = $_SERVER["HTTP_LAST_EVENT_ID"] || 0;
if (isset($lastId) && !empty($lastId) && is_numeric($lastId)) {
$lastId = intval($lastId);
$lastId++;
}
// else {
// $lastId = 0;
// }
// while current file size read is less than or equal to
// `$_GET["filesize"]` of `$_GET["filename"]`
// how to loop only when above is `true`
while (true) {
$upload = $_GET["filename"];
// is this the correct function and variable to use
// to get written bytes of `stream_copy_to_stream($input, $file);`?
$data = filesize($upload);
// $data = $_GET["filename"] . " " . $_GET["filesize"];
if ($data) {
sendMessage($lastId, $data);
$lastId++;
}
// else {
// close stream
// }
// not necessary here, though without thousands of `message` events
// will be dispatched
// sleep(1);
}
function sendMessage($id, $data) {
echo "id: $id\n";
echo "data: $data\n\n";
ob_flush();
flush();
}
?>
javascript
<!DOCTYPE html>
<html>
<head>
</head>
<body>
<input type="file">
<progress value="0" max="0" step="1"></progress>
<script>
const [url, stream, header] = ["data.php", "stream.php", "x-filename"];
const [input, progress, handleFile] = [
document.querySelector("input[type=file]")
, document.querySelector("progress")
, (event) => {
const [file] = input.files;
const [{size:filesize, name:filename}, headers, params] = [
file, new Headers(), new URLSearchParams()
];
// set `filename`, `filesize` as search parameters for `stream` URL
Object.entries({filename, filesize})
.forEach(([...props]) => params.append.apply(params, props));
// set header for `POST`
headers.append(header, filename);
// reset `progress.value` set `progress.max` to `filesize`
[progress.value, progress.max] = [0, filesize];
const [request, source] = [
new Request(url, {
method:"POST", headers:headers, body:file
})
// https://stackoverflow.com/a/42330433/
, new EventSource(`${stream}?${params.toString()}`)
];
source.addEventListener("message", (e) => {
// update `progress` here,
// call `.close()` when `e.data === filesize`
// `progress.value = e.data`, should be this simple
console.log(e.data, e.lastEventId);
}, true);
source.addEventListener("open", (e) => {
console.log("fetch upload progress open");
}, true);
source.addEventListener("error", (e) => {
console.error("fetch upload progress error");
}, true);
// sanity check for tests,
// we don't need `source` when `e.data === filesize`;
// we could call `.close()` within `message` event handler
setTimeout(() => source.close(), 30000);
// we don't need `source' to be in `Promise` chain,
// though we could resolve if `e.data === filesize`
// before `response`, then wait for `.text()`; etc.
// TODO: if and where to merge or branch `EventSource`,
// `fetch` to single or two `Promise` chains
const upload = fetch(request);
upload
.then(response => response.text())
.then(res => console.log(res))
.catch(err => console.error(err));
}
];
input.addEventListener("change", handleFile, true);
</script>
</body>
</html>
You need to clearstatcache to get real file size. With few other bits fixed, your stream.php may look like following:
<?php
header("Content-Type: text/event-stream");
header("Cache-Control: no-cache");
header("Connection: keep-alive");
// Check if the header's been sent to avoid `PHP Notice: Undefined index: HTTP_LAST_EVENT_ID in stream.php on line `
// php 7+
//$lastId = $_SERVER["HTTP_LAST_EVENT_ID"] ?? 0;
// php < 7
$lastId = isset($_SERVER["HTTP_LAST_EVENT_ID"]) ? intval($_SERVER["HTTP_LAST_EVENT_ID"]) : 0;
$upload = $_GET["filename"];
$data = 0;
// if file already exists, its initial size can be bigger than the new one, so we need to ignore it
$wasLess = $lastId != 0;
while ($data < $_GET["filesize"] || !$wasLess) {
// system calls are expensive and are being cached with assumption that in most cases file stats do not change often
// so we clear cache to get most up to date data
clearstatcache(true, $upload);
$data = filesize($upload);
$wasLess |= $data < $_GET["filesize"];
// don't send stale filesize
if ($wasLess) {
sendMessage($lastId, $data);
$lastId++;
}
// not necessary here, though without thousands of `message` events will be dispatched
//sleep(1);
// millions on poor connection and large files. 1 second might be too much, but 50 messages a second must be okay
usleep(20000);
}
function sendMessage($id, $data)
{
echo "id: $id\n";
echo "data: $data\n\n";
ob_flush();
// no need to flush(). It adds content length of the chunk to the stream
// flush();
}
Few caveats:
Security. I mean luck of it. As I understand it is a proof of concept, and security is the least of concerns, yet the disclaimer should be there. This approach is fundamentally flawed, and should be used only if you don't care of DOS attacks or information about your files goes out.
CPU. Without usleep the script will consume 100% of a single core. With long sleep you are at risk of uploading the whole file within a single iteration and the exit condition will be never met. If you are testing it locally, the usleep should be removed completely, since it is matter of milliseconds to upload MBs locally.
Open connections. Both apache and nginx/fpm have finite number of php processes that can serve the requests. A single file upload will takes 2 for the time required to upload the file. With slow bandwidth or forged requests, this time can be quite long, and the web server may start to reject requests.
Clientside part. You need to analyse the response and finally stop listening to the events when the file is fully uploaded.
EDIT:
To make it more or less production friendly, you will need an in-memory storage like redis, or memcache to store file metadata.
Making a post request, add a unique token which identify the file, and the file size.
In your javascript:
const fileId = Math.random().toString(36).substr(2); // or anything more unique
...
const [request, source] = [
new Request(`${url}?fileId=${fileId}&size=${filesize}`, {
method:"POST", headers:headers, body:file
})
, new EventSource(`${stream}?fileId=${fileId}`)
];
....
In data.php register the token and report progress by chunks:
....
$fileId = $_GET['fileId'];
$fileSize = $_GET['size'];
setUnique($fileId, 0, $fileSize);
while ($uploaded = stream_copy_to_stream($input, $file, 1024)) {
updateProgress($id, $uploaded);
}
....
/**
* Check if Id is unique, and store processed as 0, and full_size as $size
* Set reasonable TTL for the key, e.g. 1hr
*
* #param string $id
* #param int $size
* #throws Exception if id is not unique
*/
function setUnique($id, $size) {
// implement with your storage of choice
}
/**
* Updates uploaded size for the given file
*
* #param string $id
* #param int $processed
*/
function updateProgress($id, $processed) {
// implement with your storage of choice
}
So your stream.php don't need to hit the disk at all, and can sleep as long as it is acceptable by UX:
....
list($progress, $size) = getProgress('non_existing_key_to_init_default_values');
$lastId = 0;
while ($progress < $size) {
list($progress, $size) = getProgress($_GET["fileId"]);
sendMessage($lastId, $progress);
$lastId++;
sleep(1);
}
.....
/**
* Get progress of the file upload.
* If id is not there yet, returns [0, PHP_INT_MAX]
*
* #param $id
* #return array $bytesUploaded, $fileSize
*/
function getProgress($id) {
// implement with your storage of choice
}
The problem with 2 open connections cannot be solved unless you give up EventSource for old good pulling. Response time of stream.php without loop is a matter of milliseconds, and it is quite wasteful to keep the connection open all the time, unless you need hundreds updates a second.
You need break file on chunks with javascript and send those chunks. When chunk is uploaded you know exactly how many data were sent.
This is the only way and by the way it is not hard.
file.startByte += 100000;
file.stopByte += 100000;
var reader = new FileReader();
reader.onloadend = function(evt) {
data.blob = btoa(evt.target.result);
/// Do upload here, I do with jQuery ajax
}
var blob = file.slice(file.startByte, file.stopByte);
reader.readAsBinaryString(blob);

Chat using PHP, websockets and MySQL

I'm trying to write a chat using this https://github.com/ghedipunk/PHP-Websockets PHP class. Now it looks like, on beginning JS on client side connect to server, then it sends a request once a 3 seconds, and server on every request check messages in database, and send messages to client. I think it's not good, that client sends request once a 3 sec, and I wonder how to do, that server send message to client, only when there is a new message in database?
This is my extended process function:
protected function process($user, $message) {
$databaseHandler = mysqli_connect('blahblah');
$decoded = json_decode($message, true);
$chat_id = $decoded['chatid'];
$limit = $decoded['limit'];
$user_id = filterString($decoded['userid'], $databaseHandler);
$SQLQuery = "SELECT chat_messages.message, users.name FROM chat_messages JOIN users ON chat_messages.user_id=users.user_id WHERE chat_messages.chat_id = '$chat_id' ORDER BY chat_messages.message_id DESC LIMIT $limit,5;";
$SQLResult = mysqli_query($databaseHandler, $SQLQuery);
$i = 0;
$arr = array();
while ($row = mysqli_fetch_assoc($SQLResult)) {
$i++;
$arr[$i] = $row['name'] . ': ' . $row['message'];
}
$this->send($user, json_encode($arr));
}
This is JS function:
function refreshchat() {
try {
socket.send('{"token":"' + token + '","userid":'+ userid +'"chatid":' + chatid + ',"limit":' + limit + '}');
} catch (ex) {
console.log(ex);
}
setTimeout(refreshchat, 3000);
}
I skipped authorization stuff in php for clearance.
Thank you in advance!
Well you are using setTimeout JavaScript function. It only runs once. You can use setInterval function. It is run repeatedly. (http://www.w3schools.com/jsref/met_win_setinterval.asp)
It seems you want a feature that allows the server to inform the client that there is a new chat message. I think using timer in Javascript is a good option. You can also check the push notification api: https://developers.google.com/web/updates/2015/03/push-notifications-on-the-open-web.

Batch File > Javascript > WinSCP > Check if file exists

I have a batch file that will launch a .js file which, via WinSCP, checks if a file exists and returns to the batch file if it does or not.
The problem IS: It always returns not found, and I cannot figure out why. I am unsure how to use a wildcard in this scenario.
The batch file looks like this:
cscript /nologo file.js
if errorlevel 1 goto notfound
exit
:notfound
(another script to copy a file over)
Only one file can exist on the server at once. So every ten min, this batch file will run, check if there is a file, if not, copy one over.
The file.js:
// Configuration
// Remote file search for
var FILEPATH = "../filepath/TSS*";
// Session to connect to
var SESSION = "mysession#someplace.come";
// Path to winscp.com
var WINSCP = "c:\\program files (x86)\\winscp\\winscp.com";
var filesys = WScript.CreateObject("Scripting.FileSystemObject");
var shell = WScript.CreateObject("WScript.Shell");
var logfilepath = filesys.GetSpecialFolder(2) + "\\" + filesys.GetTempName() + ".xml";
var p = FILEPATH.lastIndexOf('/');
var path = FILEPATH.substring(0, p);
var filename = FILEPATH.substring(p + 1);
var exec;
// run winscp to check for file existence
exec = shell.Exec("\"" + WINSCP + "\" /log=\"" + logfilepath + "\"");
exec.StdIn.Write(
"option batch abort\n" +
"open \"" + SESSION + "\"\n" +
"ls \"" + path + "\"\n" +
"exit\n");
// wait until the script finishes
while (exec.Status == 0)
{
WScript.Sleep(100);
WScript.Echo(exec.StdOut.ReadAll());
}
if (exec.ExitCode != 0)
{
WScript.Echo("Error checking for file existence");
WScript.Quit(1);
}
// look for log file
var logfile = filesys.GetFile(logfilepath);
if (logfile == null)
{
WScript.Echo("Cannot find log file");
WScript.Quit(1);
}
// parse XML log file
var doc = new ActiveXObject("MSXML2.DOMDocument");
doc.async = false;
doc.load(logfilepath);
doc.setProperty("SelectionNamespaces",
"xmlns:w='http://winscp.net/schema/session/1.0'");
var nodes = doc.selectNodes("//w:file/w:filename[#value='" + filename + "']");
if (nodes.length > 0)
{
WScript.Echo("File found");
// signalize file existence to calling process;
// you can also continue with processing (e.g. downloading the file)
// directly from the script here
WScript.Quit(0);
}
else
{
WScript.Echo("File not found");
WScript.Quit(1);
}
On line 4 it says:
var FILEPATH = "../filepath/TSS*";
That star is what is giving me issues, i think. I need to look for a file which STARTS WITH TSS, but will have a time stamp tacked on the end. So i need to just use a wildcard after TSS.
So what i need help with is: Making this process return true if any file exists with TSS*
Any help would be much appreciated.
EDIT:
var nodes = doc.selectNodes("//w:file/w:filename[starts-with(#value, 'TSS')]");
This code seems to not work. If this code worked, it seems like it would solve all my problems.
You need to correct xpath expression in var nodes... line.
Try something like this:
doc.setProperty("SelectionLanguage", "XPath"); //added in edit
var nodes = doc.selectNodes("//w:file/w:filename[starts-with(#value, '" + filename + "')]");
and delete asterisk from FILEPATH.
Note: first line is required in order to use XPath as the query language, not default (and old) XSLPattern which doesn't support methods such as starts-with or contains.
SelectionLanguage Property (MDSN).
You can use the stat command. You can even inline the WinSCP script into the batch file:
#echo off
set REMOTE_PATH=/home/user/test.txt
winscp.com /command ^
"option batch abort" ^
"open mysession" ^
"stat %REMOTE_PATH%" ^
"exit"
if errorlevel 1 goto error
echo File %REMOTE_PATH% exists
rem Do something
exit 0
:error
echo Error or file %REMOTE_PATH% not exists
exit 1
An alternative is using the Session.FileExists from WinSCP .NET assembly.
For further details, see the WinSCP article Checking file existence.

Categories