server progress indicator using XMLHttpRequest [duplicate] - javascript

Question:
How to read and echo file size of uploaded file being written at server in real time without blocking at both server and client?
Context:
Progress of file upload being written to server from POST request made by fetch(), where body is set to Blob, File, TypedArray, or ArrayBuffer object.
The current implementation sets File object at body object passed to second parameter of fetch().
Requirement:
Read and echo to client the file size of file being written to filesystem at server as text/event-stream. Stop when all of the bytes, provided as a variable to the script as a query string parameter at GET request have been written. The read of the file currently takes place at a separate script environment, where GET call to script which should read file is made following POST to script which writes file to server.
Have not reached error handling of potential issue with write of file to server or read of file to get current file size, though that would be next step once echo of file size portion is completed.
Presently attempting to meet requirement using php. Though also interested in c, bash, nodejs, python; or other languages or approaches which can be used to perform same task.
The client side javascript portion is not an issue. Simply not that versed in php, one of the most common server-side languages used at world wide web, to implement the pattern without including parts which are not necessary.
Motivation:
Progress indicators for fetch?
Related:
Fetch with ReadableStream
Issues:
Getting
PHP Notice: Undefined index: HTTP_LAST_EVENT_ID in stream.php on line 7
at terminal.
Also, if substitute
while(file_exists($_GET["filename"])
&& filesize($_GET["filename"]) < intval($_GET["filesize"]))
for
while(true)
produces error at EventSource.
Without sleep() call, correct file size was dispatched to message event for a 3.3MB file, 3321824, was printed at console 61921, 26214, and 38093 times, respectively, when uploaded same file three times. The expected result is file size of file as the file is being written at
stream_copy_to_stream($input, $file);
instead of file size of uploaded file object. Are fopen() or stream_copy_to_stream() blocking as to other a different php process at stream.php?
Tried so far:
php is attributed to
Beyond $_POST, $_GET and $_FILE: Working with Blob in JavaScriptPHP
Introduction to Server-Sent Events with PHP example
php
// can we merge `data.php`, `stream.php` to same file?
// can we use `STREAM_NOTIFY_PROGRESS`
// "Indicates current progress of the stream transfer
// in bytes_transferred and possibly bytes_max as well" to read bytes?
// do we need to call `stream_set_blocking` to `false`
// data.php
<?php
$filename = $_SERVER["HTTP_X_FILENAME"];
$input = fopen("php://input", "rb");
$file = fopen($filename, "wb");
stream_copy_to_stream($input, $file);
fclose($input);
fclose($file);
echo "upload of " . $filename . " successful";
?>
// stream.php
<?php
header("Content-Type: text/event-stream");
header("Cache-Control: no-cache");
header("Connection: keep-alive");
// `PHP Notice: Undefined index: HTTP_LAST_EVENT_ID in stream.php on line 7` ?
$lastId = $_SERVER["HTTP_LAST_EVENT_ID"] || 0;
if (isset($lastId) && !empty($lastId) && is_numeric($lastId)) {
$lastId = intval($lastId);
$lastId++;
}
// else {
// $lastId = 0;
// }
// while current file size read is less than or equal to
// `$_GET["filesize"]` of `$_GET["filename"]`
// how to loop only when above is `true`
while (true) {
$upload = $_GET["filename"];
// is this the correct function and variable to use
// to get written bytes of `stream_copy_to_stream($input, $file);`?
$data = filesize($upload);
// $data = $_GET["filename"] . " " . $_GET["filesize"];
if ($data) {
sendMessage($lastId, $data);
$lastId++;
}
// else {
// close stream
// }
// not necessary here, though without thousands of `message` events
// will be dispatched
// sleep(1);
}
function sendMessage($id, $data) {
echo "id: $id\n";
echo "data: $data\n\n";
ob_flush();
flush();
}
?>
javascript
<!DOCTYPE html>
<html>
<head>
</head>
<body>
<input type="file">
<progress value="0" max="0" step="1"></progress>
<script>
const [url, stream, header] = ["data.php", "stream.php", "x-filename"];
const [input, progress, handleFile] = [
document.querySelector("input[type=file]")
, document.querySelector("progress")
, (event) => {
const [file] = input.files;
const [{size:filesize, name:filename}, headers, params] = [
file, new Headers(), new URLSearchParams()
];
// set `filename`, `filesize` as search parameters for `stream` URL
Object.entries({filename, filesize})
.forEach(([...props]) => params.append.apply(params, props));
// set header for `POST`
headers.append(header, filename);
// reset `progress.value` set `progress.max` to `filesize`
[progress.value, progress.max] = [0, filesize];
const [request, source] = [
new Request(url, {
method:"POST", headers:headers, body:file
})
// https://stackoverflow.com/a/42330433/
, new EventSource(`${stream}?${params.toString()}`)
];
source.addEventListener("message", (e) => {
// update `progress` here,
// call `.close()` when `e.data === filesize`
// `progress.value = e.data`, should be this simple
console.log(e.data, e.lastEventId);
}, true);
source.addEventListener("open", (e) => {
console.log("fetch upload progress open");
}, true);
source.addEventListener("error", (e) => {
console.error("fetch upload progress error");
}, true);
// sanity check for tests,
// we don't need `source` when `e.data === filesize`;
// we could call `.close()` within `message` event handler
setTimeout(() => source.close(), 30000);
// we don't need `source' to be in `Promise` chain,
// though we could resolve if `e.data === filesize`
// before `response`, then wait for `.text()`; etc.
// TODO: if and where to merge or branch `EventSource`,
// `fetch` to single or two `Promise` chains
const upload = fetch(request);
upload
.then(response => response.text())
.then(res => console.log(res))
.catch(err => console.error(err));
}
];
input.addEventListener("change", handleFile, true);
</script>
</body>
</html>

You need to clearstatcache to get real file size. With few other bits fixed, your stream.php may look like following:
<?php
header("Content-Type: text/event-stream");
header("Cache-Control: no-cache");
header("Connection: keep-alive");
// Check if the header's been sent to avoid `PHP Notice: Undefined index: HTTP_LAST_EVENT_ID in stream.php on line `
// php 7+
//$lastId = $_SERVER["HTTP_LAST_EVENT_ID"] ?? 0;
// php < 7
$lastId = isset($_SERVER["HTTP_LAST_EVENT_ID"]) ? intval($_SERVER["HTTP_LAST_EVENT_ID"]) : 0;
$upload = $_GET["filename"];
$data = 0;
// if file already exists, its initial size can be bigger than the new one, so we need to ignore it
$wasLess = $lastId != 0;
while ($data < $_GET["filesize"] || !$wasLess) {
// system calls are expensive and are being cached with assumption that in most cases file stats do not change often
// so we clear cache to get most up to date data
clearstatcache(true, $upload);
$data = filesize($upload);
$wasLess |= $data < $_GET["filesize"];
// don't send stale filesize
if ($wasLess) {
sendMessage($lastId, $data);
$lastId++;
}
// not necessary here, though without thousands of `message` events will be dispatched
//sleep(1);
// millions on poor connection and large files. 1 second might be too much, but 50 messages a second must be okay
usleep(20000);
}
function sendMessage($id, $data)
{
echo "id: $id\n";
echo "data: $data\n\n";
ob_flush();
// no need to flush(). It adds content length of the chunk to the stream
// flush();
}
Few caveats:
Security. I mean luck of it. As I understand it is a proof of concept, and security is the least of concerns, yet the disclaimer should be there. This approach is fundamentally flawed, and should be used only if you don't care of DOS attacks or information about your files goes out.
CPU. Without usleep the script will consume 100% of a single core. With long sleep you are at risk of uploading the whole file within a single iteration and the exit condition will be never met. If you are testing it locally, the usleep should be removed completely, since it is matter of milliseconds to upload MBs locally.
Open connections. Both apache and nginx/fpm have finite number of php processes that can serve the requests. A single file upload will takes 2 for the time required to upload the file. With slow bandwidth or forged requests, this time can be quite long, and the web server may start to reject requests.
Clientside part. You need to analyse the response and finally stop listening to the events when the file is fully uploaded.
EDIT:
To make it more or less production friendly, you will need an in-memory storage like redis, or memcache to store file metadata.
Making a post request, add a unique token which identify the file, and the file size.
In your javascript:
const fileId = Math.random().toString(36).substr(2); // or anything more unique
...
const [request, source] = [
new Request(`${url}?fileId=${fileId}&size=${filesize}`, {
method:"POST", headers:headers, body:file
})
, new EventSource(`${stream}?fileId=${fileId}`)
];
....
In data.php register the token and report progress by chunks:
....
$fileId = $_GET['fileId'];
$fileSize = $_GET['size'];
setUnique($fileId, 0, $fileSize);
while ($uploaded = stream_copy_to_stream($input, $file, 1024)) {
updateProgress($id, $uploaded);
}
....
/**
* Check if Id is unique, and store processed as 0, and full_size as $size
* Set reasonable TTL for the key, e.g. 1hr
*
* #param string $id
* #param int $size
* #throws Exception if id is not unique
*/
function setUnique($id, $size) {
// implement with your storage of choice
}
/**
* Updates uploaded size for the given file
*
* #param string $id
* #param int $processed
*/
function updateProgress($id, $processed) {
// implement with your storage of choice
}
So your stream.php don't need to hit the disk at all, and can sleep as long as it is acceptable by UX:
....
list($progress, $size) = getProgress('non_existing_key_to_init_default_values');
$lastId = 0;
while ($progress < $size) {
list($progress, $size) = getProgress($_GET["fileId"]);
sendMessage($lastId, $progress);
$lastId++;
sleep(1);
}
.....
/**
* Get progress of the file upload.
* If id is not there yet, returns [0, PHP_INT_MAX]
*
* #param $id
* #return array $bytesUploaded, $fileSize
*/
function getProgress($id) {
// implement with your storage of choice
}
The problem with 2 open connections cannot be solved unless you give up EventSource for old good pulling. Response time of stream.php without loop is a matter of milliseconds, and it is quite wasteful to keep the connection open all the time, unless you need hundreds updates a second.

You need break file on chunks with javascript and send those chunks. When chunk is uploaded you know exactly how many data were sent.
This is the only way and by the way it is not hard.
file.startByte += 100000;
file.stopByte += 100000;
var reader = new FileReader();
reader.onloadend = function(evt) {
data.blob = btoa(evt.target.result);
/// Do upload here, I do with jQuery ajax
}
var blob = file.slice(file.startByte, file.stopByte);
reader.readAsBinaryString(blob);

Related

Executing NodeJS script From PHP (Laravel), not finding Node Modules in production

I have this function in my API Controller that executes a node file and it works normally on localhost, here is the script.
GifApiController.php
public function saveAndDissectGif(Request $request) {
$gif = $request->file;
//save the file temporarily in the asset-uploader directory
Storage::disk('roblogif')->put('/',$gif);
//dissect the gif into frames and get the frame count
$output = $this->executeNodeFile('dissectGif.js');
$frame_count = $output[0];
//there's a 6.5 second delay between each frame, divide by 60 to get it in minutes
//assume that 20 percent of the frames will fail, which means 20 percent will have a 1 minute delay
$time_in_minutes = $frame_count * ( 6.5 / 60 ) + ( ( $frame_count * 0.2 ) );
return ceil($time_in_minutes);
}
private function executeNodeFile($javascript_file_name) {
exec("cd ".__DIR__."; cd ../../asset-uploader; node ".$javascript_file_name, $output, $err);
if($err)
return $err;
return $output;
}
dissectGif.js gets all the frames from a gif and saves them in a folder (using the gif-frames library). To get the output from the file, I am console logging it.
dissectGif.js
const fs = require("fs")
const gifFrames = require('gif-frames');
async function startApp() {
// get the gif
let gif = fs.readdirSync("./gif");
gif = gif[Object.keys(gif)[0]];
// dissect gif into frames
await gifFrames({ url: './gif/'+gif, frames: 'all', outputType: 'png', cumulative: true }).then(function (frameData) {
frameData.forEach(async function (frame) {
await frame.getImage().pipe(fs.createWriteStream('./gif-frames/'+frame.frameIndex+'.png'));
});
});
//get and return the frame count so we can estimate loading time
return fs.readdir('./gif-frames', (err, files) => {
console.log(files.length)
return files.length
});
}
startApp()
This works great on localhost. $output[0] in GifApiController.php gets the files.length from the dissectGif.js file. But when I host the site, I get an error saying Undefined array key 0 which makes me think that in production mode, it's not finding the fs and the gif-frames libraries.
I tried to put the full directory of the libraries such as:
const fs = require("../../node_modules/fs")
but that didn't work. I also tried running npm install before executing the script but that didn't work as well.
I thought the issue could be from the gif not getting saved, but i checked and it was.
Does anybody have an idea on how to solve this issue?

Why are my temporary files being deleted on one PHP page but not another?

I have two different PHP pages, both with backends and front ends, driven by Vue, thus there is an accompanying .js file for each page.
In one page, lets say WebPageA I successfully upload a file, use fetch in the .js file to send it via POST to my backend, gather some information from the upload, unpack some data and send it back to the .js script via the fetch response. Later down the line in WebPageA, after some user input, I then send another POST via fetch to my backend and in the case of WebPageA I have access to the temporary file still. It did not get deleted via PHP's temporary file structure of deleting tmp files if they are not moved or renamed. As you'll see in my code I post below, I do not rename or move the temp file and yet I have access to it when I POST to my backend the second time. Here is that code:
include "../../classes/psql_connector.php";
if (empty($_POST['req'])) {
exit();
}
if ($_POST['req'] === 'file_info' || $_POST['req'] === 'generate_plot') {
if (empty($_FILES)) {
exit();
}
$df = $_FILES['datafile'] ?? false;
if (!$df) {
echo json_encode(['success' => false]);
exit();
}
$ext = pathinfo($df['name'], PATHINFO_EXTENSION);
if (!in_array(strtolower($ext), ['tdms', 'lvm', 'csv'])) {
echo json_encode([
'success' => false,
'status' => 'file_extension'
]);
exit();
}
}
if ($_POST['req'] === 'file_info') {
$cmd = "python C:\\xampp\\htdocs\\luwak\\pages\\gp\\unpacker.py " . $df['tmp_name'] . " " . $ext;
$output = shell_exec($cmd. ' 2>&1 &');
$info = json_decode($output);
echo json_encode([
'success' => $info !== null,
'status' => 'done',
'info' => $info,
'output' => $output,
]);
This is the first portion of my backend for WebPageA. When I post here again from my .js file I use the request code (req) 'generate_plot' where I have access to $df['tmp_name'] and $ext even though I didn't save or rename the uploaded file the last time this PHP script ran.
Now on my second webpage, WebPageB, I do the exact same thing. The first 25 lines of the backend PHP file for these two webpages are nearly identical, the only difference being I don't check for .lvm or .csv file types on $ext. When I first POST to WebPageB's backend, it works as intended the file is processed and data is unpacked. However after the user inputs some information and I POST a new FormObject to my backend with a different req code, $_FILES is now empty, and of course $df and $ext are not available as those variables cannot be initialized without the necessary temp file. Here is the backend code for WebPageB:
<?php
include "../../classes/psql_connector.php";
if (empty($_POST['req'])) {
exit();
}
if ($_POST['req'] === 'file_info') {
if (empty($_FILES)) {
exit();
}
$df = $_FILES['datafile'] ?? false;
if (!$df) {
echo json_encode(['success' => false]);
exit();
}
$ext = pathinfo($df['name'], PATHINFO_EXTENSION);
if ($ext != 'tdms') {
echo json_encode([
'success' => false,
'status' => 'file_extension'
]);
exit();
}
}
if ($_POST['req'] === 'file_info') {
$cmd = "python C:\\xampp\\htdocs\\luwak\\pages\\cc\\unpacker.py " . $df['tmp_name'] . " " . $ext;
$output = shell_exec($cmd. ' 2>&1 &');
$info = json_decode($output);
echo json_encode([
'success' => $info !== null,
'status' => 'done',
'info' => $info,
'output' => $output,
]);
I cannot for the life of me figure out why in one situation with nearly identical code a temp file is not being deleted while in the other it is.
Just to be clear I am not sending a POST via a form in the HTML produced by the frontend PHP script. I am using the fetch method within my Vue js file, here is an example:
fetch("pages/gp/gpBackend.php", {
method: "POST",
body: v.form_data_obj
})
.then((response) => {
return response.json()
})
.then(function (data) {
v.processing = false
// handle errors first
if (!data.success) {
if (data.status === 'file_extension') {
v.status_msg = "This tool only supports CSV or TDMS file format."
v.has_error = true
} else {
v.error_details = data.output
v.status_msg = "An error occurred and your file could not be processed."
v.has_error = true
}
return
}
v.status_msg = "File processed successfully. Please choose what data to plot below."
v.grouped_axis_info = data.info.group_channels
v.data_file_info = data.info
v.x_end_default = data.info.death_timestamp
v.clearError()
I am more than happy to clear up any ambiguities that might show themselves in my question, but if anyone has any insight, or perhaps there's some face-palm worthy solution I am overseeing, please let me know.
Thank you
It took some hard digging into my Javascript before I realized that with WebPageA I was using the same FormData object on the JS side to send as the body of the POST via fetch. Meaning the datafile object was being passed along both times I POSTed to my backend. However on WebPageB I actually created a new FormData object within my second POST module and was not including the data file object in its body. Therefore it was never receiving another file. An embarrassing head slapper for sure, but lesson learned.

How to save video recorded with MediaRecorder API to php server?

I can record a video with the webcam, play the resulting blob on the browser and download it on the local machine, but when i save the file to the server it is unreadable. I have tried sending the chunks to the server and concatenate them there, and also sending the whole blob, but the result is the same (unreadable video).
I first read the blobs with a FileReader(), which gives a base64 result, then send it to the server, where i base64_decode() it and save it to a folder.
JS code:
var reader = new FileReader();
reader.readAsDataURL(chunks[index]);
reader.onload = function () {
upload(reader.result, function(response){
if(response.success){
// upload next chunk
}
});
};
on the server:
$json = json_decode( $request->getContent(), true );
$chunk = base64_decode( $json["chunk"] );
// all chunks get
file_put_contents("/uploadDirecotry/chunk".$json['index'].".webm", $json["chunk"]);
When all chunks are uploaded:
for ($i = 0; $i < $nrOfChunks; $i++) {
$file = fopen("/uploadDirectory/chunk".$i.".webm", 'rb');
$buff = fread($file, 1024000);
fclose($file);
$final = fopen("/processed/".$video->getFileName()."-full.webm", 'ab');
$write = fwrite($final, $buff);
fclose($final);
unlink("/uploadDirectory/chunk".$i.".webm");
}
I don't know what i am doing wrong. I've been trying for more than a week to make it work, but it won't. Please help!
you have to save decoded chunk
instead of this
file_put_contents("/uploadDirecotry/chunk".$json['index'].".webm", $json["chunk"]);
use this
file_put_contents("/uploadDirecotry/chunk".$json['index'].".webm", $chunk);
Also i suggest, please open final file before your "for loop" in write mode and close it after loop, instead of reopen every time in "for loop".

How to read and echo file size of uploaded file being written at server in real time without blocking at both server and client?

Question:
How to read and echo file size of uploaded file being written at server in real time without blocking at both server and client?
Context:
Progress of file upload being written to server from POST request made by fetch(), where body is set to Blob, File, TypedArray, or ArrayBuffer object.
The current implementation sets File object at body object passed to second parameter of fetch().
Requirement:
Read and echo to client the file size of file being written to filesystem at server as text/event-stream. Stop when all of the bytes, provided as a variable to the script as a query string parameter at GET request have been written. The read of the file currently takes place at a separate script environment, where GET call to script which should read file is made following POST to script which writes file to server.
Have not reached error handling of potential issue with write of file to server or read of file to get current file size, though that would be next step once echo of file size portion is completed.
Presently attempting to meet requirement using php. Though also interested in c, bash, nodejs, python; or other languages or approaches which can be used to perform same task.
The client side javascript portion is not an issue. Simply not that versed in php, one of the most common server-side languages used at world wide web, to implement the pattern without including parts which are not necessary.
Motivation:
Progress indicators for fetch?
Related:
Fetch with ReadableStream
Issues:
Getting
PHP Notice: Undefined index: HTTP_LAST_EVENT_ID in stream.php on line 7
at terminal.
Also, if substitute
while(file_exists($_GET["filename"])
&& filesize($_GET["filename"]) < intval($_GET["filesize"]))
for
while(true)
produces error at EventSource.
Without sleep() call, correct file size was dispatched to message event for a 3.3MB file, 3321824, was printed at console 61921, 26214, and 38093 times, respectively, when uploaded same file three times. The expected result is file size of file as the file is being written at
stream_copy_to_stream($input, $file);
instead of file size of uploaded file object. Are fopen() or stream_copy_to_stream() blocking as to other a different php process at stream.php?
Tried so far:
php is attributed to
Beyond $_POST, $_GET and $_FILE: Working with Blob in JavaScriptPHP
Introduction to Server-Sent Events with PHP example
php
// can we merge `data.php`, `stream.php` to same file?
// can we use `STREAM_NOTIFY_PROGRESS`
// "Indicates current progress of the stream transfer
// in bytes_transferred and possibly bytes_max as well" to read bytes?
// do we need to call `stream_set_blocking` to `false`
// data.php
<?php
$filename = $_SERVER["HTTP_X_FILENAME"];
$input = fopen("php://input", "rb");
$file = fopen($filename, "wb");
stream_copy_to_stream($input, $file);
fclose($input);
fclose($file);
echo "upload of " . $filename . " successful";
?>
// stream.php
<?php
header("Content-Type: text/event-stream");
header("Cache-Control: no-cache");
header("Connection: keep-alive");
// `PHP Notice: Undefined index: HTTP_LAST_EVENT_ID in stream.php on line 7` ?
$lastId = $_SERVER["HTTP_LAST_EVENT_ID"] || 0;
if (isset($lastId) && !empty($lastId) && is_numeric($lastId)) {
$lastId = intval($lastId);
$lastId++;
}
// else {
// $lastId = 0;
// }
// while current file size read is less than or equal to
// `$_GET["filesize"]` of `$_GET["filename"]`
// how to loop only when above is `true`
while (true) {
$upload = $_GET["filename"];
// is this the correct function and variable to use
// to get written bytes of `stream_copy_to_stream($input, $file);`?
$data = filesize($upload);
// $data = $_GET["filename"] . " " . $_GET["filesize"];
if ($data) {
sendMessage($lastId, $data);
$lastId++;
}
// else {
// close stream
// }
// not necessary here, though without thousands of `message` events
// will be dispatched
// sleep(1);
}
function sendMessage($id, $data) {
echo "id: $id\n";
echo "data: $data\n\n";
ob_flush();
flush();
}
?>
javascript
<!DOCTYPE html>
<html>
<head>
</head>
<body>
<input type="file">
<progress value="0" max="0" step="1"></progress>
<script>
const [url, stream, header] = ["data.php", "stream.php", "x-filename"];
const [input, progress, handleFile] = [
document.querySelector("input[type=file]")
, document.querySelector("progress")
, (event) => {
const [file] = input.files;
const [{size:filesize, name:filename}, headers, params] = [
file, new Headers(), new URLSearchParams()
];
// set `filename`, `filesize` as search parameters for `stream` URL
Object.entries({filename, filesize})
.forEach(([...props]) => params.append.apply(params, props));
// set header for `POST`
headers.append(header, filename);
// reset `progress.value` set `progress.max` to `filesize`
[progress.value, progress.max] = [0, filesize];
const [request, source] = [
new Request(url, {
method:"POST", headers:headers, body:file
})
// https://stackoverflow.com/a/42330433/
, new EventSource(`${stream}?${params.toString()}`)
];
source.addEventListener("message", (e) => {
// update `progress` here,
// call `.close()` when `e.data === filesize`
// `progress.value = e.data`, should be this simple
console.log(e.data, e.lastEventId);
}, true);
source.addEventListener("open", (e) => {
console.log("fetch upload progress open");
}, true);
source.addEventListener("error", (e) => {
console.error("fetch upload progress error");
}, true);
// sanity check for tests,
// we don't need `source` when `e.data === filesize`;
// we could call `.close()` within `message` event handler
setTimeout(() => source.close(), 30000);
// we don't need `source' to be in `Promise` chain,
// though we could resolve if `e.data === filesize`
// before `response`, then wait for `.text()`; etc.
// TODO: if and where to merge or branch `EventSource`,
// `fetch` to single or two `Promise` chains
const upload = fetch(request);
upload
.then(response => response.text())
.then(res => console.log(res))
.catch(err => console.error(err));
}
];
input.addEventListener("change", handleFile, true);
</script>
</body>
</html>
You need to clearstatcache to get real file size. With few other bits fixed, your stream.php may look like following:
<?php
header("Content-Type: text/event-stream");
header("Cache-Control: no-cache");
header("Connection: keep-alive");
// Check if the header's been sent to avoid `PHP Notice: Undefined index: HTTP_LAST_EVENT_ID in stream.php on line `
// php 7+
//$lastId = $_SERVER["HTTP_LAST_EVENT_ID"] ?? 0;
// php < 7
$lastId = isset($_SERVER["HTTP_LAST_EVENT_ID"]) ? intval($_SERVER["HTTP_LAST_EVENT_ID"]) : 0;
$upload = $_GET["filename"];
$data = 0;
// if file already exists, its initial size can be bigger than the new one, so we need to ignore it
$wasLess = $lastId != 0;
while ($data < $_GET["filesize"] || !$wasLess) {
// system calls are expensive and are being cached with assumption that in most cases file stats do not change often
// so we clear cache to get most up to date data
clearstatcache(true, $upload);
$data = filesize($upload);
$wasLess |= $data < $_GET["filesize"];
// don't send stale filesize
if ($wasLess) {
sendMessage($lastId, $data);
$lastId++;
}
// not necessary here, though without thousands of `message` events will be dispatched
//sleep(1);
// millions on poor connection and large files. 1 second might be too much, but 50 messages a second must be okay
usleep(20000);
}
function sendMessage($id, $data)
{
echo "id: $id\n";
echo "data: $data\n\n";
ob_flush();
// no need to flush(). It adds content length of the chunk to the stream
// flush();
}
Few caveats:
Security. I mean luck of it. As I understand it is a proof of concept, and security is the least of concerns, yet the disclaimer should be there. This approach is fundamentally flawed, and should be used only if you don't care of DOS attacks or information about your files goes out.
CPU. Without usleep the script will consume 100% of a single core. With long sleep you are at risk of uploading the whole file within a single iteration and the exit condition will be never met. If you are testing it locally, the usleep should be removed completely, since it is matter of milliseconds to upload MBs locally.
Open connections. Both apache and nginx/fpm have finite number of php processes that can serve the requests. A single file upload will takes 2 for the time required to upload the file. With slow bandwidth or forged requests, this time can be quite long, and the web server may start to reject requests.
Clientside part. You need to analyse the response and finally stop listening to the events when the file is fully uploaded.
EDIT:
To make it more or less production friendly, you will need an in-memory storage like redis, or memcache to store file metadata.
Making a post request, add a unique token which identify the file, and the file size.
In your javascript:
const fileId = Math.random().toString(36).substr(2); // or anything more unique
...
const [request, source] = [
new Request(`${url}?fileId=${fileId}&size=${filesize}`, {
method:"POST", headers:headers, body:file
})
, new EventSource(`${stream}?fileId=${fileId}`)
];
....
In data.php register the token and report progress by chunks:
....
$fileId = $_GET['fileId'];
$fileSize = $_GET['size'];
setUnique($fileId, 0, $fileSize);
while ($uploaded = stream_copy_to_stream($input, $file, 1024)) {
updateProgress($id, $uploaded);
}
....
/**
* Check if Id is unique, and store processed as 0, and full_size as $size
* Set reasonable TTL for the key, e.g. 1hr
*
* #param string $id
* #param int $size
* #throws Exception if id is not unique
*/
function setUnique($id, $size) {
// implement with your storage of choice
}
/**
* Updates uploaded size for the given file
*
* #param string $id
* #param int $processed
*/
function updateProgress($id, $processed) {
// implement with your storage of choice
}
So your stream.php don't need to hit the disk at all, and can sleep as long as it is acceptable by UX:
....
list($progress, $size) = getProgress('non_existing_key_to_init_default_values');
$lastId = 0;
while ($progress < $size) {
list($progress, $size) = getProgress($_GET["fileId"]);
sendMessage($lastId, $progress);
$lastId++;
sleep(1);
}
.....
/**
* Get progress of the file upload.
* If id is not there yet, returns [0, PHP_INT_MAX]
*
* #param $id
* #return array $bytesUploaded, $fileSize
*/
function getProgress($id) {
// implement with your storage of choice
}
The problem with 2 open connections cannot be solved unless you give up EventSource for old good pulling. Response time of stream.php without loop is a matter of milliseconds, and it is quite wasteful to keep the connection open all the time, unless you need hundreds updates a second.
You need break file on chunks with javascript and send those chunks. When chunk is uploaded you know exactly how many data were sent.
This is the only way and by the way it is not hard.
file.startByte += 100000;
file.stopByte += 100000;
var reader = new FileReader();
reader.onloadend = function(evt) {
data.blob = btoa(evt.target.result);
/// Do upload here, I do with jQuery ajax
}
var blob = file.slice(file.startByte, file.stopByte);
reader.readAsBinaryString(blob);

Optimize crawler script on cronjob

i have about 66Million domains in a MySQL table, i need to run crawler on all the domains and update the row count = 1 when the crawler completed.
the crawler script is in php using php crawler library
here is the script.
set_time_limit(10000);
try{
$strWebURL = $_POST['url'];
$crawler = new MyCrawler();
$crawler->setURL($strWebURL);
$crawler->addContentTypeReceiveRule("#text/html#");
$crawler->addURLFilterRule("#\.(jpg|jpeg|gif|png)$# i");
$crawler->enableCookieHandling(true);
$crawler->setTrafficLimit(1000 * 1024);
$crawler->setConnectionTimeout(10);
//start of the table
echo '<table border="1" style="margin-bottom:10px;width:100% !important;">';
echo '<tr>';
echo '<th>URL</th>';
echo '<th>Status</th>';
echo '<th>Size (bytes)</th>';
echo '<th>Page</th>';
echo '</tr>';
$crawler->go();
echo '</table>';
$this->load->model('urls');
$this->urls->incrementCount($_POST['id'],'urls');
}catch(Exception $e){
}
$this->urls->incrementCount(); only update the row and to mark the count column = 1
and because i have 66M domains i needed to run a cronjob on my server
and as cronjob runs on command line i needed a headless browser so i choose phanjomjs
because the crawler doesnt work the way i wanted it to work without the headless browser (phantomjs)
first problem i faced was to load domains from mysql db and run crawler script from a js script
i tried this:
create a php script that returns domains in json form and load it from js file and foreach the domains and run the crawler, but it didnt work very well and get stuck after sometime
next thing i tried, which im still using is create a python script to load the domains directly from mysql db and run the phantom js script on each domains from python script.
here is the code
import MySQLdb
import httplib
import sys
import subprocess
import json
args = sys.argv;
db = MySQLdb.connect("HOST","USER","PW","DB")
cursor = db.cursor()
#tablecount = args[1]
frm = args[1]
limit = args[2]
try:
sql = "SELECT * FROM urls WHERE count = 0 LIMIT %s,%s" % (frm,limit)
cursor.execute(sql)
print "TOTAL RECORDS: "+str(cursor.rowcount)
results = cursor.fetchall()
count = 0;
for row in results:
try:
domain = row[1].lower()
idd = row[0]
command = "/home/wasif/public_html/phantomjs /home/wasif/public_html/crawler2.js %s %s" % (domain,idd)
print command
proc = subprocess.Popen(command, shell=True, stdout=subprocess.PIPE)
script_response = proc.stdout.read()
print script_response
except:
print "error running crawler: "+domain
except:
print "Error: unable to fetch data"
db.close()
it takes 2 arguments to set the limit to select domain from database.
foreach domains and run this command using subproces
command = "/home/wasif/public_html/phantomjs /home/wasif/public_html/crawler2.js %s %s" % (domain,idd)
command
proc = subprocess.Popen(command, shell=True, stdout=subprocess.PIPE)
script_response = proc.stdout.read()
print script_response
crawler2.js file also takes 2 args 1 is domain and 2nd is the id to update the count = 1 when crawler completed
this is the crawler2.js
var args = require('system').args;
var address = '';
var id = '';
args.forEach(function(arg, i) {
if(i == 1){
address = arg;
}
if(i == 2){
id = arg;
}
});
address = "http://www."+address;
var page = require('webpage').create(),
server = 'http://www.EXAMPLE.net/main/crawler',
data = 'url='+address+'&id='+id;
console.log(data);
page.open(server, 'post', data, function (status) {
if (status !== 'success') {
console.log(address+' Unable to post!');
} else {
console.log(address+' : done');
}
phantom.exit();
});
it works well but my script get stuck after sometime n need to restart after sometime and log shows nothing wrong
i need to optimize this process and run crawler as fast as i can, any help would be appreciated
Web crawler programmer is in here. :)
Your python execute the phantom serially. You should do it in parallel. To do it, execute the phantom then leave it, don't wait it.
In PHP, would be like this:
exec("/your_executable_path > /dev/null &");
Don't use phantom if you don't need to. It render everything. > 50MB memory will be needed.

Categories