How to chunk string using javascript - javascript

I have a string which is more than 32kb it needs to be chunked, with every chunk having a size limit of 32kb.is it possible ? using JavaScript , I can only find codes like cutting the string or splitting the string in which, I think is not related to my task
stringChop = function(str, size){
if (str == null)
return [];
str = String(str);
return size > 0 ? str.match(new RegExp('.{1,' + size + '}', 'g')) : [str];
}
I also have code the check the bytes
const byteSize = str => new Blob([str]).size;
const result = byteSize("sample")

You really don't want to "spend time" splitting large strings in Node.
If you have to use vanilla
This is entirely possible with JavaScript (and you're pretty close). Though this is more elegant without regular expressions and with generators:
function* chunk(str, size = 3) {
for(let i = 0; i < str.length; i+= size ) yield str.slice(i, i + size);
}
[...chunk('hello world')]; // ["hel", "lo ", "wor", "ld"];
If you can use Node.js
I'd read the file you want to split with a createReadStream and then write it to different files when it reaches the limit. This is much more effective since you don't create many small strings or keep all the data in memory:
(async () => {
let currentFileIndex = 0, currentBytes = 0;
let currentFile = fs.createWriteStream(`${currentFileIndex}.csv`);
for await(const chunk of fs.createReadStream('input.csv') {
currentBytes += chunk.length;
if (currentBytes > 32000) { // or whatever limit you want
currentFile.end(); // probably wait for the allback here
currentBytes = 0;
currentFile = fs.createWriteStream(`${++currentFileIndex}.csv`)
}
await util.promisify(cb => currentFile.write(chunk, cb)();
}
})();

Related

Reverse a string in JavaScript without using an array

A simple way of reversing a string is as below:
const test = 'hello';
let i = 0;
let j = test.length - 1;
while (i < j) {
let temp = test[i];
test[j] = test[i];
test[i] = temp;
i++;
j--;
}
console.log(test);
If we try to access string using an index it works fine. For example console.log(test[2]) returns 'l'
But reversing a string using the method above returns unchanged string 'hello'. We need to use an array, reverse it and then join it to return the reversed string. But in that case we will be using an extra space. Can we do it without using an extra space?
Strings are immutable in JavaScript. Therefore, they cannot be changed in-place. Any new string requires a new memory allocation, even when doing something as simple as
const str1 = "hello";
const str2 = str[0];
Leaves two strings in memory: "hello" and "h".
Since any attempt to produce a string will create at least one new string, it is therefore impossible to reverse a string without allocating space for a new string where the characters are reversed.
The minimum space complexity for this task is thusO(n) - scales linearly with the string length. Creating an array which can be rearranged in-place and then combined back to the reversed string fulfils this.
Here is a recursive way of doing it:
const rev = s => s.length>1 ? s.at(-1)+rev(s.slice(0,-1)) : s;
console.log(rev("This is a test string."))
The final line of your question means that the answer is "no". We cannot do this without using extra space [in userland JS].
We could, however, do this if we relied on a function written in a systems programming language. And this is the C code used by V8 for Array#join. In such a language the binary representation of the reversed string could be constructed step by step and simply cast to be a UTF-16 string in the final step. I presume this approximates what Array#join does under the hood.
If your requirement is simply to avoid using an array, the following simply successively pulls the code units from the end of the input string and builds a new string from them.
This will fail horribly with surrogate pairs (eg emoji) and grapheme clusters.
const reverse = (s) => {
let result = ''
for(let x = s.length-1; x >= 0; x--) {
result += s[x]
}
return result
}
console.log(reverse('hello'))
What about a hacky for loop?
const rev = (str) => {
for(var i = str.length - 1; i >= 0; i--) {
str += str[i];
}
return str.slice(str.length / 2, str.length);
}
console.log(rev("t"));
console.log(rev("te"));
console.log(rev("tes"));
console.log(rev("test"));
OP
"Can we do it without using an extra space."
nope.
Anyhow ... the OP's while based approached which this time does not try to change characters in place but programmatically a) removes character by character from the input value while b) concatenating the reversed result string ...
function reverseStringWithoutHelpOfArray(value) {
value = String(value); // let i = 0;
let result = ''; // let j = test.length - 1;
// while (i < j) {
while (value) { // let temp = test[i];
result = result + value.slice(-1); // test[j] = test[i];
value = value.substring(0, value.length - 1); // test[i] = temp;
} // i++; j--;
return result; // }
}
console.log(
reverseStringWithoutHelpOfArray('hallo')
);

IMAP UTF-7 conversion with native Javascript

I have been trying to get a handle on a good code that will provide Javascript for converting IMAP UTF7 mailboxes to JS to UTF-16 string. There seems to be no such work done. Anyone of you built one of these or have one available to share? I am happy to build one but didn't want to if there is someone who has it already.
As I look at the specs it looks like string between '&' and '-' is first decoded with base64 and then decoded as UTF-16 Big Endian, and the reverse process for encoding non-ascii text into UTF-16 portions and then base64. The base64 +/ is represented as +, for file safe operations instead of +_ in other cases.
Let me know if anyone has a solution and I will happy to use it or write one and put it in Github!
Thanks
Vijay
I think I found a simple enough solution to this, as no one responded. Hopefully somebody finds this little script helpful for converting UTF7
>z='Συστήματα_Ανίχνευσης_Εισ & related security.pdf'
>encode_imap_utf7(z)
"&A6MDxQPDA8QDtwMBA7wDsQPEA7E-_&A5EDvQO5AwEDxwO9A7UDxQPDA7cDwg-_&A5UDuQPD- &- related security.pdf"
>decode_imap_utf7(encode_imap_utf7(z))
"Συστήματα_Ανίχνευσης_Εισ & related security.pdf"
>decode_imap_utf7(encode_imap_utf7(z)) == z
true
/* Se RFC 2060 - no / ~ \ in folder names */
function ureplacer(pmatch) {
var ret = ""
pmatch = pmatch.replace(/\,/g,'/')
var ix = pmatch.substr(1,pmatch.length-2)
if (ix.length % 4 != 0)
ix = ix.padEnd(ix.length+ 4 - ix.length % 4,"=")
try {
var dx = atob(ix)
for (var j = 0; j < dx.length; j = j+2) {
ret = ret + String.fromCharCode((dx.charCodeAt(j) << 8) + dx.charCodeAt(j+1))
}
} catch(err) {
console.log("Error in decoding foldername IMAP UTF7, sending empty string back")
console.log(err)
ret = ""
}
return ret
}
function breplacer(umatch) {
var bst = ""
for (var i=0; i < umatch.length; i++) {
var f = umatch.charCodeAt(i)
bst = bst + String.fromCharCode(f >> 8) + String.fromCharCode(f & 255)
}
try {
bst = '&'+btoa(bst).replace(/\//g,',').replace(/=+/,'')+'-'
}catch(err) {
console.log("Error in encoding foldername IMAP UTF7, sending empty string back")
console.log(err)
bst = ""
}
return bst
}
function decode_imap_utf7(mstring) {
var stm = new RegExp(/(\&[A-Za-z0-9\+\,]+\-)/,'g')
return mstring.replace(stm,ureplacer).replace('&-','&')
}
function encode_imap_utf7(ustring) {
ustring = ustring.replace(/\/|\~|\\/g,'')
var vgm = new RegExp(/([^\x20-\x7e]+)/,'g')
return ustring.replace('&','&-').replace(vgm,breplacer)
}

How to Directly Instantiate WebAssembly Module in JavaScript

The examples I've seen show essentially this:
fetch('simple.wasm').then(response =>
response.arrayBuffer()
).then(bytes =>
WebAssembly.instantiate(bytes, {})
).then(result =>
result.instance.exports...
)
But I would like to do it without making that extra HTTP request. Wondering if the only way is this (or some variation of it, which would be helpful to know):
var binary = '...mywasmbinary...'
var buffer = new ArrayBuffer(binary.length)
var view = new DataView(buffer)
for (var i = 0, n = binary.length; i < n; i++) {
var x = binary[i]
view.setInt8(i * 8, x)
}
Wondering if I have to worry about endianess or anything like that.
Or perhaps doing something with URL and blobs might be better, I'm not sure.
Yes, you are correct, in order to inline wasm modules and avoid the HTTP request, you'll have to perform some sort of encoding. I'd recommend using Base64 encoded strings as they are the most compact form.
You can encode as follows:
const readFileSync = require('fs').readFileSync;
const wasmCode = readFileSync(id);
const encoded = Buffer.from(wasmCode, 'binary').toString('base64');
You can then load the module as follows:
var encoded = "... contents of encoded from above ...";
function asciiToBinary(str) {
if (typeof atob === 'function') {
// this works in the browser
return atob(str)
} else {
// this works in node
return new Buffer(str, 'base64').toString('binary');
}
}
function decode(encoded) {
var binaryString = asciiToBinary(encoded);
var bytes = new Uint8Array(binaryString.length);
for (var i = 0; i < binaryString.length; i++) {
bytes[i] = binaryString.charCodeAt(i);
}
return bytes.buffer;
}
var module = WebAssembly.instantiate(decode(encoded), {});

Breaking Down a String into Maximum Character Sections in JavaScript

I need to break apart strings in JavaScript into chunks of no greater than 100 characters while maintaining breaks between words. I have a function in my own personal library for chunkifying a string into 100-character sections, but I can't seem to wrap my head around how to adapt it to avoid splitting in the middle of a word. I figure something can be managed using regular expressions or something, but it just isn't coming to me. One caveat to any solution is that it has to be pure JavaScript, no jQuery, and the environment has no access to browser-related globals.
-- EDIT --
Ok, I've written some code, but I'm getting strange results...
function chunkify(str) {
var wsRegEx = /\S/;
var wsEndRegEx = /\s$/;
var wsStartRegEx = /^\s/;
var chunks = new Array();
var startIndex = 0;
var endIndex = 100;
var totalChar = 0;
while (true) {
if (totalChar >= str.length) break;
var chunk = str.substr(startIndex,endIndex-startIndex);
while (wsStartRegEx.test(chunk)) {
startIndex++;
endIndex++;
totalChar++;
chunk = str.substr(startIndex,endIndex-startIndex);
}
if (!wsEndRegEx.test(chunk)) {
while (wsRegEx.test(chunk.charAt(endIndex))) {
endIndex--;
}
chunk = str.substr(startIndex,endIndex-startIndex);
}
chunks.push(chunk);
totalChar += chunk.length;
startIndex = endIndex;
endIndex += 100;
}
return chunks;
}
A previous version I posted wasn't counting chunks correctly, but this version, which does seem to break correctly, is now breaking mid word.
-- EDIT #2 --
I think I got it working great now. This seems to do the trick:
function chunkify(str) {
var wsRegEx = /\S/;
var chunks = new Array();
var startIndex = 0;
var endIndex = 100;
while (startIndex < str.length) {
while (wsRegEx.test(str.charAt(endIndex))) {
endIndex--;
}
if (!wsRegEx.test(str.charAt(startIndex)))
startIndex++;
chunks.push(str.substr(startIndex, endIndex - startIndex));
startIndex = endIndex;
endIndex += 100;
}
return chunks;
}
Is there a cleaner way to do this, or have I gotten this to be about as efficient as it'll get?
I have tried to spec this out for you, so you understand one way it can be done
function chunkify (str) {
var chunks = [];
var startIdx = 0, endIdx;
//Traverse through the string, 100 characters at a go
//If the character in the string after the next 100 (str.charAt(x)) is not a whitespace char, try the previous character(s) until a whitespace character is found.
//Split on the whitespace character and add it to chunks
return chunks
}
Here's a way to do it with regex:
chunks = str.match(/.{1,100}/g);

Convert large array of integers to unicode string and then back to array of integers in node.js

I have some data which is represented as an array of integers and can be up to 200 000 elements. The integer value can vary from 0 to 200 000.
To emulate this data (for debugging purposes) I can do the following:
let data = [];
let len = 200000
for (let i = 0; i < len; i++) {
data[i] = i;
}
To convert this array of integers as an unicode string I perform this:
let dataAsText = data.map((e) => {
return String.fromCodePoint(e);
}).join('');
When I want to convert back to an array of integers the array appears to be longer:
let dataBack = dataAsText.split('').map((e) => {
return e.codePointAt(e);
});
console.log(dataBack.length);
How does it come ? What is wrong ?
Extra information:
I use codePointAt/fromCodePoint because it can deal with all unicode values (up to 21 bits) while charCodeAt/fromCharCode fails.
Using, for example, .join('123') and .split('123') will make that the variable dataBack is the same length as data. But this isn't an elegant solution because the size of the string dataAsText will unnecessarily be very large.
If let len is equal or less to 65536 (which is 2^16 or 16 bits max value) then everything works fine. Which is strange ?
EDIT:
I use codePoint because I need to convert the data as unicode text so that the data is short.
More about codePoint vs charCode with an example:
If we convert 150000 to a character then back to an integer with codePoint:
console.log(String.fromCodePoint("150000").codePointAt(0));
this gives us 150000 which is correct. Doing the same with charCode fails and prints 18928 (and not 150000):
console.log(String.fromCharCode("150000").charCodeAt(0));
That's because higher code point values will yield 2 words, as can be seen in this snippet:
var s = String.fromCodePoint(0x2F804)
console.log(s); // Shows one character
console.log('length = ', s.length); // 2, because encoding is \uD87E\uDC04
var i = s.codePointAt(0);
console.log('CodePoint value at 0: ', i); // correct
var i = s.codePointAt(1); // Should not do this, it starts in the middle of a sequence!
console.log('CodePoint value at 1: ', i); // misleading
In your code things go wrong when you do split, as there the words making up the string are all split, discarding the fact that some pairs are intended to combine into a single character.
You can use the ES6 solution to this, where the spread syntax takes this into account:
let dataBack = [...dataAsText].map((e, i) => {
// etc.
Now your counts will be the same.
Example:
// (Only 20 instead of 200000)
let data = [];
for (let i = 199980; i < 200000; i++) {
data.push(i);
}
let dataAsText = data.map(e => String.fromCodePoint(e)).join("");
console.log("String length: " + dataAsText.length);
let dataBack = [...dataAsText].map(e => e.codePointAt(0));
console.log(dataBack);
Surrogates
Be aware that in the range 0 ... 65535 there are ranges reserved for so-called surrogates, which only represent a character when combined with another value. You should not iterate over those expecting that these values represent a character on their own. So in your original code, this will be another source for error.
To fix this, you should really skip over those values:
for (let i = 0; i < len; i++) {
if (i < 0xd800 || i > 0xdfff) data.push(i);
}
In fact, there are many other code points that do not represent a character.
I have a feeling split doesn't work with unicode values, a quick test above 65536 shows that they become double the length after splitting
Perhaps look at this post and answers, as they ask a similar question
I don't think you want charPointAt (or charCodeAt) at all. To convert a number to a string, just use String; to have a single delimited string with all the values, use a delimiter (like ,); to convert it back to a number, use the appropriate one of Number, the unary +, parseInt, or parseFloat (in your case, Number or + probably):
// Only 20 instead of 200000
let data = [];
for (let i = 199980; i < 200000; i++) {
data.push(i);
}
let dataAsText = data.join(",");
console.log(dataAsText);
let dataBack = dataAsText.split(",").map(Number);
console.log(dataBack);
If your goal with codePointAt is to keep the dataAsText string short, then you can do that, but you can't use split to recreate the array because JavaScript strings are UTF-16 (effectively) and split("") will split at each 16-bit code unit rather than keeping code points together.
A delimiter would help there too:
// Again, only 20 instead of 200000
let data = [];
for (let i = 199980; i < 200000; i++) {
data.push(i);
}
let dataAsText = data.map(e => String.fromCodePoint(e)).join(",");
console.log("String length: " + dataAsText.length);
let dataBack = dataAsText.split(",").map(e => e.codePointAt(0));
console.log(dataBack);
If you're looking for a way to encode a list of integers so that you can safely transmit it over a network, node Buffers with base64 encoding might be a better option:
let data = [];
for (let i = 0; i < 200000; i++) {
data.push(i);
}
// encoding
var ta = new Int32Array(data);
var buf = Buffer.from(ta.buffer);
var encoded = buf.toString('base64');
// decoding
var buf = Buffer.from(encoded, 'base64');
var ta = new Uint32Array(buf.buffer, buf.byteOffset, buf.byteLength >> 2);
var decoded = Array.from(ta);
// same?
console.log(decoded.join() == data.join())
Your original approach won't work because not every integer has a corresponding code point in unicode.
UPD: if you don't need the data to be binary-safe, no need for base64, just store the buffer as is:
// saving
var ta = new Int32Array(data);
fs.writeFileSync('whatever', Buffer.from(ta.buffer));
// loading
var buf = fs.readFileSync('whatever');
var loadedData = Array.from(new Uint32Array(buf.buffer, buf.byteOffset, buf.byteLength >> 2));
// same?
console.log(loadedData.join() == data.join())

Categories