If I have a random string and want to encode it into another string that only contains alphanumeric characters, what is the most efficient way to do it in JavaScript / NodeJS?
Obviously it must be possible to convert the output string back to the original input string when needed.
Thanks!
To encode to an alphanumeric string you should use an alphanumeric encoding. Some popular ones include hexadecimal (base16), base32, base36, base58 and base62. The alternatives to hexadecimal are used because the larger alphabet results in a shorter encoded string. Here's some info:
Hexadecimal is popular because it is very common.
Base32 and Base36 are useful for case-insensitive encodings. Base32 is more human readable because it removes some easy-to-misread letters. Base32 is used in gaming and for license keys.
Base58 and Base62 are useful for case-sensitive encodings. Base58 is also designed to be more human readable by removing some easy-to-misread letters. Base58 is used by Flickr, Bitcoin and others.
In NodeJS hexadecimal encoding is natively supported, and can be done as follows:
// Encode
var hex = new Buffer(string).toString('hex');
// Decode
var string = new Buffer(hex, 'hex').toString();
It's important to note that there are different implementations of some of these. For example, Flickr and Bitcoin use different implementations of Base58.
Why not just store the 2 strings in different variables so no need to convert back?
To extract all alphanumerics you could use the regex function like so:
var str='dssldjf348902.-/dsfkjl';
var patt=/[^\w\d]*/g;
var str2 = str.replace(patt,'');
str2 becomes dssldjf348902dsfkjl
Related
Need to convert the unicode of the SOH value '\u0001' to Ascii. Why is this not working?
var soh = String.fromCharCode(01);
It returns '\u0001'
Or when I try
var soh = '\u0001'
It returns a smiley face.
How can I get the unicode to become the proper SOH value(a blank unprintable character)
JS has no ASCII strings, they're intrinsically UTF-16.
In a browser you're out of luck. If you're coding for node.js you're lucky!
You can use a buffer to transcode strings into octets and then manipulate the binary data at will. But you won't get necessarily a valid string back out of the buffer once you've messed with it.
Either way you'll have to read more about it here:
https://mathiasbynens.be/notes/javascript-encoding
or here:
https://nodejs.org/api/buffer.html
EDIT: in the comment you say you use node.js, so this is an excerpt from the second link above.
const buf5 = Buffer.from('test');
// Creates a Buffer containing ASCII bytes [74, 65, 73, 74].
To create the SOH character embedded in a common ASCII string use the common escape sequence\x01 like so:
const bufferWithSOH = Buffer.from("string with \x01 SOH", "ascii");
This should do it. You can then send the bufferWithSOH content to an output stream such as a network, console or file stream.
Node.js documentation will guide you on how to use strings in a Buffer pretty well, just look up the second link above.
To ascii would be would be an array of bytes: 0x00 0x01 You would need to extract the unicode code point after the \u and call parseInt, then extract the bytes from the Number. JavaScript might not be the best language for this.
There are some texts contain six digits emojis. I need transcode it to Unicode by JavaScript.
Just like this:
origin: 328054
Unicode: \ue052 ( U+E052 'the dog face' Emoji )
How can I transcode this six digits emoji code to Unicode by Javascript?
origin: 328054
I have no idea what you mean. If treated decimal, U+50176 is not a valid Unicode character. If treated hexadecimal, it lies outside the range of code points Unicode can represent.
Unicode: \ue052 ( U+E052 )
U+E052 is reserved for private use. You don't mean that one. It seems to have been used by SoftBank to encode the Dog Face emoji. Unless you live in Japan, and use their network, it hardly will work for you.
'the dog face' Emoji
is assigned U+1F436: 🐶.
How can I encode this in Javascript?
JavaScript uses UTF-16, and since your code point is higher than U+D7FF, you will need two characters to encode it as a surrogate pair. You still can easily get the string from the code point by using String.fromCodePoint:
var df = String.fromCodePoint(0x1F436);
df.length; // 2
You can get the character codes that you need for escaping from that string using the charCodeAt method:
String.fromCodePoint(0x1F436).charCodeAt(0).toString(16) // d83d
String.fromCodePoint(0x1F436).charCodeAt(1).toString(16) // dc36
So the JS string literal you seem to be after is "\ud83d\udc36".
I am looking for way in JavaScript to convert non-ASCII characters in a string to their closest equivalent, similarly to what the PHP iconv function does. For instance if the input string is Rånades på Skyttis i Ö-vik, it should be converted to Ranades pa skyttis i o-vik. I had a look at phpjs but iconv isn't included.
Is it possible to perform such conversion in JavaScript, if so how?
Notes:
more generally this process of conversion is called transliteration
my use-case is the creation of URL slugs
The easiest way I've found:
var str = "Rånades på Skyttis i Ö-vik";
var combining = /[\u0300-\u036F]/g;
console.log(str.normalize('NFKD').replace(combining, ''));
For reference see https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/normalize
I would recommend Unicode package, it will also map Greek and Cyrillic letters to their closest ascii symbol:
unidecode('Lillı Celiné Никита Ödipus');
'Lilli Celine Nikita Odipus'
It's because iconv is a native compiled UNIX utility behind the most i18n character map conversion functions.
You won't find it in javascript unless you access some browser component.
Encoding is a property of the document so most javascript implementation just simply dismiss it.
You'll need a pure js library for unaccented strings. It would be the best to have one for the specific language you need.
The simpliest way is via some translate tables or even regex replaces.
like here : http://lehelk.com/2011/05/06/script-to-remove-diacritics/
check this thread too : Replacing diacritics in Javascript
I'd like to write some IDs for use in URLs in Crockford's base32. I'm using the base32 npm module.
So, for example, if the user types in http://domain/page/4A2A I'd like it to map to the same underlying ID as http://domain/page/4a2a
This is because I want human-friendly URLs, where the user doesn't have to worry about the difference between upper- and lower-case letters, or between "l" and "1" - they just get the page they expect.
But I'm struggling to implement this, basically because I'm too dim to understand how encoding works. First I tried:
var encoded1 = base32.encode('4a2a');
var encoded2 = base32.encode('4A2A');
console.log(encoded1, encoded2);
But they map to different underlying IDs:
6hgk4r8 6h0k4g8
OK, so maybe I need to use decode?
var encoded1 = base32.decode('4a2a');
var encoded2 = base32.decode('4A2A');
console.log(encoded1, encoded2);
No, that just gives me empty strings:
" "
What am I doing wrong, and how can I get 4A2A and 4A2A to map to the same thing?
For an incoming request, you'll want to decode the URL fragment. When you create URLs, you will take your identifier and encode it. So, given a URL http://domain/page/dnwnyub46m50, you will take that fragment and decode it. Example:
#> echo 'dnwnyub46m50'| base32 -d
my_id5
The library you linked to is case-insensitive, so you get the same result this way:
echo 'DNWNYUB46M50'| base32 -d
my_id5
When dealing with any encoding scheme (Base-16/32/64), you have two basic operations: encode, which works on a raw stream of bits/bytes, and decode which takes an encoded set of bytes and returns the original bit/byte stream. The Wikipedia page on Base32 encoding is a great resource.
When you decode a string, you get raw bytes: it may be that those bytes are not compatible with ASCII, UTF-8, or some other encoding which you are trying to work with. This is why your decoded examples look like spaces: the tools you are using do not recognize the resulting bytes as valid characters.
How you go about encoding identifiers depends on how your identifiers are generated. You didn't say how you were generating the underlying identifiers, so I can't make any assumptions about how you should handle the raw bytes that come out of the decoder, nor about the content of the raw bytes being passed into the encoder.
It's also important to mention that the library you linked to is not compatible with Crockford's Base32 encoding. The library excludes I, L, O, S, while Crockford's encoding excludes I, L, O, U. This would be a problem if you were trying to interoperate with another system that used a different library. If no one besides you will ever need to decode your URL fragments, then interoperability doesn't matter.
The source of your confusion is that a base64 or base32 are methods of representing numbers- whereas you are attempting in your examples to encode or decode text strings.
Encoding and decoding text strings as base32 is done by first converting the string into a large number. In your first examples, where you are encoding "4a2a" and "4A2A", those are strings with two different numeric values, that consequently translate to encoded base32 numbers with two different values, 6hgk4r8 6h0k4g8
when you "decode" 4a2a and 4A2A you say you get empty strings. However this is not true, the strings are not empty, they contain what the decoded number looks like, when interpreted as a string. Which is to say, it looks like nothing because 4a2a produces an unprintable character. It's invisible. What you want is to feed the encoder numbers, not strings.
JavaScript has
parseInt(num, 32)
and
num.toString(32)
built in in a way that's compatible with Java and across JavaScript versions.
The javascript length and subString function does not take into account non ascii characters.
I have a function that substrings the users input to 400 characters if they enter more than 400 characters.
e.g
function reduceInput (data) {
if (data.length > 400)
{
var reducedSize = data.substring(0,400);
}
return reducedSize;
}
However, if non ascii chars is entered (double byte chars) then this does not work. It does not take into account the character types in the equation.
I have another function that loops round each charaters, and if it is a non ascii, it increments a counter and then works out what the true count is. It works but it is a bit of a hack.
Is there a more efficient approach to doing this or is there no other alternative?
Thanks
The native character set of JavaScript and web browsers in general is UTF-16. Strings are sequences of UTF-16 code units. There is no concept of "double byte" character encodings.
If you want to calculate how many bytes a String will take up in a particular double-byte encoding, you will need to know what encoding it is and how to encode it yourself; that information will not be accessible to JavaScript natively. So for example with Shift_JIS you will have to know which characters are kana that can be encoded to a single byte, and which take part in double-byte kanji sequences.
There is not any encoding that stores all code units that represent ASCII in one byte and all code units other than ASCII in two bytes, so whatever question you are trying to solve by counting non-ASCII as two, the loop-and-add probably isn't the right answer.
In any case, the old-school double-byte encodings are a horrible anachronism to be avoided wherever possible. If you want a space-efficient byte encoding, you want UTF-8. It's easy to calculate the length of a string in UTF-8 bytes because JS has a sneaky built-in UTF-8 encoder you can leverage:
var byten= unescape(encodeURIComponent(chars)).length;
Snipping a string to 400 bytes is somewhat trickier because you want to avoid breaking a multi-byte sequence. You'll get an exception if you try to UTF-8-decode something with a broken sequence at the end, so catch it and try again:
var bytes= unescape(encodeURIComponent(chars)).slice(0, 400);
while (bytes.length>0) {
try {
chars= decodeURIComponent(escape(bytes));
break
} catch (e) {
bytes= bytes.slice(0, -1);
}
}
But it's unusual to want to limit input based on number of bytes it will take up in a particular encoding. Straight limit on number of characters is far more typical. What're you trying to do?
A regex can do the job i think
var data = /.{0,400}/.exec(originalData)[0];