For the API I'm working on, I have to map bytes to characters by these rules:
Bytes 0x20 to 0x24 inclusive are mapped to the corresponding characters U+0020 to U+0024 inclusive.
Bytes 0x26 to 0x7E inclusive are mapped to the corresponding characters U+0026 to U+007E inclusive.
Other bytes are mapped to the three character sequence consisting of a percent character followed by two uppercase hex characters, e.g. byte 0 maps to “%00” and byte 0xAB maps to “%AB”.
That's for encoding, I have to make function for decoding also.
Is this maybe some existing encoding? I googled it, but couldn't find anything.
I know U+ is for Unicode.
Can I just map it like:
if(bytesArray[i] == 0x21)
{
bytesArray[i] = U+0021;
}
?
Since you are translating values in a string, you can use String.replace. The only thing that actually needs replacing is the %XX notation; other characters can stay unchanged.
In this case you can use the user defined function of the replace command:
decodedArray = byteArray.replace (/%([0-9A-F][0-9A-F])/g,
function (match, p1)
{
return String.fromCharCode(parseInt(p1, 16));
} );
This is a safe operation because per the specification, a literal % character may not appear in the encoded string (in case you're wondering: that is the code 0x25 that gets singled out in the otherwise printable plain ASCII range). Even if a stray % appears in the string, it will only be replaced if actually followed by two valid capitalized hex bytes.
Encoding is similarly straightforward:
encodedArray = originalString.replace (/[^ -$&-~]/g,
function (match)
{
if (match.charCodeAt(0) < 16)
return '%0'+match.charCodeAt(0).toString(16).toUpperCase();
return '%'+match.charCodeAt(0).toString(16).toUpperCase();
} );
Note the 'match' regex is the exact ASCII representation of your to-be-translated code list; it is negated because you want to find all characters
outside that range.
This works better than a mapping, because that would require you to manually loop over each character and inspect it to decide whether or not it needs translating. This would work reasonably well for encoding (you'd need a list of all codes, with either a one- or 3-byte string to which it encodes) but for decoding you would need to look 2 bytes ahead, test if they are valid (!), and skip over them if so. Compared to that, the replace solution ought to be much faster.
Related
How can this be possible:
var string1 = "🌀", string2 = "🌀🌂";
//comparing the charCode
console.log(string1.charCodeAt(0) === string2.charCodeAt(0)); //true
//comparing the character
console.log(string1 === string2.substring(0,1)); //false
//This is giving me a headache.
http://jsfiddle.net/DerekL/B9Xdk/
If their char codes are the same in both strings, by comparing the character itself should return true. It is true when I put in a and ab. But when I put in these strings, it simply breaks.
Some said that it might be the encoding that is causing the problem. But since it works perfectly fine when there's only one character in the string literal, I assume encoding has nothing to do with it.
(This question addresses the core problem in my previous questions. Don't worry I deleted them already.)
In JavaScript, strings are treated by characters instead of bytes, but only if they can be expressed in 16-bit code points.
A majority of the characters will cause no issues, but in this case they don't "fit" and so they occupy 2 characters as far as JavaScript is concerned.
In this case you need to do:
string2.substring(0, 2) // "🌀"
For more information on Unicode quirkiness, see UTF-8 Everywhere.
Substring parameters are the index where he starts, and the end, where as if you change it to substr, the parameters are index where to start and how many characters.
You can use the method to compare 2 strings:
string1.localeCompare(string2);
Summary
Can you use a regular expression to match multiple characters, but replace individual characters with specific replacements.
For instance, replace \ with \\ and replace " with \x22 and replace ' with \x27.
It is my understanding that this is simply not possible, as you can use the captured sub-matches within the expression, but not with any level of logic that would allow you to conditionally output text if a sub-match took place.
The following VB.NET code is obviously totally incorrect, but gives you an idea of my thinking... (i.e. if there was a replacement command that allowed you to say "if sub-match 1 happened, then output \\ instead")
RegEx.Replace(text, "(\)?("")?(')?", "{if($1,'\\')}{if($2,'\x22')}{if($2,'\x27')}")
(This would be for use with .NET RegEx class, but would be useful for use with javascript RegExp class)
Background
More for interest than actual need, but I've been playing with encoding text for use within javascript parameters. (Well, the need is certainly there, but the interest is efficiency.)
I've been using the standard String.Replace, and doing some tests for performance with the following two functions...
Public Function GetJSSafeString(ByVal text As String) As String
Return text.Replace("\", "\\").Replace("""", "\x22").Replace("'", "\x27")
End Function
Public Function GetJSSafeString2(ByVal text As String) As String
If text.Contains("\") Then
text = text.Replace("\", "\\")
End If
If text.Contains("""") Then
text = text.Replace("""", "\x22")
End If
If text.Contains("'") Then
text = text.Replace("'", "\x27")
End If
Return text
End Function
Using two strings, both around 200 characters in length - the first does not contain any characters to be converted - the second contains one of each character to be converted (\"'). I ran each of the two strings through the two functions 100000 times each.
The four results are coming out (in total-milliseconds) roughly as...
GetJSSafeString, no converted characters: 182.0364
GetJSSafeString, converted characters: 316.0632
GetJSSafeString2, no converted characters: 60.012
GetJSSafeString2, converted characters: 354.0708
So obviously GetJSSafeString2 is best if there are no replacement, and worst if there are characters to convert (but not much worse, so looks like the better choice).
But it got me thinking... could this be done with a single regular expression?
And if so, would it be faster than either of the two above functions?
The solution in JavaScript:
var text="this is a test \\ with \"things\" to ' replace";
var h={'\\':'\\\\', '"':"\\x22", "'":"\\x27"}; //we define here the replacements
text=text.replace(/("|\\|')/g,function(match){return h[match]});
alert(text); //prints: this is a test \\ with \x22things\x22 to \x27 replace
Note: this document on replace is worth reading
Big thanks to #psxls for his answer, which will be useful for future javascript implementation.
His answer made me look at the overloads for the .NET RegEx.Replace function (which to be honest, I should have done in the first place, my bad)... and there is a MatchEvaluator delegate.
So I have implemented the following code as a test (to compliment the code already in my answer)...
Public Function GetJSSafeString3(ByVal text As String) As String
Return Regex.Replace(text, "(\\|""|')", New MatchEvaluator(AddressOf GetJSSafeString3Eval))
End Function
Public Function GetJSSafeString3Eval(ByVal textMatch As Match) As String
Select Case textMatch.Value
Case "\"
Return "\\"
Case """"
Return "\x22"
Case "'"
Return "\x27"
End Select
Return ""
End Function
And the results are as I expected... that this is far, far less efficient than either of the functions in my original question function. (The following are in milliseconds)
GetJSSafeString, no converted characters: 182
GetJSSafeString, converted characters: 316
GetJSSafeString2, no converted characters: 60
GetJSSafeString2, converted characters: 354
GetJSSafeString3, no converted characters: 477
GetJSSafeString3, converted characters: 856
As the majority of the strings that I will be converting will not contain any of the characters mentioned, I am implementing the GetJSSafeString*2* function, as that is by far the most efficient for the majority of situations.
I need to do something like this:
Have a variable of some type.
Run in a loop and assign all the possible ASCII characters to this variable and print them, one by one.
Is something similar possible for UNICODE also?
I'm not sure how exactly you want to print, but this will console.log printable ascii
for(var i=32;i<127;++i) console.log(String.fromCharCode(i));
You can document.write then if that's your intention. And if the environment is unicode, it should work for unicode as well, I believe.
Others have shown how to print the printable Ascii characters. It is possible to print all other Ascii characters, too, though they are control characters with system-dependent effect (often no effect). To create a string containing all Ascii characters into a string, you could do this:
var s = '';
for (var i = 0; i <= 127; i++) s += String.fromCharCode(i);
Unicode is much more tricky, because the Unicode coding space, from 0 to 0x10FFFF, contains a large number of unassigned code points as well as code points designated as noncharacters. There are also Private Use code points, which may be used to denote characters by “private agreement” but have no generally assigned meaning. Moreover, many Unicode characters are nonspacing, i.e. meant to combine with the preceding character (e.g., turning “a” to “â”), so you can’t visually print them in a row. There is no simple way in JavaScript to determine, from a integer, the class of the corresponding code point – you might need to read the UnicodeData.txt file, parse it, and use the information there to classify code points.
Finally, there is the programming issue that the JavaScript concept of character corresponds to a 16-bit code unit (not code point), and any Unicode code point larger than 0xFFFF needs to be represented using two code units (so-called surrogates). If you are using JavaScript in the context of an HTML document and you want to print characters in th HTML content, then the simplest way is to use character references like 𐐀 (which denotes the Unicode character at code point 10400 hexadecimal) and assign the string to the innerHTML property of an element.
If you need to write ranges of Unicode characters, you might take a look at the Full Unicode Input utility that I recently wrote. Its source code illustrates some ways of dealing with Unicode characters in JavaScript.
There are some of the ASCII characters that are non-printable, but for example getting the characters from 32 (space) to 126 (~), you would use:
var s = '';
for (var i = 32; i <= 127; i++) s += String.fromCharCode(i);
The unicode character set has more than 110,000 different characters (see Unicode), but a normal font doesn't contain all of them, so you can't display them anyway. You would have to specify what parts of the character space you are interested in.
I have an ASP Classic page with SHIFT_JIS charset. The meta tag under the page's head section is like this:
<meta http-equiv="Content-Type" content="text/html; charset=shift_jis">
My page has a text box (txtName) that should only allow 200 characters. I have a Javascript function that validates the character length, which is called on the onclick() event of my Submit button.
if(document.frmPage.txtName.value.length > 200) {
alert("You have exceeded the maximum length of 200.");
return false;
}
The problem is, Javascript is not getting the correct length of Japanese character encoded in SHIFT_JIS. For example, the character 测 has a SHIFT_JIS length of 8 characters, but Javascript is only recognizing it as one character, probably because of the Unicode encoding that Javascript uses by default. Some characters like ケ have 2 or 3 characters when in SHIFT_JIS.
If I will only depend on the length provided by Javascript, long Japanese characters would pass the page validation and it will try to save on the database, which will then fail because of the 200 maximum length of the DB column.
The browser that I'm using is Internet Explorer. Is there a way to get the SHIFT_JIS length of the Japanese character using Javascript? Is it possible to convert from Unicode to SHIFT_JIS using Javascript? How?
Thanks for the help!
For example, the character 测 has a SHIFT_JIS length of 8 characters, but Javascript is only recognizing it as one character, probably because of the Unicode encoding
Let's be clear: 测, U+6D4B (Han Character 'measure, estimate, conjecture') is a single character. When you encode it to a particular encoding like Shift-JIS, it may very well become multiple bytes.
In general JavaScript doesn't make encoding tables available so you can't find out how many bytes a character will take up. If you really need to, you have to carry around enough data to work it out yourself. For example, if you assume that the input contains only characters that are valid in Shift-JIS, this function would work out how many bytes are needed by keeping a list of all the characters that are a single byte, and assuming every other character takes two bytes:
function getShiftJISByteLength(s) {
return s.replace(/[^\x00-\x80。「」、・ヲァィゥェォャュョッーアイウエオカキクケコサシスセソタチツテトナニヌネノハヒフヘホマミムメモヤユヨラリルレロワン ゙ ゚]/g, 'xx').length;
}
However, there are no 8-byte sequences in Shift-JIS, and the character 测 is not available in Shift-JIS at all. (It's a Chinese character not used in Japan.)
Why you might be thinking it constitutes an 8-byte sequence is this: when a browser can't submit a character in a form, because it does not exist in the target charset, it replaces it with an HTML character reference: in this case 测. This is a lossy mangling: you can't tell whether the user typed literally 测 or 测. And if you are displaying the submitted content 测 as 测 then that means you are forgetting to HTML-encode your output, which probably means your application is highly vulnerable to cross-site scripting.
The only sensible answer is to use UTF-8 instead of Shift-JIS. UTF-8 can happily encode 测, or any other character, without having to resort to broken HTML character references. If you need to store content limited by encoded byte length in your database, there is a sneaky hack you can use to get the number of UTF-8 bytes in a string:
function getUTF8ByteLength(s) {
return unescape(encodeURIComponent(s)).length;
}
although probably it would be better to store native Unicode strings in the database so that the length limit refers to actual characters and not bytes in some encoding.
You are getting confused between characters and bytes. 测 is ONE character, however you look at it. In UTF-16 (which is what Javascript uses), it's two BYTES. In Shift_JIS, 8 bytes, apparently. But in both cases, it's ONE character. So what you are trying to do is limit the text length to 200 BYTES. Since Javascript is using UTF-16 (UCS-2, really) you can get it's byte length by multiplying the string length by 2, but that doesn't help you with Shift_JIS. Then again, you should probably consider switching to Unicode anyway, if you're working with Javascript...
The javascript length and subString function does not take into account non ascii characters.
I have a function that substrings the users input to 400 characters if they enter more than 400 characters.
e.g
function reduceInput (data) {
if (data.length > 400)
{
var reducedSize = data.substring(0,400);
}
return reducedSize;
}
However, if non ascii chars is entered (double byte chars) then this does not work. It does not take into account the character types in the equation.
I have another function that loops round each charaters, and if it is a non ascii, it increments a counter and then works out what the true count is. It works but it is a bit of a hack.
Is there a more efficient approach to doing this or is there no other alternative?
Thanks
The native character set of JavaScript and web browsers in general is UTF-16. Strings are sequences of UTF-16 code units. There is no concept of "double byte" character encodings.
If you want to calculate how many bytes a String will take up in a particular double-byte encoding, you will need to know what encoding it is and how to encode it yourself; that information will not be accessible to JavaScript natively. So for example with Shift_JIS you will have to know which characters are kana that can be encoded to a single byte, and which take part in double-byte kanji sequences.
There is not any encoding that stores all code units that represent ASCII in one byte and all code units other than ASCII in two bytes, so whatever question you are trying to solve by counting non-ASCII as two, the loop-and-add probably isn't the right answer.
In any case, the old-school double-byte encodings are a horrible anachronism to be avoided wherever possible. If you want a space-efficient byte encoding, you want UTF-8. It's easy to calculate the length of a string in UTF-8 bytes because JS has a sneaky built-in UTF-8 encoder you can leverage:
var byten= unescape(encodeURIComponent(chars)).length;
Snipping a string to 400 bytes is somewhat trickier because you want to avoid breaking a multi-byte sequence. You'll get an exception if you try to UTF-8-decode something with a broken sequence at the end, so catch it and try again:
var bytes= unescape(encodeURIComponent(chars)).slice(0, 400);
while (bytes.length>0) {
try {
chars= decodeURIComponent(escape(bytes));
break
} catch (e) {
bytes= bytes.slice(0, -1);
}
}
But it's unusual to want to limit input based on number of bytes it will take up in a particular encoding. Straight limit on number of characters is far more typical. What're you trying to do?
A regex can do the job i think
var data = /.{0,400}/.exec(originalData)[0];