Transforming an LTR string into RTL?

Transforming an LTR string into RTL? - javascript

I would like to convert LTR strings into RTL string. Those are displayed incorrectly when there are RTL characters in them (for example, characters in Hebrew or Arabic).
I saw this thread: Concat RTL string with LTR string in javascript
However there are many use cases where concatenating is not enough, for example when there are parathesis in the string.
Therefore is it wiser to use a 3rd party package? Do you have any recommendations for such package?
My usage: I'm passing strings to the 3rd party library PDF renderer, which displays those strings, those strings are in Hebrew, but the PDF renderer treats them as LTR

Related

How does JavaScript ensure that programs are written using the Unicode character set?

I have found this sentence while reading one of the JavaScript books:
JavaScript programs are written using the Unicode character set
What I don't understand is, how does JavaScript files makes sure, that whatever I write in .js file, would be a Unicode Character Set?
Does that mean whenever I type using keyboard on my computer, it'd always use Unicode? How does it work?

This means that the language definition employs Unicode charset. In particular, this usually means that string literals can include Unicode chars, and also may mean that identifiers can include some Unicode chars too (I don't know JavaScript, but in particular it's allowed in the Haskell language).
Now, the JavaScript implementation can choose any way to map bytes in .js file into internal Unicode representation. It may pretend that all .js files are written in UTF-8, or in 7-bit ASCII encoding, or anything else. You need to consult the implementation manual to reveal that.
And yeah, you need to know that any file consists of bytes, not characters. How characters, that you are typed in editor, converted to bytes stored in the file, is up to your editor (usually it provides a choice between use of local 8-bit encodings, UTF-8 and sometimes UTF-16). How the bytes stored in the file are converted to characters is up to your language implementation (in this case, JavaScript one).

Store ASCII code into variable, then display

I'm currently using Visual Studio Community and learning Javascript/html5. I came across a problem where I'm trying to store a ASCII code in a variable using javascript, however, when I try to display it, it'll show the actual code instead of the intended character.
<script type="text/javascript">
x = "&#24171";
function ranAlert(){
alert(x);
}
</script>
<form>
<input type="button" value="me" onclick="ranAlert()" />
</form>
The output here will show when I click on the button, "&#24171". But how would I get it to display the actual intended character "幫"?

For your character, you need to use JavaScript notation and not HTML, because you're not using the string for any HTML markup; you're just passing it to alert():
alert("\u5e6b");
The JavaScript notation uses hex, so you'd need to translate 24171 to hex notation. You can do that in your browser console:
(24171).toString(16)
gives 5e6b.
Since this character can be represented with UTF-16, a simpler way to deal with the problem is to simply use the character directly in the string:
alert("幫");
For some Unicode codepoints, representation in JavaScript requires a pair of JavaScript characters, because JavaScript strings use UTF-16 encoding and some characters require more than 16 bits to represent. The extremely helpful website FileFormat.org has resources for Unicode, including ready-to-use representations for every character in several popular languages including JavaScript.
If you want to use that codepoint in CSS, say as the value of a content: property, you escape slightly differently. The CSS encoding scheme allows 6-hex-digit values:
.my-glyph { content: "\005E6B"; }
Finally note that using the term "character" is not entirely accurate, because Unicode reflects the fact that many writing systems involve glyphs composed from several parts. I'm not very strict on terminology in general so don't follow my example; it's important to keep in mind that what looks like a single glyph (what an alphabet-user would call a "character") may actually be made up of several separate codepoints.

Which encoding to use for many international languages

I am setting up a little website and would like to make it international. All the content will be stored in an external xml in different languages and parsed into the html via javascript.
Now the problem is, there are also german umlauts, russian, chinese and japanese symbols and also right-to-left languages like arabic and farsi.
What would be the best way/solution? Is there an "international encoding" which can display all languages properly? Or is there any other solution you would suggest?
Thanks in advance!

All of the Unicode transformations (UTF-8, UTF-16, UTF-32) can encode all Unicode characters. You pick which you want to use based on the size: If most of your text is in western scripts, probably UTF-8, as it will use only one byte for most of the characters, but 2, 3, or 4 if needed. If you're encoding far east scripts, you'll probably want one of the other transformations.
The fundamental thing here is that it's all Unicode; the transformations are just different ways of representing the same characters.
The co-founder of Stack Overflow had a good article on this topic: The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
Regardless of what encoding you use for your document, note that if you're doing processing of these strings in JavaScript, JavaScript strings are UTF-16 (except that invalid values are tolerated). (Even if the document is in UTF-8 or UTF-32.) This means that, for instance, each of those emojis people are so excited about these days look like two "characters" to JavaScript, because they take two words of UTF-16 to represent. Like 😎, for instance:
console.log("😎".length); // 2
So you'll need to be careful not to split up the two halves of characters that are encoded in two words of UTF-16.

The normal (and recommended) solution for multi-lingual sites is to use UTF-8. That can can deal with any characters that have been assigned Unicode codepoints with a couple of caveats:
Unicode is a versioned standard, and a different Javascript implementations may support different Unicode versions.
If your text includes characters outside of the Unicode Basic Multilingual Plane (BMP), then you need to do your text processing (in Javascript) in a way that is Unicode aware. For instance, if you use the Javascript String class you need to take proper account of surrogate pairs when doing text manipulation.
(A Javascript String is actually encoded as UTF-16. It has methods that allow you to manipulate it as Unicode codepoints, methods / attribute such as substring and length use codeunit rather than codepoint indexing. If you are not careful, you can end up splitting a string between the low and high parts of a surrogate pair. The result will be something that cannot be displayed properly. This only affects codepoints in higher planes ... but that includes the new emoji codepoints.)

Is it safe to use UTF-8 character literals in JavaScript source code?

Is it save to write JavaScript source code (to be executed in the browser) which includes UTF-8 character literals?
For example, I would like to use an ellipses literal in a string as such:
var foo = "Oops… Something went wrong";
Do "modern" browsers support this? Is there a published browser support matrix somewhere?

JavaScript is by specification a Unicode language, so Unicode characters in strings should be safe. You can use hex escapes (\u8E24) as an alternative. Make sure your script files are served with proper content type headers.
Note that characters beyond one- and two-byte sequences are problematic, and that JavaScript regular expressions are terrible with characters beyond the first codepage. (Well maybe not "terrible", but primitive at best.)
You can also use Unicode letters, Unicode combining marks, and Unicode connector punctuation characters in identifiers, in case you want to impress your friends. Thus
var wavy﹏line = "wow";
is perfectly good JavaScript (but good luck with your bug report if you find a browser where it doesn't work).
Read all about it in the spec, or use it to fall asleep at night :)

Regex: Disable Symbols

Is there any way to disable all symbols, punctuations, block elements, geometric shapes and dingbats such like these:
✁ ✂ ✃ ✄ ✆ ✇ ✈ ✉ ✌ ✍ ✎ ✏ ✐ ✑ ✒ ✓ ✔ ✕ ⟻ ⟼ ⟽ ⟾ ⟿ ⟻ ⟼ ⟽ ⟾ ⟿ ▚ ▛ ▜ ▝ ▞ ▟
without writing down all of them in the Regular Expression Pattern, while enable all other normal language characters such like chinese, arabic etc.. such like these:
文化中国 الجزيرة نت
?
I'm building a javascript validation function and my real problem is that I can't use:
[a-zA-Z0-9]
Because this ignores a lots of languages too not just the symbols.

The Unicode standard divides up all the possible characters into code charts. Each code chart contains related characters. If you want to exclude (or include) only certain classes of characters, you will have to make a suitable list of exclusions (or inclusions). Unicode is big, so this might be a lot of work.

Not really.
JavaScript doesn't support Unicode Character Properties. The closest you'll get is excluding ranges by Unicode code point as Greg Hewgill suggested.
For example, to match all of the characters under Mathematical Symbols:
/[\u2190-\u259F]/

This depends on your regex dialect. Unfortunately, probably most existing JavaScript engines don't support Unicode character classes.
In regex engines such as the one in (recent) Perl or .Net, Unicode character classes can be referenced.
\p{L}: any kind of letter from any language.
\p{N}: any number symbol from any language (including, as I recall, the Indian and Arabic and CJK number glyphs).
Because Unicode supports composed and decomposed glyphs, you may run into certain complexities: namely, if only decomposed forms exist, it's possible that you might accidentally exclude some diacritic marks in your matching pattern, and you may need to explicitly allow glyphs of the type Mark. You can mitigate this somewhat by using, if I recall correctly, a string that has been normalized using kC normalization (only for characters that have a composed form). In environments that support Unicode well, there's usually a function that allows you to normalize Unicode strings fairly easily (true in Java and .Net, at least).
Edited to add: If you've started down this path, or have considered it, in order to regain some sanity, you may want to experiment with the Unicode Plugin for XRegExp (which will require you to take a dependency on XRegExp).

JavaScript regular expressions do not have native Unicode support. An alternative to to validate (or sanitize) the string at server site, or to use a non-native regex library. While I've never used it, XRegExp is such a library, and it has a Unicode Plugin.

Take a look at the Unicode Planes. You probably want to exclude everything but planes 0 and 2. After that, it gets ugly as you'll have to exclude a lot of plane 0 on a case-by-case basis.

We Keep Coding

JavaScript is the programming language of the Web.

Transforming an LTR string into RTL? - javascript

Related

How does JavaScript ensure that programs are written using the Unicode character set?

Store ASCII code into variable, then display

Which encoding to use for many international languages

Is it safe to use UTF-8 character literals in JavaScript source code?

Regex: Disable Symbols

Categories

Resources