need shaped forms for arabic letters - javascript

In Arabic, a character shows up differently depending on whether it is at the beginning of a word, middle, or end.
e.g. يَسْتَفْعِلُ
The second last consonant, ع -- looks like this when written independently.
But looks very different when inside the word, ﻌ - as you can see.
I want to show the shaped form in a div.
When I copy just one letter from above word, it pastes like ع
I want ﻌ
I tried splitting the word into letters in Ruby, no luck.
I tried splitting in JS, no luck.
My final goal is to render this in HTML. Given an Arabic word, find its broken parts, retaining their shaped forms.

Related

General (rough) algorithm for transliterating a RTL language to a LTR language

I am beginning thinking about how to transliterate a RTL string (i.e. arabic, hebrew) to a LTR string (i.e. the romanization of the sounds/letters). It's relatively straightforward if it's LTR -> LTR, but more tricky mentally for RTL -> LTR. For LTR -> LTR, you could have a simple mapping for each letter in A to each letter in B. Maybe multiple A's combined make a B in some cases, or a single a single A makes a chain of Bs.
a b
- -
X 1
YZ 2
ABC 3
D 456
E 78
Then given a string like XYZYZDDEABC you would get 122456456783. Basic enough, though the actual algorithm would be a bit tricky because it might have to lookahead and have a prioritization on the elements. But this is the gist of it.
Now for a RTL -> LTR transformation, I'm confused on two levels. First, how do you iterate through a RTL string? The characters are actually in LTR order, correct? It's just the visual layout in browsers and such which makes it RTL. So from a code perspective, your RTL language is actually read LTR (it's not like we have to do anything in reverse or anything). Just making sure I'm interpreting this correctly. That would mean I can just do like the above LTR -> LTR transformation for all intents and purposes.
If it's not like that, and there's something else to consider, I would like to know generally how to do this. If a language is needed for a demo, then JavaScript would be good.
You're correct. Text is stored in "logical order", which is the order it would be typed (or, in most cases, the order in which it is spoken). So you don't need to take directionality into account during transliteration.
Note that in many writing systems, including both Arabic and Hebrew, numbers are written "big-endian", with the most significant digit on the left. They are also typed in this order, meaning that the text is actually bidirectional. That is also the case when texts of different directionality are mixed together, such as when names written in Latin script are included in an Arabic or Hebrew document. Fortunately, you don't need to worry about that either, unless you're writing a Unicode renderer. (If you are, you'd need to read Annex 9 to the Unicode standard, which goes into all the details of bidirectional rendering.)

Avoid visual movement of word when showing sentence dynamicaly

On cocos creator, I want to show the dialog, in a way like Typed.js. Basically showing the letters as if someone is typing them.
The problem is that I have a Label of a determined width, and, when there is a word near the end of the size of the Label, it starts writing on a line, and finishes writing on another line.
I would like that word to start writing it at the start of the next line, but I don't know how to do it.
The thing that I have tried:
RichText for Cocos creator, to make the part of the sentence that stills needs to be written transparent, but RichText on cocos creator cannot be transparent.
Try to write the word before rendering it, and then check the size of the Label, to see if the dimension has changed, so I will have to set the last word on a new line, but the Label size is not updated until rendered, and I want it before rendering.
Any idea on how to accomplish this?
found a solution, and it's to first put all the words on the scene, without showing them, and then calculate the size of every word, then, calculate at witch word the sentence would break, and add a line break before that word. I don't know if it's the most optimized approach, but it works and I don't know any other way to do it.

How to split Japanese Characters in a Sentence into an Array using Client Side Javascript?

I'd like to put all of the individual Japanese characters into an array. For example entering 攻壳机动队 into a textarea (html) and creating an array with each character ['攻','壳','机','动','队'] in javascript. Duplicates should be kept.
I'd like to split by punctuation and spaces but, with Japanese, the sentences don't have spaces so I'm not sure how I can take each individual character and put them into an array. (I know some words consist of multiple characters but I am currently interested looking at how to separate each character to put in an array, multi-character words would be the next step).
Just using myString.split("") will split each character.
As for the second part, I think you'll find that to be very difficult. It's the same difficulty as coding for the english case of splitting the string thisismyexamplestring into coherant words. The computer won't know off hand, and you can't really add in rules stating where a general split in the string to occur, to account for multiple character words.
If, for example, you had a textarea that asked for a user to talk about their computer, then the character '电' would most likely be followed by the character '脑', and you could probably apply some logic to combine those characters into one array index, but that might not always be the case.
I used chinese in my example, but the principle is the same (Don't know japanese, sorry).

How to break up non-english text into constituent characters in javascript?

I am trying to draw text along a curve on html5 canvas. To do this, I need to break up input text into constituent characters which can individually be rotated and translated etc. The breaking up of text is easy for English. Given input string s, s[i] gives the ith character. But this does not work for non-english strings. I have a jsfiddle here illustrating the problem: http://jsfiddle.net/c6HV8/. Note that the fiddle appears differently in Chrome and IE at time of this writing. To see what the problem is, consider you have non-english text in a string s. Create a text node to which you pass s. Next, create a text node for each s[i] and display the text nodes adjacent to each other. Now compare the results. They are not the same. How can I break up non-english text into constituent characters in javascript, so that the two results are the same?
भाईसाब :) So as I'm sure you already know, the problem is that fillText and createText both work on the entire string and so it is able to evaluate the string along with all the diacritic marks (combining characters). However, when you call fillText and createText per character, none of the diacritics appear along with the characters they are supposed to be attached to. Hence they are evaluated and drawn individually, which is why you see the diacritic along with the dotted circle (kind of a place holder that says: put a character here).
There is no easy way to do this, really. Your algorithm would basically have to be like this:
Look up the current character from the string.
Find all successive characters that are diacritics and then combine all of them into a new string.
Render that string using fillText.
You can check out the results here on a forked version of your fiddle. I modified the sample text to add some more complex characters just to make sure that the algorithm works properly. The code could definitely be cleaned up; I just did it as a proof-of-concept.
The hard part is coming up with a list of code-points for diacritics for all languages if you want to internationalize this. This answer provides a list that should help you get started.

Remove Jargon but keep real characters

I"m getting bombarded by spam with posts like below, so what would be the best and most efficient way of remove all the jargon from something like this:
<texarea id="comment">ȑ̉̽ͧ̔͆ͦ̊͛̿͗҉̷̢̧̫̗̗͎͈͕e̷̪͓̼̼̣̻̻͙͔̳̘̗͙̬̱͎ͭ̃͗ͩͯͥͬ̂ͧ͐͌̑̅͢͜ͅd̴̦̺̖̣͎̲̥͕̗̺̯̤͗ͬ͌ͧ̓͒ͭ́̋ͩͥ͊̇̓̌ͫ̃́́͠</textarea>
I'm assuming RegEx, but what exactly are those things called and how would it be referenced in RegExp? The problem lays within a <textarea> tag, and upon retrieving the value, I'd like to be able to remove all that jargon from the value and have it only display the real characters which in this case should be red.
Allowing other Unicode type of characters are essential, but not characters that stack on top of each other.
Zalgo waits behind the wall.
You want to filter out combining characters, such as the diacritical marks listed here.
You should be able to get away with a simple character class pattern match, i.e.:
fooString.replace(/[\u0300-\u036f\u0483-\u0489\u1dc0-\u1dff\u20d0-\u20ff\ufe20-\ufe2f]/, "");
If you want to limit content to one combination per character (not that this really alleviates all negative side-effects), you could simply use
fooString.replace(/([\u0300-\u036f\u0483-\u0489\u1dc0-\u1dff\u20d0-\u20ff\ufe20-\ufe2f])[\u0300-\u036f\u0483-\u0489\u1dc0-\u1dff\u20d0-\u20ff\ufe20-\ufe2f]*/, "$1");
EDIT: Added a number of other combining character ranges. This is most likely still not exhaustive.
Removing combining diacriticals will make input of some languages (such as Vietnamese) difficult or impossible, so you should reconsider.

Categories