How to break up non-english text into constituent characters in javascript?

How to break up non-english text into constituent characters in javascript? - javascript

I am trying to draw text along a curve on html5 canvas. To do this, I need to break up input text into constituent characters which can individually be rotated and translated etc. The breaking up of text is easy for English. Given input string s, s[i] gives the ith character. But this does not work for non-english strings. I have a jsfiddle here illustrating the problem: http://jsfiddle.net/c6HV8/. Note that the fiddle appears differently in Chrome and IE at time of this writing. To see what the problem is, consider you have non-english text in a string s. Create a text node to which you pass s. Next, create a text node for each s[i] and display the text nodes adjacent to each other. Now compare the results. They are not the same. How can I break up non-english text into constituent characters in javascript, so that the two results are the same?

भाईसाब :) So as I'm sure you already know, the problem is that fillText and createText both work on the entire string and so it is able to evaluate the string along with all the diacritic marks (combining characters). However, when you call fillText and createText per character, none of the diacritics appear along with the characters they are supposed to be attached to. Hence they are evaluated and drawn individually, which is why you see the diacritic along with the dotted circle (kind of a place holder that says: put a character here).
There is no easy way to do this, really. Your algorithm would basically have to be like this:
Look up the current character from the string.
Find all successive characters that are diacritics and then combine all of them into a new string.
Render that string using fillText.
You can check out the results here on a forked version of your fiddle. I modified the sample text to add some more complex characters just to make sure that the algorithm works properly. The code could definitely be cleaned up; I just did it as a proof-of-concept.
The hard part is coming up with a list of code-points for diacritics for all languages if you want to internationalize this. This answer provides a list that should help you get started.

Related

need shaped forms for arabic letters

In Arabic, a character shows up differently depending on whether it is at the beginning of a word, middle, or end.
e.g. يَسْتَفْعِلُ
The second last consonant, ع -- looks like this when written independently.
But looks very different when inside the word, ﻌ - as you can see.
I want to show the shaped form in a div.
When I copy just one letter from above word, it pastes like ع
I want ﻌ
I tried splitting the word into letters in Ruby, no luck.
I tried splitting in JS, no luck.
My final goal is to render this in HTML. Given an Arabic word, find its broken parts, retaining their shaped forms.

Split a long string into chunks using REGEX 'Lookbehind' as optional

I'm working on a regex that lets me split into chunks a long text that could have #variables# inside. The rules to do the splitting basically are:
Split by each #photo# or #childphoto# variable and look behind or
ahead for text to don't cut the sentence.
Each chunk should have only one #photo# or #childphoto# variable, or not have any of these variables
Also, the chunk should be less than 350 characters
The chunk should not have to cut words or sentences
The chunk should not have to cut any of the possible text variables into the text #anyOtherVariables#
Currently, I have this Regex
/^.*[\S\s]{0,350}[\s\S](?<=(#photo#|#childphoto#)).*/
That currently is working with the .match() JavaScript method to extract the chunks of text that have the variables using the 'look behind' approach, but is not working with the other chunks that do not match the 'look behind' condition, is there a way to include the other parts?
There are the regexp and the study test case. https://regex101.com/r/kdKHkQ/1
I will really appreciate any help with that.

Here is a single JavaScript regex that does what you have specified:
^\b(?=([^]*))[^]{0,350}$(?<=(?![^]{1,}\1$)(?:#(photo|childphoto)#)?[^]*?)(?<!(?=\1$)(?:[^]*?#(photo|childphoto)#){2}[^]*?)
Demo on regex101
It enforces the 350 character limit by taking a snapshot (using lookahead) at the beginning, consuming and capturing up to 350 characters, and then using a lookbehind to look no further back than the snapshotted beginning, to assert that one of the variables in question is inside the just-captured string. Then it uses a negative lookbehind to enforce that there are not two or more of the variables in question in the just-captured string.
I did not understand your rule "The chunk should not have to cut any of the possible text variables into the text #anyOtherVariables#". If by that you mean that the lines containing variables other than #photo# or #childphoto# should be skipped over (not matched), then this regex does not do that, but it could be easily modified to do so.
Now, practically speaking, it would probably be better to implement this in code, or a combination of code and regex, but this demonstrates that exactly what you asked is possible with a pure regex.
I would like to point out that calling this "splitting by each #photo# or #childphoto# variable" is disingenous, and if I actually took that literally, it would be breaking your other rule, that the chunk should not cut sentences. That is probably why you got downvoted.
I'm posting my answer here, despite the fact that you got downvoted, because I already answered this on reddit and you disappeared without commenting.

Are there W3C specifications for line wrapping behavior in an HTML <textarea>?

I'm working on a project to generate an image that replicates user-entered text input in a <textarea>. The best approach I've found is to read the <textarea>'s value and draw it on a <canvas>, from which I can read a blob and save to a file.
Here is a very naive example, just to give an idea of what I'm trying to do:
https://codepen.io/troywarr/pen/wmbMVq
Note that when the text that you enter in the <textarea> exceeds the length of the first line, the text in the <textarea> wraps, but the text on the <canvas> does not. That's the crux of the code that I still need to write, but first I need to fully understand <textarea>'s line wrapping behavior so that I can recreate it with a JavaScript algorithm.
By playing around with a <textarea> in Google Chrome, I've determined some basic behavior when a word exceeds the line width (as set by the rendered width of the <textarea>):
Words that are preceded by a space will wrap, and the width of the preceding space is not factored into the rendered width of the previous line.
Words that are hyphenated with a figure dash (‒, U+2012), en dash (–, U+2013), or em dash (—, U+2014) will wrap after the last hyphen that fits on the current line. I'm still exploring if there are other characters that hyphenate.
Words that exceed the total line width will be broken mid-word, with subsequent characters overflowing to the next line(s).
However, experience tells me that there are likely to be some edge cases and cross-browser functional differences that I haven't considered. I searched for specifications that describe how exactly <textarea>s should wrap text, but I haven't found any. The closest I've gotten is the spec for the wrap attribute, but it doesn't offer much help.
Is this behavior just dictated by different rendering engines (i.e., not W3C-specified), or am I just missing the W3C specification? If the former, what behavior have I missed, and what do I need to be aware of in terms of cross-browser functional differences?

How to split Japanese Characters in a Sentence into an Array using Client Side Javascript?

I'd like to put all of the individual Japanese characters into an array. For example entering 攻壳机动队 into a textarea (html) and creating an array with each character ['攻','壳','机','动','队'] in javascript. Duplicates should be kept.
I'd like to split by punctuation and spaces but, with Japanese, the sentences don't have spaces so I'm not sure how I can take each individual character and put them into an array. (I know some words consist of multiple characters but I am currently interested looking at how to separate each character to put in an array, multi-character words would be the next step).

Just using myString.split("") will split each character.
As for the second part, I think you'll find that to be very difficult. It's the same difficulty as coding for the english case of splitting the string thisismyexamplestring into coherant words. The computer won't know off hand, and you can't really add in rules stating where a general split in the string to occur, to account for multiple character words.
If, for example, you had a textarea that asked for a user to talk about their computer, then the character '电' would most likely be followed by the character '脑', and you could probably apply some logic to combine those characters into one array index, but that might not always be the case.
I used chinese in my example, but the principle is the same (Don't know japanese, sorry).

Remove Jargon but keep real characters

I"m getting bombarded by spam with posts like below, so what would be the best and most efficient way of remove all the jargon from something like this:
<texarea id="comment">ȑ̉̽ͧ̔͆ͦ̊͛̿͗҉̷̢̧̫̗̗͎͈͕e̷̪͓̼̼̣̻̻͙͔̳̘̗͙̬̱͎ͭ̃͗ͩͯͥͬ̂ͧ͐͌̑̅͢͜ͅd̴̦̺̖̣͎̲̥͕̗̺̯̤͗ͬ͌ͧ̓͒ͭ́̋ͩͥ͊̇̓̌ͫ̃́́͠</textarea>
I'm assuming RegEx, but what exactly are those things called and how would it be referenced in RegExp? The problem lays within a <textarea> tag, and upon retrieving the value, I'd like to be able to remove all that jargon from the value and have it only display the real characters which in this case should be red.
Allowing other Unicode type of characters are essential, but not characters that stack on top of each other.

Zalgo waits behind the wall.
You want to filter out combining characters, such as the diacritical marks listed here.
You should be able to get away with a simple character class pattern match, i.e.:
fooString.replace(/[\u0300-\u036f\u0483-\u0489\u1dc0-\u1dff\u20d0-\u20ff\ufe20-\ufe2f]/, "");
If you want to limit content to one combination per character (not that this really alleviates all negative side-effects), you could simply use
fooString.replace(/([\u0300-\u036f\u0483-\u0489\u1dc0-\u1dff\u20d0-\u20ff\ufe20-\ufe2f])[\u0300-\u036f\u0483-\u0489\u1dc0-\u1dff\u20d0-\u20ff\ufe20-\ufe2f]*/, "$1");
EDIT: Added a number of other combining character ranges. This is most likely still not exhaustive.

Removing combining diacriticals will make input of some languages (such as Vietnamese) difficult or impossible, so you should reconsider.

We Keep Coding

JavaScript is the programming language of the Web.

How to break up non-english text into constituent characters in javascript? - javascript

Related

need shaped forms for arabic letters

Split a long string into chunks using REGEX 'Lookbehind' as optional

Are there W3C specifications for line wrapping behavior in an HTML <textarea>?

How to split Japanese Characters in a Sentence into an Array using Client Side Javascript?

Remove Jargon but keep real characters

Categories

Resources