remove Niqqud from string in javascript - javascript

I have the exact problem described here:
removing Hebrew "niqqud" using r
Have been struggling to remove niqqud ( diacritical signs used to represent vowels or distinguish between alternative pronunciations of letters of the Hebrew alphabet). I have for instance this variable: sample1 <- "הֻסְמַק"
And i cannot find effective way to remove the signs below the letters.
But in my case i have to do this in javascript.
Based of UTF-8 values table described here, I have tried this regex without success.

Just a slight problem with your regex. Try the following:
const input = "הֻסְמַק";
console.log(input)
console.log(input.replace(/[\u0591-\u05C7]/g, ''));
/*
$ node index.js
הֻסְמַק
הסמק
*/

nj_’s answer is great.
Just to add a bit (because I don’t have enough reputation points to comment directly) -
[\u0591-\u05C7] may be too broad a brush. See the relevant table here: https://en.wikipedia.org/wiki/Unicode_and_HTML_for_the_Hebrew_alphabet#Compact_table
Rows 059x and 05AX are for t'amim (accents/cantillation marks).
Niqud per se is in rows 05Bx and 05Cx.
And as Avraham commented, you can run into an issues if 2 words are joined by a makaf (05BE), then by removing that you will end up with run-on words.
If you want to remove only t’amim but keep nikud, use /[\u0591-\u05AF]/g. If you want to avoid the issue raised by Avraham, you have 2 options - either keep the maqaf, or replace it with a dash:
//keep the original makafim
const input = "כִּי־טוֹב"
console.log(input)
console.log(input.replace(/([\u05B0-\u05BD]|[\u05BF-\u05C7])/g,""));
//replace makafim with dashes
console.log(input.replace(/\u05BE/g,"-").replace(/[\u05B0-\u05C7]/g,""))
/*
$ node index.js
כִּי־טֽוֹב
כי־טוב
כי-טוב
*/

Related

Regex-rule for matching single word during input (TipTap InputRule)

I'm currently experimenting with TipTap, an editor framework.
My goal is to build a Custom Node extension for TipTap that wraps a single word in <w>-Tags, whenever a user is typing text. In TipTap I can write an InputRule with Regex for this purpose
For example the rule /(?:^|\s)((?:~)((?:[^~]+))(?:~))$/ will match text between two tildes (~text~) and wrap it with <strike>-Tags.
Click here for my Codesandbox
I was trying for so long and can't figure it out. Here are the rules that I tried:
/**
* Regex that matches a word node during input
*/
// Will match words between two tilde characters; I'm using this expression from the documentation as my starting point.
//const inputRegex = /(?:^|\s)((?:~)((?:[^~]+))(?:~))$/
// Will match a word but will append the following text to that word without the space inbetween
//const inputRegex = /\b\w+\b\s$/
// Will match a word but will append the following text to previous word without the space inbetween; Will work with double spaces
//const inputRegex = /(?:^|\s\b)(?:[^\s])(\w+\b)(?:\s)$/
// Will match a word but will swallow every second character
//const inputRegex = /\b([^\s]+)\b$/g
// Will match every second word
//const inputRegex = /\b([^\s]+)\b\s(?:\s)$/
// Will match every word but swallow spaces; Will work if I insert double spaces
const inputRegex = /\b([^\s]+)(?:\b)\s$/
The problem here is the choice of delimiter, which is space.
This becomes clear when we see the code for markInputRule.ts (line 37 to be precise)
if (captureGroup) {
const startSpaces = fullMatch.search(/\S/)
const textStart = range.from + fullMatch.indexOf(captureGroup)
const textEnd = textStart + captureGroup.length
const excludedMarks = getMarksBetween(range.from, range.to, state.doc)
When we are using '~' as delimiters, the input rule tries to place the markers for start and end, without the delimiters and provide the enclosed-text to the extension tag (CustomItalic, in your case). You can clearly test this when entering strike-through text with enclosing '~', in which case the '~' are extracted out and the text is put inside the strike-through tag.
This is exactly the cause of your double-space problem, when you are getting the match of a word with space, the spaces are replaced and then the text is entered into the tag.
I have tried to work around this using negative look-ahead patterns, but the problem remains in the code of the file mentioned above.
What I would suggest here is to copy the code in markInputRule.ts and make a custom InputRule as per your requirements, which would be way easier than working with the in-built one. Hope this helps.
I assume the problem lies within the "space". Depending on the browser, the final "space" is either not represented at all in the underlying html (Firefox) or replaced with (e.g. Chrome).
I suggest you replace the \s with (\s|\ ) in your regex.

Remove everything after constant using regex

I've got XML that has additional information, BLAH, in each tag. When creating the tags, I've separated the extra info from the tag name with a constant (XMLSPLIT as constant XML_SPLITTER)... I needed to do this because I'm generating my XML from a JSON object and I can't have multiple keys that are the same thing... but in the XML output, can't have that superfluous stuff.
For example:
....
<SetXMLSPLITBLAH>
<Value>9</Value>
<SetType>
<Name>Foo</Name>
</SetType>
</SetXMLSPLITBLAH>
...
So, after generating the XML, I go through and clean it. I'm trying to do it with a regex. I figure, I want to remove anything on a line after the splitter and replace it with just the >.
let reg = new RegExp("<Set"+XML_SPLITTER+"(.*)\/g");
cleanXML = dirtyXML.replace(reg, "<Set>")
This fails to work.
I will note, that I reg = /<Set(.*)/g; and that worked just fine... but it also captures "SetType" and any other use of a tag that starts with "
It's because ^ is a Regex special character that indicates "beginning of line". You'd need to escape it like \^ for this to work. Something like /<Set\^\^[^>]*>/g should do the trick.
Small note: The above regex assumes that the "BLAH" string in your example will never contain the > character... but if it does, then your XML is super malformed anyway.
Using .* will match > and if - for some reason - your XML file is not broken up into multiple lines (i.e. minified), you'll match more than you should. To avoid this, you can use [^>]* to match everything up to the >.
Since you've gracefully included a splitter, it'll make matching much easier and much more predictable (as you mentioned, you match SetType without a splitter).
Without a splitter, you'd have to use a regex pattern that resembles <Set(?!Type>)[^>]* or <Set(?!(?:Type|SomethingElse)>)[^>]* if you had more than just one suffix to Set that should remain. These methods use a negative lookahead to assert what follows does not match.
var str = `<SetXMLSPLITBLAH>
<Value>9</Value>
<SetType>
<Name>Foo</Name>
</SetType>
</SetXMLSPLITBLAH>`
var XML_SPLITTER = 'XMLSPLIT'
var p = `(</?)Set${XML_SPLITTER}[^>]*`
var r = new RegExp(p,'g')
x = str.replace(r,'$1Set')
console.log(x)

Velocity RegEx for case insensitive match

I want to implement a sort of Glossary in a Velocity template (using Javascript). The following is the use case:
there are a large number of items whose description may contain references to predefined terms
there is a list of terms which are defined -> allGlossary
I want to automatically find and mark all items in the allGlossary list that appear in the description of all my items
Example:
allGlossary = ["GUI","RBG","fine","Color Range"]
Item Description: The interface (GUI) shall be generated using only a pre-defined RGB color range.
After running the script, I would expect the Description to look like this:
"The interface (GUI) shall be generated using only a pre-defined RGB Color Range."
NOTE: even though "fine" does appear in the Description (defined), it shall not be marked.
I thought of splitting the description of each item into words but then I miss all the glossary items which have more than 1 word. My current idea is to look for each item in the list in each of the descriptions but I have the following limitations:
I need to find only matches for the exact items in the 2 lists (single and multiple words)
The search has to be case insensitive
Items found may be at the beginning or end of the text and separated by various symbols: space, comma, period, parentheses, brackets, etc.
I have the following code which works but is not case-insensitive:
#set($desc = $item.description)
#foreach($g in $allGlossary)
#set($desc = $desc.replaceAll("\b$g\b", "*$g*"))
#end##foreach
Can someone please help with making this case-insensitive? Or does anyone have a better way to do this?
Thanks!
UPDATE:
based on the answer below, I tried to do the following in my Velocity Template page:
#set($allGlossary = ["GUI","RGB","fine","Color Range"])
#set($itemDescription = "The interface (GUI) shall be generated using only a pre-defined RGB color range.")
<script type="text/javascript">
var allGlossary = new Array();
var itemDescription = "$itemDescription";
</script>
#foreach($a in $allGlossary)
<script type="text/javascript">
allGlossary.push("$a");
console.log(allGlossary);
</script>
#end##foreach
<script type="text/javascript">
console.log(allGlossary[0]);
</script>
The issue is that if I try to display the whole allGlossary Array, it contains the correct elements. As soon as I try to display just one of them (as in the example), I get the error Uncaught SyntaxError: missing ) after argument list.
You mentioned, that you are using JavaScript for these calculations. So one simple way would be to just iterate over your allGlossary array and create a regular expression for each iteration and use this expression to find and replace all occurrences in the text.
To find only values which are between word boundaries, you can use \b. This allows matches like (RGB) or Color Range?. To match case insensitive, you can use the /i flag and to find every instance in a string (not just the first one), you can use the global flag /g.
Dynamic creation of regular expressions (with variables in it) is only supported in JavaScript, if you use the constructor notation of a regular expression (don't forget to escape slashes). For static regular expressions, you could also use: /\bRGB\b/ig. Here is the dynamic one:
new RegExp("\\b("+item+")\\b", 'gi');
Here is a fully functional example based on your sample string. It replaces every item of the allGlossary array with a HTML wrapped version of it.
var allGlossary = ["GUI","RGB","fine","Color Range"]
var itemDescription = "The interface (GUI) shall be generated using only a pre-defined RGB color range.";
for(var i=0; i<allGlossary.length; i++) {
var item = allGlossary[i];
var regex = new RegExp("\\b("+item+")\\b", 'gi');
itemDescription = itemDescription.replace(regex, "<b>$1</b>");
}
console.log(itemDescription);
If this is not your expected solution, you can leave a comment below.

How to highlight a substring containing a random character between two known characters using javascript?

I have a bunch of strings in a data frame as given below.
v1 v2
ARSTNFGATTATNMGATGHTGNKGTEEFR SEQUENCE1
BRCTNIGATGATNLGATGHTGNQGTEEFR SEQUENCE2
ARSTNFGATTATNMGATGHTGNKGTEEFR SEQUENCE3
I want to search and highlight some selected substrings within each string in v1 column. For example, assuming first letter in the substring being searched as "N" and the last letter as "G", and the middle one could be any letter as in "NAG" or "NBG" or "NCG" or "NDG" and so on. To highlight the substring of three characters as shown below, I am writing 26 lines of code to display in R Shiny tab assuming there could be any of the 26 letters in between "N" and "G". I am just trying to optimize the code. I am new to JS. Hope I was clear. If not before down voting please let me know should you need more explanation or details.
ARSTNFGATTATNMGATGHTGNKGTEEFR
BRCTNIGATGATNLGATGHTGNQGTEEFR
ARSTNFGATTATNMGATGHTGNKGTEEFR
The abridged code with representative 2 lines (first and last line) of the 26 lines of the code I use are provided here.
datatable(DF, options = list(rowCallback=JS("function(row,data) {
data[0] = data[0].replace(/NAG/g,'<span style=\"color:blue; font-weight:bold\">NAG</span>');
.....
data[0] = data[0].replace(/NZG/g, '<span style=\"color:blue; font-weight:bold\"\">NZG</span>');
$('td:eq(0)', row).html(data[0]);}"), dom = 't'))
I think the regex you want is: /N[A-Z]G/g
If you also want it to work for lower case: /N[A-Za-z]G/g
I found a simple solution. May be it will be useful to someone like me.
datatable(DF, options = list(rowCallback = JS("function(row,data) {
data[0] = data[0].replace(/N[A-Z]G/g,'<span style=\"color:blue; font-weight:bold\">$&</span>');
$('td:eq(0)', row).html(data[0]);}"), dom = 't'))

how do I get this regex to mimic a look-behind?

/(([$]*)([A-Z]{1,3})([$]*)([0-9]{1,5}))/gi
Regex running on Debuggex
This is for pulling cell refs out of spreadsheet formulas and checking to see if the formula contains an absolute ref. The problem is that it's matching an invalid cell, the last one here:
a1
$a1
$A$5
A5*4
A20+45
A34/A$23
A1*6
A1*A45
$AAA11
AAA33
AA33:A33
$AAAAA44 // <-- not a valid cell!
It's matching the AAA44 in $AAAAA44, but it shouldn't. All the rest of the capture groups etc are working correctly -- each of those rows but the last one are correctly grabbing 1 or more cell refs. A negative lookahead seems like the right way to go, but after mucking with it for a good long while I must admit to being stuck.
If you can't match for ^...$ then you may still be able to introduce some \b matching
/foo\bbar/.test('foobar'); // false
/foo\b\d/.test('foo1'); // false
/foo\b.\d/.test('foo+1'); // true
So your RegExp would look like (I left in your capture groups)
var re = /(?:\b|^)((\$?)([a-z]{1,3})(\$?)(\d{1,5}))(?:\b|$)/i;
re.test('$AAAAA44'); // false
re.test('$AAA44'); // true
Demo

Categories